-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can you explain the roles of metric monitoring and distributed tracing in troubleshooting issues online? #138
Comments
For those newly exposed to observability systems, this is an excellent question. Observability systems are primarily divided into metric monitoring, distributed tracing, and logging, collectively referred to as the "observability trifecta." In the realm of observability systems, what does metric monitoring provide us? Firstly, we categorize metrics into two types. The first type is request-based metrics, reflecting specific conditions when an interface processes requests. This includes metrics such as Query Per Second (QPS), response time, the proportion of exceptions and slow calls, and the caller's availability (SLA), among others. These metrics authentically depict the health of a particular interface, and typically, when configuring alerts, these are the most frequently received. The second type is runtime metrics, revealing the operational status of the service itself. Examples include JVM metrics like heap memory usage and recycling, GC (garbage collection) Stop-the-World (STW) occurrences and durations, CPU, disk, network usage on containers or physical machines, and overall system load. When no obvious error messages are found in the logs or on the path of abnormal requests, it's necessary to examine whether these metrics indicate any anomalies. Let's consider a scenario as an example, such as receiving an alert for a timeout in an API call. In response, we can open distributed tracing to examine the specific call details along the path. If anomalies are detected, detailed exception information can be found in the event logs. This approach can resolve over 90% of issues. However, sometimes the bottleneck may manifest on the caller's side with extended response times, despite the server-side processing being swift. In such cases, it becomes essential to inspect the runtime conditions of the instances on both sides of the call, examining whether the CPU is saturated, or if GC STW is excessively high, among other factors. As for the data from distributed tracing, it is currently generated using OpenTelemetry. OpenTelemetry is a product aimed at establishing observability standards and boasts numerous observable implementations. In Java, it utilizes bytecode enhancement techniques to generate Spans (individual nodes in distributed tracing) at specified positions in the runtime code. For instance, in Dubbo, OpenTelemetry generates a Dubbo filter, creating a Span when a request enters the Dubbo filter. In essence, you can perceive that code to generate Spans is woven into the JVM at startup using bytecode enhancement techniques. Here is a demo of OzHera, and the address is: Ozhera-demo(username is : ozhera@ozhera.com, password is : 123456) |
Metric Monitoring: Because I have a blind spot in performance optimization and there's another team nearby working on a similar business direction. I am a bit curious, and I hope to learn some basic knowledge about performance, capacity, and stability from this part.
Distributed Tracing: This is what I'm most curious about. When a company has a large number of machines, how do they achieve end-to-end tracing, and what technologies are behind it?
The text was updated successfully, but these errors were encountered: