Can you explain the roles of metric monitoring and distributed tracing in troubleshooting issues online? #138

DFZ-Icey · 2023-11-13T15:58:49Z

Metric Monitoring: Because I have a blind spot in performance optimization and there's another team nearby working on a similar business direction. I am a bit curious, and I hope to learn some basic knowledge about performance, capacity, and stability from this part.
Distributed Tracing: This is what I'm most curious about. When a company has a large number of machines, how do they achieve end-to-end tracing, and what technologies are behind it?

sadadw1 · 2023-11-16T11:16:23Z

For those newly exposed to observability systems, this is an excellent question. Observability systems are primarily divided into metric monitoring, distributed tracing, and logging, collectively referred to as the "observability trifecta." In the realm of observability systems, what does metric monitoring provide us?

Firstly, we categorize metrics into two types. The first type is request-based metrics, reflecting specific conditions when an interface processes requests. This includes metrics such as Query Per Second (QPS), response time, the proportion of exceptions and slow calls, and the caller's availability (SLA), among others. These metrics authentically depict the health of a particular interface, and typically, when configuring alerts, these are the most frequently received.

The second type is runtime metrics, revealing the operational status of the service itself. Examples include JVM metrics like heap memory usage and recycling, GC (garbage collection) Stop-the-World (STW) occurrences and durations, CPU, disk, network usage on containers or physical machines, and overall system load. When no obvious error messages are found in the logs or on the path of abnormal requests, it's necessary to examine whether these metrics indicate any anomalies.

Let's consider a scenario as an example, such as receiving an alert for a timeout in an API call. In response, we can open distributed tracing to examine the specific call details along the path. If anomalies are detected, detailed exception information can be found in the event logs. This approach can resolve over 90% of issues.

However, sometimes the bottleneck may manifest on the caller's side with extended response times, despite the server-side processing being swift. In such cases, it becomes essential to inspect the runtime conditions of the instances on both sides of the call, examining whether the CPU is saturated, or if GC STW is excessively high, among other factors.

As for the data from distributed tracing, it is currently generated using OpenTelemetry. OpenTelemetry is a product aimed at establishing observability standards and boasts numerous observable implementations. In Java, it utilizes bytecode enhancement techniques to generate Spans (individual nodes in distributed tracing) at specified positions in the runtime code. For instance, in Dubbo, OpenTelemetry generates a Dubbo filter, creating a Span when a request enters the Dubbo filter. In essence, you can perceive that code to generate Spans is woven into the JVM at startup using bytecode enhancement techniques.

Here is a demo of OzHera, and the address is: Ozhera-demo(username is : ozhera@ozhera.com, password is : 123456)

sadadw1 added the question Further information is requested label Nov 14, 2023

sadadw1 added the good first issue Good for newcomers label Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you explain the roles of metric monitoring and distributed tracing in troubleshooting issues online? #138

Can you explain the roles of metric monitoring and distributed tracing in troubleshooting issues online? #138

DFZ-Icey commented Nov 13, 2023

sadadw1 commented Nov 16, 2023 •

edited

Loading

Can you explain the roles of metric monitoring and distributed tracing in troubleshooting issues online? #138

Can you explain the roles of metric monitoring and distributed tracing in troubleshooting issues online? #138

Comments

DFZ-Icey commented Nov 13, 2023

sadadw1 commented Nov 16, 2023 • edited Loading

sadadw1 commented Nov 16, 2023 •

edited

Loading