Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you explain the roles of metric monitoring and distributed tracing in troubleshooting issues online? #138

Open
DFZ-Icey opened this issue Nov 13, 2023 · 1 comment
Labels
good first issue Good for newcomers question Further information is requested

Comments

@DFZ-Icey
Copy link

Metric Monitoring: Because I have a blind spot in performance optimization and there's another team nearby working on a similar business direction. I am a bit curious, and I hope to learn some basic knowledge about performance, capacity, and stability from this part.
Distributed Tracing: This is what I'm most curious about. When a company has a large number of machines, how do they achieve end-to-end tracing, and what technologies are behind it?

@sadadw1 sadadw1 added the question Further information is requested label Nov 14, 2023
@sadadw1
Copy link
Collaborator

sadadw1 commented Nov 16, 2023

For those newly exposed to observability systems, this is an excellent question. Observability systems are primarily divided into metric monitoring, distributed tracing, and logging, collectively referred to as the "observability trifecta." In the realm of observability systems, what does metric monitoring provide us?

Firstly, we categorize metrics into two types. The first type is request-based metrics, reflecting specific conditions when an interface processes requests. This includes metrics such as Query Per Second (QPS), response time, the proportion of exceptions and slow calls, and the caller's availability (SLA), among others. These metrics authentically depict the health of a particular interface, and typically, when configuring alerts, these are the most frequently received.

image

The second type is runtime metrics, revealing the operational status of the service itself. Examples include JVM metrics like heap memory usage and recycling, GC (garbage collection) Stop-the-World (STW) occurrences and durations, CPU, disk, network usage on containers or physical machines, and overall system load. When no obvious error messages are found in the logs or on the path of abnormal requests, it's necessary to examine whether these metrics indicate any anomalies.

image

image

Let's consider a scenario as an example, such as receiving an alert for a timeout in an API call. In response, we can open distributed tracing to examine the specific call details along the path. If anomalies are detected, detailed exception information can be found in the event logs. This approach can resolve over 90% of issues.

image

However, sometimes the bottleneck may manifest on the caller's side with extended response times, despite the server-side processing being swift. In such cases, it becomes essential to inspect the runtime conditions of the instances on both sides of the call, examining whether the CPU is saturated, or if GC STW is excessively high, among other factors.

As for the data from distributed tracing, it is currently generated using OpenTelemetry. OpenTelemetry is a product aimed at establishing observability standards and boasts numerous observable implementations. In Java, it utilizes bytecode enhancement techniques to generate Spans (individual nodes in distributed tracing) at specified positions in the runtime code. For instance, in Dubbo, OpenTelemetry generates a Dubbo filter, creating a Span when a request enters the Dubbo filter. In essence, you can perceive that code to generate Spans is woven into the JVM at startup using bytecode enhancement techniques.

Here is a demo of OzHera, and the address is: Ozhera-demo(username is : ozhera@ozhera.com, password is : 123456)

@sadadw1 sadadw1 added the good first issue Good for newcomers label Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants