-
Notifications
You must be signed in to change notification settings - Fork 77
prometheus metrics #195
Comments
I believe there are some initiatives on metrics underway. What do you think would be the minimum set of metrics that Divolte should expose? Failure to deliver events downstream should be logged at either |
I would be most interested in latency timings of the sinks, a bucketed histogram would be a practical way I guess. For monitoring purposes I think it would be nice to have a indication of health for each sink:
I believe that analysing logs would be error prone, I would much prefer to rely on prometheus to handle this job. Hope this helps. |
What would you consider the latency for a sink? Is this the latency from event generation to event delivery? Or just RPC latency for the upstream delivery? |
Sorry for the delay. I think that delivery to upstream is my main concern. |
I'm looking into the details on this a bit. Addition of metrics to Divolte can be separated into two concerns: 1) instrumentation and 2) reporting. For the sake of supporting different deployment architectures and monitoring solutions, Divolte needs to support multiple reporting backends. This should ideally be based on a single instrumentation code path. To cover a broad range of contemporary deployment scenarios, we should consider the following backends:
Several (collections of) libraries for both instrumentation and reporting exist, with overlapping functionalities. The two most serious contenders appear to be: Instrumentation code with these libraries differs minimally. The notable challenging differences are:
For the sake of supporting JMX as first class, one implementation direction would be to instrument the Divolte code using Metrics and create a Prometheus compatible HTTP endpoint based on the collected metrics in Divolte. This is currently my preferred path. Doing the reverse of adding JMX reporting support to Prometheus' instrumentation library would probably require low level interactions with the MBean API, which I fear is time consuming. Finally, I want to develop at least a intuition and preferably some insights for the overhead of instrumentation both under normal load and under contention. I'll likely work on some of this first. |
One more thing to note is that you'll want a separate server endpoint for exposing the metrics (on a different port number than the primary one), because it's not always a good idea to expose your metrics to the internet or rely on load balancer filtering rules to exclude external requests for metrics. |
Can you update this ticket? What about the work done on the issue #11 ? |
We had an issue where one of the sinks was not properly configured which resulted in divolte trying non-stop deliver to it. Meanwhile the other sink was working and divolte was happily taking in traffic so we did not notice right away.
One way of detecting this situation is to have divolte report its health (in a more granular manner than a ping) through metrics. Since we are using prometheus and kubernetes it would be nice to have some metrics as well as an idea of how happy divolte is at any given time.
The text was updated successfully, but these errors were encountered: