prometheus metrics #195

leo-baltus · 2018-03-13T09:49:31Z

We had an issue where one of the sinks was not properly configured which resulted in divolte trying non-stop deliver to it. Meanwhile the other sink was working and divolte was happily taking in traffic so we did not notice right away.

One way of detecting this situation is to have divolte report its health (in a more granular manner than a ping) through metrics. Since we are using prometheus and kubernetes it would be nice to have some metrics as well as an idea of how happy divolte is at any given time.

friso · 2018-03-14T07:34:33Z

I believe there are some initiatives on metrics underway. What do you think would be the minimum set of metrics that Divolte should expose?

Failure to deliver events downstream should be logged at either WARN or ERROR level. Depending on your logging setup, you could already capture those and report them as metrics.

leo-baltus · 2018-03-16T13:22:21Z

I would be most interested in latency timings of the sinks, a bucketed histogram would be a practical way I guess.

For monitoring purposes I think it would be nice to have a indication of health for each sink:

# HELP divolte_sink_health indicates if the sink is up (1) or down (0)
# TYPE divolte_sink_health gauge
divolte_sink_health{sink="hadoop"} 1
# HELP divolte_sink_health indicates if the sink is up (1) or down (0)
# TYPE divolte_sink_health gauge
divolte_sink_health{sink="kafka"} 0

I believe that analysing logs would be error prone, I would much prefer to rely on prometheus to handle this job.

Hope this helps.

friso · 2018-04-03T12:59:01Z

What would you consider the latency for a sink? Is this the latency from event generation to event delivery? Or just RPC latency for the upstream delivery?

leo-baltus · 2018-04-17T06:19:53Z

Sorry for the delay. I think that delivery to upstream is my main concern.
Hope this helps.

friso · 2018-04-20T13:51:28Z

I'm looking into the details on this a bit.

Addition of metrics to Divolte can be separated into two concerns: 1) instrumentation and 2) reporting.

For the sake of supporting different deployment architectures and monitoring solutions, Divolte needs to support multiple reporting backends. This should ideally be based on a single instrumentation code path.

To cover a broad range of contemporary deployment scenarios, we should consider the following backends:

JMX
statsd (this enables support for some popular hosted metrics solutions, such as Datadog)
Prometheus (natively, without push gateway)
graphite
logging / console / local csv writing (for local debugging and testing purposes these make sense)

Several (collections of) libraries for both instrumentation and reporting exist, with overlapping functionalities. The two most serious contenders appear to be:

Instrumentation code with these libraries differs minimally. The notable challenging differences are:

The Prometheus library doesn't do JMX as backend
The Metrics library doesn't do Prometheus as backed (further complicated by the fact that Prometheus is pull based, whereas all of the supported backends are push based)
Prometheus' metric types overlap with those available in Metrics, but are not the same. Some mapping is required to go from one to the other.

For the sake of supporting JMX as first class, one implementation direction would be to instrument the Divolte code using Metrics and create a Prometheus compatible HTTP endpoint based on the collected metrics in Divolte. This is currently my preferred path. Doing the reverse of adding JMX reporting support to Prometheus' instrumentation library would probably require low level interactions with the MBean API, which I fear is time consuming.

Finally, I want to develop at least a intuition and preferably some insights for the overhead of instrumentation both under normal load and under contention.

I'll likely work on some of this first.

friso · 2018-04-20T14:08:00Z

One more thing to note is that you'll want a separate server endpoint for exposing the metrics (on a different port number than the primary one), because it's not always a good idea to expose your metrics to the internet or rely on load balancer filtering rules to exclude external requests for metrics.

Jimmy-Newtron · 2021-04-19T07:33:00Z

Can you update this ticket?
I am as well interested in retrieving some metrics, but apparently this issue is not a priority.
Can you confirm?

What about the work done on the issue #11 ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus metrics #195

prometheus metrics #195

leo-baltus commented Mar 13, 2018

friso commented Mar 14, 2018

leo-baltus commented Mar 16, 2018

friso commented Apr 3, 2018

leo-baltus commented Apr 17, 2018

friso commented Apr 20, 2018 •

edited

Loading

friso commented Apr 20, 2018

Jimmy-Newtron commented Apr 19, 2021 •

edited

Loading

prometheus metrics #195

prometheus metrics #195

Comments

leo-baltus commented Mar 13, 2018

friso commented Mar 14, 2018

leo-baltus commented Mar 16, 2018

friso commented Apr 3, 2018

leo-baltus commented Apr 17, 2018

friso commented Apr 20, 2018 • edited Loading

friso commented Apr 20, 2018

Jimmy-Newtron commented Apr 19, 2021 • edited Loading

friso commented Apr 20, 2018 •

edited

Loading

Jimmy-Newtron commented Apr 19, 2021 •

edited

Loading