Skip to content
This repository has been archived by the owner on Apr 22, 2022. It is now read-only.

prometheus metrics #195

Open
leo-baltus opened this issue Mar 13, 2018 · 7 comments
Open

prometheus metrics #195

leo-baltus opened this issue Mar 13, 2018 · 7 comments

Comments

@leo-baltus
Copy link

We had an issue where one of the sinks was not properly configured which resulted in divolte trying non-stop deliver to it. Meanwhile the other sink was working and divolte was happily taking in traffic so we did not notice right away.

One way of detecting this situation is to have divolte report its health (in a more granular manner than a ping) through metrics. Since we are using prometheus and kubernetes it would be nice to have some metrics as well as an idea of how happy divolte is at any given time.

@friso
Copy link
Collaborator

friso commented Mar 14, 2018

I believe there are some initiatives on metrics underway. What do you think would be the minimum set of metrics that Divolte should expose?

Failure to deliver events downstream should be logged at either WARN or ERROR level. Depending on your logging setup, you could already capture those and report them as metrics.

@leo-baltus
Copy link
Author

I would be most interested in latency timings of the sinks, a bucketed histogram would be a practical way I guess.

For monitoring purposes I think it would be nice to have a indication of health for each sink:

# HELP divolte_sink_health indicates if the sink is up (1) or down (0)
# TYPE divolte_sink_health gauge
divolte_sink_health{sink="hadoop"} 1
# HELP divolte_sink_health indicates if the sink is up (1) or down (0)
# TYPE divolte_sink_health gauge
divolte_sink_health{sink="kafka"} 0

I believe that analysing logs would be error prone, I would much prefer to rely on prometheus to handle this job.

Hope this helps.

@friso
Copy link
Collaborator

friso commented Apr 3, 2018

What would you consider the latency for a sink? Is this the latency from event generation to event delivery? Or just RPC latency for the upstream delivery?

@leo-baltus
Copy link
Author

Sorry for the delay. I think that delivery to upstream is my main concern.
Hope this helps.

@friso
Copy link
Collaborator

friso commented Apr 20, 2018

I'm looking into the details on this a bit.

Addition of metrics to Divolte can be separated into two concerns: 1) instrumentation and 2) reporting.

For the sake of supporting different deployment architectures and monitoring solutions, Divolte needs to support multiple reporting backends. This should ideally be based on a single instrumentation code path.

To cover a broad range of contemporary deployment scenarios, we should consider the following backends:

  • JMX
  • statsd (this enables support for some popular hosted metrics solutions, such as Datadog)
  • Prometheus (natively, without push gateway)
  • graphite
  • logging / console / local csv writing (for local debugging and testing purposes these make sense)

Several (collections of) libraries for both instrumentation and reporting exist, with overlapping functionalities. The two most serious contenders appear to be:

Instrumentation code with these libraries differs minimally. The notable challenging differences are:

  • The Prometheus library doesn't do JMX as backend
  • The Metrics library doesn't do Prometheus as backed (further complicated by the fact that Prometheus is pull based, whereas all of the supported backends are push based)
  • Prometheus' metric types overlap with those available in Metrics, but are not the same. Some mapping is required to go from one to the other.

For the sake of supporting JMX as first class, one implementation direction would be to instrument the Divolte code using Metrics and create a Prometheus compatible HTTP endpoint based on the collected metrics in Divolte. This is currently my preferred path. Doing the reverse of adding JMX reporting support to Prometheus' instrumentation library would probably require low level interactions with the MBean API, which I fear is time consuming.

Finally, I want to develop at least a intuition and preferably some insights for the overhead of instrumentation both under normal load and under contention.

I'll likely work on some of this first.

@friso
Copy link
Collaborator

friso commented Apr 20, 2018

One more thing to note is that you'll want a separate server endpoint for exposing the metrics (on a different port number than the primary one), because it's not always a good idea to expose your metrics to the internet or rely on load balancer filtering rules to exclude external requests for metrics.

@Jimmy-Newtron
Copy link

Jimmy-Newtron commented Apr 19, 2021

Can you update this ticket?
I am as well interested in retrieving some metrics, but apparently this issue is not a priority.
Can you confirm?

What about the work done on the issue #11 ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants