Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Per #3 , thought I'd open a PR and we can move specific discussions here. Can consider this an early draft, and here's what the added panels look like atm.
A short running job was added to show the difference.
Some decisions that were made:
Using
SummaryMetricFamily
- the other options would have beenGaugeMetricFamily
orHistogramMetricFamily
. Histograms require pre-defined buckets which I don't think would be a good fit as the runtimes wouldn't be generic, instead a Summary Metric uses rolling time windows. It's adding the data individually per scraped job so thecount_value
is1
and thesum_value
is only what that job's calculated runtime was; so this is a bit strange - Gauge Metrics may work instead.Currently it scrapes the 3 most recently completed jobs per queue which gives the above panels. Will look at option flags for this.
Timestamps are specified when it adds job runtime data, this is taken from the
job.ended_at
. This does have the effect that since Prometheus is append-only, that when we scrape jobs which completed prior to the last scrape, those old jobs will not be added. Instead Prometheus will throw anError on ingesting out-of-order samples
and drop them. I believe this should mean that it never stores jobs as duplicates and only displays the latest data. Could also use a better approach for this.Data labels are the
job.func_name
and the queue.Still left to do:
enqueued_at
in a different metricOther things:
To you earlier point #3 (comment), if the job completes and is removed before a scrape, there is still no information about it. So scrape times have to be within the jobs ttl which should be the case unless the ttl was manually specified to be pretty short.
I would say I'm still uncertain about using a Summary Metric - I think it's meant for when the exporter is able to calculate something like total response time for a whole group of responses? In other words, when the
count_value
is > 1, and thesum_value
would be the sum of these individual samples. Runningavg()
on this metric seems to make logical sense for now:long_running_job
s complete withinrandom.randint(2, 10)
as displayed, andshort_running_jobs
s complete within that div 10.I think using timestamps does get around issues with the data being inaccurate, as it should have only the latest jobs completed since last scrape. If that works, it should mean not having to scrape all jobs from the finished and failed registries and instead focus on getting a "what's the performance like right now" question.
Anyways, let me know your thoughts and can go from there.