Export job start/end times #3

kevinkle · 2020-07-30T16:52:20Z

Firstly, RQ Exporter has been an excellent!

Do you think it'd be possible to include an option to export Job start/end times? I believe this should be accessible at job.started_at and job.ended_at per rq docs.

The text was updated successfully, but these errors were encountered:

mdawar · 2020-07-31T08:39:30Z

Hi,

Thank you for your suggestion, how do think these job start/end times should be exposed?
The only thing that I can think of right now is exposing the job duration but I haven't checked how to do it yet, if you have any suggestions about how we should implement this please provide us with more details.

I'm not a Prometheus expert but I think getting this amount of details about the jobs is maybe more suitable for another monitoring system, for example check out:

RQ Dashboard
RQ Monitor (Newer project)

But I haven't checked if these projects expose the job start/end times.

kevinkle · 2020-07-31T15:42:13Z

I think there's been two kind of monitoring systems I've seen so far for RQ:

Snapshots: what's going on right now - for example, using RQ Dashboard to check what workers are registered to which queue, getting errors, etc. I haven't looked at RQ Monitor before, but on first look it seems like it'd fall in this category + a bunch of nice new features
Load monitoring: what's the current processing capacity at - using RQ Exporter's Grafana dashboard to see the % of worker usage over time

The start/end time isn't as important as the job duration like you suggested. The use case I'm thinking would be:

Say I have a set of queues with some running at high priority where I care about the response time. Is at some point in time, for whatever reason, the jobs on these specific queues completing slower than normal?
Another point would be: is for whatever reason my job.func_name job suddenly running much slower than normal?

So kind of an average job duration per queue name view and/or per job.func_name.

I'm also not a Prometheus either expert though, so happy to hear your thoughts.

mdawar · 2020-07-31T19:09:45Z

Actually I thought that we can calculate the duration from the start and end times of the job.

As per the Prometheus documentation on histograms and summaries:

Histograms and summaries both sample observations, typically request durations or response sizes. They track the number of observations and the sum of the observed values, allowing you to calculate the average of the observed values. Note that the number of observations (showing up in Prometheus as a time series with a _count suffix) is inherently a counter (as described above, it only goes up). The sum of observations (showing up as a time series with a _sum suffix) behaves like a counter, too, as long as there are no negative observations. Obviously, request durations or response sizes are never negative.

The thing is that RQ is not like Celery which uses events, the RQ exporter queries Redis on each scrape by Prometheus to get the jobs and workers at the time of the scrape, so say for example a job was finished and deleted before this scrape, we won't know anything about it, also this means that we need to fetch all the jobs in the finished and failed registries and calculate their duration from their start and end times on each scrape.

That's why I can't think of a good way to track the job duration without events dispatched from the jobs.

kevinkle · 2020-08-03T16:39:12Z

Hmm, BaseRegistry.get_job_ids() does have a start/end range calling ZRANGE. Thoughts on scraping the last 10 jobs or so on each queue.finished_job_registry? I'm thinking it'd have the same effect for the use case with a histogram, though this could get expensive very fast with one call to get the IDs per finished queue and another per job to get start/end times.

If I remember correctly, the finished_job_registry includes the failed_job_registry.
The jobs should also be ordered with the score by timestamp + ttl, with jobs that never expire set to +inf.

mdawar · 2020-08-04T10:47:12Z

I could be wrong but I still don't think we can record this data using a histogram/summary metric accurately, also as you said this operation could get very expensive.

About the job registries, the finished and failed registries hold the successful and failed jobs respectively, the finished_job_registry does not include the failed jobs.

Honestly, I really don't have time to start working on this feature, of course pull requests are welcome, but if the operation is expensive you need to implement it behind a flag and disable it by default.

kevinkle · 2020-08-07T15:48:54Z

That works, I'll have to take a further look on a good way to record this. Thanks for the insight.

We'll have a use case for this in Q4 this year, so I'll open a PR if it works out.

kevinkle mentioned this issue Sep 14, 2020

Export job start/end times (#3) #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export job start/end times #3

Export job start/end times #3

kevinkle commented Jul 30, 2020

mdawar commented Jul 31, 2020

kevinkle commented Jul 31, 2020 •

edited

Loading

mdawar commented Jul 31, 2020

kevinkle commented Aug 3, 2020

mdawar commented Aug 4, 2020

kevinkle commented Aug 7, 2020

Export job start/end times #3

Export job start/end times #3

Comments

kevinkle commented Jul 30, 2020

mdawar commented Jul 31, 2020

kevinkle commented Jul 31, 2020 • edited Loading

mdawar commented Jul 31, 2020

kevinkle commented Aug 3, 2020

mdawar commented Aug 4, 2020

kevinkle commented Aug 7, 2020

kevinkle commented Jul 31, 2020 •

edited

Loading