Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export job start/end times #3

Open
kevinkle opened this issue Jul 30, 2020 · 6 comments
Open

Export job start/end times #3

kevinkle opened this issue Jul 30, 2020 · 6 comments

Comments

@kevinkle
Copy link

Firstly, RQ Exporter has been an excellent!

Do you think it'd be possible to include an option to export Job start/end times? I believe this should be accessible at job.started_at and job.ended_at per rq docs.

@mdawar
Copy link
Owner

mdawar commented Jul 31, 2020

Hi,

Thank you for your suggestion, how do think these job start/end times should be exposed?
The only thing that I can think of right now is exposing the job duration but I haven't checked how to do it yet, if you have any suggestions about how we should implement this please provide us with more details.

I'm not a Prometheus expert but I think getting this amount of details about the jobs is maybe more suitable for another monitoring system, for example check out:

But I haven't checked if these projects expose the job start/end times.

@kevinkle
Copy link
Author

kevinkle commented Jul 31, 2020

I think there's been two kind of monitoring systems I've seen so far for RQ:

  1. Snapshots: what's going on right now - for example, using RQ Dashboard to check what workers are registered to which queue, getting errors, etc. I haven't looked at RQ Monitor before, but on first look it seems like it'd fall in this category + a bunch of nice new features
  2. Load monitoring: what's the current processing capacity at - using RQ Exporter's Grafana dashboard to see the % of worker usage over time

The start/end time isn't as important as the job duration like you suggested. The use case I'm thinking would be:

  • Say I have a set of queues with some running at high priority where I care about the response time. Is at some point in time, for whatever reason, the jobs on these specific queues completing slower than normal?
  • Another point would be: is for whatever reason my job.func_name job suddenly running much slower than normal?

So kind of an average job duration per queue name view and/or per job.func_name.

I'm also not a Prometheus either expert though, so happy to hear your thoughts.

@mdawar
Copy link
Owner

mdawar commented Jul 31, 2020

Actually I thought that we can calculate the duration from the start and end times of the job.

As per the Prometheus documentation on histograms and summaries:

Histograms and summaries both sample observations, typically request durations or response sizes. They track the number of observations and the sum of the observed values, allowing you to calculate the average of the observed values. Note that the number of observations (showing up in Prometheus as a time series with a _count suffix) is inherently a counter (as described above, it only goes up). The sum of observations (showing up as a time series with a _sum suffix) behaves like a counter, too, as long as there are no negative observations. Obviously, request durations or response sizes are never negative.

The thing is that RQ is not like Celery which uses events, the RQ exporter queries Redis on each scrape by Prometheus to get the jobs and workers at the time of the scrape, so say for example a job was finished and deleted before this scrape, we won't know anything about it, also this means that we need to fetch all the jobs in the finished and failed registries and calculate their duration from their start and end times on each scrape.

That's why I can't think of a good way to track the job duration without events dispatched from the jobs.

@kevinkle
Copy link
Author

kevinkle commented Aug 3, 2020

Hmm, BaseRegistry.get_job_ids() does have a start/end range calling ZRANGE. Thoughts on scraping the last 10 jobs or so on each queue.finished_job_registry? I'm thinking it'd have the same effect for the use case with a histogram, though this could get expensive very fast with one call to get the IDs per finished queue and another per job to get start/end times.

If I remember correctly, the finished_job_registry includes the failed_job_registry.
The jobs should also be ordered with the score by timestamp + ttl, with jobs that never expire set to +inf.

@mdawar
Copy link
Owner

mdawar commented Aug 4, 2020

I could be wrong but I still don't think we can record this data using a histogram/summary metric accurately, also as you said this operation could get very expensive.

About the job registries, the finished and failed registries hold the successful and failed jobs respectively, the finished_job_registry does not include the failed jobs.

Honestly, I really don't have time to start working on this feature, of course pull requests are welcome, but if the operation is expensive you need to implement it behind a flag and disable it by default.

@kevinkle
Copy link
Author

kevinkle commented Aug 7, 2020

That works, I'll have to take a further look on a good way to record this. Thanks for the insight.

We'll have a use case for this in Q4 this year, so I'll open a PR if it works out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants