You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that my issue does not already have a solution in the FAQ
I have searched the existing issues and didn't find my bug already reported there
I have checked that my bug is still present in the latest release
Version
4.0.0a4
What happened?
For my use case, I needed to have separate scheduler and worker instances (since there might be thousands of jobs to be executed in a short time and I need the system to be scalable). Everything was fine and stable until our team decided to take a stress test to see when the system breaks. Here we understood that if we throttle the number of jobs submitted to the scheduler, everything would be fine. However, if we submit like 1000 jobs at the same time to the scheduler, workers are overwhelmed and a deadlock situation is caused in the workers job processing loop; causing the worker to stop acquiring and executing jobs at all without any crashing.
I inspected the code and pinpointed the issue. It is happening inside the _process_jobs function under the scheduler (async) class (line 905 as I am reporting the issue - commit hash is f375b67). On line 938 inside the loop the worker awaits wakeup event, which itself is controlled by the job_added function defined there. this function is called only when a new job is added:
This combined with the max_concurrent_jobs constraint controlled on line 927 implies that if there are more than max_concurrent_jobs jobs in the db, the worker acquires and tries to execute them, i.e., appends them to the queue; but if there are no newer jobs scheduled after them, the wakeup_event is not set, resulting in a deadlock situation prohibiting the loop from acquiring more jobs, even if the queue is empty.
To fix that, I propose to change the structure only a little bit:
changing the job_added name to check_queue_capacity
Subscribe that function to both JobAdded and JobReleased events.
This way, we can ensure that if there are more jobs that the worker can handle at once, the worker gets notified after the queue is freed up.
How can we reproduce the bug?
Simply schedule around 1000 jobs, run couple of workers (less than 4) and see that some of the jobs won't be executed after a while, unless you restart the workers manually.
The text was updated successfully, but these errors were encountered:
Things to check first
I have checked that my issue does not already have a solution in the FAQ
I have searched the existing issues and didn't find my bug already reported there
I have checked that my bug is still present in the latest release
Version
4.0.0a4
What happened?
For my use case, I needed to have separate scheduler and worker instances (since there might be thousands of jobs to be executed in a short time and I need the system to be scalable). Everything was fine and stable until our team decided to take a stress test to see when the system breaks. Here we understood that if we throttle the number of jobs submitted to the scheduler, everything would be fine. However, if we submit like 1000 jobs at the same time to the scheduler, workers are overwhelmed and a deadlock situation is caused in the workers job processing loop; causing the worker to stop acquiring and executing jobs at all without any crashing.
I inspected the code and pinpointed the issue. It is happening inside the
_process_jobs
function under the scheduler (async) class (line 905 as I am reporting the issue - commit hash is f375b67). On line 938 inside the loop the worker awaits wakeup event, which itself is controlled by thejob_added
function defined there. this function is called only when a new job is added:This combined with the
max_concurrent_jobs
constraint controlled on line 927 implies that if there are more thanmax_concurrent_jobs
jobs in the db, the worker acquires and tries to execute them, i.e., appends them to the queue; but if there are no newer jobs scheduled after them, thewakeup_event
is not set, resulting in a deadlock situation prohibiting the loop from acquiring more jobs, even if the queue is empty.To fix that, I propose to change the structure only a little bit:
job_added
name tocheck_queue_capacity
JobAdded
andJobReleased
events.This way, we can ensure that if there are more jobs that the worker can handle at once, the worker gets notified after the queue is freed up.
How can we reproduce the bug?
Simply schedule around 1000 jobs, run couple of workers (less than 4) and see that some of the jobs won't be executed after a while, unless you restart the workers manually.
The text was updated successfully, but these errors were encountered: