Worker stops if there are "too much" simultaneous jobs #881

mavahedinia · 2024-03-25T01:23:27Z

Things to check first

I have checked that my issue does not already have a solution in the FAQ
I have searched the existing issues and didn't find my bug already reported there
I have checked that my bug is still present in the latest release

Version

4.0.0a4

What happened?

For my use case, I needed to have separate scheduler and worker instances (since there might be thousands of jobs to be executed in a short time and I need the system to be scalable). Everything was fine and stable until our team decided to take a stress test to see when the system breaks. Here we understood that if we throttle the number of jobs submitted to the scheduler, everything would be fine. However, if we submit like 1000 jobs at the same time to the scheduler, workers are overwhelmed and a deadlock situation is caused in the workers job processing loop; causing the worker to stop acquiring and executing jobs at all without any crashing.

I inspected the code and pinpointed the issue. It is happening inside the _process_jobs function under the scheduler (async) class (line 905 as I am reporting the issue - commit hash is f375b67). On line 938 inside the loop the worker awaits wakeup event, which itself is controlled by the job_added function defined there. this function is called only when a new job is added:

async def job_added(event: Event) -> None:
    if len(self._running_jobs) < self.max_concurrent_jobs:
        wakeup_event.set()
...
self.event_broker.subscribe(job_added, {JobAdded})

This combined with the max_concurrent_jobs constraint controlled on line 927 implies that if there are more than max_concurrent_jobs jobs in the db, the worker acquires and tries to execute them, i.e., appends them to the queue; but if there are no newer jobs scheduled after them, the wakeup_event is not set, resulting in a deadlock situation prohibiting the loop from acquiring more jobs, even if the queue is empty.

To fix that, I propose to change the structure only a little bit:

changing the job_added name to check_queue_capacity
Subscribe that function to both JobAdded and JobReleased events.

This way, we can ensure that if there are more jobs that the worker can handle at once, the worker gets notified after the queue is freed up.

How can we reproduce the bug?

Simply schedule around 1000 jobs, run couple of workers (less than 4) and see that some of the jobs won't be executed after a while, unless you restart the workers manually.

The text was updated successfully, but these errors were encountered:

mavahedinia · 2024-03-25T14:38:01Z

The proposed solution is available at PR#882

agronholm · 2024-05-11T09:00:00Z

Fixed via 0596db7.

mavahedinia added the bug label Mar 25, 2024

mavahedinia mentioned this issue Mar 25, 2024

Wakeup worker when there are resources freed up #882

Merged

agronholm closed this as completed May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker stops if there are "too much" simultaneous jobs #881

Worker stops if there are "too much" simultaneous jobs #881

mavahedinia commented Mar 25, 2024

mavahedinia commented Mar 25, 2024

agronholm commented May 11, 2024

Worker stops if there are "too much" simultaneous jobs #881

Worker stops if there are "too much" simultaneous jobs #881

Comments

mavahedinia commented Mar 25, 2024

Things to check first

Version

What happened?

How can we reproduce the bug?

mavahedinia commented Mar 25, 2024

agronholm commented May 11, 2024