You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Working with @emmaoberstein on setting up a new queue, we discovered that there is a condition when the worker will get stuck in a spinloop, running the pickup query multiple times per second (effectively disobeying the sleep_delay config).
The likeliest reproduction would be to enqueue a job that fails deserialization completely, or that otherwise prevents the usual job cleanup (e.g. it crashes the thread).
Here's where the work_off method hits the spinloop:
defwork_off(num=100)success=Concurrent::AtomicFixnum.new(0)failure=Concurrent::AtomicFixnum.new(0)num.timesdojobs=reserve_jobsbreakifjobs.empty?# jobs is not emptypool=Concurrent::FixedThreadPool.new(jobs.length)jobs.eachdo |job|
pool.postdo# Exception encountered when `payload_object` is first called, e.g. job fails to deserialize# - success and failure are never incremented# - job remains in queue and is immediately returned by the next `reserve_jobs` without waitingendendpool.shutdownpool.wait_for_terminationbreakifstop?end[success,failure].map(&:value)end
There a few ways I could think to fix this:
Add a new "inner loop" delay that sets a minimum amount of time in between each iteration of num.times do.
Bail from the loop if neither success nor failure were incremented (i.e. no work got done).
Ensure that job cleanup happens in all cases (except for complete loss of DB connection), to ensure that reserve_jobs won't immediately return the same job again (due to exponential backoff).
All of these feel fairly reasonable, though I'd be inclined to explore the second and third. (Adding a new delay would require more tuning & testing and would not actually address the underlying failure mode for the job.) So, actually, I'd want to start with the third option, since it would likely also address the remaining issue in #23.
The text was updated successfully, but these errors were encountered:
smudge
added a commit
to smudge/delayed
that referenced
this issue
Jun 26, 2024
This ensures that exceptions raised in thread callback hooks are rescued
and properly mark jobs as failed.
This is also a good opportunity to change the `num` argument (of
`work_off(num)`) to mean number of jobs (give or take a few due to
`max_claims`), not number of iterations. Previously (before threading
was introduced) I think it meant number of jobs (though jobs and
iterations were 1:1). I would not have done this before the refactor,
because there was no guarantee that one of `success` or `failure` would
be incremented (the thread might crash for many reasons). Now, we only
increment `success` and treat `total - success` as the "failure" number
when we return from the method.
Fixes#23 and #41
This is also a prereq for a resolution I'm cooking up for #36
Working with @emmaoberstein on setting up a new queue, we discovered that there is a condition when the worker will get stuck in a spinloop, running the pickup query multiple times per second (effectively disobeying the
sleep_delay
config).The likeliest reproduction would be to enqueue a job that fails deserialization completely, or that otherwise prevents the usual job cleanup (e.g. it crashes the thread).
Here's where the
work_off
method hits the spinloop:There a few ways I could think to fix this:
num.times do
.reserve_jobs
won't immediately return the same job again (due to exponential backoff).All of these feel fairly reasonable, though I'd be inclined to explore the second and third. (Adding a new delay would require more tuning & testing and would not actually address the underlying failure mode for the job.) So, actually, I'd want to start with the third option, since it would likely also address the remaining issue in #23.
The text was updated successfully, but these errors were encountered: