[WIP] Wait to reenqueue job post-interrupt until after callbacks are finished #66
+43
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[WIP] Test suite on Core is failing, need to investigate
Problem
When Sidekiq is being used as the queue adapter, it is possible for
job-iteration
to reenqueue the job after an interruption prior to the callbacks for the original job completing. This can create race conditions ifon_shutdown
or{after|around}_perform
callbacks touch data that the new job will also touch.This can be demonstrated by keeping the new test
"allows callbacks to finish before reenqueuing job after interrupt"
but removing the rest of the changes. The callback order becomes[["before_enqueue"], ["before_enqueue"], ["on_shutdown"]]
, which is not what we want.Solution
This can be solved by waiting until after all the callbacks on the original job are complete to reenqueue the iteration job. Rather than reenqueuing the job from within
#iterate_with_enumerator
, we can prepend anafter_perform
callback (which ensures the callback runs last) whenJobIteration::Iteration
is included and reenqueue the job there.More Context
We're leveraging the
job-iteration
gem in the new Maintenance Tasks Framework, and we use aRun
object to keep track of a maintenance task's progress under the hood. We encountered a race condition which lead to an invalid state and raised an exception. This happened because theon_shutdown
callback:https://github.com/Shopify/maintenance_tasks/blob/fd096a5d2bcbd8340370b950c927076296ece245/app/jobs/maintenance_tasks/task_job.rb#L88-L98
occurred after the new job had already gone through the
before_perform
hook:https://github.com/Shopify/maintenance_tasks/blob/fd096a5d2bcbd8340370b950c927076296ece245/app/jobs/maintenance_tasks/task_job.rb#L64-L77
and started performing, which led the
@run
object to be in an invalid state.Considerations
I don't think there are any unintended consequences to delaying the reenqueue - the only incompatibility would be if users were depending on
reenqueue_iteration_job
being called prior to their callbacks finishing and making use of one of the variables (ie.self.already_in_queue
or@retried
). I've double checked inShopify/shopify
and it doesn't seem that this is the case.Suggestions for a better way to test the order of the callbacks are also welcome - matching on output sent to
$STDOUT
in the callbacks seemed like a reasonable and simple way to do it, rather than fiddling around with class variables or something similar.Any other things I'm missing?