Possible unthrottled spinloop when jobs fail to deserailize #41

smudge · 2024-06-14T20:17:54Z

Working with @emmaoberstein on setting up a new queue, we discovered that there is a condition when the worker will get stuck in a spinloop, running the pickup query multiple times per second (effectively disobeying the sleep_delay config).

The likeliest reproduction would be to enqueue a job that fails deserialization completely, or that otherwise prevents the usual job cleanup (e.g. it crashes the thread).

Here's where the work_off method hits the spinloop:

    def work_off(num = 100)
      success = Concurrent::AtomicFixnum.new(0)
      failure = Concurrent::AtomicFixnum.new(0)

      num.times do
        jobs = reserve_jobs
        break if jobs.empty? # jobs is not empty

        pool = Concurrent::FixedThreadPool.new(jobs.length)
        jobs.each do |job|
          pool.post do
            # Exception encountered when `payload_object` is first called, e.g. job fails to deserialize
            # - success and failure are never incremented
            # - job remains in queue and is immediately returned by the next `reserve_jobs` without waiting
          end
        end

        pool.shutdown
        pool.wait_for_termination

        break if stop?
      end

      [success, failure].map(&:value)
    end

There a few ways I could think to fix this:

Add a new "inner loop" delay that sets a minimum amount of time in between each iteration of num.times do.
Bail from the loop if neither success nor failure were incremented (i.e. no work got done).
Ensure that job cleanup happens in all cases (except for complete loss of DB connection), to ensure that reserve_jobs won't immediately return the same job again (due to exponential backoff).

All of these feel fairly reasonable, though I'd be inclined to explore the second and third. (Adding a new delay would require more tuning & testing and would not actually address the underlying failure mode for the job.) So, actually, I'd want to start with the third option, since it would likely also address the remaining issue in #23.

The text was updated successfully, but these errors were encountered:

This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed. Fixes Betterment#23 and Betterment#41

This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed. This is also a good opportunity to change the `num` argument (of `work_off(num)`) to mean number of jobs (give or take a few due to `max_claims`), not number of iterations. Previously (before threading was introduced) I think it meant number of jobs (though jobs and iterations were 1:1). I would not have done this before the refactor, because there was no guarantee that one of `success` or `failure` would be incremented (the thread might crash for many reasons). Now, we only increment `success` and treat `total - success` as the "failure" number when we return from the method. Fixes #23 and #41 This is also a prereq for a resolution I'm cooking up for #36

smudge · 2024-06-27T16:36:14Z

Should be resolved by #42

smudge added a commit to smudge/delayed that referenced this issue Jun 26, 2024

Fix spin-loop/cleanup failure mode within run loop

23c05f9

This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed. Fixes Betterment#23 and Betterment#41

smudge mentioned this issue Jun 26, 2024

Fix spin-loop/cleanup failure mode within run loop #42

Merged

smudge closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible unthrottled spinloop when jobs fail to deserailize #41

Possible unthrottled spinloop when jobs fail to deserailize #41

smudge commented Jun 14, 2024 •

edited

Loading

smudge commented Jun 27, 2024

Possible unthrottled spinloop when jobs fail to deserailize #41

Possible unthrottled spinloop when jobs fail to deserailize #41

Comments

smudge commented Jun 14, 2024 • edited Loading

smudge commented Jun 27, 2024

smudge commented Jun 14, 2024 •

edited

Loading