Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible unthrottled spinloop when jobs fail to deserailize #41

Closed
smudge opened this issue Jun 14, 2024 · 1 comment
Closed

Possible unthrottled spinloop when jobs fail to deserailize #41

smudge opened this issue Jun 14, 2024 · 1 comment

Comments

@smudge
Copy link
Member

smudge commented Jun 14, 2024

Working with @emmaoberstein on setting up a new queue, we discovered that there is a condition when the worker will get stuck in a spinloop, running the pickup query multiple times per second (effectively disobeying the sleep_delay config).

The likeliest reproduction would be to enqueue a job that fails deserialization completely, or that otherwise prevents the usual job cleanup (e.g. it crashes the thread).

Here's where the work_off method hits the spinloop:

    def work_off(num = 100)
      success = Concurrent::AtomicFixnum.new(0)
      failure = Concurrent::AtomicFixnum.new(0)

      num.times do
        jobs = reserve_jobs
        break if jobs.empty? # jobs is not empty

        pool = Concurrent::FixedThreadPool.new(jobs.length)
        jobs.each do |job|
          pool.post do
            # Exception encountered when `payload_object` is first called, e.g. job fails to deserialize
            # - success and failure are never incremented
            # - job remains in queue and is immediately returned by the next `reserve_jobs` without waiting
          end
        end

        pool.shutdown
        pool.wait_for_termination

        break if stop?
      end

      [success, failure].map(&:value)
    end

There a few ways I could think to fix this:

  1. Add a new "inner loop" delay that sets a minimum amount of time in between each iteration of num.times do.
  2. Bail from the loop if neither success nor failure were incremented (i.e. no work got done).
  3. Ensure that job cleanup happens in all cases (except for complete loss of DB connection), to ensure that reserve_jobs won't immediately return the same job again (due to exponential backoff).

All of these feel fairly reasonable, though I'd be inclined to explore the second and third. (Adding a new delay would require more tuning & testing and would not actually address the underlying failure mode for the job.) So, actually, I'd want to start with the third option, since it would likely also address the remaining issue in #23.

smudge added a commit to smudge/delayed that referenced this issue Jun 26, 2024
This ensures that exceptions raised in thread callback hooks are rescued
and properly mark jobs as failed.

Fixes Betterment#23 and Betterment#41
smudge added a commit that referenced this issue Jun 27, 2024
This ensures that exceptions raised in thread callback hooks are rescued
and properly mark jobs as failed.

This is also a good opportunity to change the `num` argument (of
`work_off(num)`) to mean number of jobs (give or take a few due to
`max_claims`), not number of iterations. Previously (before threading
was introduced) I think it meant number of jobs (though jobs and
iterations were 1:1). I would not have done this before the refactor,
because there was no guarantee that one of `success` or `failure` would
be incremented (the thread might crash for many reasons). Now, we only
increment `success` and treat `total - success` as the "failure" number
when we return from the method.

Fixes #23 and #41

This is also a prereq for a resolution I'm cooking up for #36
@smudge
Copy link
Member Author

smudge commented Jun 27, 2024

Should be resolved by #42

@smudge smudge closed this as completed Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant