Hanging jobs and postgres lock timeout #1461

antulik · 2024-08-08T07:42:12Z

antulik
Aug 8, 2024

I just spent 8 hours debugging issue, so want to document it for others.

Symptoms

Job just hangs and never finish (aka frozen)
The issue happens sometimes and appears to be affected by a race condition
if Postgres lock_timeout config is set, the lock will expire
if you debug Postgres locks, you'll find that this SQL statement was the blocking one, and all other jobs are block by it.
- Note: I think this is the last executed query in the transaction.

SELECT "good_jobs"."id", "good_jobs"."queue_name", "good_jobs"."priority", "good_jobs"."serialized_params", "good_jobs"."scheduled_at", "good_jobs"."performed_at", "good_jobs"."finished_at", "good_jobs"."error", "good_jobs"."created_at", "good_jobs"."updated_at", "good_jobs"."active_job_id", "good_jobs"."concurrency_key", "good_jobs"."cron_key", "good_jobs"."cron_at", "good_jobs"."batch_id", "good_jobs"."batch_callback_id", "good_jobs"."executions_count", "good_jobs"."job_class", "good_jobs"."error_event", "good_jobs"."labels", "good_jobs"."locked_by_id", "good_jobs"."locked_at" FROM "good_jobs" WHERE "good_jobs"."id" IN (WITH "rows" AS  MATERIALIZED (SELECT "good_jobs"."id" FROM "good_jobs" WHERE "good_jobs"."finished_at" IS NULL AND ("good_jobs"."scheduled_at" <= $1 OR "good_jobs"."scheduled_at" IS NULL) ORDER BY priority ASC NULLS LAST, "good_jobs"."created_at" ASC) SELECT "rows"."id" FROM "rows" WHERE pg_try_advisory_lock(('x' || substr(md5('good_jobs' || '-' || "rows"."id"::text), 1, 16))::bit(64)::bigint) LIMIT $2) ORDER BY priority ASC NULLS LAST, "good_jobs"."created_at" ASC

None of that was pointing to the cause of the issue.

What's happening

This is a use case I had

Feature A: simple job
Feature B: processing (A) jobs with a Batch feature
Feature C: processing (B) jobs with a Batch feature

The way Feature B was implemented to be able to run as a standalone job OR as part of Feature C batch.

# Feature B 
# ...
batch = job.batch || GoodJob::Batch.new
batch.add do 
   # Queue (A) jobs here
end
batch.enqueue
# ...

This works well if B job is queued as a standalone
Sometimes it hangs if job B was queued by C as part of its batch

Issue & Solution

# Feature B 
# ...
batch = job.batch || GoodJob::Batch.new
batch.add do 
   # Queue (A) jobs here
end
batch.enqueue unless batch.enqueued? # <==== FIX HERE
# ...

The issue: existing batch job is re-enqueued while it's still running.
Solution is to add unless batch.enqueued? to only enqueue for new batches

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging jobs and postgres lock timeout #1461

{{title}}

Replies: 0 comments

Select a reply

Hanging jobs and postgres lock timeout #1461

antulik Aug 8, 2024

Symptoms

What's happening

Issue & Solution

Replies: 0 comments

antulik
Aug 8, 2024