You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just spent 8 hours debugging issue, so want to document it for others.
Symptoms
Job just hangs and never finish (aka frozen)
The issue happens sometimes and appears to be affected by a race condition
if Postgres lock_timeout config is set, the lock will expire
if you debug Postgres locks, you'll find that this SQL statement was the blocking one, and all other jobs are block by it.
Note: I think this is the last executed query in the transaction.
SELECT "good_jobs"."id", "good_jobs"."queue_name", "good_jobs"."priority", "good_jobs"."serialized_params", "good_jobs"."scheduled_at", "good_jobs"."performed_at", "good_jobs"."finished_at", "good_jobs"."error", "good_jobs"."created_at", "good_jobs"."updated_at", "good_jobs"."active_job_id", "good_jobs"."concurrency_key", "good_jobs"."cron_key", "good_jobs"."cron_at", "good_jobs"."batch_id", "good_jobs"."batch_callback_id", "good_jobs"."executions_count", "good_jobs"."job_class", "good_jobs"."error_event", "good_jobs"."labels", "good_jobs"."locked_by_id", "good_jobs"."locked_at" FROM "good_jobs" WHERE "good_jobs"."id" IN (WITH "rows" AS MATERIALIZED (SELECT "good_jobs"."id" FROM "good_jobs" WHERE "good_jobs"."finished_at" IS NULL AND ("good_jobs"."scheduled_at" <= $1 OR "good_jobs"."scheduled_at" IS NULL) ORDER BY priority ASC NULLS LAST, "good_jobs"."created_at" ASC) SELECT "rows"."id" FROM "rows" WHERE pg_try_advisory_lock(('x' || substr(md5('good_jobs' || '-' || "rows"."id"::text), 1, 16))::bit(64)::bigint) LIMIT $2) ORDER BY priority ASC NULLS LAST, "good_jobs"."created_at" ASC
None of that was pointing to the cause of the issue.
What's happening
This is a use case I had
Feature A: simple job
Feature B: processing (A) jobs with a Batch feature
Feature C: processing (B) jobs with a Batch feature
The way Feature B was implemented to be able to run as a standalone job OR as part of Feature C batch.
# Feature B # ...batch=job.batch || GoodJob::Batch.newbatch.adddo# Queue (A) jobs hereendbatch.enqueue# ...
This works well if B job is queued as a standalone
Sometimes it hangs if job B was queued by C as part of its batch
Issue & Solution
# Feature B # ...batch=job.batch || GoodJob::Batch.newbatch.adddo# Queue (A) jobs hereendbatch.enqueueunlessbatch.enqueued?# <==== FIX HERE# ...
The issue: existing batch job is re-enqueued while it's still running.
Solution is to add unless batch.enqueued? to only enqueue for new batches
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I just spent 8 hours debugging issue, so want to document it for others.
Symptoms
lock_timeout
config is set, the lock will expireNone of that was pointing to the cause of the issue.
What's happening
This is a use case I had
The way Feature B was implemented to be able to run as a standalone job OR as part of Feature C batch.
Issue & Solution
unless batch.enqueued?
to only enqueue for new batchesBeta Was this translation helpful? Give feedback.
All reactions