Don't schedule using stale job info #389

DrJosh9000 · 2024-10-01T07:32:44Z

Why

Fixes #382

What

Change the Handler interface to accept a new Job that includes a channel that is closed when the job information becomes stale. This can be used to time out in the limiter waiting for a token to become available. That way, stale jobs won't be given to the scheduler.

artem-zinnatullin · 2024-10-01T17:01:53Z

internal/controller/monitor/monitor.go

+	*api.CommandJob
+
+	// Closed when the job information becomes stale.
+	StaleCh <-chan struct{}


Hm, I thought the fix would be to double-check the cancellation/etc state of the Job/Build before actually scheduling it onto K8S

Having a callback (Channel) when information becomes stale seems like a much more complicated way to achieve that with unclear benefits and requires running a polling loop to keep the channel updated which will load up Buildkite backend quite a lot with these polling checks…

Sorry I'm not very proficient with Golang, can you please help me understand why this approach is better? Thanks!

If there was no limiter in place then Buildkite would receive a query every PollInterval for jobs to run (call this the query loop), and monitor would loop over the result (the inner loop), calling handler.Create for each of them.

In the previous PR on the limiter, I changed how the limiter worked to introduce blocking on the "token bucket", but this causes the bug: a job that was returned from the query a long time ago could be next in the inner loop in the Monitor around handler.Create, so that when the limiter can finally get a token, the job could now be cancelled.

With the StaleCh approach, the limiter can block until either there's a token, or the job information is stale. If it's stale, then the limiter shouldn't pass the job onto the scheduler and it can return early. Then the query loop can wait until the next PollInterval to run the main query - no extra query is needed.

To double-check the cancellation state of the job before passing it on to the scheduler, or double-checking within scheduler, would mean making another query to Buildkite at that point.

So while I think my approach doesn't look particularly clean (I could probably make it look nicer), it does avoid the extra query.

I see, so the idea is to keep the query loop as the main source of truth for up-to-date data and avoid creating a separate code path for explicit job state refresh? Sounds reasonable!

I guess coming from Reactive Streams I'd expect one Channel to be enough to either provide a valid object or indicate cancellation/stale by closing it, The "query loop" logic would then be distributing the updates about jobs to the interested channels or closing those channels when job is cancelled.

Your approach seems to solve that so there is no issue, just speaking out loud :)

internal/controller/scheduler/limiter.go

Co-authored-by: Artem Zinnatullin <ceo@abstractny.gay>

wolfeidau

Tough problem.

artem-zinnatullin reviewed Oct 1, 2024

View reviewed changes

internal/controller/scheduler/limiter.go Outdated Show resolved Hide resolved

DrJosh9000 force-pushed the fix-scheduling-for-cancelled-jobs branch 3 times, most recently from eb4ddd9 to befd443 Compare October 1, 2024 23:56

DrJosh9000 and others added 3 commits October 2, 2024 14:01

Don't schedule using stale job info

a4cce3e

Make comment in limiter Create clearer

8a2249a

Co-authored-by: Artem Zinnatullin <ceo@abstractny.gay>

Move inner job creation loop to own method

1b33b1e

DrJosh9000 force-pushed the fix-scheduling-for-cancelled-jobs branch from befd443 to 1b33b1e Compare October 2, 2024 04:01

artem-zinnatullin approved these changes Oct 3, 2024

View reviewed changes

wolfeidau approved these changes Oct 6, 2024

View reviewed changes

DrJosh9000 merged commit 68932d3 into main Oct 6, 2024
1 check passed

DrJosh9000 deleted the fix-scheduling-for-cancelled-jobs branch October 6, 2024 23:50

artem-zinnatullin mentioned this pull request Oct 8, 2024

Controller does not Delete Pending Jobs from K8S if they're cancelled on Buildkite — flooding the cluster #392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't schedule using stale job info #389

Don't schedule using stale job info #389

DrJosh9000 commented Oct 1, 2024

artem-zinnatullin Oct 1, 2024

DrJosh9000 Oct 1, 2024 •

edited

Loading

artem-zinnatullin Oct 3, 2024

wolfeidau left a comment

Don't schedule using stale job info #389

Don't schedule using stale job info #389

Conversation

DrJosh9000 commented Oct 1, 2024

Why

What

artem-zinnatullin Oct 1, 2024

Choose a reason for hiding this comment

DrJosh9000 Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

artem-zinnatullin Oct 3, 2024

Choose a reason for hiding this comment

wolfeidau left a comment

Choose a reason for hiding this comment

DrJosh9000 Oct 1, 2024 •

edited

Loading