Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't schedule using stale job info #389

Merged
merged 3 commits into from
Oct 6, 2024

Conversation

DrJosh9000
Copy link
Contributor

Why

Fixes #382

What

Change the Handler interface to accept a new Job that includes a channel that is closed when the job information becomes stale. This can be used to time out in the limiter waiting for a token to become available. That way, stale jobs won't be given to the scheduler.

*api.CommandJob

// Closed when the job information becomes stale.
StaleCh <-chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought the fix would be to double-check the cancellation/etc state of the Job/Build before actually scheduling it onto K8S

Having a callback (Channel) when information becomes stale seems like a much more complicated way to achieve that with unclear benefits and requires running a polling loop to keep the channel updated which will load up Buildkite backend quite a lot with these polling checks…

Sorry I'm not very proficient with Golang, can you please help me understand why this approach is better? Thanks!

Copy link
Contributor Author

@DrJosh9000 DrJosh9000 Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was no limiter in place then Buildkite would receive a query every PollInterval for jobs to run (call this the query loop), and monitor would loop over the result (the inner loop), calling handler.Create for each of them.

In the previous PR on the limiter, I changed how the limiter worked to introduce blocking on the "token bucket", but this causes the bug: a job that was returned from the query a long time ago could be next in the inner loop in the Monitor around handler.Create, so that when the limiter can finally get a token, the job could now be cancelled.

With the StaleCh approach, the limiter can block until either there's a token, or the job information is stale. If it's stale, then the limiter shouldn't pass the job onto the scheduler and it can return early. Then the query loop can wait until the next PollInterval to run the main query - no extra query is needed.

To double-check the cancellation state of the job before passing it on to the scheduler, or double-checking within scheduler, would mean making another query to Buildkite at that point.

So while I think my approach doesn't look particularly clean (I could probably make it look nicer), it does avoid the extra query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so the idea is to keep the query loop as the main source of truth for up-to-date data and avoid creating a separate code path for explicit job state refresh? Sounds reasonable!

I guess coming from Reactive Streams I'd expect one Channel to be enough to either provide a valid object or indicate cancellation/stale by closing it, The "query loop" logic would then be distributing the updates about jobs to the interested channels or closing those channels when job is cancelled.

Your approach seems to solve that so there is no issue, just speaking out loud :)

@DrJosh9000 DrJosh9000 force-pushed the fix-scheduling-for-cancelled-jobs branch 3 times, most recently from eb4ddd9 to befd443 Compare October 1, 2024 23:56
@DrJosh9000 DrJosh9000 force-pushed the fix-scheduling-for-cancelled-jobs branch from befd443 to 1b33b1e Compare October 2, 2024 04:01
Copy link
Contributor

@wolfeidau wolfeidau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tough problem.

@DrJosh9000 DrJosh9000 merged commit 68932d3 into main Oct 6, 2024
1 check passed
@DrJosh9000 DrJosh9000 deleted the fix-scheduling-for-cancelled-jobs branch October 6, 2024 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Controller keeps scheduling Pods from a cancelled Build
3 participants