-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't schedule using stale job info #389
Conversation
*api.CommandJob | ||
|
||
// Closed when the job information becomes stale. | ||
StaleCh <-chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I thought the fix would be to double-check the cancellation/etc state of the Job/Build before actually scheduling it onto K8S
Having a callback (Channel) when information becomes stale seems like a much more complicated way to achieve that with unclear benefits and requires running a polling loop to keep the channel updated which will load up Buildkite backend quite a lot with these polling checks…
Sorry I'm not very proficient with Golang, can you please help me understand why this approach is better? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was no limiter in place then Buildkite would receive a query every PollInterval
for jobs to run (call this the query loop), and monitor would loop over the result (the inner loop), calling handler.Create
for each of them.
In the previous PR on the limiter, I changed how the limiter worked to introduce blocking on the "token bucket", but this causes the bug: a job that was returned from the query a long time ago could be next in the inner loop in the Monitor around handler.Create
, so that when the limiter can finally get a token, the job could now be cancelled.
With the StaleCh
approach, the limiter can block until either there's a token, or the job information is stale. If it's stale, then the limiter shouldn't pass the job onto the scheduler and it can return early. Then the query loop can wait until the next PollInterval
to run the main query - no extra query is needed.
To double-check the cancellation state of the job before passing it on to the scheduler, or double-checking within scheduler, would mean making another query to Buildkite at that point.
So while I think my approach doesn't look particularly clean (I could probably make it look nicer), it does avoid the extra query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so the idea is to keep the query loop as the main source of truth for up-to-date data and avoid creating a separate code path for explicit job state refresh? Sounds reasonable!
I guess coming from Reactive Streams I'd expect one Channel to be enough to either provide a valid object or indicate cancellation/stale by closing it, The "query loop" logic would then be distributing the updates about jobs to the interested channels or closing those channels when job is cancelled.
Your approach seems to solve that so there is no issue, just speaking out loud :)
eb4ddd9
to
befd443
Compare
Co-authored-by: Artem Zinnatullin <ceo@abstractny.gay>
befd443
to
1b33b1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tough problem.
Why
Fixes #382
What
Change the
Handler
interface to accept a newJob
that includes a channel that is closed when the job information becomes stale. This can be used to time out in the limiter waiting for a token to become available. That way, stale jobs won't be given to the scheduler.