Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce getworkflowjobbyid gh requests #253

Conversation

bavarianbidi
Copy link
Contributor

As our garm-instance is running on an enterprise level, we are having a large error rate when it came to GetWorkflowJobByID requests towards github.

This is mostly, because our enterprise-token does not have (and it also does not need) permissions to read on a repo-level.
The current (on main) implementation already check, if the garm-instance is responsible for this webhook event (by filtering on pool labels). The proposed implementation only moves the label-check a few lines on top to return a bit earlier and with that reduce the number of api-calls towards github.

image

commit a0a344ba99097fb7a1be959c11c0458558f3b600 is a small restructuring to probably make some conditions obsolete and increase the readability.

It's only necessary to get the workflow job information if the job
defined the runs-on labels matches the labels, garm is responsible for.

This will reduce the number of API calls to a minimum.

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
Copy link
Member

@gabriel-samfira gabriel-samfira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Just a couple of comments.

slog.WarnContext(
r.ctx, "no pools matching tags; not recording job",
"requested_tags", strings.Join(jobParams.Labels, ", "))
return errors.New("no pools matching tags")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return an error? Not having a pool to handle a job isn't really an error. We could get all sorts of webhooks we don't really need, even for jobs meant for the builtin github runners. We can just return nil here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, but i used an error here to be able to identify it within the apiserver/controllers/controller.go when we generate a metric.

TBH, during implementation i didn't liked it that much but i decided to do a first impl. that way and wait for your feedback :-) - will do something different here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation from main, the check for a suitable pool is done only if the job is new or if it was not yet recorded by GARM (we record even jobs that are seen for the first time in a state other than queued). This is done because between the time when a queued job is recorded and the time that same job transitions to another state, a suitable pool may be removed. We may have a pool when the job comes in as queued but by the time it transitions to completed a user may have deleted that pool. This can happen if a different runner picked up the job and we were free to remove the pool.

That being said, we still need to transition the job status correctly. If a queued job comes in and we have no pool, we can just not record it. If an in_progress update comes in for a job we already recorded, we need to properly transition that even if we're not the ones handling it.

I think that the current implementation of HandleWorkflowJob() does too many things and should be refactored. But that will probably be done as part of a larger effort to refactor the whole pool/loop workflow.

Regarding the issue at hand of too many attempts to GetWorkflowJobByID, that call should only run in one state of the job, namely when the job is in completed state. The implementation in main checks if the job is not in queued state before attempting to fetch runner info from the workflow job. So that check should only run for situations where the job is either in_progress or completed and only if the runner name is empty. We do a further check if the job is skipped or canceled. In those 2 cases, we don't attempt to get the runner name, because a runner may not have been allocated.

I am curious what state the job is in when this actually does get called. We may be able to tweak the logic to not look up the runner name. We may even decide to remove that whole block of code altogether if it doesn't even make sense to try and get the runner name. Would it be possible to log the job itself and get that info in your env?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I found why we still make API calls needlessly.

if job.WorkflowJob.Conclusion == "skipped" || job.WorkflowJob.Conclusion == "canceled" {

I used the US English form of canceled instead of the UK English form cancelled (which has 2 L). 😞

if job.WorkflowJob.RunnerName == "" {
// We may still get no runner name. In situations such as jobs being cancelled before a runner had the chance
// to pick up the job, the runner name is not available from the API.
if job.WorkflowJob.Conclusion != "skipped" && job.WorkflowJob.Conclusion != "canceled" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is very little chance that a workflow job that is in_progress will have the conclusion set to skipped or canceled. I think we can drop this conditional statement. I am curious if the GH API would be so flaky as to not have the runner name set in a job that is marked as in_progress. I think that would be a huge bug, and we shouldn't really account for it.

Long story short, I think we can drop this check and just leave the one in the case where the job is completed.

Maybe just error our if the name is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was already wondering about this condition, tbh.
To get a better understanding, i would like to add few more debug-log messages in here to be able to see some of the relevant job events we get from GH.

@@ -121,6 +121,11 @@ func (a *APIController) handleWorkflowJobEvent(ctx context.Context, w http.Respo
"false", // label: valid
"signature_invalid", // label: reason
).Inc()
case strings.Contains(err.Error(), "no pools matching tags"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is needed given the comment bellow.

@bavarianbidi
Copy link
Contributor Author

close this PR in favor of #254

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants