Reduce getworkflowjobbyid gh requests #253

bavarianbidi · 2024-06-04T13:04:14Z

As our garm-instance is running on an enterprise level, we are having a large error rate when it came to GetWorkflowJobByID requests towards github.

This is mostly, because our enterprise-token does not have (and it also does not need) permissions to read on a repo-level.
The current (on main) implementation already check, if the garm-instance is responsible for this webhook event (by filtering on pool labels). The proposed implementation only moves the label-check a few lines on top to return a bit earlier and with that reduce the number of api-calls towards github.

commit a0a344ba99097fb7a1be959c11c0458558f3b600 is a small restructuring to probably make some conditions obsolete and increase the readability.

It's only necessary to get the workflow job information if the job defined the runs-on labels matches the labels, garm is responsible for. This will reduce the number of API calls to a minimum. Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

gabriel-samfira

This is great! Just a couple of comments.

gabriel-samfira · 2024-06-04T13:16:10Z

runner/pool/pool.go

+		slog.WarnContext(
+			r.ctx, "no pools matching tags; not recording job",
+			"requested_tags", strings.Join(jobParams.Labels, ", "))
+		return errors.New("no pools matching tags")


Do we need to return an error? Not having a pool to handle a job isn't really an error. We could get all sorts of webhooks we don't really need, even for jobs meant for the builtin github runners. We can just return nil here.

you're right, but i used an error here to be able to identify it within the apiserver/controllers/controller.go when we generate a metric.

TBH, during implementation i didn't liked it that much but i decided to do a first impl. that way and wait for your feedback :-) - will do something different here

In the current implementation from main, the check for a suitable pool is done only if the job is new or if it was not yet recorded by GARM (we record even jobs that are seen for the first time in a state other than queued). This is done because between the time when a queued job is recorded and the time that same job transitions to another state, a suitable pool may be removed. We may have a pool when the job comes in as queued but by the time it transitions to completed a user may have deleted that pool. This can happen if a different runner picked up the job and we were free to remove the pool.

That being said, we still need to transition the job status correctly. If a queued job comes in and we have no pool, we can just not record it. If an in_progress update comes in for a job we already recorded, we need to properly transition that even if we're not the ones handling it.

I think that the current implementation of HandleWorkflowJob() does too many things and should be refactored. But that will probably be done as part of a larger effort to refactor the whole pool/loop workflow.

Regarding the issue at hand of too many attempts to GetWorkflowJobByID, that call should only run in one state of the job, namely when the job is in completed state. The implementation in main checks if the job is not in queued state before attempting to fetch runner info from the workflow job. So that check should only run for situations where the job is either in_progress or completed and only if the runner name is empty. We do a further check if the job is skipped or canceled. In those 2 cases, we don't attempt to get the runner name, because a runner may not have been allocated.

I am curious what state the job is in when this actually does get called. We may be able to tweak the logic to not look up the runner name. We may even decide to remove that whole block of code altogether if it doesn't even make sense to try and get the runner name. Would it be possible to log the job itself and get that info in your env?

okay, I found why we still make API calls needlessly.

garm/runner/pool/pool.go

Line 903 in 7538f4a

if job.WorkflowJob.Conclusion == "skipped" || job.WorkflowJob.Conclusion == "canceled" {

I used the US English form of canceled instead of the UK English form cancelled (which has 2 L). 😞

gabriel-samfira · 2024-06-04T13:20:45Z

runner/pool/pool.go

+		if job.WorkflowJob.RunnerName == "" {
+			// We may still get no runner name. In situations such as jobs being cancelled before a runner had the chance
+			// to pick up the job, the runner name is not available from the API.
+			if job.WorkflowJob.Conclusion != "skipped" && job.WorkflowJob.Conclusion != "canceled" {


There is very little chance that a workflow job that is in_progress will have the conclusion set to skipped or canceled. I think we can drop this conditional statement. I am curious if the GH API would be so flaky as to not have the runner name set in a job that is marked as in_progress. I think that would be a huge bug, and we shouldn't really account for it.

Long story short, I think we can drop this check and just leave the one in the case where the job is completed.

Maybe just error our if the name is empty.

Was already wondering about this condition, tbh.
To get a better understanding, i would like to add few more debug-log messages in here to be able to see some of the relevant job events we get from GH.

gabriel-samfira · 2024-06-04T13:25:23Z

apiserver/controllers/controllers.go

@@ -121,6 +121,11 @@ func (a *APIController) handleWorkflowJobEvent(ctx context.Context, w http.Respo
 				"false",             // label: valid
 				"signature_invalid", // label: reason
 			).Inc()
+		case strings.Contains(err.Error(), "no pools matching tags"):


Not sure this is needed given the comment bellow.

bavarianbidi · 2024-06-05T10:01:44Z

close this PR in favor of #254

bavarianbidi added 2 commits June 4, 2024 12:53

chore: restructure fetching runner details via gh

a0a344b

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>

gabriel-samfira requested changes Jun 4, 2024

View reviewed changes

bavarianbidi closed this Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce getworkflowjobbyid gh requests #253

Reduce getworkflowjobbyid gh requests #253

bavarianbidi commented Jun 4, 2024

gabriel-samfira left a comment

gabriel-samfira Jun 4, 2024

bavarianbidi Jun 5, 2024

gabriel-samfira Jun 5, 2024

gabriel-samfira Jun 5, 2024

gabriel-samfira Jun 4, 2024

bavarianbidi Jun 5, 2024

gabriel-samfira Jun 4, 2024

bavarianbidi commented Jun 5, 2024

Reduce getworkflowjobbyid gh requests #253

Reduce getworkflowjobbyid gh requests #253

Conversation

bavarianbidi commented Jun 4, 2024

gabriel-samfira left a comment

Choose a reason for hiding this comment

gabriel-samfira Jun 4, 2024

Choose a reason for hiding this comment

bavarianbidi Jun 5, 2024

Choose a reason for hiding this comment

gabriel-samfira Jun 5, 2024

Choose a reason for hiding this comment

gabriel-samfira Jun 5, 2024

Choose a reason for hiding this comment

gabriel-samfira Jun 4, 2024

Choose a reason for hiding this comment

bavarianbidi Jun 5, 2024

Choose a reason for hiding this comment

gabriel-samfira Jun 4, 2024

Choose a reason for hiding this comment

bavarianbidi commented Jun 5, 2024