Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce getworkflowjobbyid gh requests #253

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions apiserver/controllers/controllers.go
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,11 @@ func (a *APIController) handleWorkflowJobEvent(ctx context.Context, w http.Respo
"false", // label: valid
"signature_invalid", // label: reason
).Inc()
case strings.Contains(err.Error(), "no pools matching tags"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is needed given the comment bellow.

metrics.WebhooksReceived.WithLabelValues(
"false", // label: valid
"not_responsible", // label: reason
).Inc()
default:
metrics.WebhooksReceived.WithLabelValues(
"false", // label: valid
Expand Down
77 changes: 46 additions & 31 deletions runner/pool/pool.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@
keyMux *keyMutex
}

func (r *basePoolManager) HandleWorkflowJob(job params.WorkflowJob) error {

Check failure on line 134 in runner/pool/pool.go

View workflow job for this annotation

GitHub Actions / Linters

cyclomatic complexity 37 of func `(*basePoolManager).HandleWorkflowJob` is high (> 30) (gocyclo)
if err := r.ValidateOwner(job); err != nil {
return errors.Wrap(err, "validating owner")
}
Expand All @@ -139,6 +139,23 @@
var jobParams params.Job
var err error
var triggeredBy int64

// especially on an enterprise level we get a lot of webhooks that are not meant for us.
// if there is no pool that matches the labels, we should ignore the job and return earlier.
potentialPools, err := r.store.FindPoolsMatchingAllTags(r.ctx, r.entity.EntityType, r.entity.ID, jobParams.Labels)
if err != nil {
slog.With(slog.Any("error", err)).ErrorContext(
r.ctx, "failed to find pools matching tags; not recording job",
"requested_tags", strings.Join(jobParams.Labels, ", "))
return errors.Wrap(err, "getting pools matching tags")
}
if len(potentialPools) == 0 {
slog.WarnContext(
r.ctx, "no pools matching tags; not recording job",
"requested_tags", strings.Join(jobParams.Labels, ", "))
return errors.New("no pools matching tags")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return an error? Not having a pool to handle a job isn't really an error. We could get all sorts of webhooks we don't really need, even for jobs meant for the builtin github runners. We can just return nil here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, but i used an error here to be able to identify it within the apiserver/controllers/controller.go when we generate a metric.

TBH, during implementation i didn't liked it that much but i decided to do a first impl. that way and wait for your feedback :-) - will do something different here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation from main, the check for a suitable pool is done only if the job is new or if it was not yet recorded by GARM (we record even jobs that are seen for the first time in a state other than queued). This is done because between the time when a queued job is recorded and the time that same job transitions to another state, a suitable pool may be removed. We may have a pool when the job comes in as queued but by the time it transitions to completed a user may have deleted that pool. This can happen if a different runner picked up the job and we were free to remove the pool.

That being said, we still need to transition the job status correctly. If a queued job comes in and we have no pool, we can just not record it. If an in_progress update comes in for a job we already recorded, we need to properly transition that even if we're not the ones handling it.

I think that the current implementation of HandleWorkflowJob() does too many things and should be refactored. But that will probably be done as part of a larger effort to refactor the whole pool/loop workflow.

Regarding the issue at hand of too many attempts to GetWorkflowJobByID, that call should only run in one state of the job, namely when the job is in completed state. The implementation in main checks if the job is not in queued state before attempting to fetch runner info from the workflow job. So that check should only run for situations where the job is either in_progress or completed and only if the runner name is empty. We do a further check if the job is skipped or canceled. In those 2 cases, we don't attempt to get the runner name, because a runner may not have been allocated.

I am curious what state the job is in when this actually does get called. We may be able to tweak the logic to not look up the runner name. We may even decide to remove that whole block of code altogether if it doesn't even make sense to try and get the runner name. Would it be possible to log the job itself and get that info in your env?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I found why we still make API calls needlessly.

if job.WorkflowJob.Conclusion == "skipped" || job.WorkflowJob.Conclusion == "canceled" {

I used the US English form of canceled instead of the UK English form cancelled (which has 2 L). 😞

}

defer func() {
// we're updating the job in the database, regardless of whether it was successful or not.
// or if it was meant for this pool or not. Github will send the same job data to all hierarchies
Expand All @@ -156,20 +173,6 @@
"job_id", jobParams.ID)
return
}
// This job is new to us. Check if we have a pool that can handle it.
potentialPools, err := r.store.FindPoolsMatchingAllTags(r.ctx, r.entity.EntityType, r.entity.ID, jobParams.Labels)
if err != nil {
slog.With(slog.Any("error", err)).ErrorContext(
r.ctx, "failed to find pools matching tags; not recording job",
"requested_tags", strings.Join(jobParams.Labels, ", "))
return
}
if len(potentialPools) == 0 {
slog.WarnContext(
r.ctx, "no pools matching tags; not recording job",
"requested_tags", strings.Join(jobParams.Labels, ", "))
return
}
}

if _, jobErr := r.store.CreateOrUpdateJob(r.ctx, jobParams); jobErr != nil {
Expand Down Expand Up @@ -210,6 +213,20 @@
return errors.Wrap(err, "converting job to params")
}

// Runner name was not set in WorkflowJob by github. We can still attempt to fetch the info we need,
// using the workflow run ID, from the API.
if job.WorkflowJob.RunnerName == "" {
// We may still get no runner name. In situations such as jobs being cancelled before a runner had the chance
// to pick up the job, the runner name is not available from the API.
if job.WorkflowJob.Conclusion != "skipped" && job.WorkflowJob.Conclusion != "canceled" {

Check failure on line 221 in runner/pool/pool.go

View workflow job for this annotation

GitHub Actions / Linters

string `canceled` has 3 occurrences, make it a constant (goconst)
runnerInfo, err := r.getRunnerDetailsFromJob(job)
if err != nil && !errors.Is(err, runnerErrors.ErrNotFound) {
return errors.Wrap(err, "fetching runner details")
}
jobParams.RunnerName = runnerInfo.Name
}
}

// update instance workload state.
if _, err := r.setInstanceRunnerStatus(jobParams.RunnerName, params.RunnerTerminated); err != nil {
if errors.Is(err, runnerErrors.ErrNotFound) {
Expand Down Expand Up @@ -246,6 +263,20 @@
return errors.Wrap(err, "converting job to params")
}

// Runner name was not set in WorkflowJob by github. We can still attempt to fetch the info we need,
// using the workflow run ID, from the API.
if job.WorkflowJob.RunnerName == "" {
// We may still get no runner name. In situations such as jobs being cancelled before a runner had the chance
// to pick up the job, the runner name is not available from the API.
if job.WorkflowJob.Conclusion != "skipped" && job.WorkflowJob.Conclusion != "canceled" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is very little chance that a workflow job that is in_progress will have the conclusion set to skipped or canceled. I think we can drop this conditional statement. I am curious if the GH API would be so flaky as to not have the runner name set in a job that is marked as in_progress. I think that would be a huge bug, and we shouldn't really account for it.

Long story short, I think we can drop this check and just leave the one in the case where the job is completed.

Maybe just error our if the name is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was already wondering about this condition, tbh.
To get a better understanding, i would like to add few more debug-log messages in here to be able to see some of the relevant job events we get from GH.

runnerInfo, err := r.getRunnerDetailsFromJob(job)
if err != nil && !errors.Is(err, runnerErrors.ErrNotFound) {
return errors.Wrap(err, "fetching runner details")
}
jobParams.RunnerName = runnerInfo.Name
}
}

// update instance workload state.
instance, err := r.setInstanceRunnerStatus(jobParams.RunnerName, params.RunnerActive)
if err != nil {
Expand Down Expand Up @@ -954,6 +985,7 @@
StartedAt: job.WorkflowJob.StartedAt,
CompletedAt: job.WorkflowJob.CompletedAt,
Name: job.WorkflowJob.Name,
RunnerName: job.WorkflowJob.RunnerName,
GithubRunnerID: job.WorkflowJob.RunnerID,
RunnerGroupID: job.WorkflowJob.RunnerGroupID,
RunnerGroupName: job.WorkflowJob.RunnerGroupName,
Expand All @@ -962,23 +994,6 @@
Labels: job.WorkflowJob.Labels,
}

runnerName := job.WorkflowJob.RunnerName
if job.Action != "queued" && runnerName == "" {
if job.WorkflowJob.Conclusion != "skipped" && job.WorkflowJob.Conclusion != "canceled" {
// Runner name was not set in WorkflowJob by github. We can still attempt to fetch the info we need,
// using the workflow run ID, from the API.
// We may still get no runner name. In situations such as jobs being cancelled before a runner had the chance
// to pick up the job, the runner name is not available from the API.
runnerInfo, err := r.getRunnerDetailsFromJob(job)
if err != nil && !errors.Is(err, runnerErrors.ErrNotFound) {
return jobParams, errors.Wrap(err, "fetching runner details")
}
runnerName = runnerInfo.Name
}
}

jobParams.RunnerName = runnerName

switch r.entity.EntityType {
case params.GithubEntityTypeEnterprise:
jobParams.EnterpriseID = &asUUID
Expand Down
Loading