-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
garm webhook && metrics/o11y #272
Comments
Hi @pathcl The scope of GARM is limited to successfully spinning up runners and making them available to the workflow jobs that are triggered on GitHub. Everything we add to GARM is geared towards that scope. But I'll explain on each point:
Indeed, this is something that can be addressed simultaneously with adding metrics to providers (see bellow). If the stuck workflow is a symptom of a stuck runner/provider, then that should be addressed by metrics added to providers. Otherwise, it's outside the scope of GARM to watch workflows themselves. We only care about the workflow jobs that we record. The distinction is important. We may not record all jobs for various reasons:
Sadly, there is no efficient way to fix any of the last 2 scenarios. We have orgs which may have many repos, and we have enterprises which may have many orgs which may have many repos. Workflows only exist at the repo level, so if we attempt to ingest any workflows we missed, it means hammering the GH API for workflows on potentially thousands of repos. The only potential workaround is if the operator of GARM knows that GARM/GitHub was down for a while and missed some jobs, they can increase min-idle-runners to match max-runners until they consume the queue on their github repos and then set min-idle-runners back to its original value.
This is one area where we need to improve. We have metrics for the GH API calls, but no metrics for provider calls. We don't currently see if a runner just failed to reach
We have that. If you look at the function you highlighted above, you should see them in the body of the function.: garm/apiserver/controllers/controllers.go Lines 112 to 115 in 8f0d447
garm/apiserver/controllers/controllers.go Lines 120 to 123 in 8f0d447
garm/apiserver/controllers/controllers.go Lines 125 to 128 in 8f0d447
garm/apiserver/controllers/controllers.go Lines 134 to 137 in 8f0d447
We can improve on this. If you have any suggestions in regards to what extra info you believe would make sense, we can find a way to add it.
That is something that we can't fix. The scope of GARM is to make a runner available to a workflow job. As long as we receive a webhook for a |
we are able to "calculate" some kind of an error rate when it came to provider-interaction (this metric is part of the |
slightly off topic, but somehow related to this discussion here. We are operating garm on an enterprise level. To make this work, we are receiving every action event from github (according to the garm documentation). With that, we get a lot of events, even those we are not responsible for. To get more insights about our users/customers and the information we already have in the event payload, we are using this information by storing it into a database. with that we are e..g able to see how delayed github send events to our system. |
Hello folks,
One the challenges about runners and github actions after years it's still observability.
I'd like to know if we have plans to work on o11y for garm's webhook.
garm/apiserver/controllers/controllers.go
Line 98 in 8f0d447
Use case(s)
The text was updated successfully, but these errors were encountered: