-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto scaling not taking into account whether a repo has access to a runner group. #236
Comments
Hmm. This is an interesting case. The ability to schedule a runner based on runner group is currently missing in GARM. For example, we completely ignore workflows such as the following. runs-on:
group: my-runner-group We need to be able to schedule a pool based on group as well as labels: runs-on:
group: my-runner-group
labels: ["large", "self-hosted"] And currently GARM doesn't do anything with the group name. We can register runners in a runner group, but if a workflow comes in with a Do your workflows explicitly target groups or do they just use This week I'll be traveling, but next week I will definitely attempt to fix this huge missing feature. We already record the runner group name in the pool definition. There is no reason we can't schedule the proper pool when a job comes in that targets a specific pool. So even if you do use As a side note, I highly advise to avoid using any of the default labels and use unique label sets if at all possible for a better scheduling experience. |
Yeah unfortunately we already have too many repos with self-hosted as the only label so runner group control is the only way we have to prevent breaking those repos. Currently working on getting the repos updated but we don't own the majority of them so I can't go and make the change directly. Any thoughts on the "selected repos" feature of runner groups? |
I'm guessing the repos that use So there are actually 3 features to potentially implement here:
We need to figure out what the impact is (in term of API calls to GitHub) of keeping that allow list in sync with repos that have been granted access to a runner group. We also need to take into account that runner groups can be created at the enterprise level and shared with orgs, which in turn can share them with repos. This last aspect may be inconsequential, but we still need to explore it. Using the runner group as a source of truth for 2) seems correct, but the fact that those runner groups can be updated at any time to allow/deny public repos or that the runner group can be shared with more repos at any time, without sending out a webhook, means we need to poll the runner groups as well. This is a potential pain in the neck, especially if you have huge orgs with many repos and many runner groups. But this is something that definitely needs to be explored more. I can definitely have a look at points 1) and 2) starting next week. For 3) I need some time to think, and hopefully access to an enterprise account. |
Yeah. For enterprises, we can only see orgs we've shared the runner group with. If we have an enterprise with many orgs, and those orgs have many repos, we'd need to first list all orgs with access to the runner group, then for each org we'd need to see which repo they share that runner group with and add those repos to the list. It can potentially generate a lot of API requests, and it can be really slow. |
Yeah, didn't know that enterprises had runner groups, definitely adds an additional dimension of complexity to the solution. You are correct the self-hosted don't target a runner group, they just rely on there not being any "global" runner groups and the built in runner group allow list. For organization runner groups the api impact is 1 api call to get the runer group ID from runner group name (unless you cache the runner group ID somewhere) and 1 api call to get the list of repositories, more or less |
This is a mess. It seems that jobs in queued state do not set the runner group, even if the workflow explicitly sets: runs-on:
group: my-runner-group
labels: ["large", "self-hosted"] So until a job is picked up, we don't know the runner group. This means that we can't really schedule by group name. We can still attempt to create that allowlist/blocklist of repos that we should react to when a job comes in. If the pool has an explicit runner group set and the entity requesting a runner is not allowed in any of the runner groups set on pools, we can ignore the job. But we still have to solve the issue of API requests. Will think about how we can do this in a sane way. |
Hey @gabriel-samfira has there been any recent breakthroughs on this issue? We have a really inconvenient custom solution in place and would like to move away from it. Just want to add that ontop of selected-repos, runner groups also have a selected-workflow functionality that is not being taken into account by the auto-scaler. |
sadly no. They still don't send the runner group as part of a |
Hello, I have found that the current auto scaling mechanism is not taking into account whether a repo is registered to the runner group specified by an org pool.
Currently we have many repos that are not registered under the runner group specified for pools, when any of those repos have a queued up workflow job it triggers the auto scale because they are in the same org and the tags match.
Unfortunately most of our existing workflows are set to self-hosted tag so they are triggering autoscaling of pools that should not be (since they aren't in the runner group they wont be picked up). This is causing all of our pools to constantly pin runner amount to max and recreate runners non stop (cleaned up by scaleDown loop).
As a temporary workaround I have created a fork of garm and have implemented the following:
common.GithubRunnerGroup
object with runner group and selected repos info from ghcli.The text was updated successfully, but these errors were encountered: