Skip to content

Latest commit

 

History

History
480 lines (356 loc) · 33 KB

config.md

File metadata and controls

480 lines (356 loc) · 33 KB

Configuration

The GARM configuration is a simple toml. The sample config file in the testdata folder is fairly well commented and should be enough to get you started. The configuration file is split into several sections, each of which is documented in its own page. The sections are:

The default config section

The default config section holds configuration options that don't need a category of their own, but are essential to the operation of the service. In this section we will detail each of the options available in the default section.

[default]
# Uncomment this line if you'd like to log to a file instead of standard output.
# log_file = "/tmp/runner-manager.log"

# Enable streaming logs via web sockets. Use garm-cli debug-log.
enable_log_streamer = false

# Enable the golang debug server. See the documentation in the "doc" folder for more information.
debug_server = false

The callback_url option

Your runners will call back home with status updates as they install. Once they are set up, they will also send the GitHub agent ID they were allocated. You will need to configure the callback_url option in the garm server config. This URL needs to point to the following API endpoint:

POST /api/v1/callbacks/status

Example of a runner sending status updates:

garm-cli runner show garm-DvxiVAlfHeE7
+-----------------+------------------------------------------------------------------------------------+
| FIELD           | VALUE                                                                              |
+-----------------+------------------------------------------------------------------------------------+
| ID              | 16b96ba2-d406-45b8-ab66-b70be6237b4e                                               |
| Provider ID     | garm-DvxiVAlfHeE7                                                                  |
| Name            | garm-DvxiVAlfHeE7                                                                  |
| OS Type         | linux                                                                              |
| OS Architecture | amd64                                                                              |
| OS Name         | ubuntu                                                                             |
| OS Version      | jammy                                                                              |
| Status          | running                                                                            |
| Runner Status   | idle                                                                               |
| Pool ID         | 8ec34c1f-b053-4a5d-80d6-40afdfb389f9                                               |
| Addresses       | 10.198.117.120                                                                     |
| Status Updates  | 2023-07-08T06:26:46: runner registration token was retrieved                       |
|                 | 2023-07-08T06:26:46: using cached runner found in /opt/cache/actions-runner/latest |
|                 | 2023-07-08T06:26:50: configuring runner                                            |
|                 | 2023-07-08T06:26:56: runner successfully configured after 1 attempt(s)             |
|                 | 2023-07-08T06:26:56: installing runner service                                     |
|                 | 2023-07-08T06:26:56: starting service                                              |
|                 | 2023-07-08T06:26:57: runner successfully installed                                 |
+-----------------+------------------------------------------------------------------------------------+

This URL must be set and must be accessible by the instance. If you wish to restrict access to it, a reverse proxy can be configured to accept requests only from networks in which the runners garm manages will be spun up. This URL doesn't need to be globally accessible, it just needs to be accessible by the instances.

For example, in a scenario where you expose the API endpoint directly, this setting could look like the following:

callback_url = "https://garm.example.com/api/v1/callbacks"

Authentication is done using a short-lived JWT token, that gets generated for a particular instance that we are spinning up. That JWT token grants access to the instance to only update its own status and to fetch metadata for itself. No other API endpoints will work with that JWT token. The validity of the token is equal to the pool bootstrap timeout value (default 20 minutes) plus the garm polling interval (5 minutes).

There is a sample nginx config in the testdata folder. Feel free to customize it in any way you see fit.

The metadata_url option

The metadata URL is the base URL for any information an instance may need to fetch in order to finish setting itself up. As this URL may be placed behind a reverse proxy, you'll need to configure it in the garm config file. Ultimately this URL will need to point to the following garm API endpoint:

GET /api/v1/metadata

This URL needs to be accessible only by the instances garm sets up. This URL will not be used by anyone else. To configure it in garm add the following line in the [default] section of your garm config:

metadata_url = "https://garm.example.com/api/v1/metadata"

The debug_server option

GARM can optionally enable the golang profiling server. This is useful if you suspect garm may be have a bottleneck in any way. To enable the profiling server, add the following section to the garm config:

[default]

debug_server = true

And restart garm. You can then use the following command to start profiling:

go tool pprof http://127.0.0.1:9997/debug/pprof/profile?seconds=120

IMPORTANT NOTE on profiling when behind a reverse proxy: The above command will hang for a fairly long time. Most reverse proxies will timeout after about 60 seconds. To avoid this, you should only profile on localhost by connecting directly to garm.

It's also advisable to exclude the debug server URLs from your reverse proxy and only make them available locally.

Now that the debug server is enabled, here is a blog post on how to profile golang applications: https://blog.golang.org/profiling-go-programs

The log_file option

By default, GARM logs everything to standard output.

You can optionally log to file by adding the following to your config file:

[default]
# Use this if you'd like to log to a file instead of standard output.
log_file = "/tmp/runner-manager.log"

Rotating log files

GARM automatically rotates the log if it reaches 500 MB in size or 28 days, whichever comes first.

However, if you want to manually rotate the log file, you can send a SIGHUP signal to the GARM process.

You can add the following to your systemd unit file to enable reload:

[Service]
ExecReload=/bin/kill -HUP $MAINPID

Then you can simply:

systemctl reload garm

The enable_log_streamer option

This option allows you to stream garm logs directly to your terminal. Set this option to true, then you can use the following command to stream logs:

garm-cli debug-log

An important note on enabling this option when behind a reverse proxy. The log streamer uses websockets to stream logs to you. You will need to configure your reverse proxy to allow websocket connections. If you're using nginx, you will need to add the following to your nginx server config:

location /api/v1/ws {
    proxy_pass http://garm_backend;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "Upgrade";
    proxy_set_header Host $host;
}

The logging section

GARM has switched to the slog package for logging, adding structured logging. As such, we added a dedicated logging section to the config to tweak the logging settings. We moved the enable_log_streamer and the log_file options from the default section to the logging section. They are still available in the default section for backwards compatibility, but they are deprecated and will be removed in a future release.

An example of the new logging section:

[logging]
# Uncomment this line if you'd like to log to a file instead of standard output.
# log_file = "/tmp/runner-manager.log"

# enable_log_streamer enables streaming the logs over websockets
enable_log_streamer = true
# log_format is the output format of the logs. GARM uses structured logging and can
# output as "text" or "json"
log_format = "text"
# log_level is the logging level GARM will output. Available log levels are:
#  * debug
#  * info
#  * warn
#  * error
log_level = "debug"
# log_source will output information about the function that generated the log line.
log_source = false

By default GARM logs everything to standard output. You can optionally log to file by adding the log_file option to the logging section. The enable_log_streamer option allows you to stream GARM logs directly to your terminal. Set this option to true, then you can use the following command to stream logs:

garm-cli debug-log

The log_format, log_level and log_source options allow you to tweak the logging output. The log_format option can be set to text or json. The log_level option can be set to debug, info, warn or error. The log_source option will output information about the function that generated the log line. All these options influence how the structured logging is output.

This will allow you to ingest GARM logs in a central location such as an ELK stack or similar.

Database configuration

GARM currently supports SQLite3. Support for other stores will be added in the future.

[database]
  # Turn on/off debugging for database queries.
  debug = false
  # Database backend to use. Currently supported backends are:
  #   * sqlite3
  backend = "sqlite3"
  # the passphrase option is a temporary measure by which we encrypt the webhook
  # secret that gets saved to the database, using AES256. In the future, secrets
  # will be saved to something like Barbican or Vault, eliminating the need for
  # this. This string needs to be 32 characters in size.
  passphrase = "shreotsinWadquidAitNefayctowUrph"
  [database.sqlite3]
    # Path on disk to the sqlite3 database file.
    db_file = "/home/runner/garm.db"

Provider configuration

GARM was designed to be extensible. Providers can be written as external executables which implement the needed interface to create/delete/list compute systems that are used by GARM to create runners.

Providers

GARM delegates the functionality needed to create the runners to external executables. These executables can be either binaries or scripts. As long as they adhere to the needed interface, they can be used to create runners in any target IaaS. You might find this behavior familiar if you've ever had to deal with installing CNIs in containerd. The principle is the same.

The configuration for an external provider is quite simple:

# This is an example external provider. External providers are executables that
# implement the needed interface to create/delete/list compute systems that are used
# by GARM to create runners.
[[provider]]
name = "openstack_external"
description = "external openstack provider"
provider_type = "external"
  [provider.external]
  # config file passed to the executable via GARM_PROVIDER_CONFIG_FILE environment variable
  config_file = "/etc/garm/providers.d/openstack/keystonerc"
  # Absolute path to an executable that implements the provider logic. This executable can be
  # anything (bash, a binary, python, etc). See documentation in this repo on how to write an
  # external provider.
  provider_executable = "/etc/garm/providers.d/openstack/garm-external-provider"
  # This option will pass all environment variables that start with AWS_ to the provider.
  # To pass in individual variables, you can add the entire name to the list.
  environment_variables = ["AWS_"]

The external provider has three options:

  • provider_executable
  • config_file
  • environment_variables

The provider_executable option is the absolute path to an executable that implements the provider logic. GARM will delegate all provider operations to this executable. This executable can be anything (bash, python, perl, go, etc). See Writing an external provider for more details.

The config_file option is a path on disk to an arbitrary file, that is passed to the external executable via the environment variable GARM_PROVIDER_CONFIG_FILE. This file is only relevant to the external provider. GARM itself does not read it. Let's take the OpenStack provider as an example. The config file contains access information for an OpenStack cloud as well as some provider specific options like whether or not to boot from volume and which tenant network to use.

The environment_variables option is a list of environment variables that will be passed to the external provider. By default GARM will pass a clean env to providers, consisting only of variables that the provider interface expects. However, in some situations, provider may need access to certain environment variables set in the env of GARM itself. This might be needed to enable access to IAM roles (ec2) or managed identity (azure). This option takes a list of environment variables or prefixes of environment variables that will be passed to the provider. For example, if you want to pass all environment variables that start with AWS_ to the provider, you can set this option to ["AWS_"].

If you want to implement an external provider, you can use this file for anything you need to pass into the binary when GARM calls it to execute a particular operation.

Available external providers

For non-testing purposes, these are the external providers currently available:

Details on how to install and configure them are available in their respective repositories.

If you wrote a provider and would like to add it to the above list, feel free to open a PR.

The metrics section

This is one of the features in GARM that I really love having. For one thing, it's community contributed and for another, it really adds value to the project. It allows us to create some pretty nice visualizations of what is happening with GARM.

Common metrics

Metric name Type Labels Description
garm_health Gauge controller_id=<controller id>
callback_url=<callback url>
controller_webhook_url=<controller webhook url>
metadata_url=<metadata url>
webhook_url=<webhook url>
name=<hostname>
This is a gauge that is set to 1 if GARM is healthy and 0 if it is not. This is useful for alerting.
garm_webhooks_received Counter valid=<valid request>
reason=<reason for invalid requests>
This is a counter that increments every time GARM receives a webhook from GitHub.

Enterprise metrics

Metric name Type Labels Description
garm_enterprise_info Gauge id=<enterprise id>
name=<enterprise name>
This is a gauge that is set to 1 and expose enterprise information
garm_enterprise_pool_manager_status Gauge id=<enterprise id>
name=<enterprise name>
running=<true|false>
This is a gauge that is set to 1 if the enterprise pool manager is running and set to 0 if not

Organization metrics

Metric name Type Labels Description
garm_organization_info Gauge id=<organization id>
name=<organization name>
This is a gauge that is set to 1 and expose organization information
garm_organization_pool_manager_status Gauge id=<organization id>
name=<organization name>
running=<true|false>
This is a gauge that is set to 1 if the organization pool manager is running and set to 0 if not

Repository metrics

Metric name Type Labels Description
garm_repository_info Gauge id=<repository id>
name=<repository name>
This is a gauge that is set to 1 and expose repository information
garm_repository_pool_manager_status Gauge id=<repository id>
name=<repository name>
running=<true|false>
This is a gauge that is set to 1 if the repository pool manager is running and set to 0 if not

Provider metrics

Metric name Type Labels Description
garm_provider_info Gauge description=<provider description>
name=<provider name>
type=<internal|external>
This is a gauge that is set to 1 and expose provider information

Pool metrics

Metric name Type Labels Description
garm_pool_info Gauge flavor=<flavor>
id=<pool id>
image=<image name>
os_arch=<defined OS arch>
os_type=<defined OS name>
pool_owner=<owner name>
pool_type=<repository|organization|enterprise>
prefix=<prefix>
provider=<provider name>
tags=<concatenated list of pool tags>
This is a gauge that is set to 1 and expose pool information
garm_pool_status Gauge enabled=<true|false>
id=<pool id>
This is a gauge that is set to 1 if the pool is enabled and set to 0 if not
garm_pool_bootstrap_timeout Gauge id=<pool id> This is a gauge that is set to the pool bootstrap timeout
garm_pool_max_runners Gauge id=<pool id> This is a gauge that is set to the pool max runners
garm_pool_min_idle_runners Gauge id=<pool id> This is a gauge that is set to the pool min idle runners

Runner metrics

Metric name Type Labels Description
garm_runner_status Gauge name=<runner name>
pool_owner=<owner name>
pool_type=<repository|organization|enterprise>
provider=<provider name>
runner_status=<running|stopped|error|pending_delete|deleting|pending_create|creating|unknown>
status=<idle|pending|terminated|installing|failed|active>
This is a gauge value that gives us details about the runners garm spawns
garm_runner_operations_total Counter provider=<provider name>
operation=<CreateInstance|DeleteInstance|GetInstance|ListInstances|RemoveAllInstances|Start\Stop>
This is a counter that increments every time a runner operation is performed
garm_runner_errors_total Counter provider=<provider name>
operation=<CreateInstance|DeleteInstance|GetInstance|ListInstances|RemoveAllInstances|Start\Stop>
This is a counter that increments every time a runner operation errored

Github metrics

Metric name Type Labels Description
garm_github_operations_total Counter operation=<ListRunners|CreateRegistrationToken|...>
scope=<Organization|Repository|Enterprise>
This is a counter that increments every time a github operation is performed
garm_github_errors_total Counter operation=<ListRunners|CreateRegistrationToken|...>
scope=<Organization|Repository|Enterprise>
This is a counter that increments every time a github operation errored

Enabling metrics

Metrics are disabled by default. To enable them, add the following to your config file:

[metrics]

# Toggle to disable authentication (not recommended) on the metrics endpoint.
# If you do disable authentication, I encourage you to put a reverse proxy in front
# of garm and limit which systems can access that particular endpoint. Ideally, you
# would enable some kind of authentication using the reverse proxy, if the built-in auth
# is not sufficient for your needs.
#
# Default: false
disable_auth = true

# Toggle metrics. If set to false, the API endpoint for metrics collection will
# be disabled.
#
# Default: false
enable = true

# period is the time interval when the /metrics endpoint will update internal metrics about
# controller specific objects (e.g. runners, pools, etc.)
#
# Default: "60s"
period = "30s"

You can choose to disable authentication if you wish, however it's not terribly difficult to set up, so I generally advise against disabling it.

Configuring prometheus

The following section assumes that your garm instance is running at garm.example.com and has TLS enabled.

First, generate a new JWT token valid only for the metrics endpoint:

garm-cli metrics-token create

Note: The token validity is equal to the TTL you set in the JWT config section.

Copy the resulting token, and add it to your prometheus config file. The following is an example of how to add garm as a target in your prometheus config file:

scrape_configs:
  - job_name: "garm"
    # Connect over https. If you don't have TLS enabled, change this to http.
    scheme: https
    static_configs:
      - targets: ["garm.example.com"]
    authorization:
      credentials: "superSecretTokenYouGeneratedEarlier"

The JWT authentication config section

This section configures the JWT authentication used by the API server. GARM is currently a single user system and that user has the right to do anything and everything GARM is capable of. As a result, the JWT auth we have does not include a refresh token. The token is valid for the duration of the time to live (TTL) set in the config file. Once the token expires, you will need to log in again.

It is recommended that the secret be a long, randomly generated string. Changing the secret at any time will invalidate all existing tokens.

[jwt_auth]
# A JWT token secret used to sign tokens. Obviously, this needs to be changed :).
secret = ")9gk_4A6KrXz9D2u`0@MPea*sd6W`%@5MAWpWWJ3P3EqW~qB!!(Vd$FhNc*eU4vG"

# Time to live for tokens. Both the instances and you will use JWT tokens to
# authenticate against the API. However, this TTL is applied only to tokens you
# get when logging into the API. The tokens issued to the instances we manage,
# have a TTL based on the runner bootstrap timeout set on each pool. The minimum
# TTL for this token is 24h.
time_to_live = "8760h"

The API server config section

This section allows you to configure the GARM API server. The API server is responsible for serving all the API endpoints used by the garm-cli, the runners that phone home their status and by GitHub when it sends us webhooks.

The config options are fairly straight forward.

[apiserver]
  # Bind the API to this IP
  bind = "0.0.0.0"
  # Bind the API to this port
  port = 9997
  # Whether or not to set up TLS for the API endpoint. If this is set to true,
  # you must have a valid apiserver.tls section.
  use_tls = false
  # Set a list of allowed origins
  # By default, if this option is omitted or empty, we will check
  # only that the origin is the same as the originating server.
  # A literal of "*" will allow any origin
  cors_origins = ["*"]
  [apiserver.tls]
    # Path on disk to a x509 certificate bundle.
    # NOTE: if your certificate is signed by an intermediary CA, this file
    # must contain the entire certificate bundle needed for clients to validate
    # the certificate. This usually means concatenating the certificate and the
    # CA bundle you received.
    certificate = ""
    # The path on disk to the corresponding private key for the certificate.
    key = ""

The GARM API server has the option to enable TLS, but I suggest you use a reverse proxy and enable TLS termination in that reverse proxy. There is an nginx sample in this repository with TLS termination enabled.

You can of course enable TLS in both garm and the reverse proxy. The choice is yours.