Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add new metrics #180

Merged
merged 1 commit into from
Oct 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 48 additions & 4 deletions doc/config_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,55 @@

This is one of the features in GARM that I really love having. For one thing, it's community contributed and for another, it really adds value to the project. It allows us to create some pretty nice visualizations of what is happening with GARM.

At the moment there are only three meaningful metrics being collected, besides the default ones that the prometheus golang package enables by default. These are:
## Common metrics

* `garm_health` - This is a gauge that is set to 1 if GARM is healthy and 0 if it is not. This is useful for alerting.
* `garm_runner_status` - This is a gauge value that gives us details about the runners garm spawns
* `garm_webhooks_received` - This is a counter that increments every time GARM receives a webhook from GitHub.
| Metric name | Type | Labels | Description |
|--------------------------|---------|-------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| `garm_health` | Gauge | `controller_id`=&lt;controller id&gt; <br>`name`=&lt;hostname&gt; | This is a gauge that is set to 1 if GARM is healthy and 0 if it is not. This is useful for alerting. |
| `garm_webhooks_received` | Counter | `controller_id`=&lt;controller id&gt; <br>`name`=&lt;hostname&gt; | This is a counter that increments every time GARM receives a webhook from GitHub. |

## Enterprise metrics

| Metric name | Type | Labels | Description |
|---------------------------------------|-------|-------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| `garm_enterprise_info` | Gauge | `id`=&lt;enterprise id&gt; <br>`name`=&lt;enterprise name&gt; | This is a gauge that is set to 1 and expose enterprise information |
| `garm_enterprise_pool_manager_status` | Gauge | `id`=&lt;enterprise id&gt; <br>`name`=&lt;enterprise name&gt; <br>`running`=&lt;true\|false&gt; | This is a gauge that is set to 1 if the enterprise pool manager is running and set to 0 if not |

## Organization metrics

| Metric name | Type | Labels | Description |
|-----------------------------------------|-------|-----------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| `garm_organization_info` | Gauge | `id`=&lt;organization id&gt; <br>`name`=&lt;organization name&gt; | This is a gauge that is set to 1 and expose organization information |
| `garm_organization_pool_manager_status` | Gauge | `id`=&lt;organization id&gt; <br>`name`=&lt;organization name&gt; <br>`running`=&lt;true\|false&gt; | This is a gauge that is set to 1 if the organization pool manager is running and set to 0 if not |

## Repository metrics

| Metric name | Type | Labels | Description |
|---------------------------------------|-------|-------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| `garm_repository_info` | Gauge | `id`=&lt;repository id&gt; <br>`name`=&lt;repository name&gt; | This is a gauge that is set to 1 and expose repository information |
| `garm_repository_pool_manager_status` | Gauge | `id`=&lt;repository id&gt; <br>`name`=&lt;repository name&gt; <br>`running`=&lt;true\|false&gt; | This is a gauge that is set to 1 if the repository pool manager is running and set to 0 if not |

## Provider metrics

| Metric name | Type | Labels | Description |
|----------------------|-------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| `garm_provider_info` | Gauge | `description`=&lt;provider description&gt; <br>`name`=&lt;provider name&gt; <br>`type`=&lt;internal\|external&gt; | This is a gauge that is set to 1 and expose provider information |

## Pool metrics

| Metric name | Type | Labels | Description |
|-------------------------------|-------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| `garm_pool_info` | Gauge | `flavor`=&lt;flavor&gt; <br>`id`=&lt;pool id&gt; <br>`image`=&lt;image name&gt; <br>`os_arch`=&lt;defined OS arch&gt; <br>`os_type`=&lt;defined OS name&gt; <br>`pool_owner`=&lt;owner name&gt; <br>`pool_type`=&lt;repository\|organization\|enterprise&gt; <br>`prefix`=&lt;prefix&gt; <br>`provider`=&lt;provider name&gt; <br>`tags`=&lt;concatenated list of pool tags&gt; <br> | This is a gauge that is set to 1 and expose pool information |
| `garm_pool_status` | Gauge | `enabled`=&lt;true\|false&gt; <br>`id`=&lt;pool id&gt; | This is a gauge that is set to 1 if the pool is enabled and set to 0 if not |
| `garm_pool_bootstrap_timeout` | Gauge | `id`=&lt;pool id&gt; | This is a gauge that is set to the pool bootstrap timeout |
| `garm_pool_max_runners` | Gauge | `id`=&lt;pool id&gt; | This is a gauge that is set to the pool max runners |
| `garm_pool_min_idle_runners` | Gauge | `id`=&lt;pool id&gt; | This is a gauge that is set to the pool min idle runners |

## Runner metrics

| Metric name | Type | Labels | Description |
|----------------------|-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
| `garm_runner_status` | Gauge | `controller_id`=&lt;controller id&gt; <br>`hostname`=&lt;hostname&gt; <br>`name`=&lt;runner name&gt; <br>`pool_owner`=&lt;owner name&gt; <br>`pool_type`=&lt;repository\|organization\|enterprise&gt; <br>`provider`=&lt;provider name&gt; <br>`runner_status`=&lt;running\|stopped\|error\|pending_delete\|deleting\|pending_create\|creating\|unknown&gt; <br>`status`=&lt;idle\|pending\|terminated\|installing\|failed\|active&gt; <br> | This is a gauge value that gives us details about the runners garm spawns |

More metrics will be added in the future.

Expand Down
50 changes: 50 additions & 0 deletions metrics/enterprise.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
package metrics

import (
"log"
"strconv"

"github.com/cloudbase/garm/auth"
"github.com/prometheus/client_golang/prometheus"
)

// CollectOrganizationMetric collects the metrics for the enterprise objects
func (c *GarmCollector) CollectEnterpriseMetric(ch chan<- prometheus.Metric, hostname string, controllerID string) {
ctx := auth.GetAdminContext()

enterprises, err := c.runner.ListEnterprises(ctx)
if err != nil {
log.Printf("listing providers: %s", err)
return
}

for _, enterprise := range enterprises {

enterpriseInfo, err := prometheus.NewConstMetric(
c.enterpriseInfo,
prometheus.GaugeValue,
1,
enterprise.Name, // label: name
enterprise.ID, // label: id
)
if err != nil {
log.Printf("cannot collect enterpriseInfo metric: %s", err)
continue
}
ch <- enterpriseInfo

enterprisePoolManagerStatus, err := prometheus.NewConstMetric(
c.enterprisePoolManagerStatus,
prometheus.GaugeValue,
bool2float64(enterprise.PoolManagerStatus.IsRunning),
enterprise.Name, // label: name
enterprise.ID, // label: id
strconv.FormatBool(enterprise.PoolManagerStatus.IsRunning), // label: running
)
if err != nil {
log.Printf("cannot collect enterprisePoolManagerStatus metric: %s", err)
continue
}
ch <- enterprisePoolManagerStatus
}
}
22 changes: 22 additions & 0 deletions metrics/health.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
package metrics

import (
"log"

"github.com/prometheus/client_golang/prometheus"
)

func (c *GarmCollector) CollectHealthMetric(ch chan<- prometheus.Metric, hostname string, controllerID string) {
m, err := prometheus.NewConstMetric(
c.healthMetric,
prometheus.GaugeValue,
1,
hostname,
controllerID,
)
if err != nil {
log.Printf("error on creating health metric: %s", err)
return
}
ch <- m
}
79 changes: 79 additions & 0 deletions metrics/instance.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
package metrics

import (
"log"

"github.com/cloudbase/garm/auth"
"github.com/prometheus/client_golang/prometheus"
)

// CollectInstanceMetric collects the metrics for the runner instances
// reflecting the statuses and the pool they belong to.
func (c *GarmCollector) CollectInstanceMetric(ch chan<- prometheus.Metric, hostname string, controllerID string) {
ctx := auth.GetAdminContext()

instances, err := c.runner.ListAllInstances(ctx)
if err != nil {
log.Printf("cannot collect metrics, listing instances: %s", err)
return
}

pools, err := c.runner.ListAllPools(ctx)
if err != nil {
log.Printf("listing pools: %s", err)
return
}

type poolInfo struct {
Name string
Type string
ProviderName string
}

poolNames := make(map[string]poolInfo)
for _, pool := range pools {
if pool.EnterpriseName != "" {
poolNames[pool.ID] = poolInfo{
Name: pool.EnterpriseName,
Type: string(pool.PoolType()),
ProviderName: pool.ProviderName,
}
} else if pool.OrgName != "" {
poolNames[pool.ID] = poolInfo{
Name: pool.OrgName,
Type: string(pool.PoolType()),
ProviderName: pool.ProviderName,
}
} else {
poolNames[pool.ID] = poolInfo{
Name: pool.RepoName,
Type: string(pool.PoolType()),
ProviderName: pool.ProviderName,
}
}
}

for _, instance := range instances {

m, err := prometheus.NewConstMetric(
c.instanceMetric,
prometheus.GaugeValue,
1,
instance.Name, // label: name
string(instance.Status), // label: status
string(instance.RunnerStatus), // label: runner_status
poolNames[instance.PoolID].Name, // label: pool_owner
poolNames[instance.PoolID].Type, // label: pool_type
instance.PoolID, // label: pool_id
hostname, // label: hostname
controllerID, // label: controller_id
poolNames[instance.PoolID].ProviderName, // label: provider
)

if err != nil {
log.Printf("cannot collect runner metric: %s", err)
continue
}
ch <- m
}
}
Loading