Skip to content

Commit

Permalink
Merge pull request #690 from cloudflare/stats
Browse files Browse the repository at this point in the history
Collect query stats and use it in query/cost
  • Loading branch information
prymitive authored Aug 4, 2023
2 parents 86ba0ef + 305f0eb commit 2b94079
Show file tree
Hide file tree
Showing 13 changed files with 626 additions and 47 deletions.
8 changes: 8 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Changelog

## v0.45.0

### Added

- The `query/cost` check can now use Prometheus query stats to verify query
evaluation time and the number of samples used by a query. See
[query/cost](checks/query/cost.md) docs for details.

## v0.44.2

### Fixed
Expand Down
68 changes: 63 additions & 5 deletions docs/checks/query/cost.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,41 @@ grand_parent: Documentation
This check is used to calculate cost of a query and optionally report an issue
if that cost is too high. It will run `expr` query from every rule against
selected Prometheus servers and report results.
This check can be used for both recording and alerting rules, but is most
This check can be used for both recording and alerting rules, but is mostly
useful for recording rules.

## Query evaluation duration

The total duration of a query comes from Prometheus query stats included
in the API response when `?stats=1` is passed.
When enabled pint can report if `evalTotalTime` is higher than configured limit,
which can be used either for informational purpose or to fail checks on queries
that are too expensive (depending on configured `severity`).

## Query evaluation samples

Similar to evaluation duration this information comes from Prometheus query stats.
There are two different stats that give us information about the number of samples
used by given query:

- `totalQueryableSamples` - the total number of samples read during the query execution.
- `peakSamples` - the max samples kept in memory during the query execution and shows
how close the query was to reach the `--query.max-samples`` limit.

In general higher `totalQueryableSamples` means that a query either reads a lot of
time series and/or queries a large time range, both translating into longer query
execution times.
Looking at `peakSamples` on the other hand can be useful to find queries that are
complex and perform some operation on a large number of time series, for example
when you run `max(...)` on a query that returns a huge number of results.

## Series returned by the query

For recording rules anything returned by the query will be saved into Prometheus
as new time series. Checking how many time series does a rule return allows us
to estimate how much extra memory will be needed.
`pint` will try to estimate the number of bytes needed per single time series
and use that to estimate the amount of memory needed for all time series
and use that to estimate the amount of memory needed to store all the time series
returned by given query.
The `bytes per time series` number is calculated using this query:

Expand All @@ -23,7 +53,7 @@ avg(avg_over_time(go_memstats_alloc_bytes[2h]) / avg_over_time(prometheus_tsdb_h

Since Go uses garbage collector total Prometheus process memory will be more than the
sum of all memory allocations, depending on many factors like memory pressure,
Go version, GOGC settings etc. The estimate `pint` gives you should be considered
Go version, `GOGC` settings etc. The estimate `pint` gives you should be considered
`best case` scenario.

## Configuration
Expand All @@ -32,8 +62,11 @@ Syntax:

```js
cost {
severity = "bug|warning|info"
maxSeries = 5000
severity = "bug|warning|info"
maxSeries = 5000
maxPeakSamples = 10000
maxTotalSamples = 200000
maxEvaluationDuration = "1m"
}
```

Expand All @@ -43,6 +76,15 @@ cost {
report it as information.
- `maxSeries` - if set and number of results for given query exceeds this value
it will be reported as a bug (or custom severity if `severity` is set).
- `maxPeakSamples` - setting this to a non-zero value will tell pint to report
any query that has higher `peakSamples` values than the value configured here.
Nothing will be reported if this option is not set.
- `maxTotalSamples` - setting this to a non-zero value will tell pint to report
any query that has higher `totalQueryableSamples` values than the value
configured here. Nothing will be reported if this option is not set.
- `maxEvaluationDuration` - setting this to a non-zero value will tell pint to
report any query that has higher `evalTotalTime` values than the value
configured here. Nothing will be reported if this option is not set.

## How to enable it

Expand All @@ -68,6 +110,22 @@ rule {
}
```

Fail checks if any recording rule is using more than 300000 peak samples
or if it's taking more than 30 seconds to evaluate.

```js
rule {
match {
kind = "recording"
}
cost {
maxPeakSamples = 300000
maxEvaluationDuration = "30s"
severity = "bug"
}
}
```

## How to disable it

You can disable this check globally by adding this config block:
Expand Down
16 changes: 12 additions & 4 deletions internal/checks/base_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,7 @@ func (pe promError) respond(w http.ResponseWriter, _ *http.Request) {

type vectorResponse struct {
samples model.Vector
stats promapi.QueryStats
}

func (vr vectorResponse) respond(w http.ResponseWriter, _ *http.Request) {
Expand All @@ -306,17 +307,20 @@ func (vr vectorResponse) respond(w http.ResponseWriter, _ *http.Request) {
result := struct {
Status string `json:"status"`
Data struct {
ResultType string `json:"resultType"`
Result model.Vector `json:"result"`
ResultType string `json:"resultType"`
Result model.Vector `json:"result"`
Stats promapi.QueryStats `json:"stats"`
} `json:"data"`
}{
Status: "success",
Data: struct {
ResultType string `json:"resultType"`
Result model.Vector `json:"result"`
ResultType string `json:"resultType"`
Result model.Vector `json:"result"`
Stats promapi.QueryStats `json:"stats"`
}{
ResultType: "vector",
Result: vr.samples,
Stats: vr.stats,
},
}
d, err := json.MarshalIndent(result, "", " ")
Expand All @@ -328,6 +332,7 @@ func (vr vectorResponse) respond(w http.ResponseWriter, _ *http.Request) {

type matrixResponse struct {
samples []*model.SampleStream
stats promapi.QueryStats
}

func (mr matrixResponse) respond(w http.ResponseWriter, r *http.Request) {
Expand Down Expand Up @@ -357,15 +362,18 @@ func (mr matrixResponse) respond(w http.ResponseWriter, r *http.Request) {
Data struct {
ResultType string `json:"resultType"`
Result []*model.SampleStream `json:"result"`
Stats promapi.QueryStats `json:"stats"`
} `json:"data"`
}{
Status: "success",
Data: struct {
ResultType string `json:"resultType"`
Result []*model.SampleStream `json:"result"`
Stats promapi.QueryStats `json:"stats"`
}{
ResultType: "matrix",
Result: samples,
Stats: mr.stats,
},
}
d, err := json.MarshalIndent(result, "", " ")
Expand Down
53 changes: 46 additions & 7 deletions internal/checks/query_cost.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ package checks
import (
"context"
"fmt"
"time"

"github.com/cloudflare/pint/internal/discovery"
"github.com/cloudflare/pint/internal/output"
Expand All @@ -15,18 +16,24 @@ const (
BytesPerSampleQuery = "avg(avg_over_time(go_memstats_alloc_bytes[2h]) / avg_over_time(prometheus_tsdb_head_series[2h]))"
)

func NewCostCheck(prom *promapi.FailoverGroup, maxSeries int, severity Severity) CostCheck {
func NewCostCheck(prom *promapi.FailoverGroup, maxSeries, maxTotalSamples, maxPeakSamples int, maxEvaluationDuration time.Duration, severity Severity) CostCheck {
return CostCheck{
prom: prom,
maxSeries: maxSeries,
severity: severity,
prom: prom,
maxSeries: maxSeries,
maxTotalSamples: maxTotalSamples,
maxPeakSamples: maxPeakSamples,
maxEvaluationDuration: maxEvaluationDuration,
severity: severity,
}
}

type CostCheck struct {
prom *promapi.FailoverGroup
maxSeries int
severity Severity
prom *promapi.FailoverGroup
maxSeries int
maxTotalSamples int
maxPeakSamples int
maxEvaluationDuration time.Duration
severity Severity
}

func (c CostCheck) Meta() CheckMeta {
Expand Down Expand Up @@ -95,5 +102,37 @@ func (c CostCheck) Check(ctx context.Context, _ string, rule parser.Rule, _ []di
Text: fmt.Sprintf("%s returned %d result(s)%s%s", promText(c.prom.Name(), qr.URI), series, estimate, above),
Severity: severity,
})

if c.maxTotalSamples > 0 && qr.Stats.Samples.TotalQueryableSamples > c.maxTotalSamples {
problems = append(problems, Problem{
Fragment: expr.Value.Value,
Lines: expr.Lines(),
Reporter: c.Reporter(),
Text: fmt.Sprintf("%s queried %d samples in total when executing this query, which is more than the configured limit of %d", promText(c.prom.Name(), qr.URI), qr.Stats.Samples.TotalQueryableSamples, c.maxTotalSamples),
Severity: c.severity,
})
}

if c.maxPeakSamples > 0 && qr.Stats.Samples.PeakSamples > c.maxPeakSamples {
problems = append(problems, Problem{
Fragment: expr.Value.Value,
Lines: expr.Lines(),
Reporter: c.Reporter(),
Text: fmt.Sprintf("%s queried %d peak samples when executing this query, which is more than the configured limit of %d", promText(c.prom.Name(), qr.URI), qr.Stats.Samples.PeakSamples, c.maxPeakSamples),
Severity: c.severity,
})
}

evalDur := time.Duration(qr.Stats.Timings.EvalTotalTime * float64(time.Second))
if c.maxEvaluationDuration > 0 && evalDur > c.maxEvaluationDuration {
problems = append(problems, Problem{
Fragment: expr.Value.Value,
Lines: expr.Lines(),
Reporter: c.Reporter(),
Text: fmt.Sprintf("%s took %s when executing this query, which is more than the configured limit of %s", promText(c.prom.Name(), qr.URI), output.HumanizeDuration(evalDur), output.HumanizeDuration(c.maxEvaluationDuration)),
Severity: c.severity,
})
}

return problems
}
Loading

0 comments on commit 2b94079

Please sign in to comment.