You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, I do not think this is a bug in general, but more like a problem with the record rules and possibly a GKE-specific issue.
We have a negative large-number-percentage error budget on all our apiserver_request:availability30d metrics. For example, this metric currently has a value of -717.9133737393183:
1 - ((sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) - sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="1",verb=~"POST|PUT|PATCH|DELETE"})) + sum by (cluster) (code:apiserver_request_total:increase30d{code=~"5..",verb="write"} or vector(0))) / sum by (cluster) (code:apiserver_request_total:increase30d{verb="write"})
I split the record rule into smaller chunks to try to understand what's going on here. From my understanding, the metric is:
1 - ((all requests - fast requests) + error requests) / all requests
The problem seems to be that the "all requests" in the numerator is cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"} and the "all requests" in the denominator is code:apiserver_request_total:increase30d{verb="write"}, and these are by far not the same value on GKE.
When I replace the denominator with the same as in the numerator like this:
1 - ((sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) - sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="1",verb=~"POST|PUT|PATCH|DELETE"})) + sum by (cluster) (code:apiserver_request_total:increase30d{code=~"5..",verb="write"} or vector(0))) / sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
I get a result of ~0.1%, which seems reasonable.
So, I wonder why:
The rule is built the way it is based on two different metrics.
What cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d and cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d actually express, and why these values aren't the same for GKE.
I'm grateful for any hints pointing me in the right direction.
Ah, thanks for the tip. It was a coincidence that I had just started configuring our GKE cluster alerts and hadn't seen it work before, so I thought it had never worked before.
What happened?
First of all, I do not think this is a bug in general, but more like a problem with the record rules and possibly a GKE-specific issue.
We have a negative large-number-percentage error budget on all our
apiserver_request:availability30d
metrics. For example, this metric currently has a value of -717.9133737393183:I split the record rule into smaller chunks to try to understand what's going on here. From my understanding, the metric is:
The problem seems to be that the "all requests" in the numerator is
cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"}
and the "all requests" in the denominator iscode:apiserver_request_total:increase30d{verb="write"}
, and these are by far not the same value on GKE.When I replace the denominator with the same as in the numerator like this:
I get a result of ~0.1%, which seems reasonable.
So, I wonder why:
cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d
andcluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d
actually express, and why these values aren't the same for GKE.I'm grateful for any hints pointing me in the right direction.
Please provide any helpful snippets.
No response
What parts of the codebase are affected?
Rules
I agree to the following terms:
The text was updated successfully, but these errors were encountered: