[Bug]: SLO is negative all the time on GKE clusters #1000

TheKangaroo · 2024-12-16T17:34:27Z

What happened?

First of all, I do not think this is a bug in general, but more like a problem with the record rules and possibly a GKE-specific issue.

We have a negative large-number-percentage error budget on all our apiserver_request:availability30d metrics. For example, this metric currently has a value of -717.9133737393183:

1 - ((sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) - sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="1",verb=~"POST|PUT|PATCH|DELETE"})) + sum by (cluster) (code:apiserver_request_total:increase30d{code=~"5..",verb="write"} or vector(0))) / sum by (cluster) (code:apiserver_request_total:increase30d{verb="write"})

I split the record rule into smaller chunks to try to understand what's going on here. From my understanding, the metric is:

1 - ((all requests - fast requests) + error requests) / all requests

The problem seems to be that the "all requests" in the numerator is cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"} and the "all requests" in the denominator is code:apiserver_request_total:increase30d{verb="write"}, and these are by far not the same value on GKE.

When I replace the denominator with the same as in the numerator like this:

1 - ((sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) - sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="1",verb=~"POST|PUT|PATCH|DELETE"})) + sum by (cluster) (code:apiserver_request_total:increase30d{code=~"5..",verb="write"} or vector(0))) / sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"})

I get a result of ~0.1%, which seems reasonable.

So, I wonder why:

The rule is built the way it is based on two different metrics.
What cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d and cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d actually express, and why these values aren't the same for GKE.

I'm grateful for any hints pointing me in the right direction.

Please provide any helpful snippets.

No response

What parts of the codebase are affected?

Rules

I agree to the following terms:

I agree to follow this project's Code of Conduct.
I have filled out all the required information above to the best of my ability.
I have searched the issues of this repository and believe that this is not a duplicate.
I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.

The text was updated successfully, but these errors were encountered:

skl · 2024-12-16T18:39:30Z

Looks like this might be fixed by:

fix: apiserver availability apiserver_request_sli_duration_seconds_count:increase30d #998

TheKangaroo · 2024-12-16T19:34:47Z

Ah, thanks for the tip. It was a coincidence that I had just started configuring our GKE cluster alerts and hadn't seen it work before, so I thought it had never worked before.

skl · 2024-12-17T10:15:12Z

@TheKangaroo #998 was merged, let me know if that resolves the issue on your side 👍

TheKangaroo · 2024-12-18T14:59:07Z

I just checked, and this has indeed resolved my issue. Thank you so much for pointing me in this direction! :)

skl self-assigned this Dec 17, 2024

skl added the bug Something isn't working label Dec 17, 2024

TheKangaroo closed this as completed Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: SLO is negative all the time on GKE clusters #1000

[Bug]: SLO is negative all the time on GKE clusters #1000

TheKangaroo commented Dec 16, 2024

skl commented Dec 16, 2024

TheKangaroo commented Dec 16, 2024

skl commented Dec 17, 2024

TheKangaroo commented Dec 18, 2024

[Bug]: SLO is negative all the time on GKE clusters #1000

[Bug]: SLO is negative all the time on GKE clusters #1000

Comments

TheKangaroo commented Dec 16, 2024

What happened?

Please provide any helpful snippets.

What parts of the codebase are affected?

I agree to the following terms:

skl commented Dec 16, 2024

TheKangaroo commented Dec 16, 2024

skl commented Dec 17, 2024

TheKangaroo commented Dec 18, 2024