Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: SLO is negative all the time on GKE clusters #1000

Closed
4 tasks done
TheKangaroo opened this issue Dec 16, 2024 · 4 comments
Closed
4 tasks done

[Bug]: SLO is negative all the time on GKE clusters #1000

TheKangaroo opened this issue Dec 16, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@TheKangaroo
Copy link

What happened?

First of all, I do not think this is a bug in general, but more like a problem with the record rules and possibly a GKE-specific issue.

We have a negative large-number-percentage error budget on all our apiserver_request:availability30d metrics. For example, this metric currently has a value of -717.9133737393183:

1 - ((sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) - sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="1",verb=~"POST|PUT|PATCH|DELETE"})) + sum by (cluster) (code:apiserver_request_total:increase30d{code=~"5..",verb="write"} or vector(0))) / sum by (cluster) (code:apiserver_request_total:increase30d{verb="write"})

I split the record rule into smaller chunks to try to understand what's going on here. From my understanding, the metric is:

1 - ((all requests - fast requests) + error requests) / all requests

The problem seems to be that the "all requests" in the numerator is cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"} and the "all requests" in the denominator is code:apiserver_request_total:increase30d{verb="write"}, and these are by far not the same value on GKE.

When I replace the denominator with the same as in the numerator like this:

1 - ((sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) - sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="1",verb=~"POST|PUT|PATCH|DELETE"})) + sum by (cluster) (code:apiserver_request_total:increase30d{code=~"5..",verb="write"} or vector(0))) / sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"})

I get a result of ~0.1%, which seems reasonable.

So, I wonder why:

  1. The rule is built the way it is based on two different metrics.
  2. What cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d and cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d actually express, and why these values aren't the same for GKE.

I'm grateful for any hints pointing me in the right direction.

Please provide any helpful snippets.

No response

What parts of the codebase are affected?

Rules

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.
@skl
Copy link
Collaborator

skl commented Dec 16, 2024

@TheKangaroo
Copy link
Author

Ah, thanks for the tip. It was a coincidence that I had just started configuring our GKE cluster alerts and hadn't seen it work before, so I thought it had never worked before.

@skl
Copy link
Collaborator

skl commented Dec 17, 2024

@TheKangaroo #998 was merged, let me know if that resolves the issue on your side 👍

@skl skl self-assigned this Dec 17, 2024
@skl skl added the bug Something isn't working label Dec 17, 2024
@TheKangaroo
Copy link
Author

I just checked, and this has indeed resolved my issue. Thank you so much for pointing me in this direction! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants