Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix missing sleep in _watch_resource_loop #373

Merged
merged 1 commit into from
Dec 18, 2024

Conversation

yetisage
Copy link
Contributor

When upgrading a Loki helm release, I noticed a sharp increase in the Kubernetes API servers memory usage immediately after.
I found that the loki-sc-rules sidecars (which uses the kiwigrid/k8s-sidecar image) were suddenly logging a lot more than usual, with all log lines being something like:

{"time": "2024-11-24T15:56:24.320161+00:00", "taskName": null, "msg": "ApiException when calling kubernetes: (403)\nReason: Forbidden\nHTTP response headers: HTTPHeaderDict({'Audit-Id': '33df569c-1218-4e1b-ad8e-5092c02b0d98', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'e3350d13-36fe-460d-9422-d90ba1a8d608', 'X-Kubernetes-Pf-Prioritylevel-Uid': '7c4d615c-8ab4-4786-b3c8-1f8725853156', 'Date': 'Sun, 24 Nov 2024 15:56:24 GMT', 'Content-Length': '295'})\nHTTP response body: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"secrets is forbidden: User \\\\\"system:serviceaccount:monitoring:loki\\\\\" cannot watch resource \\\\\"secrets\\\\\" in API group \\\\\"\\\\\" in the namespace \\\\\"monitoring\\\\\"\",\"reason\":\"Forbidden\",\"details\":{\"kind\":\"secrets\"},\"code\":403}\\n'\n\n", "level": "ERROR"}

Looking into it, the _watch_resource_loop seems to have had some changes in #326 where the sleeps were split into the except clauses. However, the ApiException except clause did not get its own sleep, which is causing it to create watch requests as fast as the loop allows it to.

I created my own patched image with the change and ran a small test on a single-node Kubernetes cluster.
The test consisted of spinning up a small Kubernetes cluster, installing Loki using the helm chart and breaking the ClusterRoleBinding to the serviceaccount, to receive a 403 status code.

I labeled the pods with sidecar_version to more easily distinguish between the log rates:

Query:
sum by(level, sidecar_version) (count_over_time({container="loki-sc-rules"} | json [$__auto]))
image

After changing to the patched image, the rate of ERROR logs is reduced from 200-300/sec to about 2/5sec.

Add sleep to ApiException except to reduce watch requests,
when receiving a non 500 status code.
This avoids spamming the Kubernetes API with attempted watch requests
if for example the API returns a 4xx status code.
@yetisage
Copy link
Contributor Author

Ironically, this will probably also happen if the Kubernetes API server returns a 429 (Too Many Requests) error 😃

@ChristianGeie ChristianGeie added bug Something isn't working python Pull requests that update Python code labels Dec 18, 2024
Copy link
Collaborator

@ChristianGeie ChristianGeie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the problem and believe that this solves the issue that the watch request is triggered again immediately. Thx for reporting and contribution.

@ChristianGeie ChristianGeie merged commit 5342afb into kiwigrid:master Dec 18, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants