BUG: Fix missing sleep in _watch_resource_loop #373

yetisage · 2024-11-24T17:03:37Z

When upgrading a Loki helm release, I noticed a sharp increase in the Kubernetes API servers memory usage immediately after.
I found that the loki-sc-rules sidecars (which uses the kiwigrid/k8s-sidecar image) were suddenly logging a lot more than usual, with all log lines being something like:

{"time": "2024-11-24T15:56:24.320161+00:00", "taskName": null, "msg": "ApiException when calling kubernetes: (403)\nReason: Forbidden\nHTTP response headers: HTTPHeaderDict({'Audit-Id': '33df569c-1218-4e1b-ad8e-5092c02b0d98', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'e3350d13-36fe-460d-9422-d90ba1a8d608', 'X-Kubernetes-Pf-Prioritylevel-Uid': '7c4d615c-8ab4-4786-b3c8-1f8725853156', 'Date': 'Sun, 24 Nov 2024 15:56:24 GMT', 'Content-Length': '295'})\nHTTP response body: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"secrets is forbidden: User \\\\\"system:serviceaccount:monitoring:loki\\\\\" cannot watch resource \\\\\"secrets\\\\\" in API group \\\\\"\\\\\" in the namespace \\\\\"monitoring\\\\\"\",\"reason\":\"Forbidden\",\"details\":{\"kind\":\"secrets\"},\"code\":403}\\n'\n\n", "level": "ERROR"}

Looking into it, the _watch_resource_loop seems to have had some changes in #326 where the sleeps were split into the except clauses. However, the ApiException except clause did not get its own sleep, which is causing it to create watch requests as fast as the loop allows it to.

I created my own patched image with the change and ran a small test on a single-node Kubernetes cluster.
The test consisted of spinning up a small Kubernetes cluster, installing Loki using the helm chart and breaking the ClusterRoleBinding to the serviceaccount, to receive a 403 status code.

I labeled the pods with sidecar_version to more easily distinguish between the log rates:

Query:
sum by(level, sidecar_version) (count_over_time({container="loki-sc-rules"} | json [$__auto]))

After changing to the patched image, the rate of ERROR logs is reduced from 200-300/sec to about 2/5sec.

Add sleep to ApiException except to reduce watch requests, when receiving a non 500 status code. This avoids spamming the Kubernetes API with attempted watch requests if for example the API returns a 4xx status code.

yetisage · 2024-11-24T17:07:01Z

Ironically, this will probably also happen if the Kubernetes API server returns a 429 (Too Many Requests) error 😃

ChristianGeie

I understand the problem and believe that this solves the issue that the watch request is triggered again immediately. Thx for reporting and contribution.

BUG: Fix bug with _watch_resource_loop

c4920ec

Add sleep to ApiException except to reduce watch requests, when receiving a non 500 status code. This avoids spamming the Kubernetes API with attempted watch requests if for example the API returns a 4xx status code.

ChristianGeie added bug Something isn't working python Pull requests that update Python code labels Dec 18, 2024

ChristianGeie approved these changes Dec 18, 2024

View reviewed changes

ChristianGeie merged commit 5342afb into kiwigrid:master Dec 18, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix missing sleep in _watch_resource_loop #373

BUG: Fix missing sleep in _watch_resource_loop #373

yetisage commented Nov 24, 2024

yetisage commented Nov 24, 2024

ChristianGeie left a comment

BUG: Fix missing sleep in _watch_resource_loop #373

BUG: Fix missing sleep in _watch_resource_loop #373

Conversation

yetisage commented Nov 24, 2024

yetisage commented Nov 24, 2024

ChristianGeie left a comment

Choose a reason for hiding this comment