BUG: Fix missing sleep in _watch_resource_loop #373
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When upgrading a Loki helm release, I noticed a sharp increase in the Kubernetes API servers memory usage immediately after.
I found that the
loki-sc-rules
sidecars (which uses thekiwigrid/k8s-sidecar
image) were suddenly logging a lot more than usual, with all log lines being something like:Looking into it, the
_watch_resource_loop
seems to have had some changes in #326 where the sleeps were split into the except clauses. However, the ApiException except clause did not get its own sleep, which is causing it to create watch requests as fast as the loop allows it to.I created my own patched image with the change and ran a small test on a single-node Kubernetes cluster.
The test consisted of spinning up a small Kubernetes cluster, installing Loki using the helm chart and breaking the ClusterRoleBinding to the serviceaccount, to receive a 403 status code.
I labeled the pods with
sidecar_version
to more easily distinguish between the log rates:Query:
sum by(level, sidecar_version) (count_over_time({container="loki-sc-rules"} | json [$__auto]))
After changing to the patched image, the rate of
ERROR
logs is reduced from 200-300/sec to about 2/5sec.