-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spire-agent gets OOMKilled after pod restart #5067
Comments
I think this could be related to the fix introduced in #4231. The valyala/fastjson library has some pretty poor memory usage characteristics:
If there is a spike to the number of pods running on the kubelet, the pods response might be quite large. On a 64-bit platform, the size of fastjson.Value is 80 bytes. Even if we assume a 500KiB response, that is 40MiB. This is a per-attestation cost (we don't share the kubelet output, yet). Further, fastjson has an outstanding bug that causes memory to be held onto a little longer, meaning that if you are undergoing many attestations at once, the GC might not be able to release memory fast enough. There is a PR open but fastjson may not be actively maintained (valyala/fastjson#101). Hard to know if/when this PR would land. We've considered moving to another, more frequently maintained library... |
Hi @azdagron, |
We found that even if Spire v1.8.7 is much better than the older releases spire-agent got OOMKilled after a while. So, the spike is still there. Thus, I made a patched version with the parser from valyala/fastjson#101 and tested it. It seems to be working without issues. So, unfortunately it is not enough to upgrade to Spire v1.8.7 or later release. |
That's what I suspected. Unfortunately, fastjson seems to no longer be actively maintained. It's probably to the benefit of the project to move to a different json parsing library that is actively maintained. |
I'll close this issue in favor of those. |
When I delete a spire-agent pod in the cluster, it becomes unstable, the container gets
OOMKilled
a couple of times before becoming stable.These are the resource settings currently:
There is only a warning about a container ID that is not found at attestation then the agent exits with reason
OOMKilled
.As it can be seen on this graph, the average memory consumption is always below 200Mi but after restart there is a spike that causes the container restart:
Can you please help to understand if this behavior is normal? Is there any other way than further increasing the memory resource limit to make it stable?
The text was updated successfully, but these errors were encountered: