Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spire-agent gets OOMKilled after pod restart #5067

Closed
szvincze opened this issue Apr 15, 2024 · 6 comments
Closed

spire-agent gets OOMKilled after pod restart #5067

szvincze opened this issue Apr 15, 2024 · 6 comments
Assignees
Labels
triage/in-progress Issue triage is in progress

Comments

@szvincze
Copy link
Contributor

szvincze commented Apr 15, 2024

  • Version: up to 1.8.6
  • Platform: Linux 6.5.0-26-generic 26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 GNU/Linux
  • Subsystem: agent

When I delete a spire-agent pod in the cluster, it becomes unstable, the container gets OOMKilled a couple of times before becoming stable.

These are the resource settings currently:

    Limits:
      cpu:     1
      memory:  570Mi
    Requests:
      cpu:     800m
      memory:  512Mi

There is only a warning about a container ID that is not found at attestation then the agent exits with reason OOMKilled.

time="2024-04-11T13:40:53Z" level=warning msg="Current umask 0022 is too permissive; setting umask 0027"
time="2024-04-11T13:40:53Z" level=info msg="Starting agent with data directory: "/run/spire/temp""
time="2024-04-11T13:40:53Z" level=warning msg="Agent is now configured to accept remote network connections for Prometheus stats collection. Please ensure access to this port is tightly controlled" subsystem_name=telemetry
time="2024-04-11T13:40:53Z" level=info msg="Plugin loaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog
time="2024-04-11T13:40:53Z" level=info msg="Plugin loaded" external=false plugin_name=memory plugin_type=KeyManager subsystem_name=catalog
time="2024-04-11T13:40:53Z" level=info msg="Plugin loaded" external=false plugin_name=k8s plugin_type=WorkloadAttestor subsystem_name=catalog
time="2024-04-11T13:40:53Z" level=info msg="Bundle loaded" subsystem_name=attestor trust_domain_id="spiffe://infra"
time="2024-04-11T13:40:53Z" level=info msg="SVID is not found. Starting node attestation" subsystem_name=attestor trust_domain_id="spiffe://infra"
time="2024-04-11T13:40:53Z" level=info msg="Node attestation was successful" rettestable=true spiffe_id="spiffe://infra/spire/agent/k8s_psat/infra-cluster/890be4ad-1618-4379-9bc9-c54bb55223d5" subsystem_name=attestor trust_domain_id="spiffe://infra"
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=28908b2f-d02b-4269-8ecf-de78d350bf5d spiffe_id="spiffe://infra/ns/infra/pod/nsmgr-72bjt" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=deaeb5c6-ab93-4851-88c5-e13497486f09 spiffe_id="spiffe://infra/ns/infra/pod/forwarder-vpp-xgbgq" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=610ecd0f-e2c9-4e38-922d-9e325a3dd6cb spiffe_id="spiffe://infra/ns/cndsc3/pod/proxy-vpn1-6vkn4" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=0415f6a1-240c-49b3-8215-7e337bcead79 spiffe_id="spiffe://infra/ns/cndsc3/pod/proxy-vpn2-pvb7p" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=0d2837cb-9aa8-4189-b2d7-c3ddeb4ee587 spiffe_id="spiffe://infra/ns/cndsc3/pod/stateless-lb-frontend-attr-vpn2-5444c987npdbt" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=cfdb2a8a-e340-4cef-8d3f-8944af4e4758 spiffe_id="spiffe://infra/ns/cndsc3/pod/fdr-fdcb9f8f6-tqqcj" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=14d49a0a-2b58-4a7c-a0de-551df85d2fc1 spiffe_id="spiffe://infra/ns/infra/pod/registry-k8s-5cc46b8bbf-bk6wc" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Starting Workload and SDS APIs" address=/run/spire/sockets/agent.sock network=unix subsystem_name=endpoints
time="2024-04-11T13:40:53Z" level=warning msg="Container id not found" attempt=1 container_id=0c2c4b591c6e68635a6cfff55c91a90244bfb85c2091d0f43874785ff39789a7 external=false plugin_name=k8s plugin_type=WorkloadAttestor pod_uid=1f39d257-6173-4a4e-ac25-3706dd13db63 retry_interval=500ms subsystem_name=catalog
2024-04-11T13:40:58.743 Agent exit code: 137

As it can be seen on this graph, the average memory consumption is always below 200Mi but after restart there is a spike that causes the container restart:
spire-agent-oomkilled

Can you please help to understand if this behavior is normal? Is there any other way than further increasing the memory resource limit to make it stable?

@MarcosDY MarcosDY added the triage/in-progress Issue triage is in progress label Apr 16, 2024
@azdagron
Copy link
Member

I think this could be related to the fix introduced in #4231. The valyala/fastjson library has some pretty poor memory usage characteristics:

fastjson requires up to sizeof(Value) * len(inputJSON) bytes of memory for parsing inputJSON string. Limit the maximum size of the inputJSON before parsing it in order to limit the maximum memory usage.

If there is a spike to the number of pods running on the kubelet, the pods response might be quite large. On a 64-bit platform, the size of fastjson.Value is 80 bytes. Even if we assume a 500KiB response, that is 40MiB. This is a per-attestation cost (we don't share the kubelet output, yet).

Further, fastjson has an outstanding bug that causes memory to be held onto a little longer, meaning that if you are undergoing many attestations at once, the GC might not be able to release memory fast enough. There is a PR open but fastjson may not be actively maintained (valyala/fastjson#101). Hard to know if/when this PR would land.

We've considered moving to another, more frequently maintained library...

@szvincze
Copy link
Contributor Author

Hi @azdagron,
Thanks for the comment. In the meantime I updated the description because it turned out that the spike was observed on v1.8.4. I managed to reproduce it up to v1.8.6 but it does not come on v1.8.7 and later releases.
However it is not 100% sure that I see exactly the same behavior like my colleague who reported it first, so there will be a test in the original environment with Spire v1.8.7 and hopefully I can come back with the outcome soon.

@szvincze
Copy link
Contributor Author

We found that even if Spire v1.8.7 is much better than the older releases spire-agent got OOMKilled after a while. So, the spike is still there. Thus, I made a patched version with the parser from valyala/fastjson#101 and tested it. It seems to be working without issues. So, unfortunately it is not enough to upgrade to Spire v1.8.7 or later release.

@azdagron
Copy link
Member

That's what I suspected. Unfortunately, fastjson seems to no longer be actively maintained. It's probably to the benefit of the project to move to a different json parsing library that is actively maintained.

@azdagron
Copy link
Member

I've opened #5109 and #5111 to track potential mitigations. I believe #5109 should be done no matter what, considering valyala/json is not actively maintained. If that isn't enough, we could consider #5111, though it is more complicated.

@azdagron
Copy link
Member

I'll close this issue in favor of those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/in-progress Issue triage is in progress
Projects
None yet
Development

No branches or pull requests

3 participants