spire-agent gets OOMKilled after pod restart #5067

szvincze · 2024-04-15T09:07:41Z

Version: up to 1.8.6
Platform: Linux 6.5.0-26-generic 26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 GNU/Linux
Subsystem: agent

When I delete a spire-agent pod in the cluster, it becomes unstable, the container gets OOMKilled a couple of times before becoming stable.

These are the resource settings currently:

    Limits:
      cpu:     1
      memory:  570Mi
    Requests:
      cpu:     800m
      memory:  512Mi

There is only a warning about a container ID that is not found at attestation then the agent exits with reason OOMKilled.

time="2024-04-11T13:40:53Z" level=warning msg="Current umask 0022 is too permissive; setting umask 0027"
time="2024-04-11T13:40:53Z" level=info msg="Starting agent with data directory: "/run/spire/temp""
time="2024-04-11T13:40:53Z" level=warning msg="Agent is now configured to accept remote network connections for Prometheus stats collection. Please ensure access to this port is tightly controlled" subsystem_name=telemetry
time="2024-04-11T13:40:53Z" level=info msg="Plugin loaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog
time="2024-04-11T13:40:53Z" level=info msg="Plugin loaded" external=false plugin_name=memory plugin_type=KeyManager subsystem_name=catalog
time="2024-04-11T13:40:53Z" level=info msg="Plugin loaded" external=false plugin_name=k8s plugin_type=WorkloadAttestor subsystem_name=catalog
time="2024-04-11T13:40:53Z" level=info msg="Bundle loaded" subsystem_name=attestor trust_domain_id="spiffe://infra"
time="2024-04-11T13:40:53Z" level=info msg="SVID is not found. Starting node attestation" subsystem_name=attestor trust_domain_id="spiffe://infra"
time="2024-04-11T13:40:53Z" level=info msg="Node attestation was successful" rettestable=true spiffe_id="spiffe://infra/spire/agent/k8s_psat/infra-cluster/890be4ad-1618-4379-9bc9-c54bb55223d5" subsystem_name=attestor trust_domain_id="spiffe://infra"
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=28908b2f-d02b-4269-8ecf-de78d350bf5d spiffe_id="spiffe://infra/ns/infra/pod/nsmgr-72bjt" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=deaeb5c6-ab93-4851-88c5-e13497486f09 spiffe_id="spiffe://infra/ns/infra/pod/forwarder-vpp-xgbgq" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=610ecd0f-e2c9-4e38-922d-9e325a3dd6cb spiffe_id="spiffe://infra/ns/cndsc3/pod/proxy-vpn1-6vkn4" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=0415f6a1-240c-49b3-8215-7e337bcead79 spiffe_id="spiffe://infra/ns/cndsc3/pod/proxy-vpn2-pvb7p" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=0d2837cb-9aa8-4189-b2d7-c3ddeb4ee587 spiffe_id="spiffe://infra/ns/cndsc3/pod/stateless-lb-frontend-attr-vpn2-5444c987npdbt" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=cfdb2a8a-e340-4cef-8d3f-8944af4e4758 spiffe_id="spiffe://infra/ns/cndsc3/pod/fdr-fdcb9f8f6-tqqcj" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Renewing X509-SVID" entry_id=14d49a0a-2b58-4a7c-a0de-551df85d2fc1 spiffe_id="spiffe://infra/ns/infra/pod/registry-k8s-5cc46b8bbf-bk6wc" subsystem_name=manager
time="2024-04-11T13:40:53Z" level=info msg="Starting Workload and SDS APIs" address=/run/spire/sockets/agent.sock network=unix subsystem_name=endpoints
time="2024-04-11T13:40:53Z" level=warning msg="Container id not found" attempt=1 container_id=0c2c4b591c6e68635a6cfff55c91a90244bfb85c2091d0f43874785ff39789a7 external=false plugin_name=k8s plugin_type=WorkloadAttestor pod_uid=1f39d257-6173-4a4e-ac25-3706dd13db63 retry_interval=500ms subsystem_name=catalog
2024-04-11T13:40:58.743 Agent exit code: 137

As it can be seen on this graph, the average memory consumption is always below 200Mi but after restart there is a spike that causes the container restart:

Can you please help to understand if this behavior is normal? Is there any other way than further increasing the memory resource limit to make it stable?

The text was updated successfully, but these errors were encountered:

azdagron · 2024-04-18T19:37:53Z

I think this could be related to the fix introduced in #4231. The valyala/fastjson library has some pretty poor memory usage characteristics:

fastjson requires up to sizeof(Value) * len(inputJSON) bytes of memory for parsing inputJSON string. Limit the maximum size of the inputJSON before parsing it in order to limit the maximum memory usage.

If there is a spike to the number of pods running on the kubelet, the pods response might be quite large. On a 64-bit platform, the size of fastjson.Value is 80 bytes. Even if we assume a 500KiB response, that is 40MiB. This is a per-attestation cost (we don't share the kubelet output, yet).

Further, fastjson has an outstanding bug that causes memory to be held onto a little longer, meaning that if you are undergoing many attestations at once, the GC might not be able to release memory fast enough. There is a PR open but fastjson may not be actively maintained (valyala/fastjson#101). Hard to know if/when this PR would land.

We've considered moving to another, more frequently maintained library...

szvincze · 2024-04-22T12:03:04Z

Hi @azdagron,
Thanks for the comment. In the meantime I updated the description because it turned out that the spike was observed on v1.8.4. I managed to reproduce it up to v1.8.6 but it does not come on v1.8.7 and later releases.
However it is not 100% sure that I see exactly the same behavior like my colleague who reported it first, so there will be a test in the original environment with Spire v1.8.7 and hopefully I can come back with the outcome soon.

szvincze · 2024-04-23T15:45:58Z

We found that even if Spire v1.8.7 is much better than the older releases spire-agent got OOMKilled after a while. So, the spike is still there. Thus, I made a patched version with the parser from valyala/fastjson#101 and tested it. It seems to be working without issues. So, unfortunately it is not enough to upgrade to Spire v1.8.7 or later release.

azdagron · 2024-04-23T16:07:06Z

That's what I suspected. Unfortunately, fastjson seems to no longer be actively maintained. It's probably to the benefit of the project to move to a different json parsing library that is actively maintained.

azdagron · 2024-04-30T20:50:07Z

I've opened #5109 and #5111 to track potential mitigations. I believe #5109 should be done no matter what, considering valyala/json is not actively maintained. If that isn't enough, we could consider #5111, though it is more complicated.

azdagron · 2024-04-30T20:50:16Z

I'll close this issue in favor of those.

MarcosDY added the triage/in-progress Issue triage is in progress label Apr 16, 2024

MarcosDY assigned azdagron Apr 18, 2024

azdagron mentioned this issue Apr 30, 2024

Replace JSON library for targeted parsing of the Kubelet response #5109

Open

azdagron closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spire-agent gets OOMKilled after pod restart #5067

spire-agent gets OOMKilled after pod restart #5067

szvincze commented Apr 15, 2024 •

edited

Loading

azdagron commented Apr 18, 2024

szvincze commented Apr 22, 2024

szvincze commented Apr 23, 2024

azdagron commented Apr 23, 2024

azdagron commented Apr 30, 2024

azdagron commented Apr 30, 2024

spire-agent gets OOMKilled after pod restart #5067

spire-agent gets OOMKilled after pod restart #5067

Comments

szvincze commented Apr 15, 2024 • edited Loading

azdagron commented Apr 18, 2024

szvincze commented Apr 22, 2024

szvincze commented Apr 23, 2024

azdagron commented Apr 23, 2024

azdagron commented Apr 30, 2024

azdagron commented Apr 30, 2024

szvincze commented Apr 15, 2024 •

edited

Loading