Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta][Heartbeat] NodeJS process leak in elastic-agent-complete #32386

Closed
emilioalvap opened this issue Jul 18, 2022 · 2 comments
Closed

[Meta][Heartbeat] NodeJS process leak in elastic-agent-complete #32386

emilioalvap opened this issue Jul 18, 2022 · 2 comments
Labels
bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team

Comments

@emilioalvap
Copy link
Collaborator

emilioalvap commented Jul 18, 2022

Summary

After some reports of memory leaking in elastic-agent-complete containers running in AWS Fargate, we've found a couple of scenarios where elastic-agent-complete would spawn unresponsive or orphaned processes:

  • Random scenarios where node process goes unresponsive and doesn't exit.
  • New Fleet integration triggers elastic-agent-issued restart when a synthetics monitor is running.

Details

NodeJS goes unresponsive and doesn't exit

On some containers used for testing, node processes go unresponsive seemingly at random. Once unresponsive, HB will block that monitor from running again, waiting for node to exit.
Even though node is unresponsive, it will continue to allocate memory over time, until an OOM is issued by the kernel or V8 reaches max heap size:

image

/// APM logs
11:07:16.802 elastic_agent.filebeat [elastic_agent.filebeat][info] Non-zero metrics in the last 30s
11:07:36.268 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: 
11:07:36.268 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: <--- Last few GCs --->
11:07:36.269 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: 
11:07:36.270 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: [51102:0x5e2e970] 87623078 ms: Scavenge 2026.0 (2080.2) -> 2019.5 (2080.2) MB, 4.5 / 0.0 ms  (average mu = 0.992, current mu = 0.994) allocation failure 
11:07:36.270 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: [51102:0x5e2e970] 87623104 ms: Scavenge 2026.5 (2080.2) -> 2019.8 (2080.2) MB, 4.7 / 0.0 ms  (average mu = 0.992, current mu = 0.994) allocation failure 
11:07:36.270 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: [51102:0x5e2e970] 87623132 ms: Scavenge 2027.3 (2080.2) -> 2020.2 (2096.2) MB, 5.2 / 0.0 ms  (average mu = 0.992, current mu = 0.994) allocation failure 
11:07:36.270 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: 
11:07:36.270 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: 
11:07:36.270 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: <--- JS stacktrace --->
11:07:36.270 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: 
11:07:36.271 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
11:07:36.271 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr:  1: 0xb09c10 node::Abort() [node]
11:07:36.271 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr:  2: 0xa1c193 node::FatalError(char const*, char const*) [node]
11:07:36.271 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr:  3: 0xcf8dbe v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
11:07:36.271 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr:  4: 0xcf9137 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
11:07:36.271 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr:  5: 0xeb09d5  [node]
11:07:36.272 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr:  6: 0xec069d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
11:07:36.272 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr:  7: 0xec339e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
11:07:36.272 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr:  8: 0xe848da v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
11:07:36.272 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr:  9: 0x11fd626 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
11:07:36.272 elastic_agent.heartbeat [elastic_agent.heartbeat][info] stderr: 10: 0x15f2099  [node]
11:07:36.634 elastic_agent.heartbeat [elastic_agent.heartbeat][warn] error scanning synthetics runner results read |0: file already closed
11:07:36.634 elastic_agent.heartbeat [elastic_agent.heartbeat][warn] could not scan stdout events from synthetics: read |0: file already closed
11:07:36.635 elastic_agent.heartbeat [elastic_agent.heartbeat][warn] error scanning synthetics runner results read |0: file already closed
11:07:36.635 elastic_agent.heartbeat [elastic_agent.heartbeat][warn] could not scan stderr events from synthetics: read |0: file already closed
11:07:36.635 elastic_agent.heartbeat [elastic_agent.heartbeat][info] Command has completed(-1): /usr/share/elastic-agent/.node/node/bin/elastic-synthetics elastic-synthetics --screenshots on --throttling 5d/3u/20l --inline --rich-events

Fleet integration policy update

When a new Synthetics integration gets added to an agent's policy, elastic-agent triggers a restart for all beats processes. ATM, heartbeat exits without checking if there're synthetic monitors running.

Since these run under a different PID, when HB exits these node processes are orphaned and end up re-assigned as PID 1 children. These children are never destroyed and never release whatever resources are already allocated to them:

image

Tracking

These're the individual issues:

@emilioalvap emilioalvap added bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team labels Jul 18, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime (Team:Uptime)

@emilioalvap emilioalvap changed the title [Meta] NodeJS process leak in elastic-agent-complete [Meta][Heartbeat] NodeJS process leak in elastic-agent-complete Jul 18, 2022
@paulb-elastic
Copy link
Contributor

Closing to leave the separate remaining two issues to be tracked on their own

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

No branches or pull requests

3 participants