Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-running agents lose RabbitMQ connection #168

Open
tskluzac opened this issue May 24, 2024 · 2 comments
Open

Long-running agents lose RabbitMQ connection #168

tskluzac opened this issue May 24, 2024 · 2 comments

Comments

@tskluzac
Copy link
Collaborator

I have noticed that when a Zambeze agent runs for an extended period of time, they are often met with a "RabbitMQ::AMQP -- Channel closed" error that seems to cause processing to halt. We should find a way to be robust to these failures. We should probably implement:

Steps to reproduce:

  • Start agent. Launch campaign. Wait 30 minutes. Launch campaign. (has happened twice to me today).

I would like to propose the following test to close this issue:

  • Have two Zambeze agents. Perhaps Defiant and Frontier.
  • Write a script that intermittently launches jobs to Zambeze (1 -- 10 minute gaps between them) over the course of 3 hours.
  • Ensure that all tasks successfully execute.
@wigging
Copy link
Collaborator

wigging commented Jul 18, 2024

I replicated this issue with the following steps:

  1. Start up RabbitMQ using the official Docker image/container
  2. Start zambeze. This creates 4 AMQP connections with RabbitMQ.
  3. After 3 minutes RabbitMQ errors caused by missed heartbeats from client. This causes two connections to close.

The errors displayed in the terminal are:

2024-07-18 18:22:01.351298+00:00 [error] <0.834.0> closing AMQP connection <0.834.0> (192.168.65.1:23193 -> 172.17.0.2:5672):
2024-07-18 18:22:01.351298+00:00 [error] <0.834.0> missed heartbeats from client, timeout: 60s
2024-07-18 18:22:01.352443+00:00 [error] <0.828.0> closing AMQP connection <0.828.0> (192.168.65.1:48626 -> 172.17.0.2:5672):
2024-07-18 18:22:01.352443+00:00 [error] <0.828.0> missed heartbeats from client, timeout: 60s

@wigging
Copy link
Collaborator

wigging commented Jul 19, 2024

Note that pull request #180 partially fixes this issue. The pull request fixes the reconnection problem but subsequent campaign runs still fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants