Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connection recovery to rmq on publish #180

Merged
merged 5 commits into from
Jul 19, 2024

Conversation

flourishingtune
Copy link
Contributor

@flourishingtune flourishingtune commented Jul 7, 2024

Issue description:

If campaigns are launched with long intervals between them, the RabbitMQ server closes the connection. This happens because the RabbitMQ server expects regular heartbeats from its connections, but it didn't receive heartbeats within the default timeout. This is the log on the RabbitMQ side:

2024-07-07 21:45:42.724472+00:00 [error] <0.2779.0> closing AMQP connection <0.2779.0> (172.17.0.1:44258 -> 172.17.0.2:5672):
2024-07-07 21:45:42.724472+00:00 [error] <0.2779.0> missed heartbeats from client, timeout: 60s

Within Zambeze, we are maintaining 4 different connections to the RabbitMQ server which are within message_handler. Two of them are consumers (recv_activity, recv_control) and are able to maintain the heartbeats when they start_consuming, thus they don't lose the connection. This can be verified by launching tshark and monitoring the port 5672.

With producers (send_activity_dag, send_control), when we have gaps between launches, the basic_publish doesn't run and the RabbitMQ thus doesn't know if the connection is still active and closes the connection after the default timeout. Unlike start_consuming on the consumer side, the basic_publish on the producer is only responsible for publishing messages to channels and doesn't maintain heartbeats.

Resolution:

The issue can be resolved by either maintaining regular heartbeats from the producers, or by implementing connection recovery when the basic_publish fails to publish message due to the lost connection. Seems like we can turn off the heartbeats altogether, but it is not recommended for reasons like monitoring, resource management, etc. I discovered that both of the implementations are documented on the pika library repository (maintaining heartbeats and recovery). I have implemented the second approach of recovery since it doesn't require maintaining heartbeats and is a bit easy on CPU.

Tests:

I tested the recovery approach with varying intervals (10-30 minutes) via a dummy bash script that intermittently launches jobs and made sure the connection lost log was seen in the RabbitMQ logs before launching new campaigns. I verified that the publisher was able to send messages by looking at the corresponding logs. The followed the relay of the new messages from the message_handler to the executor and verified that they were relaying properly within Zambeze.

Before:

2024-07-07 16:37:27,007 - zambeze.cli_agent - ERROR - [mh] UNABLE TO SEND ACTIVITY MESSAGE! CAUGHT: StreamLostError: Transport indicated EOF

After:

2024-07-07 16:46:25,668 - zambeze.cli_agent - DEBUG - [send_activity] Successfully sent activity!

I verified that the RabbitMQ received the new recovery connections:

2024-07-07 21:46:25.636570+00:00 [info] <0.2846.0> accepting AMQP connection <0.2846.0> (172.17.0.1:48978 -> 172.17.0.2:5672)
2024-07-07 21:46:25.649824+00:00 [info] <0.2846.0> connection <0.2846.0> (172.17.0.1:48978 -> 172.17.0.2:5672): user 'guest' authenticated and granted access to vhost '/'
2024-07-07 21:46:32.697239+00:00 [info] <0.2859.0> accepting AMQP connection <0.2859.0> (172.17.0.1:54738 -> 172.17.0.2:5672)
2024-07-07 21:46:32.702999+00:00 [info] <0.2859.0> connection <0.2859.0> (172.17.0.1:54738 -> 172.17.0.2:5672): user 'guest' authenticated and granted access to vhost '/'

I noticed that the subsequent runs of campaigns (i.e. more than 1) is failing and had issues even without RabbitMQ connection loss. I think it would be good to address those in a different PR.

Please let me know if there are other tests you would like me to look into. Thanks!

@wigging
Copy link
Collaborator

wigging commented Jul 9, 2024

Is there a way to write a test for this without having to actually run it for 30 minutes?

@flourishingtune
Copy link
Contributor Author

@wigging Thanks for the review. Looking at it...

@wigging wigging requested review from wigging and tskluzac July 11, 2024 17:05
- Add consumer recovery for general connection errors.
- Propagate errors for non-recovering cases.
- TODO: Add tests.
@wigging
Copy link
Collaborator

wigging commented Jul 18, 2024

I was able to reproduce the problem with the following steps (I also posted this on the related issue):

  1. Start up RabbitMQ using the official Docker image/container
  2. Start zambeze. This creates 4 AMQP connections with RabbitMQ.
  3. After 3 minutes RabbitMQ errors caused by missed heartbeats from client. This causes two connections to close.

The errors displayed in the terminal are:

2024-07-18 18:22:01.351298+00:00 [error] <0.834.0> closing AMQP connection <0.834.0> (192.168.65.1:23193 -> 172.17.0.2:5672):
2024-07-18 18:22:01.351298+00:00 [error] <0.834.0> missed heartbeats from client, timeout: 60s
2024-07-18 18:22:01.352443+00:00 [error] <0.828.0> closing AMQP connection <0.828.0> (192.168.65.1:48626 -> 172.17.0.2:5672):
2024-07-18 18:22:01.352443+00:00 [error] <0.828.0> missed heartbeats from client, timeout: 60s

So I'm seeing the connection errors occur much sooner at 3 minutes compared to the 10-30 minutes that is reported here and in the related issue. I will pull down this pull request and see if I still get these errors after 3 minutes.

@wigging
Copy link
Collaborator

wigging commented Jul 18, 2024

@flourishingtune I pulled down your changes and ran the steps that I listed in my previous comment. I still get the RabbitMQ errors from missed heartbeats and the connections close after 3 minutes.

@flourishingtune
Copy link
Contributor Author

Hi @wigging, thanks for checking out the PR. The second approach of reconnection implemented in this PR will reconnect while launching next campaign so that the message can relay. Thus, we still see timeouts in the RabbitMQ logs from producer although this won't affect Zambeze's message relays since we reconnect when required i.e. while publishing messages from campaigns.

There are a few things like number of reconnection attempts (in both consumer/producer) that I wish to include going further.

If we maintain heartbeats from producer as well (1st approach I described in the PR), then RabbitMQ connection can be maintained without loss. Let me know if you think that's a better approach.

@wigging
Copy link
Collaborator

wigging commented Jul 19, 2024

@flourishingtune I ran this pull request again and I can confirm that the reconnection is indeed implemented. But as you noted, subsequent runs of a campaign fail even if the reconnection is made. The only way I can run another campaign is to restart zambeze. So I guess this pull request is good to merge since it fixes the reconnection issue. The problem with running subsequent campaigns can be addressed in a separate pull request.

@wigging
Copy link
Collaborator

wigging commented Jul 19, 2024

I removed the closes issue #168 from the pull request description. Making note of it here so we know this pull request is related to the issue but doesn't completely fix it.

This pull request partially fixes issue #168 regarding reconnection issues. But running subsequent campaigns still fail.

@tskluzac
Copy link
Collaborator

This sounds good. Thanks @flourishingtune for the implementation and @wigging for the review.

@tskluzac tskluzac merged commit b676134 into ORNL:main Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants