Add connection recovery to rmq on publish #180

flourishingtune · 2024-07-07T23:17:32Z

Issue description:

If campaigns are launched with long intervals between them, the RabbitMQ server closes the connection. This happens because the RabbitMQ server expects regular heartbeats from its connections, but it didn't receive heartbeats within the default timeout. This is the log on the RabbitMQ side:

2024-07-07 21:45:42.724472+00:00 [error] <0.2779.0> closing AMQP connection <0.2779.0> (172.17.0.1:44258 -> 172.17.0.2:5672):
2024-07-07 21:45:42.724472+00:00 [error] <0.2779.0> missed heartbeats from client, timeout: 60s

Within Zambeze, we are maintaining 4 different connections to the RabbitMQ server which are within message_handler. Two of them are consumers (recv_activity, recv_control) and are able to maintain the heartbeats when they start_consuming, thus they don't lose the connection. This can be verified by launching tshark and monitoring the port 5672.

With producers (send_activity_dag, send_control), when we have gaps between launches, the basic_publish doesn't run and the RabbitMQ thus doesn't know if the connection is still active and closes the connection after the default timeout. Unlike start_consuming on the consumer side, the basic_publish on the producer is only responsible for publishing messages to channels and doesn't maintain heartbeats.

Resolution:

The issue can be resolved by either maintaining regular heartbeats from the producers, or by implementing connection recovery when the basic_publish fails to publish message due to the lost connection. Seems like we can turn off the heartbeats altogether, but it is not recommended for reasons like monitoring, resource management, etc. I discovered that both of the implementations are documented on the pika library repository (maintaining heartbeats and recovery). I have implemented the second approach of recovery since it doesn't require maintaining heartbeats and is a bit easy on CPU.

Tests:

I tested the recovery approach with varying intervals (10-30 minutes) via a dummy bash script that intermittently launches jobs and made sure the connection lost log was seen in the RabbitMQ logs before launching new campaigns. I verified that the publisher was able to send messages by looking at the corresponding logs. The followed the relay of the new messages from the message_handler to the executor and verified that they were relaying properly within Zambeze.

Before:

2024-07-07 16:37:27,007 - zambeze.cli_agent - ERROR - [mh] UNABLE TO SEND ACTIVITY MESSAGE! CAUGHT: StreamLostError: Transport indicated EOF

After:

2024-07-07 16:46:25,668 - zambeze.cli_agent - DEBUG - [send_activity] Successfully sent activity!

I verified that the RabbitMQ received the new recovery connections:

2024-07-07 21:46:25.636570+00:00 [info] <0.2846.0> accepting AMQP connection <0.2846.0> (172.17.0.1:48978 -> 172.17.0.2:5672)
2024-07-07 21:46:25.649824+00:00 [info] <0.2846.0> connection <0.2846.0> (172.17.0.1:48978 -> 172.17.0.2:5672): user 'guest' authenticated and granted access to vhost '/'
2024-07-07 21:46:32.697239+00:00 [info] <0.2859.0> accepting AMQP connection <0.2859.0> (172.17.0.1:54738 -> 172.17.0.2:5672)
2024-07-07 21:46:32.702999+00:00 [info] <0.2859.0> connection <0.2859.0> (172.17.0.1:54738 -> 172.17.0.2:5672): user 'guest' authenticated and granted access to vhost '/'

I noticed that the subsequent runs of campaigns (i.e. more than 1) is failing and had issues even without RabbitMQ connection loss. I think it would be good to address those in a different PR.

Please let me know if there are other tests you would like me to look into. Thanks!

wigging · 2024-07-09T14:12:32Z

Is there a way to write a test for this without having to actually run it for 30 minutes?

flourishingtune · 2024-07-11T00:06:04Z

@wigging Thanks for the review. Looking at it...

- Add consumer recovery for general connection errors. - Propagate errors for non-recovering cases. - TODO: Add tests.

wigging · 2024-07-18T18:37:04Z

I was able to reproduce the problem with the following steps (I also posted this on the related issue):

Start up RabbitMQ using the official Docker image/container
Start zambeze. This creates 4 AMQP connections with RabbitMQ.
After 3 minutes RabbitMQ errors caused by missed heartbeats from client. This causes two connections to close.

The errors displayed in the terminal are:

2024-07-18 18:22:01.351298+00:00 [error] <0.834.0> closing AMQP connection <0.834.0> (192.168.65.1:23193 -> 172.17.0.2:5672):
2024-07-18 18:22:01.351298+00:00 [error] <0.834.0> missed heartbeats from client, timeout: 60s
2024-07-18 18:22:01.352443+00:00 [error] <0.828.0> closing AMQP connection <0.828.0> (192.168.65.1:48626 -> 172.17.0.2:5672):
2024-07-18 18:22:01.352443+00:00 [error] <0.828.0> missed heartbeats from client, timeout: 60s

So I'm seeing the connection errors occur much sooner at 3 minutes compared to the 10-30 minutes that is reported here and in the related issue. I will pull down this pull request and see if I still get these errors after 3 minutes.

wigging · 2024-07-18T19:05:48Z

@flourishingtune I pulled down your changes and ran the steps that I listed in my previous comment. I still get the RabbitMQ errors from missed heartbeats and the connections close after 3 minutes.

flourishingtune · 2024-07-19T01:26:24Z

Hi @wigging, thanks for checking out the PR. The second approach of reconnection implemented in this PR will reconnect while launching next campaign so that the message can relay. Thus, we still see timeouts in the RabbitMQ logs from producer although this won't affect Zambeze's message relays since we reconnect when required i.e. while publishing messages from campaigns.

There are a few things like number of reconnection attempts (in both consumer/producer) that I wish to include going further.

If we maintain heartbeats from producer as well (1st approach I described in the PR), then RabbitMQ connection can be maintained without loss. Let me know if you think that's a better approach.

wigging · 2024-07-19T17:13:05Z

@flourishingtune I ran this pull request again and I can confirm that the reconnection is indeed implemented. But as you noted, subsequent runs of a campaign fail even if the reconnection is made. The only way I can run another campaign is to restart zambeze. So I guess this pull request is good to merge since it fixes the reconnection issue. The problem with running subsequent campaigns can be addressed in a separate pull request.

wigging · 2024-07-19T17:24:26Z

I removed the closes issue #168 from the pull request description. Making note of it here so we know this pull request is related to the issue but doesn't completely fix it.

This pull request partially fixes issue #168 regarding reconnection issues. But running subsequent campaigns still fail.

tskluzac · 2024-07-19T18:03:50Z

This sounds good. Thanks @flourishingtune for the implementation and @wigging for the review.

flourishingtune added 2 commits July 7, 2024 16:52

Add connection recovery to rmq on publish

e397b6b

Minor: Fix imports formatting

e589ad2

wigging requested review from wigging and tskluzac July 11, 2024 17:05

flourishingtune added 3 commits July 11, 2024 16:53

[WIP] Add consumer recovery and error propagation

8e2e16b

- Add consumer recovery for general connection errors. - Propagate errors for non-recovering cases. - TODO: Add tests.

Reformat strings

864997b

Clear logging on connection close

967c104

wigging approved these changes Jul 19, 2024

View reviewed changes

wigging mentioned this pull request Jul 19, 2024

Long-running agents lose RabbitMQ connection #168

Open

tskluzac merged commit b676134 into ORNL:main Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add connection recovery to rmq on publish #180

Add connection recovery to rmq on publish #180

flourishingtune commented Jul 7, 2024 •

edited by wigging

Loading

wigging commented Jul 9, 2024

flourishingtune commented Jul 11, 2024

wigging commented Jul 18, 2024

wigging commented Jul 18, 2024

flourishingtune commented Jul 19, 2024

wigging commented Jul 19, 2024

wigging commented Jul 19, 2024

tskluzac commented Jul 19, 2024

Add connection recovery to rmq on publish #180

Add connection recovery to rmq on publish #180

Conversation

flourishingtune commented Jul 7, 2024 • edited by wigging Loading

wigging commented Jul 9, 2024

flourishingtune commented Jul 11, 2024

wigging commented Jul 18, 2024

wigging commented Jul 18, 2024

flourishingtune commented Jul 19, 2024

wigging commented Jul 19, 2024

wigging commented Jul 19, 2024

tskluzac commented Jul 19, 2024

flourishingtune commented Jul 7, 2024 •

edited by wigging

Loading