Add connection recovery to rmq on publish #180
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue description:
If campaigns are launched with long intervals between them, the RabbitMQ server closes the connection. This happens because the RabbitMQ server expects regular heartbeats from its connections, but it didn't receive heartbeats within the default timeout. This is the log on the RabbitMQ side:
Within Zambeze, we are maintaining 4 different connections to the RabbitMQ server which are within
message_handler
. Two of them are consumers (recv_activity
,recv_control
) and are able to maintain the heartbeats when they start_consuming, thus they don't lose the connection. This can be verified by launchingtshark
and monitoring the port5672
.With producers (
send_activity_dag
,send_control
), when we have gaps between launches, the basic_publish doesn't run and the RabbitMQ thus doesn't know if the connection is still active and closes the connection after the default timeout. Unlikestart_consuming
on the consumer side, thebasic_publish
on the producer is only responsible for publishing messages to channels and doesn't maintain heartbeats.Resolution:
The issue can be resolved by either maintaining regular heartbeats from the producers, or by implementing connection recovery when the
basic_publish
fails to publish message due to the lost connection. Seems like we can turn off the heartbeats altogether, but it is not recommended for reasons like monitoring, resource management, etc. I discovered that both of the implementations are documented on thepika
library repository (maintaining heartbeats and recovery). I have implemented the second approach of recovery since it doesn't require maintaining heartbeats and is a bit easy on CPU.Tests:
I tested the recovery approach with varying intervals (10-30 minutes) via a dummy bash script that intermittently launches jobs and made sure the connection lost log was seen in the RabbitMQ logs before launching new campaigns. I verified that the publisher was able to send messages by looking at the corresponding logs. The followed the relay of the new messages from the
message_handler
to theexecutor
and verified that they were relaying properly within Zambeze.Before:
After:
I verified that the RabbitMQ received the new recovery connections:
I noticed that the subsequent runs of campaigns (i.e. more than 1) is failing and had issues even without RabbitMQ connection loss. I think it would be good to address those in a different PR.
Please let me know if there are other tests you would like me to look into. Thanks!