[Ineligible for community support] Dynamic shovel stops working #10545
-
Describe the bugThere are two shovels with the same source but different destinations. They are meant to shovel events across virtual hosts on the same instance so that our old system and new system both have a chance to process these events. After Google outage, both of these shovels stopped sending but continued to consume messages. These are the only two shovels that share a source and all other shovels continued to operate as normal. Restarting shovel-1 fixed shovel-1. We then realized that shovel-2 was also misbehaving and restarted that. Now, shovel-2 works but shovel-1 does not. Reproduction stepsIt is not currently clear how to reproduce the issue. We are guessing, at this moment, because this is the first time we have ever seen this and our other environments, which experienced the same Google outage, did not experience this problem. Roughly:
Expected behaviorShovels should operate as expected across all process restarts. Additionally, restarting one shovel should have zero impact on other shovels. Additional contextWe run RabbitMQ in a Kubernetes cluster. There was a Google incident which caused the pods to be cycled. When the pods came back up, two of the shovels were displaying "running" but the messages were not being shoveled - we have metrics on both sides which shows we published messages that the shovels should have picked up but those messages were never received on the other end. The queues for the shovels were empty, so, I assume, they were consuming messages but failing to send them to the destination. The shovels are configured to move messages between virtual hosts on the same instance (using localhost) and not between instances. They also use the same source exchange and routing-key. We are running RabbitMQ 3.11.8. I have reviewed the change log through 3.11.28 but there does not appear to be any changes that are relevant to this issue. We are asking our DevOps team to prioritize an upgrade but we cannot know that this will resolve the issue. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
@luke-lacroix-healthy we cannot suggest anything without
We do not guess in this community, in particular when someone is seeking free support. |
Beta Was this translation helpful? Give feedback.
-
There is only one scenario where failures of some shovels can affect others: if they fail at such a high rate that the entire Shovel supervisor (an Erlang concept) considers the rate too high to recover and stops. This will be visible in the logs. It should take at least several hundreds of failing shovels a second IIRC. Otherwise one shovel is completely independent from others, besides the fact they all run on one cluster nodes and therefore share resources, plus share all TLS keys/certificates/settings. |
Beta Was this translation helpful? Give feedback.
-
Also, RabbitMQ 3.11.8 is only covered by extended commercial support and is 20 patches behind the latest 3.11.x version. If you have a support subscription, please collect relevant data from all nodes (feel free to edit out sensitive log values such as hostnames and usernames) and then file a support ticket. And even better option would be to first upgrade to at least the latest 3.11. If not, you can either buy one or move to 3.12.x which is covered by community support until end of June 2024. |
Beta Was this translation helpful? Give feedback.
Also, RabbitMQ 3.11.8 is only covered by extended commercial support and is 20 patches behind the latest 3.11.x version.
If you have a support subscription, please collect relevant data from all nodes (feel free to edit out sensitive log values such as hostnames and usernames) and then file a support ticket. And even better option would be to first upgrade to at least the latest 3.11.
If not, you can either buy one or move to 3.12.x which is covered by community support until end of June 2024.