[Ineligible for community support] Dynamic shovel stops working #10545

luke-lacroix-healthy · 2024-02-14T16:05:20Z

luke-lacroix-healthy
Feb 14, 2024

Describe the bug

There are two shovels with the same source but different destinations. They are meant to shovel events across virtual hosts on the same instance so that our old system and new system both have a chance to process these events. After Google outage, both of these shovels stopped sending but continued to consume messages. These are the only two shovels that share a source and all other shovels continued to operate as normal.

Restarting shovel-1 fixed shovel-1. We then realized that shovel-2 was also misbehaving and restarted that. Now, shovel-2 works but shovel-1 does not.

Reproduction steps

It is not currently clear how to reproduce the issue. We are guessing, at this moment, because this is the first time we have ever seen this and our other environments, which experienced the same Google outage, did not experience this problem. Roughly:

have a rabbitMQ cluster configured with 3 nodes
create two shovels with the same source and different destinations (localhost, but different virtual hosts)
kill one of the pods in the cluster
check that messages are still being shoveled

Expected behavior

Shovels should operate as expected across all process restarts. Additionally, restarting one shovel should have zero impact on other shovels.

Additional context

We run RabbitMQ in a Kubernetes cluster. There was a Google incident which caused the pods to be cycled. When the pods came back up, two of the shovels were displaying "running" but the messages were not being shoveled - we have metrics on both sides which shows we published messages that the shovels should have picked up but those messages were never received on the other end. The queues for the shovels were empty, so, I assume, they were consuming messages but failing to send them to the destination. The shovels are configured to move messages between virtual hosts on the same instance (using localhost) and not between instances. They also use the same source exchange and routing-key.

We are running RabbitMQ 3.11.8. I have reviewed the change log through 3.11.28 but there does not appear to be any changes that are relevant to this issue. We are asking our DevOps team to prioritize an upgrade but we cannot know that this will resolve the issue.

Answered by michaelklishin

Feb 14, 2024

Also, RabbitMQ 3.11.8 is only covered by extended commercial support and is 20 patches behind the latest 3.11.x version.

If you have a support subscription, please collect relevant data from all nodes (feel free to edit out sensitive log values such as hostnames and usernames) and then file a support ticket. And even better option would be to first upgrade to at least the latest 3.11.

If not, you can either buy one or move to 3.12.x which is covered by community support until end of June 2024.

View full answer

michaelklishin · 2024-02-14T16:08:08Z

michaelklishin
Feb 14, 2024
Maintainer

@luke-lacroix-healthy we cannot suggest anything without

Reasonably complete logs from all nodes around the time of incident
A reasonably complete description of how Shovels were set up

We do not guess in this community, in particular when someone is seeking free support.

0 replies

michaelklishin · 2024-02-14T16:09:42Z

michaelklishin
Feb 14, 2024
Maintainer

There is only one scenario where failures of some shovels can affect others: if they fail at such a high rate that the entire Shovel supervisor (an Erlang concept) considers the rate too high to recover and stops. This will be visible in the logs. It should take at least several hundreds of failing shovels a second IIRC.

Otherwise one shovel is completely independent from others, besides the fact they all run on one cluster nodes and therefore share resources, plus share all TLS keys/certificates/settings.

0 replies

michaelklishin · 2024-02-14T16:11:36Z

michaelklishin
Feb 14, 2024
Maintainer

Also, RabbitMQ 3.11.8 is only covered by extended commercial support and is 20 patches behind the latest 3.11.x version.

If you have a support subscription, please collect relevant data from all nodes (feel free to edit out sensitive log values such as hostnames and usernames) and then file a support ticket. And even better option would be to first upgrade to at least the latest 3.11.

If not, you can either buy one or move to 3.12.x which is covered by community support until end of June 2024.

2 replies

luke-lacroix-healthy Feb 14, 2024
Author

I agree that we need to upgrade and we are having DevOps prioritize that task.

It is doubtful that I will be able to collect the logs for you, considering the volume of our logs and the massive amount of PHI/PII that may be contained in them - I would spend the next 6 months filtering out that information.

Your confirmation that shovels should not affect each other unless the supervisor experiences some failure is, at least, a jumping off point for us. I should be able to identify such a message in our logs and that, at least, will give us something to go off of. We can, at least, generate alerts so that we know to take corrective action.

michaelklishin Feb 14, 2024
Maintainer

We cannot analyze a problem without logs and metrics (in this case mostly logs). Guessing is an extremely expensive way of troubleshooting infrastructure, in particular distributed infrastructure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ineligible for community support] Dynamic shovel stops working #10545

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[Ineligible for community support] Dynamic shovel stops working #10545

luke-lacroix-healthy Feb 14, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 2 replies

michaelklishin Feb 14, 2024 Maintainer

michaelklishin Feb 14, 2024 Maintainer

michaelklishin Feb 14, 2024 Maintainer

luke-lacroix-healthy Feb 14, 2024 Author

michaelklishin Feb 14, 2024 Maintainer

luke-lacroix-healthy
Feb 14, 2024

Replies: 3 comments 2 replies

michaelklishin
Feb 14, 2024
Maintainer

michaelklishin
Feb 14, 2024
Maintainer

michaelklishin
Feb 14, 2024
Maintainer

luke-lacroix-healthy Feb 14, 2024
Author

michaelklishin Feb 14, 2024
Maintainer