Possible race condition in classic queue deletion/declaration handling for multi-node clusters #11001
Replies: 2 comments 2 replies
-
It is possible that the client reconnects and does the re-declare before the node is properly down which is why it appears successful. Try a force kill of the node and see if this changes behavour. There is however a relatively simple change you can make for this and that is to use exclusive server-named queues for your reply queues. This way there is no requirement on transient deletion (which is unlikely to ever be atomic anyway). Also see: https://www.rabbitmq.com/blog/2021/08/21/4.0-deprecation-announcements#removal-of-transient-non-exclusive-queues |
Beta Was this translation helpful? Give feedback.
-
This is a classic problem with auto-delete queues with known (shared) names. RabbitMQ cannot do much about it without a very expensive cluster-wide locking of the queue. Clients that reconnect will try to declare a queue that another node may be in the process of removing. Using server-named queues or not using auto-delete queues naturally makes the problem go away. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
I'm experiencing issues with non-HA classic queues with RabbitMQ v3.12.4. I've also seen it in v3.11.10 and likely earlier, but it wasn't as obvious due to a client library bug. We're running a three node cluster, where most queues are HA, but for this specific case they aren't. I'm yet to test more recent RabbitMQ as I'm testing in a staging environment built for other things, but will attempt to when I can and update here.
When clients initially connect, they declare a non-HA classic queue against the node they have connected to. If at a later date that node goes down, the expectation is that they will reconnect to one of the remaining nodes and re-declare their queue in order to continue operations. This works correctly in most cases, but I'm seeing what I believe could be a server-side race condition when reconnections occur where the node we are connecting to believes the queue exists, but it doesn't. As such the client connects, believes it has successfully declared the queue, but when other clients attempt to send messages to its queue they can't find it. The owning client doesn't realise its queue doesn't exist as it only checks for this at connection time.
These issues have been seen via three deployments of OpenStack which exhibit the same issues. Having traced through the client libraries, observed logs, and written a test case with a different client library I believe this is pointing at a server-side issue, unless of course the way in which the queues or clients are being used goes against RabbitMQ's expectations.
Reproduction steps
rabbitmqctl stop_app
against the RMQ node which the queue has been created againstExample script (adapted from pika library examples):
Note that I've commented out a sleep towards the end of this script during channel exception handling. With the sleep in place it appears much less likely that the issue will occur, presumably as the server node(s) have time to synchronise their state with regard to which queues are declared and operational. Without the sleep I've so far had a 100% success rate replicating the issue, but it's possible that as these RMQ nodes are otherwise in use, additional load on them could make it more likely in our case.
Expected behavior
When a client has lost connection to the server and its single-node classic queue has been lost, I would expect the queue to be re-created successfully on a different RabbitMQ cluster node once connection has been successful and a non-passive queue declaration command has been accepted. Otherwise the client should receive indication that the queue isn't functional so it can generate an exception.
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions