Broken binding between queue and routing key #4237
-
Short descriptionWe have encountered an issue with a queue suddenly stopping the message consumption. The effect is as if the queue-to-routing-key binding does not exist anymore. At the same time, the binding is still visible through rabbitmq admin console. Long descriptionProblem occurs at least with rabbitmq server version 3.9.10, cluster of three nodes, kubernetes running on AWS. It occurs quite rarely, but last time I was able to debug the issue a little and confirmed the problem is at rabbitmq server side. I have renamed here the exchange/routing keys/queues for the sake of this issue report. Note that queue name and routing key name for this case happen to be identical (probably this is not important, but for sake of accuracy retained that fact also here). The exchange type is "topic". The queue type is "classic".
Debugging with "recon_trace" plugin
then sent the messages. Indeed, in one of the three nodes, the trace was following:
confirming once again, that rabbitmq server just couldn't resolve the "foo.test.one" <-> "foo.test.one" binding.
So in second environment, rabbit_exchange_type_topic:route function returned both queue bindings, as expected. Questions
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 4 replies
-
I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :( |
Beta Was this translation helpful? Give feedback.
-
Bindings are deleted when their queue or exchange are. If an application rapidly deletes and immediately re-declares one of the binding ends, there can be concurrent scheme database operations. Applications also can delete bindings (unbind queues) Management UI uses a separate database for many things. They are updated based on certain events, many of which can be monitored. Since we can't know what the exact scenario was, publishers should handle unroutable messages and rebind. Unroutable message rate is exposed as a metric available |
Beta Was this translation helpful? Give feedback.
-
Thanks for the hints. I will try to set up the alternate exchange and monitor the internal events. Also automatic rebind in case of unroutable message sounds like a reasonable workaround. Still interesting to note, that our consumers are always just in case (re)binding everything when starting up. As we've restarted several times also the consumer apps, this should mean that just calling |
Beta Was this translation helpful? Give feedback.
-
Seeing this exact issue as well, on 3.9.13. Some bindings seem perfectly valid but messages published to the exchange with the correct routing key just disappear. Deleting and rebuilding the binding fixes it but I haven't found a way to identify broken bindings yet. |
Beta Was this translation helpful? Give feedback.
-
I've seen this occur quite a few times when performing rolling cycles/upgrades on clusters with 10k+ auto-delete and non-mirrored queues. For queues that only have a single consumer I introduced a callback in the java client [1] to support renaming the queue as part of queue-recovery to avoid the timing scenario that @michaelklishin explains above. For queues that have multiple consumers from multiple different app instances and hence can't be easily renamed without requiring coordination... I ended up making the queues durable and mirroring them to 1 other node. That way when a node is cycled/crashes the queue/bindings don't get scheduled for deletion and you can't run into any timing issues with clients reconnecting and re-declaring the queue & bindings on a new node. I do have some concerns with this approach going forward however as it sounds like classic queue mirroring support will be removed in a future release. |
Beta Was this translation helpful? Give feedback.
Bindings are deleted when their queue or exchange are. If an application rapidly deletes and immediately re-declares one of the binding ends, there can be concurrent scheme database operations. Applications also can delete bindings (unbind queues)
at any time.
Management UI uses a separate database for many things. They are updated based on certain events, many of which can be monitored.
Since we can't know what the exact scenario was, publishers should handle unroutable messages and rebind. Unroutable message rate is exposed as a metric available
via Prometheus and management UI alike.