Broken binding between queue and routing key #4237

andres-3 · 2022-03-04T20:50:18Z

andres-3
Mar 4, 2022

Short description

We have encountered an issue with a queue suddenly stopping the message consumption. The effect is as if the queue-to-routing-key binding does not exist anymore. At the same time, the binding is still visible through rabbitmq admin console.

Long description

Problem occurs at least with rabbitmq server version 3.9.10, cluster of three nodes, kubernetes running on AWS. It occurs quite rarely, but last time I was able to debug the issue a little and confirmed the problem is at rabbitmq server side.

I have renamed here the exchange/routing keys/queues for the sake of this issue report. Note that queue name and routing key name for this case happen to be identical (probably this is not important, but for sake of accuracy retained that fact also here). The exchange type is "topic". The queue type is "classic".

Initial configuration, as seen from admin console: the queue "foo.test.one" and the routing key "foo.test.one" between which the binding seems to be broken.

queue "foo.test.one", bindings (2):
From         | Routing key  | Arguments
---------------------------------------
(Default exchange binding)
foo-exchange | foo.test.one |

For testing, created another queue "foo.test.two" and bound that also to the original routing key "foo.test.one".

queue "foo.test.two", bindings (2):
From         | Routing key  | Arguments
---------------------------------------
(Default exchange binding)
foo-exchange | foo.test.one |

Posted some messages to exchange "foo-exchange", with routing key "foo.test.one"
- monitored queue "foo.test.one" - no messages received
- monitored queue "foo.test.two" - all messages received
Posted some messages directly to queue "foo.test.one"
- all messages received

Debugging with "recon_trace" plugin

First of all I have no prior experience with erlang, so bear with me on this if some of my assumptions are incorrect. But I identified that rabbit_exchange_type_topic:route/2 is most probably the called function when server is resolving queue-to-routing-key bindings. By opening an "erl" program in all three rabbit nodes, using kubectl and /bin/sh, I managed to confirm that. So I executed

> recon_trace:calls({rabbit_exchange_type_topic, route, fun(_) -> return_trace() end}, 100).

then sent the messages. Indeed, in one of the three nodes, the trace was following:

19:35:02.602787 <0.9503.0> rabbit_exchange_type_topic:route({exchange,{resource,<<"/">>,exchange,<<"foo-exchange">>},
          topic,true,false,false,[],undefined,undefined,undefined,
          {[],[]},
          #{user => <<"rabbitmq">>}}, {delivery,false,true,<0.9503.0>,
    {basic_message,
        {resource,<<"/">>,exchange,<<"foo-exchange">>},
        [<<"foo.test.one">>],
        %
        % removed the content of message
        %
    14,noflow})

19:35:02.607428 <0.9503.0> rabbit_exchange_type_topic:route/2 --> [{resource,<<"/">>,queue,<<"foo.test.two">>}]

confirming once again, that rabbitmq server just couldn't resolve the "foo.test.one" <-> "foo.test.one" binding.

Retested this all in another identical environment, where the original binding was still working. Did another recon_trace. This time the output was:

19:39:02.164337 <0.9918.0> rabbit_exchange_type_topic:route({exchange,{resource,<<"/">>,exchange,<<"foo-exchange">>},
          topic,true,false,false,[],undefined,undefined,undefined,
          {[],[]},
          #{user => <<"rabbitmq">>}}, {delivery,false,true,<0.9918.0>,
    {basic_message,
        {resource,<<"/">>,exchange,<<"foo-exchange">>},
        [<<"foo.test.one">>],
        %
        % removed the content of message
        %
    14,noflow})
                         
19:39:02.166676 <0.9918.0> rabbit_exchange_type_topic:route/2 --> [{resource,<<"/">>,queue,<<"foo.test.one">>},
 {resource,<<"/">>,queue,<<"foo.test.two">>}]

So in second environment, rabbit_exchange_type_topic:route function returned both queue bindings, as expected.

Questions

Any theory, what could cause the breakage of binding?
Even if the binding is somehow lost, why does the admin console still report it as existing?
Do you have any recommendation, what should we pay attention to and what should we monitor next time, when similar incident occurs?

Answered by michaelklishin

Mar 5, 2022

Bindings are deleted when their queue or exchange are. If an application rapidly deletes and immediately re-declares one of the binding ends, there can be concurrent scheme database operations. Applications also can delete bindings (unbind queues)
at any time.

Management UI uses a separate database for many things. They are updated based on certain events, many of which can be monitored.

Since we can't know what the exact scenario was, publishers should handle unroutable messages and rebind. Unroutable message rate is exposed as a metric available
via Prometheus and management UI alike.

View full answer

michaelklishin · 2022-03-05T13:23:12Z

michaelklishin
Mar 5, 2022
Maintainer

I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :(

0 replies

michaelklishin · 2022-03-05T13:26:59Z

michaelklishin
Mar 5, 2022
Maintainer

Bindings are deleted when their queue or exchange are. If an application rapidly deletes and immediately re-declares one of the binding ends, there can be concurrent scheme database operations. Applications also can delete bindings (unbind queues)
at any time.

Management UI uses a separate database for many things. They are updated based on certain events, many of which can be monitored.

Since we can't know what the exact scenario was, publishers should handle unroutable messages and rebind. Unroutable message rate is exposed as a metric available
via Prometheus and management UI alike.

0 replies

andres-3 · 2022-03-05T20:11:32Z

andres-3
Mar 5, 2022
Author

Thanks for the hints. I will try to set up the alternate exchange and monitor the internal events. Also automatic rebind in case of unroutable message sounds like a reasonable workaround.

Still interesting to note, that our consumers are always just in case (re)binding everything when starting up. As we've restarted several times also the consumer apps, this should mean that just calling
org.springframework.amqp.rabbit.core.RabbitAdmin.declareBinding
was not enough to fix it. However, I could try if first unbinding and then binding would make a difference.

0 replies

FalconerTC · 2022-03-23T00:44:40Z

FalconerTC
Mar 23, 2022

Seeing this exact issue as well, on 3.9.13. Some bindings seem perfectly valid but messages published to the exchange with the correct routing key just disappear. Deleting and rebuilding the binding fixes it but I haven't found a way to identify broken bindings yet.

3 replies

michaelklishin Mar 23, 2022
Maintainer

Please see Unroutable Message Handling

klarkent Feb 6, 2024

I’m not sure this works for everyone, but I may be missing something. I have a publishing app which does not even know who consumes the messages and I have three queues consuming from the same exchange with identical bindings but one queue gets no traffic. How can the publisher even fix this?

michaelklishin Feb 6, 2024
Maintainer

@klarkent please start a new discussion with details (such as an executable way to reproduce). We do not guess or "recycle" existing discussions in this community.

vikinghawk · 2022-03-23T19:55:25Z

vikinghawk
Mar 23, 2022

I've seen this occur quite a few times when performing rolling cycles/upgrades on clusters with 10k+ auto-delete and non-mirrored queues.

For queues that only have a single consumer I introduced a callback in the java client [1] to support renaming the queue as part of queue-recovery to avoid the timing scenario that @michaelklishin explains above.

For queues that have multiple consumers from multiple different app instances and hence can't be easily renamed without requiring coordination... I ended up making the queues durable and mirroring them to 1 other node. That way when a node is cycled/crashes the queue/bindings don't get scheduled for deletion and you can't run into any timing issues with clients reconnecting and re-declaring the queue & bindings on a new node. I do have some concerns with this approach going forward however as it sounds like classic queue mirroring support will be removed in a future release.

[1] rabbitmq/rabbitmq-java-client#693

1 reply

michaelklishin Mar 23, 2022
Maintainer

This is a known side-effect of exclusive, and possibly in some scenarios auto-delete queues: they can be deleted concurrently with a client that reconnects and re-declares the same topology. There is no solution that would not significantly affect availability characteristics. I don't think mirroring is very relevant here.

Using a static topology, possibly with queue TTL, avoids this problem in environments where you can observe it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken binding between queue and routing key #4237

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Broken binding between queue and routing key #4237

andres-3 Mar 4, 2022

Short description

Long description

Debugging with "recon_trace" plugin

Questions

Replies: 5 comments · 4 replies

michaelklishin Mar 5, 2022 Maintainer

michaelklishin Mar 5, 2022 Maintainer

andres-3 Mar 5, 2022 Author

FalconerTC Mar 23, 2022

michaelklishin Mar 23, 2022 Maintainer

klarkent Feb 6, 2024

michaelklishin Feb 6, 2024 Maintainer

vikinghawk Mar 23, 2022

michaelklishin Mar 23, 2022 Maintainer

andres-3
Mar 4, 2022

Replies: 5 comments 4 replies

michaelklishin
Mar 5, 2022
Maintainer

michaelklishin
Mar 5, 2022
Maintainer

andres-3
Mar 5, 2022
Author

FalconerTC
Mar 23, 2022

michaelklishin Mar 23, 2022
Maintainer

michaelklishin Feb 6, 2024
Maintainer

vikinghawk
Mar 23, 2022

michaelklishin Mar 23, 2022
Maintainer