RabbitMQ queue issue after cluster restart #10654

mpmatti · 2024-03-04T07:34:15Z

mpmatti
Mar 4, 2024

We are seeing constant problems with classic mirrored queues, when rabbitmq is restarted.

Our RabbitMQ cluster runs on Kubernetes (using bitnami helm chart), with three nodes.

ha policies are as follows:
ha-mode: all
ha-promote-on-failure: always
ha-promote-on-shutdown: when-synced
ha-sync-mode: manual

Each classic queue is mirrored to all three nodes.
After running rolling restart on the rabbitmq cluster, seems one of the cluster nodes is somehow losing track of some of the classic queues (not all).

node-0 sees two instances of the queue:

I have no name!@rabbitmq-bitnami-0:/$ rabbitmqctl list_queues --vhost /rabbit-tester name owner_pid mirror_pids messages
Timeout: 60.0 seconds ...
Listing queues for vhost /rabbit-tester ...
name    owner_pid       mirror_pids     messages
test-queue
test-queue              [<rabbit@rabbitmq-bitnami-1.rabbitmq-bitnami-headless.rabbitmq.svc.cluster.local.1709376722.5152.0>]    0

node-1 sees one instance of the queue (which should be the case)

I have no name!@rabbitmq-bitnami-1:/$ rabbitmqctl list_queues --vhost /rabbit-tester name owner_pid mirror_pids messages
Timeout: 60.0 seconds ...
Listing queues for vhost /rabbit-tester ...
name    owner_pid       mirror_pids     messages
test-queue              [<rabbit@rabbitmq-bitnami-1.rabbitmq-bitnami-headless.rabbitmq.svc.cluster.local.1709376722.5152.0>]    0

node-2 sees one instance as well:

I have no name!@rabbitmq-bitnami-2:/$ rabbitmqctl list_queues --vhost /rabbit-tester name owner_pid mirror_pids messages
Timeout: 60.0 seconds ...
Listing queues for vhost /rabbit-tester ...
name    owner_pid       mirror_pids     messages
test-queue              [<rabbit@rabbitmq-bitnami-1.rabbitmq-bitnami-headless.rabbitmq.svc.cluster.local.1709376722.5152.0>]    0

The queue is bound to a topic exchange, and does not get any messages. Publisher confirms option is used in the publisher side. The publisher does not log any errors, nor does the consumer.

node-0 has lost the bindings in the topic exchange:

I have no name!@rabbitmq-bitnami-0:/$ rabbitmqctl list_bindings --vhost /rabbit-tester
Listing bindings for vhost /rabbit-tester...
I have no name!@rabbitmq-bitnami-0:/$

while node-1 and node-2 seem normal:

I have no name!@rabbitmq-bitnami-1:/$ rabbitmqctl list_bindings --vhost /rabbit-tester
Listing bindings for vhost /rabbit-tester...
source_name     source_kind     destination_name        destination_kind        routing_key     arguments
        exchange        test-queue      queue   test-queue      []
topic_exchange  exchange        test-queue      queue   test.classic.*  []
I have no name!@rabbitmq-bitnami-1:/$

I have no name!@rabbitmq-bitnami-2:/$ rabbitmqctl list_bindings --vhost /rabbit-tester
Listing bindings for vhost /rabbit-tester...
source_name     source_kind     destination_name        destination_kind        routing_key     arguments
        exchange        test-queue      queue   test-queue      []
topic_exchange  exchange        test-queue      queue   test.classic.*  []
I have no name!@rabbitmq-bitnami-2:/$

Scaling the rabbitmq cluster down & up solves the problem, but only until the next rolling restart. The most severe thing is, we don't get any indication in the rabbitmq logs that the problem is on.

Has anyone experienced anything similar to this? Is there any way to fix this or have a work-around? We are planning to move to quorum queues, which probably would solve the issue (we haven't seen the same happening with quorum queues). However, the migration takes time, due to high number of applications still using classic queues.

Answered by michaelklishin

Mar 5, 2024

Classic mirrored queues have been deprecated for several years, their doc guide very explicitly recommends quorum queues and streams.

I'm afraid the suggestion here is to use quorum queues and/or streams, and give 3.13.0 with Khepri a shot. Khepri uses the same recovery mechanism as quorum queues, streams.

View full answer

michaelklishin · 2024-03-05T00:54:02Z

michaelklishin
Mar 5, 2024
Maintainer

Classic mirrored queues have been deprecated for several years, their doc guide very explicitly recommends quorum queues and streams.

I'm afraid the suggestion here is to use quorum queues and/or streams, and give 3.13.0 with Khepri a shot. Khepri uses the same recovery mechanism as quorum queues, streams.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabbitMQ queue issue after cluster restart #10654

{{title}}

Replies: 1 comment

{{title}}

Select a reply

RabbitMQ queue issue after cluster restart #10654

mpmatti Mar 4, 2024

Replies: 1 comment

michaelklishin Mar 5, 2024 Maintainer

mpmatti
Mar 4, 2024

michaelklishin
Mar 5, 2024
Maintainer