Possible race condition in classic queue deletion/declaration handling for multi-node clusters #11001

andrewbonney · 2024-04-15T13:09:20Z

andrewbonney
Apr 15, 2024

Describe the bug

I'm experiencing issues with non-HA classic queues with RabbitMQ v3.12.4. I've also seen it in v3.11.10 and likely earlier, but it wasn't as obvious due to a client library bug. We're running a three node cluster, where most queues are HA, but for this specific case they aren't. I'm yet to test more recent RabbitMQ as I'm testing in a staging environment built for other things, but will attempt to when I can and update here.

When clients initially connect, they declare a non-HA classic queue against the node they have connected to. If at a later date that node goes down, the expectation is that they will reconnect to one of the remaining nodes and re-declare their queue in order to continue operations. This works correctly in most cases, but I'm seeing what I believe could be a server-side race condition when reconnections occur where the node we are connecting to believes the queue exists, but it doesn't. As such the client connects, believes it has successfully declared the queue, but when other clients attempt to send messages to its queue they can't find it. The owning client doesn't realise its queue doesn't exist as it only checks for this at connection time.

These issues have been seen via three deployments of OpenStack which exhibit the same issues. Having traced through the client libraries, observed logs, and written a test case with a different client library I believe this is pointing at a server-side issue, unless of course the way in which the queues or clients are being used goes against RabbitMQ's expectations.

Reproduction steps

Set up a three node RabbitMQ cluster running v3.12.4
Run the following simple Python script to declare a queue and sit waiting.
Call rabbitmqctl stop_app against the RMQ node which the queue has been created against
Watch the log until the client successfully reconnects (there will be many channel errors whilst the server indicates the queue is stopped initially)
Check whether the queue has appeared successfully against a different member of the cluster (I was using the management plugin for this purpose, filtering for the queue name 'reply_recovery')
Repeat if necessary until the queue doesn't appear

Example script (adapted from pika library examples):

import pika
import random
import ssl
import time

def on_message(channel, method_frame, header_frame, body):
    print(method_frame.delivery_tag)
    print(body)
    print()
    channel.basic_ack(delivery_tag=method_frame.delivery_tag)

my_username = 'xxxx'
my_password = 'xxxx'
my_vhost = '/xxxx'

context = ssl.SSLContext()
context.verify_mode = ssl.CERT_NONE
context.check_hostname = False
ssl_options = pika.SSLOptions(context, "localhost")
credentials = pika.PlainCredentials(my_username, my_password)
conn_params3 = pika.ConnectionParameters(host='xxxx', port=5671, virtual_host=my_vhost, ssl_options=ssl_options, credentials=credentials)
conn_params2 = pika.ConnectionParameters(host='xxxx', port=5671, virtual_host=my_vhost, ssl_options=ssl_options, credentials=credentials)
conn_params1 = pika.ConnectionParameters(host='xxxx', port=5671, virtual_host=my_vhost, ssl_options=ssl_options, credentials=credentials)
all_params = [conn_params1, conn_params2, conn_params3]

while(True):
    try:
        print("Connecting...")
        ## Shuffle the hosts list before reconnecting.
        ## This can help balance connections.
        random.shuffle(all_params)
        connection = pika.BlockingConnection(all_params)
        channel = connection.channel()
        channel.basic_qos(prefetch_count=1)
        ## This queue is intentionally non-durable. See http://www.rabbitmq.com/ha.html#non-mirrored-queue-behavior-on-node-failure
        ## to learn more.
        channel.queue_declare('reply_recovery', durable = False, arguments = {'x-expires': 1800000})
        channel.basic_consume('reply_recovery', on_message)
        try:
            print("Consuming...")
            channel.start_consuming()
        except KeyboardInterrupt:
            channel.stop_consuming()
            connection.close()
            break
    except pika.exceptions.ConnectionClosedByBroker:
        # Uncomment this to make the example not attempt recovery
        # from server-initiated connection closure, including
        # when the node is stopped cleanly
        #
        # break
        continue
    except pika.exceptions.AMQPChannelError as err:
        print("Caught a channel error: {}. Sleeping for 1 second".format(err))
        #time.sleep(1)
        continue
    except pika.exceptions.AMQPConnectionError:
        print("Connection was closed, retrying...")
        continue

Note that I've commented out a sleep towards the end of this script during channel exception handling. With the sleep in place it appears much less likely that the issue will occur, presumably as the server node(s) have time to synchronise their state with regard to which queues are declared and operational. Without the sleep I've so far had a 100% success rate replicating the issue, but it's possible that as these RMQ nodes are otherwise in use, additional load on them could make it more likely in our case.

Expected behavior

When a client has lost connection to the server and its single-node classic queue has been lost, I would expect the queue to be re-created successfully on a different RabbitMQ cluster node once connection has been successful and a non-passive queue declaration command has been accepted. Otherwise the client should receive indication that the queue isn't functional so it can generate an exception.

Additional context

No response

kjnilsson · 2024-04-15T13:27:04Z

kjnilsson
Apr 15, 2024
Maintainer

It is possible that the client reconnects and does the re-declare before the node is properly down which is why it appears successful. Try a force kill of the node and see if this changes behavour.

There is however a relatively simple change you can make for this and that is to use exclusive server-named queues for your reply queues. This way there is no requirement on transient deletion (which is unlikely to ever be atomic anyway). Also see: https://www.rabbitmq.com/blog/2021/08/21/4.0-deprecation-announcements#removal-of-transient-non-exclusive-queues

1 reply

andrewbonney Apr 15, 2024
Author

Thanks for the quick response. I've just tried the force-kill and whilst the behaviour is cleaner in the client logs, the same outcome is seen.

I'll take a look at the link and message back if I hit any issues with that.

michaelklishin · 2024-04-15T14:06:46Z

michaelklishin
Apr 15, 2024
Maintainer

This is a classic problem with auto-delete queues with known (shared) names. RabbitMQ cannot do much about it without a very expensive cluster-wide locking of the queue. Clients that reconnect will try to declare a queue that another node may be in the process of removing.

Using server-named queues or not using auto-delete queues naturally makes the problem go away.

1 reply

andrewbonney Apr 15, 2024
Author

Thanks. I guess auto-delete is implied for a classic non-HA queue where the originating node goes down? In this case I'm not explicitly specifying auto-delete, but I guess it the behaviour is the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race condition in classic queue deletion/declaration handling for multi-node clusters #11001

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Possible race condition in classic queue deletion/declaration handling for multi-node clusters #11001

andrewbonney Apr 15, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 2 comments · 2 replies

kjnilsson Apr 15, 2024 Maintainer

andrewbonney Apr 15, 2024 Author

michaelklishin Apr 15, 2024 Maintainer

andrewbonney Apr 15, 2024 Author

andrewbonney
Apr 15, 2024

Replies: 2 comments 2 replies

kjnilsson
Apr 15, 2024
Maintainer

andrewbonney Apr 15, 2024
Author

michaelklishin
Apr 15, 2024
Maintainer

andrewbonney Apr 15, 2024
Author