Slow operations in degraded cluster #10100

kepakiano · 2023-12-12T11:17:52Z

kepakiano
Dec 12, 2023

Describe the bug

Hi everyone,

I'm following up on this discussion. We have looked into it a little further and can reliably reproduce the performance issues we run into in case one node of a three node cluster fails.

Reproduction steps

Create cluster consisting of three nodes (RabbitMQ version 3.12.10)
Open a connection to node 1
Kill node 2
Declare a classic queue using the open connection

Expected behavior

Usually, declaring a classic queue takes less than 50ms. We expect this to still be true after one of the nodes has unexpectedly crashed (or has been killed manually).

Additional context

We have created a Java integration test using the testcontainers framework. In the test, three containers sharing the same Erlang cookie are joined into a cluster. We then execute rabbitmqctl cluster_status which takes around five seconds. Afterwards, we declare a classic queue, which takes a couple dozen milliseconds as expected. So the cluster works and fulfills our requirements.
Now we introduce the disruption: We execute docker kill to kill one of the two containers we do not have an open AMQP connection to.
Then we again ask for the cluster status, which takes over 3 minutes and declare a classic queue which takes more than 8 seconds.
We assume that the cluster tries to reach the failed node during both these operations, runs into a timeout and only then reports success. Maybe there is a way to not ask the failed node?
We can manually achieve this by evicting the failed node from the cluster. After that, all operations are fast again.

I attached the log of our test. The interesting lines are:
line 16: this is how long the initial rabbitmqctl cluster_status took in the healthy cluster and below is the output
line 98: declaring the queue in the healthy cluster is fast
line 106: rabbitmqctl cluster_status after killing one node takes four minutes
line 181: declaring a queue after killing one node takes 8 seconds
line 185: evicting the failed node
line 192: rabbitmqctl cluster_status in the 2 node cluster is fast again
line 265: declaring a queue is fast again (23ms)

rabbitmq_log.txt

Answered by michaelklishin

Dec 12, 2023

@kepakiano there isn't much that can be added to #9445 and #9522.

It's great that you expect operations in a cluster with down nodes to work exactly like they do in a cluster with every member up. But that's not how things work in practice.

Peer unavailability takes time to detect, this is true for every part of RabbitMQ. The only way to avoid that is to avoid features or configurations where nodes have to contact their peers. Such as a node-local queue replica placement strategy. This was recommended to you in #944 but you seemingly want someone to wave a magic wand.

#9874 or even #10065 do not fundamentally change any of that, only chip away at the amount of time taken by any operations…

View full answer

michaelklishin · 2023-12-12T12:27:36Z

michaelklishin
Dec 12, 2023
Maintainer

@kepakiano there isn't much that can be added to #9445 and #9522.

It's great that you expect operations in a cluster with down nodes to work exactly like they do in a cluster with every member up. But that's not how things work in practice.

Peer unavailability takes time to detect, this is true for every part of RabbitMQ. The only way to avoid that is to avoid features or configurations where nodes have to contact their peers. Such as a node-local queue replica placement strategy. This was recommended to you in #944 but you seemingly want someone to wave a magic wand.

#9874 or even #10065 do not fundamentally change any of that, only chip away at the amount of time taken by any operations that may have to interact with a down node.

0 replies

kepakiano · 2023-12-12T13:11:51Z

kepakiano
Dec 12, 2023
Author

@michaelklishin Thank you for your answer! I hope you did not mistake me saying "I expect" as a demand, it definitely isn't.

I would, however, appreciate some help clearing up some things I apparently don't understand yet. For example, how #944 is related here. It is my understanding that declaring a default classic queue (exclusive, auto-delete, non-durable) is by default something local to the node I'm connected to. Is that correct? If not (i.e. it's a cluster-wide operation), is there a configurable timeout which contributes to the delay of 8 seconds I'm seeing? (Additional info: I wanted to exclude the client as a source of error, for example if it called GET /queues when declaring, so I used rabbitmqadmin to directly declare the queue on a node and the delay persists). In #9445 you suggested we gather information on when this issue occurs and also said that delays of >5s could be problematic, which is why we opened this issue/discussion.

For now, we can ignore the delay seen when querying the cluster status. It's a curiosity, but wouldn't break our use case.

7 replies

kjnilsson Dec 12, 2023
Maintainer

so we have found a bit of code that very well could be responsible for the 8s delay when declaring exclusive queues:

#10102

Also I found some improvements to the cluster_status command which may help a bit but 4 minutes sounds a lot longer than the 1 minute we'd expect it to take to detect the dangling TCP connection.

#10101

Do you use TLS between nodes? Can you tell us anything about your networking that may potentially make TCP connection setup and/or DNS resolution take a long time?

kjnilsson Dec 12, 2023
Maintainer

Ok I have now managed to a local reproduction of what I think the issue with cluster_status is and I am pretty confident I know what needs to change.

stefanlin Dec 12, 2023

@kjnilsson maybe you can give a hint in which part in the code path you expecting the issue. I try to familiarize myself with the topic a little.

michaelklishin Dec 12, 2023
Maintainer

We have found a few non-obvious calls to other nodes in places that do not really need them. A PR or two will materialize later this week.

kepakiano Dec 13, 2023
Author

@kjnilsson We do not use TLS between the nodes, neither in test nor in prod. DNS resolution should not be a problem as it's provided by K8s in prod and in the test it was on the same host. This means that the TCP connections should establish quickly in the test and fairly quickly in prod: The nodes run in different datacenters with a inter-datacenter latency of <1 ms

kjnilsson · 2023-12-13T09:33:31Z

kjnilsson
Dec 13, 2023
Maintainer

Ok so the cluster_status slowness should be mostly addressed in #10101

The slowness of classic queue declarations should be addressed in #10102

Testing welcome @kepakiano !

33 replies

kjnilsson Dec 18, 2023
Maintainer

@kepakiano we appreciate your testing and efforts. I have run your test app several times on my machine and it is always fast. Could you have run against a stale set of docker images when you saw the cluster_status slowness?

michaelklishin Dec 18, 2023
Maintainer

We offer no guarantees that #10065 will be implemented, or that it will ship in 3.13.0.

kepakiano Dec 19, 2023
Author

@kjnilsson I don't think it was an issue with a stale image, as we all removed the images we had and pulled them fresh last Wednesday. However, it could have been an issue with our test setups all being comprised of various virtualization methods, older Docker versions and not too powerful hardware. I just ran the test several times on a more powerful machine with native Linux and the issue does not seem to persist 🎉 So thank you for that :)

@michaelklishin no problem, we'll just keep our eyes open :)

kjnilsson Dec 19, 2023
Maintainer

Ok thanks for reporting back. #10065 wont ship until 4.0 at the earliest as it does involve a certain degree of risk and 3.13 is about to go soon.

kepakiano Dec 19, 2023
Author

Alright, we're looking forward to it either way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow operations in degraded cluster #10100

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 40 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Slow operations in degraded cluster #10100

kepakiano Dec 12, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 40 replies

michaelklishin Dec 12, 2023 Maintainer

kepakiano Dec 12, 2023 Author

kjnilsson Dec 12, 2023 Maintainer

kjnilsson Dec 12, 2023 Maintainer

stefanlin Dec 12, 2023

michaelklishin Dec 12, 2023 Maintainer

kepakiano Dec 13, 2023 Author

kjnilsson Dec 13, 2023 Maintainer

kjnilsson Dec 18, 2023 Maintainer

michaelklishin Dec 18, 2023 Maintainer

kepakiano Dec 19, 2023 Author

kjnilsson Dec 19, 2023 Maintainer

kepakiano Dec 19, 2023 Author

kepakiano
Dec 12, 2023

Replies: 3 comments 40 replies

michaelklishin
Dec 12, 2023
Maintainer

kepakiano
Dec 12, 2023
Author

kjnilsson Dec 12, 2023
Maintainer

kjnilsson Dec 12, 2023
Maintainer

michaelklishin Dec 12, 2023
Maintainer

kepakiano Dec 13, 2023
Author

kjnilsson
Dec 13, 2023
Maintainer

kjnilsson Dec 18, 2023
Maintainer

michaelklishin Dec 18, 2023
Maintainer

kepakiano Dec 19, 2023
Author

kjnilsson Dec 19, 2023
Maintainer

kepakiano Dec 19, 2023
Author