Slow operations in degraded cluster #10100
-
Describe the bugHi everyone, I'm following up on this discussion. We have looked into it a little further and can reliably reproduce the performance issues we run into in case one node of a three node cluster fails. Reproduction steps
Expected behaviorUsually, declaring a classic queue takes less than 50ms. We expect this to still be true after one of the nodes has unexpectedly crashed (or has been killed manually). Additional contextWe have created a Java integration test using the testcontainers framework. In the test, three containers sharing the same Erlang cookie are joined into a cluster. We then execute I attached the log of our test. The interesting lines are: |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 40 replies
-
@kepakiano there isn't much that can be added to #9445 and #9522. It's great that you expect operations in a cluster with down nodes to work exactly like they do in a cluster with every member up. But that's not how things work in practice. Peer unavailability takes time to detect, this is true for every part of RabbitMQ. The only way to avoid that is to avoid features or configurations where nodes have to contact their peers. Such as a node-local queue replica placement strategy. This was recommended to you in #944 but you seemingly want someone to wave a magic wand. #9874 or even #10065 do not fundamentally change any of that, only chip away at the amount of time taken by any operations that may have to interact with a down node. |
Beta Was this translation helpful? Give feedback.
-
@michaelklishin Thank you for your answer! I hope you did not mistake me saying "I expect" as a demand, it definitely isn't. I would, however, appreciate some help clearing up some things I apparently don't understand yet. For example, how #944 is related here. It is my understanding that declaring a default classic queue (exclusive, auto-delete, non-durable) is by default something local to the node I'm connected to. Is that correct? If not (i.e. it's a cluster-wide operation), is there a configurable timeout which contributes to the delay of 8 seconds I'm seeing? (Additional info: I wanted to exclude the client as a source of error, for example if it called For now, we can ignore the delay seen when querying the cluster status. It's a curiosity, but wouldn't break our use case. |
Beta Was this translation helpful? Give feedback.
-
Ok so the cluster_status slowness should be mostly addressed in #10101 The slowness of classic queue declarations should be addressed in #10102 Testing welcome @kepakiano ! |
Beta Was this translation helpful? Give feedback.
@kepakiano there isn't much that can be added to #9445 and #9522.
It's great that you expect operations in a cluster with down nodes to work exactly like they do in a cluster with every member up. But that's not how things work in practice.
Peer unavailability takes time to detect, this is true for every part of RabbitMQ. The only way to avoid that is to avoid features or configurations where nodes have to contact their peers. Such as a node-local queue replica placement strategy. This was recommended to you in #944 but you seemingly want someone to wave a magic wand.
#9874 or even #10065 do not fundamentally change any of that, only chip away at the amount of time taken by any operations…