pause_minority does not behave as expected in a specific test #12166
Replies: 3 comments 6 replies
-
Additional information to @michaelklishin :
And here is an example of the log from the node that exited the cluster and hasn't been operational for almost a day.
|
Beta Was this translation helpful? Give feedback.
-
@CrazyMushu please stop filing the same issue over and over or your ability to do so will be limited org-wide. They are moved to discussions for a reason. We do not have enough information to reproduce your claims. We will try but our team does not guess, or use issues for discussions and forming hypothesis. This is what Discussions are for. In one of the discussions we have shared a talk from RabbitMQ Summit dedicated to this specific topic.
This not a normal condition for a node and is something you must take care of first. RabbitMQ nodes log quite a bit about the peer state changes they observe, certain client operations, and so on. Even if there are no cluster operations during the test, a node that loses connections to its peers will eventually log multiple related messages. |
Beta Was this translation helpful? Give feedback.
-
According to the logs, node 0 at Node 2 detects node 0's disconnection at Node 1 detects node's0 disconnection at Such messages are easy to find by searching for "rabbit on node ". Furthermore, in these logs I do not see any messages from
So the partition handling does not kick in. Peer discovery's cleanup of unreachable peers, however, does kick in:
This feature does not actually remove nodes by default but it does log that some peers were still unreachable. |
Beta Was this translation helpful? Give feedback.
-
I encountered a similar problem. The
cluster_partition_handling = pause_minority
setting is enabled, but after disconnecting one of the 3 nodes, it remains reachable, and applications can still connect to it.Reproduction steps: Pause any of the virtual hosts in VMware vSphere, wait for 30 seconds to 1 minute until the Kubernetes cluster reports that the node is unavailable, and then resume the host from the pause. As a result, the RabbitMQ cluster will split, with the cluster divided into two parts – 2 nodes and 1 node will operate in parallel.
The nature of the partition is as follows:
At the same time, none of the cluster state check commands will indicate that the 2-node group is in the minority. Examples:
I would expect that in such a scenario, the cluster node that is in the minority would at least shut down its port so that applications can't connect to it, but that's not the case.
Example of checking port availability from inside the container:
Example of checking port availability from another Kubernetes namespace:
Could you suggest an alternative solution, other than manually restarting the node as mentioned in the documentation, or waiting for version 4.0?
Originally posted by @CrazyMushu in #8111 (comment)
Beta Was this translation helpful? Give feedback.
All reactions