pause_minority does not behave as expected in a specific test #12166

CrazyMushu · 2024-08-29T07:15:04Z

CrazyMushu
Aug 29, 2024

I encountered a similar problem. The cluster_partition_handling = pause_minority setting is enabled, but after disconnecting one of the 3 nodes, it remains reachable, and applications can still connect to it.

rabbitmq.conf: |
      cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
      cluster_formation.k8s.address_type = hostname
      cluster_formation.k8s.host = kubernetes.default
      cluster_formation.node_cleanup.interval = 10
      cluster_formation.node_cleanup.only_log_warning = true
      cluster_partition_handling = pause_minority
      queue_master_locator=min-masters
      disk_free_limit.absolute = 512MB
      vm_memory_high_watermark.absolute = 1GB
      cluster_name = atrabbit.cag.wargaming.net
      prometheus.return_per_object_metrics = true


apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
  labels:
    app: rabbitmq
  namespace: rabbitmq
spec:
  serviceName: rabbitmq
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  revisionHistoryLimit: 3
  podManagementPolicy: Parallel
  selector:
    matchLabels:
      app: rabbitmq
  template:
    metadata:
      name: rabbitmq
      labels:
        app: rabbitmq
    spec:
      terminationGracePeriodSeconds: 30
      securityContext:
        runAsUser: 1001
        fsGroup: 1001
      volumes:
        - name: config
          configMap:
            name: rabbitmq-config
            items:
            - key: rabbitmq.conf
              path: rabbitmq.conf
            - key: enabled_plugins
              path: enabled_plugins
        - name: rabbitmq-erlang-cookie
          emptyDir: {}
        - name: erlang-cookie-secret
          secret:
            secretName: rabbitmq-erlang-cookie
            defaultMode: 420
      initContainers:
      - name: erlang-cookie-hack
        image: busybox:latest
        imagePullPolicy: IfNotPresent
        command: ["sh", "-c", "cp /tmp/erlang-cookie-secret/.erlang.cookie /var/lib/rabbitmq/.erlang.cookie && chmod 600 /var/lib/rabbitmq/.erlang.cookie"]
        securityContext:
          privileged: true
        volumeMounts:
        - name: rabbitmq-erlang-cookie
          mountPath: /var/lib/rabbitmq/
        - name: erlang-cookie-secret
          mountPath: /tmp/erlang-cookie-secret/
      containers:
      - name: rabbitmq
        image: rabbitmq:management-alpine
        imagePullPolicy: Always
        volumeMounts:
          - name: config
            mountPath: /etc/rabbitmq
            readOnly: true
          - name: rabbitmq-data
            mountPath: /var/lib/rabbitmq/mnesia
          - name: rabbitmq-erlang-cookie
            mountPath: /var/lib/rabbitmq/
        ports:
          - name: http
            protocol: TCP
            containerPort: 15672
          - name: amqp
            protocol: TCP
            containerPort: 5672
          - name: prometheus
            protocol: TCP
            containerPort: 15692
        env:
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: RABBITMQ_USE_LONGNAME
            value: "true"
          - name: K8S_SERVICE_NAME
            value: rabbitmq
          - name: RABBITMQ_NODENAME
            value: rabbit@$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)
          - name: K8S_HOSTNAME_SUFFIX
            value: .$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE)
          - name: RABBITMQ_DEFAULT_USER
            value: admin
          - name: RABBITMQ_DEFAULT_PASS
            value: admin
        resources:
          limits:
            memory: 1.5Gi
          requests:
            memory: 64Mi
            cpu: 0.2
        livenessProbe:
          exec:
            command:
              - sh
              - -ec
              - rabbitmq-diagnostics -q status
        ## Configure RabbitMQ containers' extra options for liveness probe st.2
        ## ref: https://www.rabbitmq.com/monitoring.html#health-checks
          initialDelaySeconds: 40
          periodSeconds: 30
          timeoutSeconds: 20
          failureThreshold: 6
          successThreshold: 1
        readinessProbe:
          exec:
            command:
              - sh
              - -ec
              - rabbitmq-diagnostics -q check_running && rabbitmq-diagnostics -q check_local_alarms
        ## Configure RabbitMQ containers' extra options for readiness probe st.3
        ## ref: https://www.rabbitmq.com/monitoring.html#health-checks
          initialDelaySeconds: 40
          periodSeconds: 30
          timeoutSeconds: 20
          failureThreshold: 3
          successThreshold: 1
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - rabbitmq
              topologyKey: kubernetes.io/hostname

Reproduction steps: Pause any of the virtual hosts in VMware vSphere, wait for 30 seconds to 1 minute until the Kubernetes cluster reports that the node is unavailable, and then resume the host from the pause. As a result, the RabbitMQ cluster will split, with the cluster divided into two parts – 2 nodes and 1 node will operate in parallel.

The nature of the partition is as follows:

Node	Was partitioned from
rabbit@rabbitmq-2.rabbitmq.rabbitmq	rabbit@rabbitmq-0.rabbitmq.rabbitmq rabbit@rabbitmq-1.rabbitmq.rabbitmq

At the same time, none of the cluster state check commands will indicate that the 2-node group is in the minority. Examples:

/ $ rabbitmq-diagnostics -q alarms
Node rabbit@rabbitmq-2.rabbitmq.rabbitmq reported no alarms, local or clusterwide

/ $ rabbitmq-diagnostics check_running
Checking if RabbitMQ is running on node rabbit@rabbitmq-2.rabbitmq.rabbitmq ...
RabbitMQ on node rabbit@rabbitmq-2.rabbitmq.rabbitmq is fully booted and running

I would expect that in such a scenario, the cluster node that is in the minority would at least shut down its port so that applications can't connect to it, but that's not the case.

Example of checking port availability from inside the container:

/ $ rabbitmq-diagnostics check_port_listener 5672
Asking node rabbit@rabbitmq-2.rabbitmq.rabbitmq if there's an active listener on port 5672 ...
A listener for port 5672 is running on node rabbit@rabbitmq-2.rabbitmq.rabbitmq.

Example of checking port availability from another Kubernetes namespace:

/app # telnet rabbitmq-2.rabbitmq.rabbitmq.svc.cluster.local 5672
Trying 10.0.0.31...
Connected to rabbitmq-2.rabbitmq.rabbitmq.svc.cluster.local.
Escape character is '^]'.

Could you suggest an alternative solution, other than manually restarting the node as mentioned in the documentation, or waiting for version 4.0?

Originally posted by @CrazyMushu in #8111 (comment)

CrazyMushu · 2024-08-29T07:27:07Z

CrazyMushu
Aug 29, 2024
Author

Additional information to @michaelklishin :

All container logs in the cluster are empty, showing only startup information. Example:
I checked port 5672 to see if the cluster node that was left in the minority closes the port to client connections, as I found no clear documentation on how the node behaves in such an event. It turns out that it does not close the port.
Pausing the virtual machine helps reproduce the event that leads to a split cluster. We are discussing a problem where the cluster is split with inconsistent data, and the solution that is supposed to address this issue is not working. The method used to trigger this event is irrelevant.

The autoheal option is not suitable for me because I'd prefer a completely non-functional cluster over a split one. Also, the pause_if_all_down option did not successfully reassemble the cluster:

cluster_partition_handling = pause_if_all_down
cluster_partition_handling.pause_if_all_down.recover = autoheal
cluster_partition_handling.pause_if_all_down.nodes.1 = rabbit@rabbitmq-0.rabbitmq.rabbitmq
cluster_partition_handling.pause_if_all_down.nodes.2 = rabbit@rabbitmq-1.rabbitmq.rabbitmq
cluster_partition_handling.pause_if_all_down.nodes.3 = rabbit@rabbitmq-2.rabbitmq.rabbitmq

And here is an example of the log from the node that exited the cluster and hasn't been operational for almost a day.

2024-08-28 12:39:07.794968+00:00 [notice] <0.44.0> Application syslog exited with reason: stopped
2024-08-28 12:39:07.798984+00:00 [notice] <0.254.0> Logging: switching to configured handler(s); following messages may not be visible in this log output

  ##  ##      RabbitMQ 3.13.7
  ##  ##
  ##########  Copyright (c) 2007-2024 Broadcom Inc and/or its subsidiaries
  ######  ##
  ##########  Licensed under the MPL 2.0. Website: https://rabbitmq.com

  Erlang:      26.2.5.2 [jit]
  TLS Library: OpenSSL - OpenSSL 3.1.6 4 Jun 2024
  Release series support status: see https://www.rabbitmq.com/release-information

  Doc guides:  https://www.rabbitmq.com/docs
  Support:     https://www.rabbitmq.com/docs/contact
  Tutorials:   https://www.rabbitmq.com/tutorials
  Monitoring:  https://www.rabbitmq.com/docs/monitoring
  Upgrading:   https://www.rabbitmq.com/docs/upgrade

  Logs: /var/log/rabbitmq/rabbit@rabbitmq-2.rabbitmq.rabbitmq.log
        <stdout>

  Config file(s): /etc/rabbitmq/rabbitmq.conf

  Starting broker... completed with 11 plugins.

0 replies

michaelklishin · 2024-08-29T15:29:52Z

michaelklishin
Aug 29, 2024
Maintainer

@CrazyMushu please stop filing the same issue over and over or your ability to do so will be limited org-wide. They are moved to discussions for a reason.

We do not have enough information to reproduce your claims. We will try but our team does not guess, or use issues for discussions and forming hypothesis. This is what Discussions are for. In one of the discussions we have shared a talk from RabbitMQ Summit dedicated to this specific topic.

All container logs in the cluster are empty, showing only startup information

This not a normal condition for a node and is something you must take care of first. RabbitMQ nodes log quite a bit about the peer state changes they observe, certain client operations, and so on. Even if there are no cluster operations during the test, a node that loses connections to its peers will eventually log multiple related messages.

2 replies

CrazyMushu Aug 30, 2024
Author

I'm a bit confused — first, I'm asked not to resurrect old bugs, and now I'm told not to create new ones. I'm lost. Where are we supposed to discuss emerging issues?

I managed to collect debug-level logs from all three cluster hosts. I'm attaching them to my ticket.
rabbit@rabbitmq-2.rabbitmq.rabbitmq.log
rabbit@rabbitmq-1.rabbitmq.rabbitmq.log
rabbit@rabbitmq-0.rabbitmq.rabbitmq.log

michaelklishin Aug 30, 2024
Maintainer

@CrazyMushu our team does not use issues for discussions. That's what, ahem, Discussions are for.

michaelklishin · 2024-08-30T17:32:48Z

michaelklishin
Aug 30, 2024
Maintainer

According to the logs, node 0 at 2024-08-30 08:25:56.002576+00:00 and 2024-08-30 08:25:56.002224+00:00 reports a peer (node 2) as unreachable.

Node 2 detects node 0's disconnection at 2024-08-30 08:26:55.979060+00:00. I do not see any further log messages on node 2 that would indicate that node 0 has rejoined.

Node 1 detects node's0 disconnection at 2024-08-30 08:25:55.636475+00:00 and similarly to node 2, there are no further log messages that would demonstrate that node 0 has reconnected.

Such messages are easy to find by searching for "rabbit on node ".

Furthermore, in these logs I do not see any messages from pause_minority besides

2024-08-30 08:21:15.464016+00:00 [info] <0.648.0> Starting rabbit_node_monitor (in pause_minority mode)

So the partition handling does not kick in. Peer discovery's cleanup of unreachable peers, however, does kick in:

2024-08-30 08:26:38.571201+00:00 [debug] <0.1190.0> Peer discovery cleanup: rabbit_peer_discovery_k8s returned ['rabbit@rabbitmq-1.rabbitmq.rabbitmq',
2024-08-30 08:26:38.571201+00:00 [debug] <0.1190.0>                                                             'rabbit@rabbitmq-0.rabbitmq.rabbitmq',
2024-08-30 08:26:38.571201+00:00 [debug] <0.1190.0>                                                             'rabbit@rabbitmq-2.rabbitmq.rabbitmq']
2024-08-30 08:26:38.571320+00:00 [debug] <0.1190.0> Peer discovery: all unreachable nodes are still registered with the discovery backend rabbit_peer_discovery_k8s

This feature does not actually remove nodes by default but it does log that some peers were still unreachable.

4 replies

CrazyMushu Sep 2, 2024
Author

Then I would like to understand why pause_minority is not working. What additional data do I need to provide to help investigate this situation?

michaelklishin Sep 2, 2024
Maintainer

Re-reading the description, I'm not sure why it should. You pause the entire VM, so RabbitMQ on it cannot know if it was in a minority or not. It did not lose connectivity to any peers, its execution was completely stopped.

When the VM is resumed, the paused node "wakes up" with the same view it had, when its peers were around.

Distributed data services are not always prepared to "time freezing", losing network connections asymmetrically (the peer has detected our failure but we did not detect anything because execution was paused).

Raft-based features will handle such scenarios better because a leader that comes back after downtime will step down as a leader "voluntarily" and quickly.

In 4.x, at some point partition handling strategies will be removed (most likely by 4.2 when Mnesia can be removed besides what's needed to migrate the data). At the very least Raft recovery strategy is easy to reason about and the failure detector used by the Raft-based features is more efficient than net ticks.

This reminds of #10701.

CrazyMushu Sep 5, 2024
Author

I conducted experiments with network disconnections: I disconnected the network on different hosts for varying periods of time. However, I was unable to achieve a split-brain state in the cluster. Continuing my investigation, I discovered that although the cluster administration web page shows a split-brain message, RabbitMQ does not add this information to its alarms, which seems strange.

For example:

/ $ rabbitmq-diagnostics -q alarms  
Node rabbit@rabbitmq-2.rabbitmq.rabbitmq reported no alarms, local or clusterwide  
/ $ rabbitmq-diagnostics -q check_local_alarms  
Node rabbit@rabbitmq-2.rabbitmq.rabbitmq reported no local alarms

Some information about a problem with the cluster is reported by the rabbitmq-diagnostics cluster_status command, but not in the alarms section, only in the Network Partitions section.

**Alarms**

(none)

**Network Partitions**

Node rabbit@rabbitmq-2.rabbitmq.rabbitmq cannot communicate with rabbit@rabbitmq-0.rabbitmq.rabbitmq, rabbit@rabbitmq-1.rabbitmq.rabbitmq.

Maybe it would be worth adding information about Network Partition issues to Alarms? This could allow the rabbitmq-diagnostics -q check_local_alarms command to be used for readiness or liveness probes, so that the problematic node could be disconnected from the network or even restarted, which would restore the working cluster.

michaelklishin Sep 5, 2024
Maintainer

What constitutes an alarm has been well defined for over a decade. Most alarms (two out of three) are cluster-wide while network partitions are not (they can be asymmetric, affect only two nodes out of N).

If you need a health check command that checks for partitions in the cluster, you are welcome to contribute it using check_local_alarms and other health checks as examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pause_minority does not behave as expected in a specific test #12166

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

pause_minority does not behave as expected in a specific test #12166

CrazyMushu Aug 29, 2024

Replies: 3 comments · 6 replies

CrazyMushu Aug 29, 2024 Author

michaelklishin Aug 29, 2024 Maintainer

CrazyMushu Aug 30, 2024 Author

michaelklishin Aug 30, 2024 Maintainer

michaelklishin Aug 30, 2024 Maintainer

CrazyMushu Sep 2, 2024 Author

michaelklishin Sep 2, 2024 Maintainer

CrazyMushu Sep 5, 2024 Author

michaelklishin Sep 5, 2024 Maintainer

CrazyMushu
Aug 29, 2024

Replies: 3 comments 6 replies

CrazyMushu
Aug 29, 2024
Author

michaelklishin
Aug 29, 2024
Maintainer

CrazyMushu Aug 30, 2024
Author

michaelklishin Aug 30, 2024
Maintainer

michaelklishin
Aug 30, 2024
Maintainer

CrazyMushu Sep 2, 2024
Author

michaelklishin Sep 2, 2024
Maintainer

CrazyMushu Sep 5, 2024
Author

michaelklishin Sep 5, 2024
Maintainer