Node unresponsive after running list_queues #11256

sunfinite · 2024-05-17T01:41:33Z

sunfinite
May 17, 2024

Hi,

When a node in a cluster is restarting, running rabbitmqctl list_queues on another node can sometimes cause the second node to hang. All subsequent rabbitmqctl and rabbitmq-diagnostics commands fail because the target node is unreachable. The hung node also shows high, persistent CPU use by the beam.smp erlang process. The node is recovered only when we restart the container running RabbitMq.

There are previous reports of list_queues command hanging but in this case it is the node that it is run against that hangs. The list_queues command itself exits immediately with the following error:

{:badrpc, {:EXIT, {:aborted, {:no_exists, [:rabbit_queue, {:amqqueue, {:resource, "/", :queue, :_}, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_}]}}}}

We have been able to reproduce the issue with both list_queues and list_unresponsive_queues on version 3.11.28.

Reproduction steps:

On node 1, run:

$ for i in {1..1000}; do date; rabbitmqctl list_queues; done

Trigger restart on node 2. We stop and start mnesia to produce a quick, consistent repro. When the issue was first discovered, the nodes were being restarted the right way with rabbitmqctl stop_app etc.:

$ for i in {1..100}; do date; rabbitmqctl eval 'mnesia:stop().'; rabbitmqctl eval 'mnesia:start().'; done

On node 1, the following output is seen from list_queues command loop:

root@ip-10-0-22-82:/# for i in {1..1000}; do date; rabbitmqctl list_queues; done
Fri May 17 01:05:39 UTC 2024
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
...
Fri May 17 01:06:12 UTC 2024
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
Fri May 17 01:06:13 UTC 2024
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
{:badrpc, {:EXIT, {:aborted, {:no_exists, [:rabbit_queue, {:amqqueue, {:resource, "/", :queue, :_}, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_}]}}}}
Fri May 17 01:06:14 UTC 2024
Error: unable to perform an operation on node 'rabbit@ip-10-0-22-82.ap-southeast-2.compute.internal'. Please see diagnostics information and suggestions below.
Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running
...

Corresponding output from node 2:

root@ip-10-0-4-83:/# for i in {1..100}; do date; rabbitmqctl eval 'mnesia:stop().'; rabbitmqctl eval 'mnesia:start().'; done
Fri May 17 01:05:56 UTC 2024
stopped
ok
Fri May 17 01:05:57 UTC 2024
stopped
ok
Fri May 17 01:05:58 UTC 2024
stopped
ok
...

The for loop completes successfully on node 2 and node 2 remains responsive.

In some cases, node 1 does NOT become unresponsive and the for loop continues even after receiving the badrpc error:

Listing unresponsive queues for vhost / ...
{:badrpc, {:EXIT, {:aborted, {:no_exists, [:rabbit_queue, {:amqqueue, {:resource, "/", :queue, :_}, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_}]}}}}
Listing unresponsive queues for vhost / ...
Listing unresponsive queues for vhost / ...
Listing unresponsive queues for vhost / ...

Other information

There are no errors in rabbit.log on the node that is hung. Even enabling debug logs did not yield anything new.
Trying to connect to the rabbit node using erl -remsh also times out:

root@ip-10-0-22-82:/# COOKIE=`cat /var/lib/rabbitmq/.erlang.cookie`
root@ip-10-0-22-82:/# erl -name debug -setcookie $COOKIE -remsh rabbit
Erlang/OTP 25 [erts-13.2.2.5] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit:ns]

*** ERROR: Shell process terminated! (^G to start new job) ***

Running rabbitmq-diagnostics on the node before it hangs also eventually times out when the node becomes unresponsive:

$ rabbitmq-diagnostics consume_event_stream
...
19:55:22.057 [error] ** Node :"rabbit@ip-10-0-22-82.ap-southeast-2.compute.internal" not responding **
** Removing (timedout) connection **
...

top output showing 100% CPU core usage:

root@ip-10-0-22-82:/# top -bn 1 -p 18
top - 01:39:43 up 14 days,  3:32,  0 users,  load average: 1.10, 1.24, 1.19
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 48.3 us,  0.0 sy,  0.0 ni, 51.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7751.7 total,   1272.8 free,   1079.9 used,   5399.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5916.7 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   18 rabbitmq  20   0 2330348 211064  72596 S 100.0   2.7  35:39.87 beam.smp
   
   
root@ip-10-0-22-82:/# ps -Af | grep 18
rabbitmq    18     1 11 May16 ?        00:35:45 /opt/erlang/lib/erlang/erts-13.2.2.5/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -sbwt none -sbwtdcpu none -sbwtdio none -B i -- -root /opt/erlang/lib/erlang -bindir /opt/erlang/lib/erlang/erts-13.2.2.5/bin -progname erl -- -home /var/lib/rabbitmq -- -pa  -noshell -noinput -s rabbit boot -boot start_sasl -syslog logger [] -syslog syslog_error_logger false -kernel prevent_overlapping_partitions false
rabbitmq    25    18  0 May16 ?        00:00:00 erl_child_setup 500000

Running cluster_status on other nodes shows that status of hung nodes is unknown:

Maintenance status

Node: rabbit@ip-10-0-12-253.ap-southeast-2.compute.internal, status: not under maintenance
Node: rabbit@ip-10-0-22-82.ap-southeast-2.compute.internal, status: unknown
Node: rabbit@ip-10-0-4-83.ap-southeast-2.compute.internal, status: not under maintenance

The issue does not happen when restart and list loops are run on the same node.

kjnilsson · 2024-05-17T09:00:49Z

kjnilsson
May 17, 2024
Maintainer

3.11.28 is an old version that is out of community support. Please try your test against the only supported community version: 3.13.x

1 reply

sunfinite May 21, 2024
Author

I have so far been unable to consistently reproduce the issue on 3.13.0. The node hung on one run out of over 20 tries with the same symptoms as above. There are lot of no_exists error messages on runs without a hang:

Tue May 21 14:42:59 UTC 2024
Listing unresponsive queues for vhost / ...
Tue May 21 14:43:00 UTC 2024
Listing unresponsive queues for vhost / ...
{:badrpc, {:EXIT, {:aborted, {:no_exists, [:rabbit_queue, {:amqqueue, {:resource, "/", :queue, :_}, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_, :_}]}}}}
Tue May 21 14:43:01 UTC 2024
Listing unresponsive queues for vhost / ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node unresponsive after running list_queues #11256

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Node unresponsive after running list_queues #11256

sunfinite May 17, 2024

Reproduction steps:

Other information

Replies: 1 comment · 1 reply

kjnilsson May 17, 2024 Maintainer

sunfinite May 21, 2024 Author

sunfinite
May 17, 2024

Replies: 1 comment 1 reply

kjnilsson
May 17, 2024
Maintainer

sunfinite May 21, 2024
Author