One node in a cluster consumes more CPU resources than others #12668

sagr4019 · 2024-11-06T08:37:46Z

sagr4019
Nov 6, 2024

Community Support Policy

I have read RabbitMQ's Community Support Policy
I agree to provide all relevant information (versions, logs, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

3.13.6

Erlang version used

26.2.x

Operating system (distribution) used

linux

How is RabbitMQ deployed?

RabbitMQ-as-a-Service from a public cloud provider

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

# PASTE OUTPUT HERE, BETWEEN BACKTICKS

Logs from node 1 (with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 2 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 3 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

rabbitmq.conf

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location

# PASTE rabbitmq.conf HERE, BETWEEN BACKTICKS

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

# PASTE advanced.config HERE, BETWEEN BACKTICKS

Application code

# PASTE CODE HERE, BETWEEN BACKTICKS

Kubernetes deployment file

# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS

What problem are you trying to solve?

Hello folks!

We're facing some problems with a hosted RabbitMQ Cluster deployment on AWS (MessageMQ).
We're following the RPC pattern described in https://www.rabbitmq.com/tutorials/tutorial-six-python.

The cluster consists of three nodes mq.m5.large, each with 2vCPUs and 8GB RAM.
The CPU of only one node seems to be more or less linear increasing until it reaches ~98%. After that the whole system begins to throttle and we need to restart the broker, see screenshot below:

The Memory seems also to increasing linear but is still far away from the memory limit:

The "problematic" node is the root node of our four main queues, the short-lived rpc queues are evenly distributed across all three nodes.
We do not have much traffic on our broker at all (at least for that instance configuration I suppose)

Since it is a hosted solution we have no direct access on the nodes itself, we can only access some Metrics via Cloudwatch and through the rabbitmq management plugin. Here we found some interesting behaviour regarding to garbage collection:

As you can see on the right side, the garbage collection seems also to increase linear and don't stop increasing until a restart.
Same goes for Erlang processes on this node:

Maybe someone of you already saw this kind of behaviour and can help here

Answered by michaelklishin

Nov 6, 2024

RabbitMQ 3.13.x is out of community support. You won't receive any more responses unless you move to 4.x or buy a support subscription.

Nodes will very rarely use exactly the same amount of resources. Some will inevitably have more connections than others, even if just a bit more, some queues and streams are busier than others, and so on.

Your node has 2 CPU cores according to the screenshot but 25K Erlang processes, so likely thousands of connections and/or channels and/or queue replicas. That's not a ratio that's going to work well, Try 4 cores or even 8, or using fewer queues (e.g. by using a stream with repeatable non-destructive reads for some workloads).

There are metrics on how run…

View full answer

michaelklishin · 2024-11-06T16:37:31Z

michaelklishin
Nov 6, 2024
Maintainer

RabbitMQ 3.13.x is out of community support. You won't receive any more responses unless you move to 4.x or buy a support subscription.

Nodes will very rarely use exactly the same amount of resources. Some will inevitably have more connections than others, even if just a bit more, some queues and streams are busier than others, and so on.

Your node has 2 CPU cores according to the screenshot but 25K Erlang processes, so likely thousands of connections and/or channels and/or queue replicas. That's not a ratio that's going to work well, Try 4 cores or even 8, or using fewer queues (e.g. by using a stream with repeatable non-destructive reads for some workloads).

There are metrics on how runtime schedulers/CPU time is spent.

Another part of the solution will lie in leader replica placement when queues and streams are declared, queue replica rebalancing and reasonably even connection rebalancing using a proxy, a load balancer, and/or relevant client library settings if they are provided by the library used: lists of connection endpoints and their shuffling on every (re)connection attempt.

There's also a specific set of recommendations for environments with a lot of connections/channels/queues/streams that are relatively idle, or at least some of them are.

And then there is Direct Reply-to that avoids response queue churn and 4.0 has a new exchange type specifically for request-response workloads.

1 reply

michaelklishin Nov 6, 2024
Maintainer

Two metrics that prove that the node does not have a reasonable amount of CPU resources are

You have a process run queue (nothing to do with RabbitMQ queues, it's similar to Load Average on Linux) that is above 0
The garbage collection rate is high, which is not a problem in and of itself but it aggravates the CPU resource starvation

michaelklishin · 2024-11-07T14:07:50Z

michaelklishin
Nov 7, 2024
Maintainer

Ignoring the fact that 3.13.x is out of community support and that RabbitMQ has a Discord server, a community Slack, a mailing list and Discussions, @sagr4019 wrote me an email asking for more free help for an out of support series.

While that is completely unacceptable, there was one useful bit of information: the number of at least connections is small.

So one of the applications likely leaks channels, queues or streams, or connections in a different virtual host. There are multiple metrics available to detect such conditions both in the management UI and Prometheus: the totals, the churn (open/declare, close/delete) client operation rates.

0 replies

michaelklishin · 2024-11-07T18:05:38Z

michaelklishin
Nov 7, 2024
Maintainer

Final response. I am out of other ideas even if this was a behavior of a 4.0.3 node.

I cannot know what specifically results in 25K Erlang processes on your node but this is an obvious starting point.

The most common reason is simply having a lot of connections or queue/stream replicas on that node.
The second most common reason is channel leaks [1]. Some developers do not realize that opening a channel is a network roundtrip and a resulting channel has a certain resource footprint, including in terms of CPU usage at the very least because it emits metrics every 5s by default.

Other things that involve spawns of new processes ("threads") are HTTP API requests (very few of them and they are short lived but new processes could be started per request, and request rate can be, say, hundreds a second), shovels, federation links, dead lettering workers for queues.

RabbitMQ has metrics that can prove right or wrong all of the above hypotheses. The seesaw (zig-zag) pattern on one of the charts suggests
a leak that is plugged by node restarts, or a channel leak that ends when the connection is terminated for any reason, and then the client re-connects.

Logs make it trivial to spot new inbound connections [3], debug logging [4] and internal event log [5] makes it easy to spot newly opened channel, declared queues and streams. HTTP API requests are logged separately [6].

The only other common and obvious reason for CPU burn is polling consumers [2] but they do not spawn (start) new processes.

That said, the node on your screenshot does over 7K CPU context switches a second, which is not crazy high but it's certainly quite high. In an environment with a lot of connections and/or queues I'd say use [8] but this is not the case according to you.

[7] is my final guess without any logs or relevant metrics but it has been fixed in latest 3.12.x and 3.13.x series, leave alone 4.0.x. And it only
will affect clusters with quorum queues.

Finally, adding 2 more CPU cores will help if this is not a leak but a natural load. If it is a resource leak of sorts, then even with 16 or 32 cores you
will observe a slower but nonetheless monotonic increase of both CPU load and the number of Erlang processes started/running on the node.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One node in a cluster consumes more CPU resources than others #12668

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

One node in a cluster consumes more CPU resources than others #12668

sagr4019 Nov 6, 2024

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Replies: 3 comments · 1 reply

michaelklishin Nov 6, 2024 Maintainer

michaelklishin Nov 6, 2024 Maintainer

michaelklishin Nov 7, 2024 Maintainer

michaelklishin Nov 7, 2024 Maintainer

sagr4019
Nov 6, 2024

Replies: 3 comments 1 reply

michaelklishin
Nov 6, 2024
Maintainer

michaelklishin Nov 6, 2024
Maintainer

michaelklishin
Nov 7, 2024
Maintainer

michaelklishin
Nov 7, 2024
Maintainer