One node in a cluster consumes more CPU resources than others #12668
-
Community Support Policy
RabbitMQ version used3.13.6 Erlang version used26.2.x Operating system (distribution) usedlinux How is RabbitMQ deployed?RabbitMQ-as-a-Service from a public cloud provider rabbitmq-diagnostics status outputSee https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics
Logs from node 1 (with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 2 (if applicable, with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 3 (if applicable, with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs
rabbitmq.confSee https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location
Steps to deploy RabbitMQ clusterSteps to reproduce the behavior in questionadvanced.configSee https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location
Application code# PASTE CODE HERE, BETWEEN BACKTICKS Kubernetes deployment file# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS What problem are you trying to solve?Hello folks! We're facing some problems with a hosted RabbitMQ Cluster deployment on AWS (MessageMQ). The cluster consists of three nodes mq.m5.large, each with 2vCPUs and 8GB RAM. The "problematic" node is the root node of our four main queues, the short-lived rpc queues are evenly distributed across all three nodes. Since it is a hosted solution we have no direct access on the nodes itself, we can only access some Metrics via Cloudwatch and through the rabbitmq management plugin. Here we found some interesting behaviour regarding to garbage collection: Maybe someone of you already saw this kind of behaviour and can help here |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
RabbitMQ 3.13.x is out of community support. You won't receive any more responses unless you move to 4.x or buy a support subscription. Nodes will very rarely use exactly the same amount of resources. Some will inevitably have more connections than others, even if just a bit more, some queues and streams are busier than others, and so on. Your node has 2 CPU cores according to the screenshot but 25K Erlang processes, so likely thousands of connections and/or channels and/or queue replicas. That's not a ratio that's going to work well, Try 4 cores or even 8, or using fewer queues (e.g. by using a stream with repeatable non-destructive reads for some workloads). There are metrics on how runtime schedulers/CPU time is spent. Another part of the solution will lie in leader replica placement when queues and streams are declared, queue replica rebalancing and reasonably even connection rebalancing using a proxy, a load balancer, and/or relevant client library settings if they are provided by the library used: lists of connection endpoints and their shuffling on every (re)connection attempt. There's also a specific set of recommendations for environments with a lot of connections/channels/queues/streams that are relatively idle, or at least some of them are. And then there is Direct Reply-to that avoids response queue churn and 4.0 has a new exchange type specifically for request-response workloads. |
Beta Was this translation helpful? Give feedback.
-
Ignoring the fact that 3.13.x is out of community support and that RabbitMQ has a Discord server, a community Slack, a mailing list and Discussions, @sagr4019 wrote me an email asking for more free help for an out of support series. While that is completely unacceptable, there was one useful bit of information: the number of at least connections is small. So one of the applications likely leaks channels, queues or streams, or connections in a different virtual host. There are multiple metrics available to detect such conditions both in the management UI and Prometheus: the totals, the churn (open/declare, close/delete) client operation rates. |
Beta Was this translation helpful? Give feedback.
-
Final response. I am out of other ideas even if this was a behavior of a I cannot know what specifically results in 25K Erlang processes on your node but this is an obvious starting point. The most common reason is simply having a lot of connections or queue/stream replicas on that node. Other things that involve spawns of new processes ("threads") are HTTP API requests (very few of them and they are short lived but new processes could be started per request, and request rate can be, say, hundreds a second), shovels, federation links, dead lettering workers for queues. RabbitMQ has metrics that can prove right or wrong all of the above hypotheses. The seesaw (zig-zag) pattern on one of the charts suggests Logs make it trivial to spot new inbound connections [3], debug logging [4] and internal event log [5] makes it easy to spot newly opened channel, declared queues and streams. HTTP API requests are logged separately [6]. The only other common and obvious reason for CPU burn is polling consumers [2] but they do not spawn (start) new processes. That said, the node on your screenshot does over 7K CPU context switches a second, which is not crazy high but it's certainly quite high. In an environment with a lot of connections and/or queues I'd say use [8] but this is not the case according to you. [7] is my final guess without any logs or relevant metrics but it has been fixed in latest 3.12.x and 3.13.x series, leave alone 4.0.x. And it only Finally, adding 2 more CPU cores will help if this is not a leak but a natural load. If it is a resource leak of sorts, then even with 16 or 32 cores you
|
Beta Was this translation helpful? Give feedback.
RabbitMQ 3.13.x is out of community support. You won't receive any more responses unless you move to 4.x or buy a support subscription.
Nodes will very rarely use exactly the same amount of resources. Some will inevitably have more connections than others, even if just a bit more, some queues and streams are busier than others, and so on.
Your node has 2 CPU cores according to the screenshot but 25K Erlang processes, so likely thousands of connections and/or channels and/or queue replicas. That's not a ratio that's going to work well, Try 4 cores or even 8, or using fewer queues (e.g. by using a stream with repeatable non-destructive reads for some workloads).
There are metrics on how run…