Management interface (rabbitmq_management plugin) timing out when a node is offline #9522
-
I'm running into an issue where the management interface is timing out whenever one node of a three node cluster is down. This doesn't occur when just the rabbitmq-server process is stopped on an otherwise running host, it only happens when the whole host is offline. Requests have a roughly 45 second timeout, so commandline tools such as rabbitmqadmin are technically usable if you're patient, but the web interface is practically unusable and health checks and stats gathering (such as number of messages in queues) are effectively dead as long as the host is down. Things instantly recover as soon as the host is up (ie, it pings), whether rabbitmq is running or not. Running RabbitMQ 3.12.6 on Ubuntu 22.04. I haven't installed a test 3.11 cluster to check there, but did notice on a recent customer migration that 3.10.0 didn't have this problem. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 15 replies
-
Is this when you force terminate a host or shut it down in an ordered way? |
Beta Was this translation helpful? Give feedback.
-
FYI just installed 3.12.7 on the test cluster, still getting timeouts when one node is down. This single point of failure is the last thing on my list before considering my cluster ready for production, I hope something can be done about this. |
Beta Was this translation helpful? Give feedback.
-
This discussion has identified something worth investigating: where and if inter-node communication timeouts can be revisited lower. Those timeouts are usually 30-60s for a reason that should not be overlooked: low timeouts are guaranteed to result in false positives in certain other scenarios. This is particularly true for the management plugin that sometimes has to transfer very large responses (hello There is no such thing as an "optimal timeout default" but we may be able to avoid contacting nodes that are belived to be stopped or down. |
Beta Was this translation helpful? Give feedback.
-
Some take aways from this discussion @tubemeister. A quick investigation suggests that there may be a very straightforward way of reducing the effect of unreachable nodes on As for the HTTP API queries, it is a significantly more involved code path but it seemingly comes down to the same function that returns a list of running nodes. If a node has been disconnected recently, it will be included in the running list for some time. Contacting nodes with a low timeout may sound like an obvious solution but specifically for queue metric aggregation, it is very risky and can produce false positives in busy clusters, which will piss off technical operations. In the end, my conclusion that the best solution by far is to use the Prometheus plugin for monitoring. It only returns node-local data, and aggregation is done by external tools such as Grafana, not RabbitMQ nodes themselves. This doc section needs an update that would mention the effects of timeouts and that Prometheus-based monitoring side steps this problem entirely. |
Beta Was this translation helpful? Give feedback.
-
@tubemeister this PR may (or may not) improve matters for your use case. I found a few places where cluster wide queries were executed per queue unnecessarily as well as a few other optimisations. We still need to do at least two cluster wide queries so it may not make a system with a dangling TCP connection as responsive as we'd like but it may still help. |
Beta Was this translation helpful? Give feedback.
-
Fixed in 3.13 |
Beta Was this translation helpful? Give feedback.
Fixed in 3.13