Management interface (rabbitmq_management plugin) timing out when a node is offline #9522

tubemeister · 2023-09-22T10:24:17Z

tubemeister
Sep 22, 2023

I'm running into an issue where the management interface is timing out whenever one node of a three node cluster is down.

This doesn't occur when just the rabbitmq-server process is stopped on an otherwise running host, it only happens when the whole host is offline.

Requests have a roughly 45 second timeout, so commandline tools such as rabbitmqadmin are technically usable if you're patient, but the web interface is practically unusable and health checks and stats gathering (such as number of messages in queues) are effectively dead as long as the host is down. Things instantly recover as soon as the host is up (ie, it pings), whether rabbitmq is running or not.

Running RabbitMQ 3.12.6 on Ubuntu 22.04.

I haven't installed a test 3.11 cluster to check there, but did notice on a recent customer migration that 3.10.0 didn't have this problem.

Answered by mkuratczyk

Feb 28, 2024

Fixed in 3.13

View full answer

kjnilsson · 2023-09-22T10:32:18Z

kjnilsson
Sep 22, 2023
Maintainer

Is this when you force terminate a host or shut it down in an ordered way?

9 replies

tubemeister Sep 22, 2023
Author

Like this?

overview:

queues:

kjnilsson Sep 22, 2023
Maintainer

That makes it pretty clear which HTTP requests that take the longest. Why it takes so long is another matter. :)

michaelklishin Sep 22, 2023
Maintainer

GET /queues will contact all nodes, and wait for a timeout if some are not reachable. Maybe nodes known to be unreachable could be filtered out earlier, it comes down whether we want to trust the current node failure detector.

tubemeister Sep 22, 2023
Author

It's probably a quorum queues thing, as I did just remember the customer I mentioned with the 3.10 version where this didn't happen was only running classic queues.

tubemeister Oct 2, 2023
Author

Looks like it will try to contact all nodes, for every queue in a row, as the timeouts seem to add up. Flagging a node as down after the first timeout during the same GET /queues request would at least reduce the overall timeout to about 10s instead of over a minute and more as queues get added...

Though I'd kind of expect the cluster node itself (the one the management plugin is running on) to know if a node is down? Surely that shouldn't be a new surprise during every subsequent queue status request, for as long as the node is down?

tubemeister · 2023-10-19T14:12:30Z

tubemeister
Oct 19, 2023
Author

FYI just installed 3.12.7 on the test cluster, still getting timeouts when one node is down.

This single point of failure is the last thing on my list before considering my cluster ready for production, I hope something can be done about this.

6 replies

tubemeister Oct 20, 2023
Author

Ah, the user hostile attitude again?

We're not all developers you know, some of us are merely users of your software and hoped you'd at least be interested in any bugs found, especially bugs where just one node being down causes features on the whole cluster to fail...

I'll look into this Prometheus thing, and see what else I can do to work around this bug.

tubemeister Oct 20, 2023
Author

FYI things like 'rabbitmq-queues check_if_node_is_quorum_critical' also take half a minute to answer, which seems like the kind of tool that is especially meant to be used in scenarios where you might be down a node.

Oh well.

michaelklishin Oct 20, 2023
Maintainer

@tubemeister I don't see how pointing out that this is open source software maintained by a small team is "hostile attitude".

Very likely you are in the 99% of users who never pay for OSS RabbitMQ, never contribute, and continue getting support and updates year after year, and always expect improvements (obviously for free, and shipped tomorrow). Should I call your attitude "maintainer-hostile"?

michaelklishin Oct 20, 2023
Maintainer

rabbitmq-queues check_if_node_is_quorum_critical was designed for rolling upgrades before a node goes for planned shutdown, as an extra safety measure.

Assuming that it should return immediately is very much assuming that in distributed systems, it is easy to reliably know if a peer is reachable.

michaelklishin Oct 20, 2023
Maintainer

Another thing about rabbitmq-queues check_if_node_is_quorum_critical that may or may not be obvious is that it evaluates the state of every quorum queue and stream in the system, so even if it uses a really low timeout for a single operation, that can add up.

michaelklishin · 2023-10-20T14:50:40Z

michaelklishin
Oct 20, 2023
Maintainer

This discussion has identified something worth investigating: where and if inter-node communication timeouts can be revisited lower.

Those timeouts are usually 30-60s for a reason that should not be overlooked: low timeouts are guaranteed to result in false positives in certain other scenarios. This is particularly true for the management plugin that sometimes has to transfer very large responses (hello GET /api/queues without pagination or filtering).

There is no such thing as an "optimal timeout default" but we may be able to avoid contacting nodes that are belived to be stopped or down.

0 replies

michaelklishin · 2023-10-20T15:29:39Z

michaelklishin
Oct 20, 2023
Maintainer

Some take aways from this discussion @tubemeister.

A quick investigation suggests that there may be a very straightforward way of reducing the effect of unreachable nodes on check_if_node_is_quorum_critical: #9755.

As for the HTTP API queries, it is a significantly more involved code path but it seemingly comes down to the same function that returns a list of running nodes.

If a node has been disconnected recently, it will be included in the running list for some time. Contacting nodes with a low timeout may sound like an obvious solution but specifically for queue metric aggregation, it is very risky and can produce false positives in busy clusters, which will piss off technical operations.

In the end, my conclusion that the best solution by far is to use the Prometheus plugin for monitoring. It only returns node-local data, and aggregation is done by external tools such as Grafana, not RabbitMQ nodes themselves.

This doc section needs an update that would mention the effects of timeouts and that Prometheus-based monitoring side steps this problem entirely.

0 replies

kjnilsson · 2023-11-06T14:54:47Z

kjnilsson
Nov 6, 2023
Maintainer

@tubemeister this PR may (or may not) improve matters for your use case. I found a few places where cluster wide queries were executed per queue unnecessarily as well as a few other optimisations. We still need to do at least two cluster wide queries so it may not make a system with a dangling TCP connection as responsive as we'd like but it may still help.

#9874

0 replies

mkuratczyk · 2024-02-28T10:55:19Z

mkuratczyk
Feb 28, 2024
Maintainer

Fixed in 3.13

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Management interface (rabbitmq_management plugin) timing out when a node is offline #9522

{{title}}

Replies: 6 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Management interface (rabbitmq_management plugin) timing out when a node is offline #9522

tubemeister Sep 22, 2023

Replies: 6 comments · 15 replies

kjnilsson Sep 22, 2023 Maintainer

tubemeister Sep 22, 2023 Author

kjnilsson Sep 22, 2023 Maintainer

michaelklishin Sep 22, 2023 Maintainer

tubemeister Sep 22, 2023 Author

tubemeister Oct 2, 2023 Author

tubemeister Oct 19, 2023 Author

tubemeister Oct 20, 2023 Author

tubemeister Oct 20, 2023 Author

michaelklishin Oct 20, 2023 Maintainer

michaelklishin Oct 20, 2023 Maintainer

michaelklishin Oct 20, 2023 Maintainer

michaelklishin Oct 20, 2023 Maintainer

michaelklishin Oct 20, 2023 Maintainer

kjnilsson Nov 6, 2023 Maintainer

mkuratczyk Feb 28, 2024 Maintainer

tubemeister
Sep 22, 2023

Replies: 6 comments 15 replies

kjnilsson
Sep 22, 2023
Maintainer

tubemeister Sep 22, 2023
Author

kjnilsson Sep 22, 2023
Maintainer

michaelklishin Sep 22, 2023
Maintainer

tubemeister Sep 22, 2023
Author

tubemeister Oct 2, 2023
Author

tubemeister
Oct 19, 2023
Author

tubemeister Oct 20, 2023
Author

tubemeister Oct 20, 2023
Author

michaelklishin Oct 20, 2023
Maintainer

michaelklishin Oct 20, 2023
Maintainer

michaelklishin Oct 20, 2023
Maintainer

michaelklishin
Oct 20, 2023
Maintainer

michaelklishin
Oct 20, 2023
Maintainer

kjnilsson
Nov 6, 2023
Maintainer

mkuratczyk
Feb 28, 2024
Maintainer