Proposal: increase the default value of `vm_memory_high_watermark` in RabbitMQ 4.0 #10518

mkuratczyk · 2024-02-08T12:05:26Z

mkuratczyk
Feb 8, 2024
Maintainer

We are considering increasing the default value of vm_memory_high_watermark in RabbitMQ 4.0. The current value of 0.4 (blocking publishers when 40% of the available memory is used) was very conservative even many years ago, when this default was introduced. These days, with all the advancements in Erlang and RabbitMQ, it seems just wasteful.

We would appreciate your input on this topic. Are you using the default 0.4 or a different value? Have you experienced any issues related to the vm_memory_high_watermark value? For example, you set it to 0.8 but then experienced out of memory kills or something like that?

Additional details that would be helpful (please share as much as you can):

RabbitMQ version
queue types used (classic/quorum/streams)
memory usage graph

We appreciate your input!

Answered by mkuratczyk

Aug 29, 2024

We've just merged a change to set the default in RabbitMQ 4.0 to 0.6. Unless something comes up during the beta/release candidate testing, that's what we are going with for now.

#12161

View full answer

ViliusS · 2024-02-08T15:39:48Z

ViliusS
Feb 8, 2024

Hi,

we are using vm_memory_high_watermark with slightly larger value of 0.45. We too considered it wasteful and wanted to use as much memory available as possible, however bigger values, such as 0.6 or even 0.7, produced some OOM errors. Not sure why, but during couple test runs in production we found that 0.45 works best for us.
Maybe the culprit of this is that we run thousands simultaneous classic v1 queues mostly for RPC calls. Some of our consumer clients can be very slow and this accumulates a fair amount of messages. Sometimes the producers can send very large packets too. Maybe this is related to the default paging ratio of 50% and how it works in conjunction with high watermark parameter - not sure.
Anyway, production grade tests were done on RabbitMQ 3.7 some time ago. We are running our RabbitMQ cluster (v3.9 now) in a Kubernetes giving 4500 MB of RAM for each node. Both, memory allocation and queue count increased 2-3 fold since the tests were made. Sadly, I cannot provide any graphs.
The main problem with RabbitMQ memory usage in general we see to this day is that Kubernetes shows different (lower) usage than what we see internally in RabbitMQ management interface. The difference is always 100-200 MB, sometimes more. This makes it difficult to predict the actual memory usage, set up monitoring.
We run under Linux Kubernetes nodes with RSS memory allocation strategy.
If you will change the default, please make sure this is tested under Kubernetes. Linux memory management can be tricky as it is. Combined with cgroups layer every change could produce unforeseen results.

7 replies

michaelklishin Feb 8, 2024
Maintainer

And that part is correct, Kubernetes has no way of knowing how the process uses its allocated memory. That will apply to all memory-managed runtimes since they preallocate memory in blocks, one way or another.

I was referring to the kernel page cache, which processes such as RabbitMQ nodes do not allocate, cannot release, and generally do not manage or have any insight into. Those must not be counted towards the process memory footprint but on older K8S versions with cgroups v1, that is exactly what happens, as now documented in the Memory Usage guide.

ViliusS Feb 8, 2024

Just now saw that kernel page cache docs apply only for streams. We are not really using streams.

EDIT: I have updated my initial post to specify, that even though tests were made on 3.7 we are running 3.9 now.

ViliusS Feb 8, 2024

And that part is correct, Kubernetes has no way of knowing how the process uses its allocated memory. That will apply to all memory-managed runtimes since they preallocate memory in blocks, one way or another.

I would disagree. If runtime really already allocated memory, even it is empty, it should show up as used by the Kubernetes. At least this is how Java or JavaScript/Node.js loads are counted. RabbitMQ is the only software I've seen have these mismatches. Not saying it's RabbitMQ fault, just stating the facts.

michaelklishin Feb 8, 2024
Maintainer

So the important factor is how much memory the runtime can allocate upfront, and/or how much memory these classic queues use. In part because the actual footprint is correlated with the allocation.

A durable CQv2 queue with messages published as persistent can have a low footprint (although the post does not really focus on memory in general).

Therefore in order to increase the default for similar workloads, there are two options:

Make CQv2 the default (which will happen in 4.x)
Tweak default memory allocator flags RabbitMQ nodes use such that the pre-allocated carriers (blocks) are smaller. This can have negative effects under peak load, and is one of the hardest defaults to change in the entire system

But with a potentially lower pre-allocation rate and CQv2, the default can go up somewhat, maybe to 0.6, 0.7 or or something like that.

Versions out of Support, using CMQs

I cannot help but mention that 3.9 is equally out of support. So is 3.10. 3.11 is going out of support in July.

This environment also uses classic mirrored queues, a feature that will be removed entirely later this year after years of deprecation.

michaelklishin Feb 8, 2024
Maintainer

@ViliusS that's exactly what I said: that part of Kubernetes memory monitoring is correct and will apply to any memory-managed runtime

michaelklishin · 2024-02-08T15:46:30Z

michaelklishin
Feb 8, 2024
Maintainer

During this discussion I ask everyone to be mindful of a few things:

We are trying to make a decision that will go into RabbitMQ 4.x (possibly 4.0). I'm afraid the experience from 3.7, or even 3.9 days is going to be of limited use because the feature set and implementation details of various queue types has changed meaningfully since then
There is no such thing as a good default (in particular for settings like these). Any value we pick will not work for someone
On Kubernetes, there are critically important monitoring factors at play that RabbitMQ has no control over. Affected Kubernetes versions arguably should be excluded from this survey exactly because any streams user will run into incorrect memory accounted and false positive cases of nodes killed by the OOM killer

0 replies

michaelklishin · 2024-02-08T16:20:41Z

michaelklishin
Feb 8, 2024
Maintainer

Judging from the initial part of this discussion, the important factor is how much memory the runtime can/does allocate upfront, and/or how much memory certain queue types can use. In part because the actual footprint is correlated with the allocation.

Classic Queue Footprint

A durable CQv2 queue with messages published as persistent can have a low footprint (although the post does not really focus on memory in general).

However, CQv1 are still the default in 3.13. We expect this to change in 4.x as soon as
classic mirrored queues are gone (since their behavior in certain scenarios is why we cannot just flip the default for everyone).

Other than that, a durable classic queue with messages published as persistent does not really
have that many parameters that can lower its footprint.

Quorum Queue Footprint

Quorum queues do not optimize for the minimum possible footprint by default. They optimize for
safety and throughput. However, their footprint is largely capped in practice, and the upper bound of the range can be lowered with certain configuration settings.

Changing these defaults will require a lot of testing. They work reasonably well for many deployments.

Stream Footprint

Streams have very low memory footprint without any special tuning except for the mentioned several times peculiarity on older K8S with cgroups v1

What Would Allow for a Higher Default

Therefore in order to increase the default for similar workloads, there are a few key areas of potential improvement:

Making sure that all client libraries publish messages as persistent by default. Otherwise they do not explicitly instruct RabbitMQ to keep messages in memory (which CQv1 classic queues will abide by)
Make CQv2 the default (which will happen in 4.x)
Potentially reduce the upper footprint boundary of QQs. This will take a lot of testing
Tweak default memory allocator flags RabbitMQ nodes use such that the pre-allocated carriers (blocks) are smaller. This can have negative effects under peak load, and is one of the hardest defaults to change in the entire system

But with a potentially lower pre-allocation rate and CQv2, the default can go up somewhat, maybe to 0.6, 0.7 or or something like that.

0 replies

drewhammond · 2024-02-08T16:28:52Z

drewhammond
Feb 8, 2024

To add a data point...

We run dozens of RabbitMQ clusters of varying size and throughput (3-11 nodes each with throughputs ranging from tens/s to high hundreds of thousands/s), all running in EC2 without k8s.

We have found that 0.8 works best for us as a default for the following reasons:

We still use durable classic mirrored queues in most places (non-lazy)
We frequently have unplanned surges in queues and need to survive for some period of time past 0.4 while teams investigate
Relatedly, a publishing outage causes a bigger disruption than an OOM in our business
We have 24/7 monitoring and a large global ops team to address alerts that would have led to a RabbitMQ memory problem eventually (queue backlogs, etc.)

I expect with CQv2 and other improvements we would be even more comfortable with our high default.

0 replies

olikasg · 2024-02-09T11:06:05Z

olikasg
Feb 9, 2024

In our experience the system that would benefit the most from increasing the watermark are the ones with low amount of memory allocated. I see a lot of environments with 4GB or lower total memory. In such a system the risk of increasing the watermark also significantly increases the risk of OOM because queue pile up, or channel buffering, or any sort of overload will very quickly push the system to a state where a single process can trigger the OOM.

The systems with adequate or large amount of memory allocated (16GB or higher) hardly benefit from an increase under normal operations or during an overload because of the recent optimisations (CQv2, QQ, etc). Using up to 0.6 for such systems can save system resources without much risk, and this setting is naturally found during the optimisation phase, so no default change is warranted, imho.

Changing the default would increase the OOM issues for small systems with little benefit for bigger systems. It is debatable which is better: blocked publishers or OOM. In my experience blocked publishers are safer for RabbitMQ during an overload.

I think that optimising the default WAL size for small deployments is more relevant.

1 reply

ViliusS Feb 9, 2024

It is debatable which is better: blocked publishers or OOM. In my experience blocked publishers are safer for RabbitMQ during an overload.

I must agree. In our experience OOMs could mean faster automated service restore time. However, it could also mean a very problematic cases where let's say Mnesia DB gets corrupted and then node needs to be rebuilt, or even cases where all cluster needs to be rebuilt because of OOMs happened on every node in sequence when clients automatically reconnect to the next node and overload it before already killed nodes are up.

johanrhodin · 2024-02-12T18:46:53Z

johanrhodin
Feb 12, 2024

We (CloudAMQP) set a default value of 0.81. Customers can of course change this value.

Over the years we've experimented with different default values, and there are pros and cons with all selections, but in the current era (version 3.12.x) of RabbitMQ with CQv2 and QQs the watermark can for many systems go a bit higher that 0.4 without adverse effects.

Many ways to get to OOM are not stopped by the watermark. A low watermark will lead to not using system resources efficiently.

0 replies

adamski007 · 2024-08-14T12:25:14Z

adamski007
Aug 14, 2024

We are currently using the default value of 0.4 .
It was probably a mistake on our side, we should have used way more, but we just stick to the default behaviour when we did start using RabbitMQ.

Currently, I think now in the container world, which probably most of the people are having with RabbitMQ, it does not make a lot of sense of having this value of 0.4 , because basically we are wasting 60 % of the memory being defined to be used by RabbitMQ in the container. As anyway, the container will only have RabbitMQ server inside of it.
Currently, we are using RabbitMQ :

3.9.12
3.10.22
and only Quorum queue.

Having this value be increased to lets say 0.8 , would make sense in a container world.
Thats probably something we will test and eventually start using.
container is

1 reply

michaelklishin Aug 14, 2024
Maintainer

At least half of paying customers do not run RabbitMQ in containers. However, I agree with your thinking and image authors can (and likely do) override the default.

You are welcome to suggest this in https://github.com/docker-library/rabbitmq/issues/.

mkuratczyk · 2024-08-29T10:13:04Z

mkuratczyk
Aug 29, 2024
Maintainer Author

We've just merged a change to set the default in RabbitMQ 4.0 to 0.6. Unless something comes up during the beta/release candidate testing, that's what we are going with for now.

#12161

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: increase the default value of `vm_memory_high_watermark` in RabbitMQ 4.0 #10518

{{title}}

Replies: 8 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Proposal: increase the default value of vm_memory_high_watermark in RabbitMQ 4.0 #10518

mkuratczyk Feb 8, 2024 Maintainer

Replies: 8 comments · 9 replies

michaelklishin Feb 8, 2024 Maintainer

michaelklishin Feb 8, 2024 Maintainer

Versions out of Support, using CMQs

michaelklishin Feb 8, 2024 Maintainer

michaelklishin Feb 8, 2024 Maintainer

michaelklishin Feb 8, 2024 Maintainer

Classic Queue Footprint

Quorum Queue Footprint

Stream Footprint

What Would Allow for a Higher Default

michaelklishin Aug 14, 2024 Maintainer

mkuratczyk Aug 29, 2024 Maintainer Author

Proposal: increase the default value of `vm_memory_high_watermark` in RabbitMQ 4.0 #10518

mkuratczyk
Feb 8, 2024
Maintainer

Replies: 8 comments 9 replies

michaelklishin Feb 8, 2024
Maintainer

michaelklishin Feb 8, 2024
Maintainer

michaelklishin Feb 8, 2024
Maintainer

michaelklishin
Feb 8, 2024
Maintainer

michaelklishin
Feb 8, 2024
Maintainer

michaelklishin Aug 14, 2024
Maintainer

mkuratczyk
Aug 29, 2024
Maintainer Author