Filesystem operations sometimes fail with an ebusy return code when EBS volumes are used for storage #11047
-
Describe the bugAWS EC2 Disk setup:
Partitioning:
Mount options:
Permissions:
Observed Error:
Error does not occur on 100% of servers . Reproduction steps
Expected behaviorExpected behavior is a good startup with rabbitmq being able to create directories and files under /var/lib/rabbitmq/mnesia//quorum Additional contextNo response |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 14 replies
-
What kind of EBS volumes are you using? |
Beta Was this translation helpful? Give feedback.
-
@daveofthedogs you cannot claim that an |
Beta Was this translation helpful? Give feedback.
-
Our Production Checklist guide explicitly suggests against using network-attached storage where possible. It is certainly possible to use local storage on AWS, including SSDs (that quorum queues, Khepri and in particular streams will greatly benefit from). This question should be directed to AWS support, and these nodes should use VM-local storage, as recommended in the docs. |
Beta Was this translation helpful? Give feedback.
-
This error has nothing to do with performance but rather indeed with volume mounting. It can be easily triggered by mounting a volume in the
The same will happen if trying to remove this folder from the operating system level (which is effectively the same thing that RabbitMQ does):
|
Beta Was this translation helpful? Give feedback.
-
Closing this discussion, as the answer was to not mount the quorum directory separately, Team reproduced the error and suggested a fix for it. |
Beta Was this translation helpful? Give feedback.
-
According to So looks like we must retry up to N times with a delay in between (of, say, 5ms?) and it (in theory) should only affect file deletes, for which we usually can afford to retry N times. |
Beta Was this translation helpful? Give feedback.
-
@bakkenl has clarified what I'm missing in #11066: if you have a root node data directory on volume A, and one of its subdirectories on volume B, deleting all subdirectories from A, in particular B, won't be possible because a portion of the path is shared, and so are the associated kernel locks. An alternative multi-volume solution might look like this:
But it should be at least mentioned that this kind of setup is rare and we don't test or document it enough. And that for two (slow compared to local SSDs) EBS volumes it may or may not yield any practical improvements in I/O throughput or latency. |
Beta Was this translation helpful? Give feedback.
-
Besides the suggestion to avoiding overlapping filesystem volumes, #11066 should at least avoid the exception under such conditions. |
Beta Was this translation helpful? Give feedback.
FWIW, this discussion is literally the first time I have seen
ebusy
in a RabbitMQ environment. Clearly there is something odd in your specific case.@daveofthedogs any time you include the output of a command, it is essential that you also include the command invocation itself. We have no idea how you are running
fio
.Do you see this issue when you do NOT separate out disks like this?