Replies: 3 comments 2 replies
-
That only applies to the kill exit reason as it bypasses the terminate
callback.
…On Fri, 10 Nov 2023 at 16:22, Gábor Oláh ***@***.***> wrote:
Describe the bug
We noticed that with some streams the reader statistics showed an excess
number of readers. In our case the process crashed because of einval on a
send, but that was harmless as the reader process restarted without
problems. The issue is that the reader reference in the connection table
isn't cleaned, and (checking where the statistics come from) the reader
isn't closed in osiris either. Below is a simple reproduction with a
process kill.
The same behaviour happens if an AMQP consumer consumes from a stream and
the channel process crashes.
The hangings can be cleaned when the node is restarted.
Reproduction steps
1. Start a single RabbitMQ node:
docker run -it --rm --network stream-perf-test -p 15672 --name rabbitmq rabbitmq:3.12-management
1. Enable the stream plugins:
docker exec rabbitmq rabbitmq-plugins enable rabbitmq_stream
docker exec rabbitmq rabbitmq-plugins enable rabbitmq_stream_management
1. Start a stream consumer:
docker run -it --rm --network stream-perf-test pivotalrabbitmq/stream-perf-test --uris rabbitmq-stream://rabbitmq:5552 -x 0 -y 1
1. Check the stream status:
docker exec rabbitmq rabbitmq-streams stream_status stream
Output:
Status of stream stream on node ***@***.*** ...
┌────────┬─────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐
│ role │ node │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │
├────────┼─────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ writer │ ***@***.*** │ 1 │ -1 │ -1 │ 0 │ 1 │ 1 │
└────────┴─────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘
1. Find and kill the stream reader process simulating a crash. The
management UI gives us stream-consumer-0's port, which helps
identifying the connection pid in the tracked_connection table.
docker exec -it rabbitmq bash
rabbitmq-diagnostics remote_shell
In the shell:
ets:tab2list(tracked_connection).exit(<0.1255.0>, kill).
The reader process is killed:
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> errorContext: child_terminated
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> reason: killed
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> offender: [{pid,<0.1255.0>},
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> {id,rabbit_stream_reader},
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> {mfargs,
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> {rabbit_stream_reader,start_link,
1. Check the stream status (the perftest client automatically
reconnects):
docker exec rabbitmq rabbitmq-streams stream_status stream
The output:
Status of stream stream on node ***@***.*** ...
┌────────┬─────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐
│ role │ node │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │
├────────┼─────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ writer │ ***@***.*** │ 1 │ -1 │ -1 │ 0 │ 2 │ 1 │
└────────┴─────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘
After restarting the node, the readed count goes back to normal.
rabbitmqctl stop_app
rabbitmqctl start_app
rabbitmq-streams stream_status stream
Status of stream stream on node ***@***.*** ...
┌────────┬─────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐
│ role │ node │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │
├────────┼─────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ writer │ ***@***.*** │ 2 │ -1 │ -1 │ 0 │ 0 │ 0 │
└────────┴─────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘
Expected behavior
After a process crash, the reader is properly closed and any hanging
records are freed up. Restarting the node periodically to clean hanging
entries isn't ideal.
Additional context
It seems that osiris logs are only closed in the terminate functions
which are not guaranteed to be run all the time. Specifically if the
process crashes.
https://github.com/rabbitmq/osiris/blob/a94832a7905a3194426d17c83f1e7577276ef420/src/osiris_replica_reader.erl#L285
—
Reply to this email directly, view it on GitHub
<#9904>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJAHFH6XAZXLJIHPQCDYQLYDZIE3AVCNFSM6AAAAAA7GNCHOKVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DOOJUGI2DMNQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
How did I miss that? I just lost the only way to "reproduce" the issue we saw. Anyway, the reader hangings were real, though collecting data from the live system was not possible. Can we move this to the discussion section to keep the record and when (if) I manage to reproduce I'll report back.
|
Beta Was this translation helpful? Give feedback.
-
hi, issue is not fixed in latest version, here is some details i have noticed this issue few months ago, back then we had rabbitmq version 3.11.11 and after upgrading to version 3.12.8 and recently on version 3.12.12 the issue remains. currently our rabbitmq cluster is on version 3.12.12
when readers starts piling up we have tens of thousands logs like this ones
i have noticed memory growth on node with most readers
"Top ETS Tables" overview on node01 (ordered by memory) shows that "anonymous" is growing in memory
|
Beta Was this translation helpful? Give feedback.
-
Describe the bug
We noticed that with some streams the reader statistics showed an excess number of readers. In our case the process crashed because of
einval
on a send, but that was harmless as the reader process restarted without problems. The issue is that the reader reference in the connection table isn't cleaned, and (checking where the statistics come from) the reader isn't closed in osiris either. Below is a simple reproduction with a process kill.The same behaviour happens if an AMQP consumer consumes from a stream and the channel process crashes.
The hangings can be cleaned when the node is restarted.
Reproduction steps
Output:
stream-consumer-0
's port, which helps identifying the connection pid in thetracked_connection
table.In the shell:
The reader process is killed:
The output:
After restarting the node, the readed count goes back to normal.
Expected behavior
After a process crash, the reader is properly closed and any hanging records are freed up. Restarting the node periodically to clean hanging entries isn't ideal.
Additional context
It seems that osiris logs are only closed in the
terminate
functions which are not guaranteed to be run all the time. Specifically if the process crashes.https://github.com/rabbitmq/osiris/blob/a94832a7905a3194426d17c83f1e7577276ef420/src/osiris_replica_reader.erl#L285
Beta Was this translation helpful? Give feedback.
All reactions