Stream reader counter is not decremented when a process is killed in a specific manner #9908

olikasg · 2023-11-10T16:22:19Z

olikasg
Nov 10, 2023

Describe the bug

We noticed that with some streams the reader statistics showed an excess number of readers. In our case the process crashed because of einval on a send, but that was harmless as the reader process restarted without problems. The issue is that the reader reference in the connection table isn't cleaned, and (checking where the statistics come from) the reader isn't closed in osiris either. Below is a simple reproduction with a process kill.

The same behaviour happens if an AMQP consumer consumes from a stream and the channel process crashes.

The hangings can be cleaned when the node is restarted.

Reproduction steps

Start a single RabbitMQ node:

docker run -it --rm --network stream-perf-test -p 15672 --name rabbitmq rabbitmq:3.12-management

Enable the stream plugins:

docker exec rabbitmq rabbitmq-plugins enable rabbitmq_stream
docker exec rabbitmq rabbitmq-plugins enable rabbitmq_stream_management

Start a stream consumer:

docker run -it --rm --network stream-perf-test pivotalrabbitmq/stream-perf-test --uris rabbitmq-stream://rabbitmq:5552 -x 0 -y 1

Check the stream status:

docker exec rabbitmq rabbitmq-streams stream_status stream

Output:

Status of stream stream on node rabbit@d00266167722 ...
┌────────┬─────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐
│ role   │ node                │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │
├────────┼─────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ writer │ rabbit@d00266167722 │ 1     │ -1     │ -1               │ 0            │ 1       │ 1        │
└────────┴─────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘

Find and kill the stream reader process simulating a crash. The management UI gives us stream-consumer-0's port, which helps identifying the connection pid in the tracked_connection table.

docker exec -it rabbitmq bash
rabbitmq-diagnostics remote_shell

In the shell:

ets:tab2list(tracked_connection).
exit(<0.1255.0>, kill).

The reader process is killed:

2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0>     errorContext: child_terminated
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0>     reason: killed
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0>     offender: [{pid,<0.1255.0>},
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0>                {id,rabbit_stream_reader},
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0>                {mfargs,
2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0>                    {rabbit_stream_reader,start_link,

Check the stream status (the perftest client automatically reconnects):

docker exec rabbitmq rabbitmq-streams stream_status stream

The output:

Status of stream stream on node rabbit@d00266167722 ...
┌────────┬─────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐
│ role   │ node                │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │
├────────┼─────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ writer │ rabbit@d00266167722 │ 1     │ -1     │ -1               │ 0            │ 2       │ 1        │
└────────┴─────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘

After restarting the node, the readed count goes back to normal.

rabbitmqctl stop_app
rabbitmqctl start_app
rabbitmq-streams stream_status stream
Status of stream stream on node rabbit@d00266167722 ...
┌────────┬─────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐
│ role   │ node                │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │
├────────┼─────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ writer │ rabbit@d00266167722 │ 2     │ -1     │ -1               │ 0            │ 0       │ 0        │
└────────┴─────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘

Expected behavior

After a process crash, the reader is properly closed and any hanging records are freed up. Restarting the node periodically to clean hanging entries isn't ideal.

Additional context

It seems that osiris logs are only closed in the terminate functions which are not guaranteed to be run all the time. Specifically if the process crashes.
https://github.com/rabbitmq/osiris/blob/a94832a7905a3194426d17c83f1e7577276ef420/src/osiris_replica_reader.erl#L285

kjnilsson · 2023-11-10T17:01:26Z

kjnilsson
Nov 10, 2023
Maintainer

That only applies to the kill exit reason as it bypasses the terminate callback.

…

On Fri, 10 Nov 2023 at 16:22, Gábor Oláh ***@***.***> wrote: Describe the bug We noticed that with some streams the reader statistics showed an excess number of readers. In our case the process crashed because of einval on a send, but that was harmless as the reader process restarted without problems. The issue is that the reader reference in the connection table isn't cleaned, and (checking where the statistics come from) the reader isn't closed in osiris either. Below is a simple reproduction with a process kill. The same behaviour happens if an AMQP consumer consumes from a stream and the channel process crashes. The hangings can be cleaned when the node is restarted. Reproduction steps 1. Start a single RabbitMQ node: docker run -it --rm --network stream-perf-test -p 15672 --name rabbitmq rabbitmq:3.12-management 1. Enable the stream plugins: docker exec rabbitmq rabbitmq-plugins enable rabbitmq_stream docker exec rabbitmq rabbitmq-plugins enable rabbitmq_stream_management 1. Start a stream consumer: docker run -it --rm --network stream-perf-test pivotalrabbitmq/stream-perf-test --uris rabbitmq-stream://rabbitmq:5552 -x 0 -y 1 1. Check the stream status: docker exec rabbitmq rabbitmq-streams stream_status stream Output: Status of stream stream on node ***@***.*** ... ┌────────┬─────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐ │ role │ node │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │ ├────────┼─────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤ │ writer │ ***@***.*** │ 1 │ -1 │ -1 │ 0 │ 1 │ 1 │ └────────┴─────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘ 1. Find and kill the stream reader process simulating a crash. The management UI gives us stream-consumer-0's port, which helps identifying the connection pid in the tracked_connection table. docker exec -it rabbitmq bash rabbitmq-diagnostics remote_shell In the shell: ets:tab2list(tracked_connection).exit(<0.1255.0>, kill). The reader process is killed: 2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> errorContext: child_terminated 2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> reason: killed 2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> offender: [{pid,<0.1255.0>}, 2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> {id,rabbit_stream_reader}, 2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> {mfargs, 2023-11-10 16:06:20.853992+00:00 [error] <0.1253.0> {rabbit_stream_reader,start_link, 1. Check the stream status (the perftest client automatically reconnects): docker exec rabbitmq rabbitmq-streams stream_status stream The output: Status of stream stream on node ***@***.*** ... ┌────────┬─────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐ │ role │ node │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │ ├────────┼─────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤ │ writer │ ***@***.*** │ 1 │ -1 │ -1 │ 0 │ 2 │ 1 │ └────────┴─────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘ After restarting the node, the readed count goes back to normal. rabbitmqctl stop_app rabbitmqctl start_app rabbitmq-streams stream_status stream Status of stream stream on node ***@***.*** ... ┌────────┬─────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐ │ role │ node │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │ ├────────┼─────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤ │ writer │ ***@***.*** │ 2 │ -1 │ -1 │ 0 │ 0 │ 0 │ └────────┴─────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘ Expected behavior After a process crash, the reader is properly closed and any hanging records are freed up. Restarting the node periodically to clean hanging entries isn't ideal. Additional context It seems that osiris logs are only closed in the terminate functions which are not guaranteed to be run all the time. Specifically if the process crashes. https://github.com/rabbitmq/osiris/blob/a94832a7905a3194426d17c83f1e7577276ef420/src/osiris_replica_reader.erl#L285 — Reply to this email directly, view it on GitHub <#9904>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJAHFH6XAZXLJIHPQCDYQLYDZIE3AVCNFSM6AAAAAA7GNCHOKVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DOOJUGI2DMNQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

olikasg · 2023-11-10T17:41:21Z

olikasg
Nov 10, 2023
Author

How did I miss that? I just lost the only way to "reproduce" the issue we saw. Anyway, the reader hangings were real, though collecting data from the live system was not possible. Can we move this to the discussion section to keep the record and when (if) I manage to reproduce I'll report back.
Any suggestions how I could simulate the effect of a crash like this? (This is on 3.11.11)

<0.30035.2160> ** Reason for termination = error:{badmatch,{error,einval}}
<0.30035.2160> ** Callback modules = [rabbit_stream_reader]
<0.30035.2160> ** Callback mode = [state_functions,state_enter]
<0.30035.2160> ** Stacktrace =
<0.30035.2160> ** [{rabbit_stream_reader,setopts,3,
<0.30035.2160> [{file,\""rabbit_stream_reader.erl\""},{line,3680}]},
<0.30035.2160> {rabbit_stream_reader,send_chunks,8,
<0.30035.2160> [{file,\""rabbit_stream_reader.erl\""},{line,3357}]},
<0.30035.2160> {rabbit_stream_reader,'-open/3-fun-2-',5,
<0.30035.2160> [{file,\""rabbit_stream_reader.erl\""},{line,1103}]},
<0.30035.2160> {lists,foldl,3,[{file,\""lists.erl\""},{line,1350}]},
<0.30035.2160> {rabbit_stream_reader,open,3,
<0.30035.2160> [{file,\""rabbit_stream_reader.erl\""},{line,1093}]},
<0.30035.2160> {gen_statem,loop_state_callback,11,[{file,\""gen_statem.erl\""},{line,1426}]},
<0.30035.2160> {proc_lib,init_p_do_apply,3,[{file,\""proc_lib.erl\""},{line,240}]}]

1 reply

kjnilsson Nov 13, 2023
Maintainer

That is just a stream reader (i.e. connection) crashing probably because the TCP connection was terminated or perhaps the system ran out of file handles. This should call the terminate callback and sort out the counters but that version is quite old so we could have had a bug that later was fixed.

thedarac · 2024-02-02T15:50:15Z

thedarac
Feb 2, 2024

hi, issue is not fixed in latest version, here is some details

i have noticed this issue few months ago, back then we had rabbitmq version 3.11.11 and after upgrading to version 3.12.8 and recently on version 3.12.12 the issue remains.

currently our rabbitmq cluster is on version 3.12.12

sudo docker exec rabbitmq-test rabbitmq-streams stream_status stream_test

Status of stream stream_test on node rabbitmq-test@node01 ...
role	node			epoch	offset	committed_offset	first_offset	readers	segments
--------------------------------------------------------------------------------------------------------
writer	rabbitmq-test@node01	8	94834198	94830153	85835040	274268	13
replica	rabbitmq-test@node02	8	94834198	94830153	85835040	42	13
replica	rabbitmq-test@node03	8	94834198	94830153	85835040	3	13

when readers starts piling up we have tens of thousands logs like this ones

<0.9151.4979> Stream protocol connection has been closed by peer�[0m
<0.4871.4944> Stream protocol connection has been closed by peer�[0m
...

i have noticed memory growth on node with most readers

node			memory
-------------------------------
rabbitmq-test@node01	2.9 GiB
rabbitmq-test@node02	699 MiB
rabbitmq-test@node03	926 MiB

"Top ETS Tables" overview on node01 (ordered by memory) shows that "anonymous" is growing in memory

name				owner name		memory	size	type	named	protection	compressed
------------------------------------------------------------------------------------------------------------------
anonymous			seshat_counters_server	1.2 GiB	274271	set	false	public		false
queue_stats_deliver_stats	rabbit_mgmt_storage	9.3 MiB	662	set	true	public		false
channel_stats_deliver_stats	rabbit_mgmt_storage	6.4 MiB	444	set	true	public		false
...

1 reply

michaelklishin Feb 2, 2024
Maintainer

@thedarac it's not really obvious what exactly does "streams are piling up" mean. A lot of TCP stream client connections drop but the metric does not?

Also, consider using discussions for, well, discussions, and sharing actual log snippets, with timestamps and more than two lines. We cannot tell from the lines above if there'd been a mass disconnect of clients, for example, and with 100 lines and timestamps, we would be able to.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream reader counter is not decremented when a process is killed in a specific manner #9908

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Stream reader counter is not decremented when a process is killed in a specific manner #9908

olikasg Nov 10, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 2 replies

kjnilsson Nov 10, 2023 Maintainer

olikasg Nov 10, 2023 Author

kjnilsson Nov 13, 2023 Maintainer

thedarac Feb 2, 2024

michaelklishin Feb 2, 2024 Maintainer

olikasg
Nov 10, 2023

Replies: 3 comments 2 replies

kjnilsson
Nov 10, 2023
Maintainer

olikasg
Nov 10, 2023
Author

kjnilsson Nov 13, 2023
Maintainer

thedarac
Feb 2, 2024

michaelklishin Feb 2, 2024
Maintainer