Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOF errors aren't retried by the Writer #1352

Open
scottybrisbane opened this issue Nov 27, 2024 · 1 comment
Open

EOF errors aren't retried by the Writer #1352

scottybrisbane opened this issue Nov 27, 2024 · 1 comment
Labels

Comments

@scottybrisbane
Copy link

Describe the bug

During routine patching of our MSK Kafka clusters, we see a range of transient errors in our kafka producers (using the kafka-go Writer) as the brokers are patched. Many of these errors are retried by the logic in the kafka-go Writer, but we see a small volume of EOF errors which are not retried at all and result in an immediate permanent failure of those writes. I'm not sure if this is intentional due to certain behaviours of the kafka protocol, but from our perspective we would like to see these requests that result in an io.EOF error retried as well so that we don't lose those messages.

It looks like the related code snippets are:

Kafka Version

Kafka version: 3.7
kafka-go version: v0.4.47

To Reproduce

We see this behaviour for a small number of writes every time there is security patching or other operations on our MSK cluster that result in broker rolling replacements/restarts. When the Writer sees an io.EOF error, this is not retried and the message write fails.

Expected Behavior

Ideally all errors that can be retried by the Writer are retried so that maintenance operations on a Kafka cluster are seamless and don't cause any messages to be lost.

Observed Behavior

If an EOF is received by the kafka Writer, this is not retried and the write fails immediately with the following error: Kafka write errors (1/1), errors: [kafka.(*Client).Produce: EOF]

@fzj55
Copy link

fzj55 commented Nov 28, 2024

The reason why these errors are not retried is because an exception occurred in the entire cluster. If you are doing a rolling restart, it stands to reason that these errors defined by him will not be triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants