Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle database timeouts from Khepri minority #10915

Closed
wants to merge 22 commits into from

Conversation

the-mikedavis
Copy link
Member

Operations like declaring/deleting queues fail when sent against a node that's part of a minority. We need to let the database failures ({error, timeout}) bubble up to the callers - usually the channel - so that these operations don't cause needless crash reports.

Closes #10753
This depends on a change upstream in Khepri: rabbitmq/khepri#256

@the-mikedavis the-mikedavis self-assigned this Apr 3, 2024
@mergify mergify bot added the bazel label Apr 3, 2024
@the-mikedavis the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch 3 times, most recently from 3207119 to 60e06ee Compare May 6, 2024 18:29
@the-mikedavis the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch 3 times, most recently from f38326b to 6add459 Compare May 13, 2024 21:36
The prior code skirted transactions because the filter function might
cause Khepri to call itself. We want to use the same idea as the old
code - get all queues, filter them, then delete them - but we want to
perform the deletion in a transaction and fail the transaction if any
queues changed since we read them.

This fixes a bug - that the call to `delete_in_khepri/2` could return
an error tuple that would be improperly recognized as `Deletions` -
but should also make deleting transient queues atomic and fast.
Each call to `delete_in_khepri/2` needed to wait on Ra to replicate
because the deletion is an individual command sent from one process.
Performing all deletions at once means we only need to wait for one
command to be replicated across the cluster.

We also bubble up any errors to delete now rather than storing them as
deletions. This fixes a crash that occurs on node down when Khepri is
in a minority.
The clause of the spec that allowed passing a list of queue name
resources is out of date: the guard prevents a list from ever matching.
Previously a failing transaction would go unnoticed. Now we return an
error tuple.
`khepri_tx:abort/1` is only meant for use within a transaction - I
assume this was a relic of implementing this function with a transaction
previously.

The only caller already wraps this function in a `try`/`catch` block
that logs the error and re-raises.
All callers assume that this operation will succeed.
This function is only used by the test suites. A backtrace should make
the thrown error clearer though.
Note that we don't refactor the `throw/1` to an `erlang:error/1` since
it's caught by `rabbit_vhost:add/3`.
This function is only used by a test suite which matches on the 'ok'
return.
@the-mikedavis
Copy link
Member Author

These changes have been split out into other smaller PRs now

@the-mikedavis the-mikedavis deleted the md/khepri/database-operations-in-minority branch August 16, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Khepri: timeouts when one of the nodes stops responding
1 participant