Handle database timeouts from Khepri minority #10915

the-mikedavis · 2024-04-03T22:12:59Z

Operations like declaring/deleting queues fail when sent against a node that's part of a minority. We need to let the database failures ({error, timeout}) bubble up to the callers - usually the channel - so that these operations don't cause needless crash reports.

Closes #10753
This depends on a change upstream in Khepri: rabbitmq/khepri#256

The prior code skirted transactions because the filter function might cause Khepri to call itself. We want to use the same idea as the old code - get all queues, filter them, then delete them - but we want to perform the deletion in a transaction and fail the transaction if any queues changed since we read them. This fixes a bug - that the call to `delete_in_khepri/2` could return an error tuple that would be improperly recognized as `Deletions` - but should also make deleting transient queues atomic and fast. Each call to `delete_in_khepri/2` needed to wait on Ra to replicate because the deletion is an individual command sent from one process. Performing all deletions at once means we only need to wait for one command to be replicated across the cluster. We also bubble up any errors to delete now rather than storing them as deletions. This fixes a crash that occurs on node down when Khepri is in a minority.

The clause of the spec that allowed passing a list of queue name resources is out of date: the guard prevents a list from ever matching.

Previously a failing transaction would go unnoticed. Now we return an error tuple.

`khepri_tx:abort/1` is only meant for use within a transaction - I assume this was a relic of implementing this function with a transaction previously. The only caller already wraps this function in a `try`/`catch` block that logs the error and re-raises.

All callers assume that this operation will succeed.

This function is only used by the test suites. A backtrace should make the thrown error clearer though.

Note that we don't refactor the `throw/1` to an `erlang:error/1` since it's caught by `rabbit_vhost:add/3`.

This function is only used by a test suite which matches on the 'ok' return.

the-mikedavis · 2024-08-16T14:38:21Z

These changes have been split out into other smaller PRs now

the-mikedavis self-assigned this Apr 3, 2024

mergify bot added the bazel label Apr 3, 2024

the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch 3 times, most recently from 3207119 to 60e06ee Compare May 6, 2024 18:29

the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch 3 times, most recently from f38326b to 6add459 Compare May 13, 2024 21:36

the-mikedavis mentioned this pull request Jun 6, 2024

Khepri: Use read-only transactions to query for user/topic permissions #11398

Merged

mergify bot mentioned this pull request Jun 7, 2024

Khepri: Use read-only transactions to query for user/topic permissions (backport #11398) #11413

Merged

the-mikedavis added 20 commits July 1, 2024 13:27

WIP: Bump Khepri to X

4614a58

Introduce a rabbit_khepri:timeout_error() error type

d9b87ba

Handle database failures when declaring exchanges

a794cb4

Handle database failures when declaring queues

4b26164

Handle database failures when adding/removing bindings

c93f100

Handle database failures when deleting exchanges

f584d1d

Ignore timeout errors from deleting transient queues on node down

1fb25f9

minor: Correct outdated spec for rabbit_amqqueue:lookup/1

96e93ec

The clause of the spec that allowed passing a list of queue name resources is out of date: the guard prevents a list from ever matching.

rabbit_db_queue: Bubble up errors in set_many/1 with Khepri enabled

90bf0bc

Previously a failing transaction would go unnoticed. Now we return an error tuple.

rabbit_db_exchange: Reflect possible failure in update/2 spec

c0edff0

rabbit_db_exchange: Bubble up database errors in set/1

3d17070

rabbit_db_exchange: Raise database errors in next_serial/1

fbfec8e

All callers assume that this operation will succeed.

rabbit_db_exchange: Bubble up errors in delete_serial/1

e8e45a2

rabbit_db_exchange: Raise Khepri errors instead of throwing in clear/0

12d6448

This function is only used by the test suites. A backtrace should make the thrown error clearer though.

rabbit_db_vhost: Declare no-return in create_or_get/3 spec

d6912c3

Note that we don't refactor the `throw/1` to an `erlang:error/1` since it's caught by `rabbit_vhost:add/3`.

rabbit_db_vhost: Bubble up database errors in delete/1

c84d00d

rabbit_db_vhost: Bubble up database errors in clear/0

a7096c3

This function is only used by a test suite which matches on the 'ok' return.

rabbit_db_rtparams: Handle timeout failures from set/set_global

aeb720b

the-mikedavis added 2 commits July 1, 2024 13:27

rabbit_runtime_parameters: Remove unused value_global/2, value/4

3c64932

rabbit_runtime_parameters: Handle timeout failures in clear functions

ed3b934

the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch from 6add459 to ed3b934 Compare July 1, 2024 17:27

the-mikedavis mentioned this pull request Jul 3, 2024

rabbit_runtime_parameters: Remove dead 'value_global/2', 'value/4' #11614

Merged

mergify bot mentioned this pull request Jul 3, 2024

rabbit_runtime_parameters: Remove dead 'value_global/2', 'value/4' (backport #11614) #11615

Merged

the-mikedavis mentioned this pull request Jul 11, 2024

Handle timeouts possible in Khepri minority in rabbit_db_binding #11685

Merged

This was referenced Jul 22, 2024

Handle timeouts possible in Khepri minority in rabbit_db_binding (backport #11685) #11784

Merged

Handle timeouts possible in Khepri minority in rabbit_db_binding (backport #11685) (backport #11784) #11786

Merged

the-mikedavis closed this Aug 16, 2024

the-mikedavis deleted the md/khepri/database-operations-in-minority branch August 16, 2024 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle database timeouts from Khepri minority #10915

Handle database timeouts from Khepri minority #10915

the-mikedavis commented Apr 3, 2024

the-mikedavis commented Aug 16, 2024

Handle database timeouts from Khepri minority #10915

Handle database timeouts from Khepri minority #10915

Conversation

the-mikedavis commented Apr 3, 2024

the-mikedavis commented Aug 16, 2024