Restore improvement: batch retry #4071

Michal-Leszczynski · 2024-10-16T10:54:11Z

This PR adds 2 changes related to work orchestration during download and LAS stage.

If node failed to restore a batch, the batch is returned to the batch dispatcher and can be retried by other nodes
If node failed to restore a batch, the node can still try to restore other batches, but from different dcs from backup manifest

After batch failed to be restored, SM cleans the upload dir, so that there are no left overs there.
Those changes aim to make restore more robust, and should help with #3871.

Moreover, this PR also refactors general areas connected to batching:

added lots of comments explaining batching mechanism
restore workload structure is now flattened and not modified by the batching mechanism - a new workload progress structure has been added to represent current state of batching progress

Ref #3871
Fixes #4065

Now, if batch restoration failed on one node, it can still be retried by other nodes. Failed node is no longer used for the restore. Fixes #4065

This commit adds TestRestoreTablesBatchRetryIntegration, which injects errors during download and LAS step and validates that restore finished successfully despite them (thanks to batch retries).

After giving it some more thought, I decided to flatten workload structure. Instead of having location/table/dir layers, now everything operates on the dir layer. It makes the implementation easier, especially for the upcoming changes related to node retries.

Now, even if host failed to restore given batch it can still try to restore batches originating from different dcs. This improves retries in general, but also should help with #3871.

This commit extends TestBatchDispatcher to include failures in its scenario and to validate host retry in a different datacenter.

Michal-Leszczynski · 2024-10-16T11:43:15Z

@karol-kokoszka This PR is ready for review!

pkg/service/restore/index.go

pkg/service/restore/batch.go

pkg/service/restore/batch_test.go

pkg/service/restore/tables_worker.go

pkg/service/restore/restore_integration_test.go

Previously Workload structure was created during indexing and was updated during batching in order to keep its progress. This was confusing, because it wasn't obvious whether size and SSTable fields were describing the initial Workload state or the updated one. This commit makes it so Workload structure is not changed during batching. Instead, workloadProgress was added to in order to store batching progress. Moreover, this commit also adds a lot of documentation about batchDispatcher internal behavior.

Michal-Leszczynski · 2024-10-21T06:48:16Z

After re-reading the code, I discovered 3 areas which need to be improved:

fix situation when all hosts failed (even after re-tries)
fix situation when parallel != 0 and hosts don't have access to all restored locations
improve context cancellation

I will include those fixes in this PR today.

Consider a scenario with parallel=1 and multi-dc and multi-location. Note that SM is using 'parallel.Run' for restoring in parallel. Note that previous batching changes made host hang in 'batchDispatcher.DispatchBatch' if there were no more SSTables to restore, because it was still possible that another node failed to restore some SSTables, so that the hanging host could be awakened and restore failed SSTables returned to batchDispatcher. All of this meant that batching process could hang, because 'parallel.Run' would allow only a single host to restore SSTables at the time, but batching mechanism wouldn't free it until all SSTables are restored. Another scenario when batching mechanism could fail would be that all hosts failed (with re-tries) to restore all SSTables. Because of that, I changed batching mechanism to be more DC oriented. Now, 'workloadProgress' keeps track of remaining bytes to be restored per DC, and it also keeps host DC access instead of location access (the assumption being that a single DC can be backed up to only single location). This information allow to free hosts that can't restore any SSTables because they either already failed to restore some SSTables from given DCs, or all SSTables from given DCs were already restored.

I'm not sure if previous behavior was bugged, but changes introduced in this commit should make it more clear that batching mechanism respects context cancellation. This commit also adds a simple test validating that pausing restore during batching ends quickly.

Michal-Leszczynski · 2024-10-22T07:16:37Z

@karol-kokoszka PR is ready for re-review!

pkg/service/restore/batch.go

Michal-Leszczynski added 5 commits October 16, 2024 12:42

feat(restore): make batches retryable

53a05e9

Now, if batch restoration failed on one node, it can still be retried by other nodes. Failed node is no longer used for the restore. Fixes #4065

feat(restore_test): test batch retry

1e0a40d

This commit adds TestRestoreTablesBatchRetryIntegration, which injects errors during download and LAS step and validates that restore finished successfully despite them (thanks to batch retries).

feat(restore): add host retries

88f0547

Now, even if host failed to restore given batch it can still try to restore batches originating from different dcs. This improves retries in general, but also should help with #3871.

feat(restore): extend batch test with host retry

f3baac6

This commit extends TestBatchDispatcher to include failures in its scenario and to validate host retry in a different datacenter.

Michal-Leszczynski marked this pull request as ready for review October 16, 2024 11:43

Michal-Leszczynski requested a review from karol-kokoszka as a code owner October 16, 2024 11:43

karol-kokoszka reviewed Oct 16, 2024

View reviewed changes

Michal-Leszczynski added 2 commits October 21, 2024 20:54

Michal-Leszczynski force-pushed the ml/restore-retry branch from 1d3db02 to 2eb9e4a Compare October 21, 2024 19:00

Michal-Leszczynski requested a review from karol-kokoszka October 22, 2024 07:16

karol-kokoszka reviewed Oct 22, 2024

View reviewed changes

pkg/service/restore/batch.go Show resolved Hide resolved

karol-kokoszka approved these changes Oct 22, 2024

View reviewed changes

Michal-Leszczynski merged commit 7ab9235 into master Oct 22, 2024
51 checks passed

Michal-Leszczynski deleted the ml/restore-retry branch October 22, 2024 10:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore improvement: batch retry #4071

Restore improvement: batch retry #4071

Michal-Leszczynski commented Oct 16, 2024 •

edited

Loading

Michal-Leszczynski commented Oct 16, 2024

Michal-Leszczynski commented Oct 21, 2024

Michal-Leszczynski commented Oct 22, 2024

Restore improvement: batch retry #4071

Restore improvement: batch retry #4071

Conversation

Michal-Leszczynski commented Oct 16, 2024 • edited Loading

Michal-Leszczynski commented Oct 16, 2024

Michal-Leszczynski commented Oct 21, 2024

Michal-Leszczynski commented Oct 22, 2024

Michal-Leszczynski commented Oct 16, 2024 •

edited

Loading