feat(restore): implement batch and host re-tries (#4071) · scylladb/scylla-manager@7ab9235

Commit

feat(restore): implement batch and host re-tries (#4071)

* feat(restore): make batches retryable

Now, if batch restoration failed on one node,
it can still be retried by other nodes.
Failed node is no longer used for the restore.

Fixes #4065

* feat(restore_test): test batch retry

This commit adds TestRestoreTablesBatchRetryIntegration,
which injects errors during download and LAS step and
validates that restore finished successfully despite them
(thanks to batch retries).

* refactor(restore): flatten workload structure

After giving it some more thought, I decided
to flatten workload structure.
Instead of having location/table/dir layers,
now everything operates on the dir layer.
It makes the implementation easier, especially
for the upcoming changes related to node retries.

* feat(restore): add host retries

Now, even if host failed to restore given batch
it can still try to restore batches originating
from different dcs. This improves retries in
general, but also should help with #3871.

* feat(restore): extend batch test with host retry

This commit extends TestBatchDispatcher to include
failures in its scenario and to validate host retry
in a different datacenter.

* refactor(restore): make batchDispatcher more comprehensible

Previously Workload structure was created during indexing
and was updated during batching in order to keep its
progress. This was confusing, because it wasn't obvious
whether size and SSTable fields were describing the initial
Workload state or the updated one.

This commit makes it so Workload structure is not changed
during batching. Instead, workloadProgress was added to
in order to store batching progress.

Moreover, this commit also adds a lot of documentation
about batchDispatcher internal behavior.

* fix(restore): free host when it can't restore anymore

Consider a scenario with parallel=1 and multi-dc and multi-location.
Note that SM is using 'parallel.Run' for restoring in parallel.
Note that previous batching changes made host hang in
'batchDispatcher.DispatchBatch' if there were no more SSTables to restore,
because it was still possible that another node failed to restore some
SSTables, so that the hanging host could be awakened and restore failed
SSTables returned to batchDispatcher.

All of this meant that batching process could hang, because 'parallel.Run'
would allow only a single host to restore SSTables at the time, but batching
mechanism wouldn't free it until all SSTables are restored.

Another scenario when batching mechanism could fail would be that
all hosts failed (with re-tries) to restore all SSTables.

Because of that, I changed batching mechanism to be more DC
oriented. Now, 'workloadProgress' keeps track of remaining
bytes to be restored per DC, and it also keeps host DC access
instead of location access (the assumption being that a single
DC can be backed up to only single location).

This information allow to free hosts that can't restore
any SSTables because they either already failed to restore
some SSTables from given DCs, or all SSTables from given
DCs were already restored.

* feat(restore): improve context cancelling during batching

I'm not sure if previous behavior was bugged, but changes
introduced in this commit should make it more clear that
batching mechanism respects context cancellation.
This commit also adds a simple test validating that
pausing restore during batching ends quickly.

Loading branch information

Michal-Leszczynski authored Oct 22, 2024

1 parent c720770 commit 7ab9235

pkg/service/backup/backupspec/location.go

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -102,13 +102,19 @@ func NewLocation(location string) (l Location, err error) {
  
    }

    func (l Location) String() string {

    	p := l.Provider.String() + ":" + l.Path

    	p := l.StringWithoutDC()

    	if l.DC != "" {

    		p = l.DC + ":" + p

    	}

    	return p

    }

    // StringWithoutDC returns Location string representation

    // that lacks DC information.

    func (l Location) StringWithoutDC() string {

    	return l.Provider.String() + ":" + l.Path

    }

    // Datacenter returns location's datacenter.

    func (l Location) Datacenter() string {

    	return l.DC

0 comments on commit `7ab9235`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `7ab9235`

Commit

There are no files selected for viewing

0 comments on commit 7ab9235

0 comments on commit `7ab9235`