Restore: add and fill host info in restore progress #4088

Michal-Leszczynski · 2024-10-29T08:12:59Z

This PR creates and fills new swagger host restore progress definitions from #4082 in restore service.
It also updates managerclient to correctly display bandwidth and shard information in sctool progress.

Fixes #4042

Michal-Leszczynski · 2024-10-29T08:38:24Z

Examples:

Before first load&stream

miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         RUNNING (restoring backed-up data)
Start time:     29 Oct 24 09:27:27 CET
Duration:       10s
Progress:       0% | 37%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    316.128k/s
  - Load&stream: unknown

╭─────────────────────────────────────────────────┬──────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │ Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼──────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 0% | 37% │  86k │       0 │    32.245k │      0 │
╰─────────────────────────────────────────────────┴──────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   160.801k/s/shard │               unknown │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   155.433k/s/shard │               unknown │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

During restore

miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         RUNNING (restoring backed-up data)
Start time:     29 Oct 24 09:27:27 CET
Duration:       31s
Progress:       74% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    306.050k/s
  - Load&stream: 1.585k/s

╭─────────────────────────────────────────────────┬────────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │   Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼────────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 74% | 100% │  86k │ 64.368k │        86k │      0 │
╰─────────────────────────────────────────────────┴────────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   150.350k/s/shard │           813/s/shard │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   155.797k/s/shard │           810/s/shard │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

During repair

miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         DONE
Start time:     29 Oct 24 09:27:27 CET
End time:       29 Oct 24 09:28:02 CET
Duration:       34s
Progress:       100% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    306.050k/s
  - Load&stream: 1.415k/s


╭─────────────────────────────────────────────────┬─────────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │    Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼─────────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 100% | 100% │  86k │     86k │        86k │      0 │
╰─────────────────────────────────────────────────┴─────────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   150.350k/s/shard │           724/s/shard │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   155.797k/s/shard │           724/s/shard │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

karol-kokoszka · 2024-10-29T14:53:44Z

Why the kilobytes is (160.801k/s/shard) instead of (160.801kB/s/shard)

(813/s/shard) is bytes per second per shard ? B is missing.

pkg/service/restore/progress.go

karol-kokoszka

@Michal-Leszczynski idle time per node is missing

Michal-Leszczynski · 2024-10-30T08:21:20Z

Why the kilobytes is (160.801k/s/shard) instead of (160.801kB/s/shard)
(813/s/shard) is bytes per second per shard ? B is missing.

I just used the standard way in which we display bytes in managerclient.
I will update it to use KiB instead of k, etc... (in other places as well).

@Michal-Leszczynski idle time per node is missing

That's true! I forgot about your suggestion about that in the issue.

Unfortunately, it can't be calculated like Restore duration - (time reported as download) - (time reported as load & stream) since we don't really have Restore duration part. Note that the run duration displayed in sctool progress is the run duration, and not the entire task execution duration. So if restore was running for 1h and then was paused and resumed, the sctool progress would display duration close to 0, but we would like the idle time to be consistent across runs.

Of course, we could manually calculate entire task execution duration by traversing previous runs (on SM side), but even that won't solve the whole problem.

Another thing with this approach is that the total task execution duration also includes other time consuming restore stages (e.g. indexing, changing tombstone_gc, rebulding views, ...). This would result in reporting overestimated idle time, which could cause more harm than good.

In order to overcome that, we would need to know when download and load&stream stage started and finished, but we currently don't have such information.

For those reasons, I would prefer to skip idle time display, as it can still be observed via SM metrics.
@karol-kokoszka what do you think about it?

karol-kokoszka · 2024-10-30T10:02:54Z

Having information that node was spending time on something other that l&s or download helps with finding potential optimizations for the restore.
Seeing high bw for load & stream and high bw for download may be misleading when there is no information about how the node was utilized during the restore.

Examples from 3.3.3 shows that node can download fast, l&s fast but remain idle for most of the time.

Of course, we could manually calculate entire task execution duration by traversing previous runs (on SM side),

Yes, then let's do it.

Another thing with this approach is that the total task execution duration also includes other time consuming restore stages (e.g. indexing, changing tombstone_gc, rebulding views, ...). This would result in reporting overestimated idle time, which could cause more harm than good.

Call it other instead of idle then.

karol-kokoszka · 2024-10-31T16:05:42Z

@Michal-Leszczynski as per today's sync. It's enough to include the time node spent on l&s and the time node spent on downloading.

Michal-Leszczynski · 2024-10-31T16:28:55Z

@karol-kokoszka here is updated display:

miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15d9e-e2ef-4bd2-a1cb-522f6d9238ef --details
Restore progress
Run:            4e18659d-97a1-11ef-8209-0892040e83bb
Status:         DONE
Start time:     31 Oct 24 17:00:52 CET
End time:       31 Oct 24 17:01:07 CET
Duration:       15s
Progress:       100% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    138.263KiB/s
  - Load&stream: 1.416KiB/s

╭─────────────────────────────────────────────────┬─────────────┬───────┬─────────┬────────────┬────────╮
│ Keyspace                                        │    Progress │  Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼─────────────┼───────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 100% | 100% │ 86KiB │   86KiB │      86KiB │     0B │
╰─────────────────────────────────────────────────┴─────────────┴───────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────┬───────────────────────┬──────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Download duration │ Load&stream bandwidth │ Load&stream duration │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.22 │      2 │  72.165KiB/s/shard │                1s │          817B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.23 │      2 │  63.981KiB/s/shard │                1s │          552B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.12 │      2 │  67.407KiB/s/shard │                1s │          808B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.13 │      2 │  65.901KiB/s/shard │                1s │          810B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.21 │      2 │  64.017KiB/s/shard │                1s │          547B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.11 │      2 │  80.400KiB/s/shard │                1s │          812B/s/shard │                  11s │
╰────────────────┴────────┴────────────────────┴───────────────────┴───────────────────────┴──────────────────────╯

karol-kokoszka

👍

Michal-Leszczynski · 2024-11-04T08:19:51Z

@karol-kokoszka I just discovered a small display bug.
The general average bandwidth was calculated per instance and not per shard:

Bandwidth:
  - Download:    138.263KiB/s
  - Load&stream: 1.416KiB/s

I just fixed that.

It was a left-over after feature development:/

It's going to be needed for calculating per shard download/stream bandwidth in progress command.

This commit also moves host shard info to the tablesWorker, as it is commonly reused during restore procedure.

This allows to calculate download/stream per shard bandwidth in 'sctool progress' display.

Fixes #4042

It is nicer to see: "Size: 10B" instead of "Size: 10" or "Size: 20KiB" instead of "Size: 20k".

Restore: add and fill host info in restore progress * chore(go.mod): remove replace directive to SM submodules It was a left-over after feature development:/ * chore(go.mod): bump SM submodules deps * feat(schema): add shard cnt to restore_run_progress It's going to be needed for calculating per shard download/stream bandwidth in progress command. * feat(restore): add and fill shard cnt in restore run progress This commit also moves host shard info to the tablesWorker, as it is commonly reused during restore procedure. * feat(restore): add and fill host info in progress This allows to calculate download/stream per shard bandwidth in 'sctool progress' display. * feat(managerclient): display bandwidth in sctool progress Fixes #4042 * feat(managerclient): include B or iB in SizeSuffix display It is nicer to see: "Size: 10B" instead of "Size: 10" or "Size: 20KiB" instead of "Size: 20k".

Michal-Leszczynski marked this pull request as ready for review October 29, 2024 09:34

Michal-Leszczynski requested a review from karol-kokoszka as a code owner October 29, 2024 09:34

karol-kokoszka approved these changes Oct 29, 2024

View reviewed changes

pkg/service/restore/progress.go Outdated Show resolved Hide resolved

karol-kokoszka requested changes Oct 29, 2024

View reviewed changes

Michal-Leszczynski force-pushed the ml/restore-bw-fill-hosts branch 2 times, most recently from 97fd869 to f960de8 Compare October 30, 2024 08:42

Michal-Leszczynski force-pushed the ml/restore-bw-fill-hosts branch from f960de8 to 731eb38 Compare October 31, 2024 16:28

karol-kokoszka approved these changes Oct 31, 2024

View reviewed changes

Michal-Leszczynski force-pushed the ml/restore-bw-fill-hosts branch from 731eb38 to cfce92a Compare November 4, 2024 08:17

karol-kokoszka approved these changes Nov 4, 2024

View reviewed changes

Michal-Leszczynski added 7 commits November 4, 2024 13:46

chore(go.mod): remove replace directive to SM submodules

bb6542d

It was a left-over after feature development:/

chore(go.mod): bump SM submodules deps

740215d

feat(schema): add shard cnt to restore_run_progress

8d4e07a

It's going to be needed for calculating per shard download/stream bandwidth in progress command.

feat(restore): add and fill shard cnt in restore run progress

b7c0559

This commit also moves host shard info to the tablesWorker, as it is commonly reused during restore procedure.

feat(restore): add and fill host info in progress

417c5df

This allows to calculate download/stream per shard bandwidth in 'sctool progress' display.

feat(managerclient): display bandwidth in sctool progress

7c8c070

Fixes #4042

feat(managerclient): include B or iB in SizeSuffix display

44a2b80

It is nicer to see: "Size: 10B" instead of "Size: 10" or "Size: 20KiB" instead of "Size: 20k".

karol-kokoszka force-pushed the ml/restore-bw-fill-hosts branch from cfce92a to 44a2b80 Compare November 4, 2024 12:55

karol-kokoszka merged commit aba3560 into master Nov 4, 2024
52 checks passed

karol-kokoszka deleted the ml/restore-bw-fill-hosts branch November 4, 2024 13:46

karol-kokoszka mentioned this pull request Nov 4, 2024

go.mod: remove replace directive #4096

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore: add and fill host info in restore progress #4088

Restore: add and fill host info in restore progress #4088

Michal-Leszczynski commented Oct 29, 2024 •

edited

Loading

Michal-Leszczynski commented Oct 29, 2024

karol-kokoszka commented Oct 29, 2024

karol-kokoszka left a comment

Michal-Leszczynski commented Oct 30, 2024

karol-kokoszka commented Oct 30, 2024

karol-kokoszka commented Oct 31, 2024

Michal-Leszczynski commented Oct 31, 2024

karol-kokoszka left a comment

Michal-Leszczynski commented Nov 4, 2024

Restore: add and fill host info in restore progress #4088

Restore: add and fill host info in restore progress #4088

Conversation

Michal-Leszczynski commented Oct 29, 2024 • edited Loading

Michal-Leszczynski commented Oct 29, 2024

karol-kokoszka commented Oct 29, 2024

karol-kokoszka left a comment

Choose a reason for hiding this comment

Michal-Leszczynski commented Oct 30, 2024

karol-kokoszka commented Oct 30, 2024

karol-kokoszka commented Oct 31, 2024

Michal-Leszczynski commented Oct 31, 2024

karol-kokoszka left a comment

Choose a reason for hiding this comment

Michal-Leszczynski commented Nov 4, 2024

Michal-Leszczynski commented Oct 29, 2024 •

edited

Loading