Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore: add and fill host info in restore progress #4088

Merged
merged 7 commits into from
Nov 4, 2024

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Oct 29, 2024

This PR creates and fills new swagger host restore progress definitions from #4082 in restore service.
It also updates managerclient to correctly display bandwidth and shard information in sctool progress.

Fixes #4042

@Michal-Leszczynski
Copy link
Collaborator Author

Examples:

Before first load&stream
miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         RUNNING (restoring backed-up data)
Start time:     29 Oct 24 09:27:27 CET
Duration:       10s
Progress:       0% | 37%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    316.128k/s
  - Load&stream: unknown

╭─────────────────────────────────────────────────┬──────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │ Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼──────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 0% | 37% │  86k │       0 │    32.245k │      0 │
╰─────────────────────────────────────────────────┴──────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   160.801k/s/shard │               unknown │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   155.433k/s/shard │               unknown │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

During restore
miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         RUNNING (restoring backed-up data)
Start time:     29 Oct 24 09:27:27 CET
Duration:       31s
Progress:       74% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    306.050k/s
  - Load&stream: 1.585k/s

╭─────────────────────────────────────────────────┬────────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │   Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼────────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 74% | 100% │  86k │ 64.368k │        86k │      0 │
╰─────────────────────────────────────────────────┴────────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   150.350k/s/shard │           813/s/shard │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   155.797k/s/shard │           810/s/shard │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

During repair
miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         DONE
Start time:     29 Oct 24 09:27:27 CET
End time:       29 Oct 24 09:28:02 CET
Duration:       34s
Progress:       100% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    306.050k/s
  - Load&stream: 1.415k/s


╭─────────────────────────────────────────────────┬─────────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │    Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼─────────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 100% | 100% │  86k │     86k │        86k │      0 │
╰─────────────────────────────────────────────────┴─────────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   150.350k/s/shard │           724/s/shard │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   155.797k/s/shard │           724/s/shard │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

@Michal-Leszczynski Michal-Leszczynski marked this pull request as ready for review October 29, 2024 09:34
@karol-kokoszka
Copy link
Collaborator

Why the kilobytes is (160.801k/s/shard) instead of (160.801kB/s/shard)

(813/s/shard) is bytes per second per shard ? B is missing.

pkg/service/restore/progress.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michal-Leszczynski idle time per node is missing

@Michal-Leszczynski
Copy link
Collaborator Author

Why the kilobytes is (160.801k/s/shard) instead of (160.801kB/s/shard)
(813/s/shard) is bytes per second per shard ? B is missing.

I just used the standard way in which we display bytes in managerclient.
I will update it to use KiB instead of k, etc... (in other places as well).

@Michal-Leszczynski idle time per node is missing

That's true! I forgot about your suggestion about that in the issue.

Unfortunately, it can't be calculated like Restore duration - (time reported as download) - (time reported as load & stream) since we don't really have Restore duration part. Note that the run duration displayed in sctool progress is the run duration, and not the entire task execution duration. So if restore was running for 1h and then was paused and resumed, the sctool progress would display duration close to 0, but we would like the idle time to be consistent across runs.

Of course, we could manually calculate entire task execution duration by traversing previous runs (on SM side), but even that won't solve the whole problem.

Another thing with this approach is that the total task execution duration also includes other time consuming restore stages (e.g. indexing, changing tombstone_gc, rebulding views, ...). This would result in reporting overestimated idle time, which could cause more harm than good.

In order to overcome that, we would need to know when download and load&stream stage started and finished, but we currently don't have such information.

For those reasons, I would prefer to skip idle time display, as it can still be observed via SM metrics.
@karol-kokoszka what do you think about it?

@Michal-Leszczynski Michal-Leszczynski force-pushed the ml/restore-bw-fill-hosts branch 2 times, most recently from 97fd869 to f960de8 Compare October 30, 2024 08:42
@karol-kokoszka
Copy link
Collaborator

Having information that node was spending time on something other that l&s or download helps with finding potential optimizations for the restore.
Seeing high bw for load & stream and high bw for download may be misleading when there is no information about how the node was utilized during the restore.

Examples from 3.3.3 shows that node can download fast, l&s fast but remain idle for most of the time.

Of course, we could manually calculate entire task execution duration by traversing previous runs (on SM side),

Yes, then let's do it.

Another thing with this approach is that the total task execution duration also includes other time consuming restore stages (e.g. indexing, changing tombstone_gc, rebulding views, ...). This would result in reporting overestimated idle time, which could cause more harm than good.

Call it other instead of idle then.

@karol-kokoszka
Copy link
Collaborator

@Michal-Leszczynski as per today's sync. It's enough to include the time node spent on l&s and the time node spent on downloading.

@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka here is updated display:

miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15d9e-e2ef-4bd2-a1cb-522f6d9238ef --details
Restore progress
Run:            4e18659d-97a1-11ef-8209-0892040e83bb
Status:         DONE
Start time:     31 Oct 24 17:00:52 CET
End time:       31 Oct 24 17:01:07 CET
Duration:       15s
Progress:       100% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    138.263KiB/s
  - Load&stream: 1.416KiB/s

╭─────────────────────────────────────────────────┬─────────────┬───────┬─────────┬────────────┬────────╮
│ Keyspace                                        │    Progress │  Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼─────────────┼───────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 100% | 100% │ 86KiB │   86KiB │      86KiB │     0B │
╰─────────────────────────────────────────────────┴─────────────┴───────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────┬───────────────────────┬──────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Download duration │ Load&stream bandwidth │ Load&stream duration │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.22 │      2 │  72.165KiB/s/shard │                1s │          817B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.23 │      2 │  63.981KiB/s/shard │                1s │          552B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.12 │      2 │  67.407KiB/s/shard │                1s │          808B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.13 │      2 │  65.901KiB/s/shard │                1s │          810B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.21 │      2 │  64.017KiB/s/shard │                1s │          547B/s/shard │                  11s │
├────────────────┼────────┼────────────────────┼───────────────────┼───────────────────────┼──────────────────────┤
│ 192.168.200.11 │      2 │  80.400KiB/s/shard │                1s │          812B/s/shard │                  11s │
╰────────────────┴────────┴────────────────────┴───────────────────┴───────────────────────┴──────────────────────╯

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka I just discovered a small display bug.
The general average bandwidth was calculated per instance and not per shard:

Bandwidth:
  - Download:    138.263KiB/s
  - Load&stream: 1.416KiB/s

I just fixed that.

It was a left-over after feature development:/
It's going to be needed for calculating per shard
download/stream bandwidth in progress command.
This commit also moves host shard info to the tablesWorker,
as it is commonly reused during restore procedure.
This allows to calculate download/stream per shard
bandwidth in 'sctool progress' display.
It is nicer to see:
"Size: 10B" instead of "Size: 10" or
"Size: 20KiB" instead of "Size: 20k".
@karol-kokoszka karol-kokoszka merged commit aba3560 into master Nov 4, 2024
52 checks passed
@karol-kokoszka karol-kokoszka deleted the ml/restore-bw-fill-hosts branch November 4, 2024 13:46
karol-kokoszka pushed a commit that referenced this pull request Nov 4, 2024
Restore: add and fill host info in restore progress

* chore(go.mod): remove replace directive to SM submodules

It was a left-over after feature development:/

* chore(go.mod): bump SM submodules deps

* feat(schema): add shard cnt to restore_run_progress

It's going to be needed for calculating per shard
download/stream bandwidth in progress command.

* feat(restore): add and fill shard cnt in restore run progress

This commit also moves host shard info to the tablesWorker,
as it is commonly reused during restore procedure.

* feat(restore): add and fill host info in progress

This allows to calculate download/stream per shard
bandwidth in 'sctool progress' display.

* feat(managerclient): display bandwidth in sctool progress

Fixes #4042

* feat(managerclient): include B or iB in SizeSuffix display

It is nicer to see:
"Size: 10B" instead of "Size: 10" or
"Size: 20KiB" instead of "Size: 20k".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store bandwidth characteristic of Manager restore process
2 participants