Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dump and restore internal state #277

Merged
merged 2 commits into from
Apr 27, 2021

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Apr 20, 2021

What type of PR is this?
/kind bug

What this PR does / why we need it:

This replaces the previous approach, trying to reconstruct state from
observations, with a simpler dump/restore of the internal state as a
JSON file in the driver's data directory. That old code was broken (did
not distinguish between own volumes and foreign volumes) and
incomplete (not all state restored).

No attempts are made to restore state properly after a node reboot.

Which issue(s) this PR fixes:

#210 (comment)

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

During startup, the driver may have restored internal state incorrectly (volumes added to internal list that belong to some other driver) or failed to start completely (`failed to get capacity info: no such file or directory`).

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 20, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 20, 2021
@@ -48,13 +48,7 @@ type FileSystems struct {
Filsystem []ContainerFileSystem `json:"filesystems"`
}

func getSourcePath(volumeHandle string) string {
return fmt.Sprintf("%s/%s", dataRoot, volumeHandle)
}
Copy link
Contributor Author

@pohly pohly Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a reimplementation of getVolumePath.

func (hp *hostPath) checkPVCapacityValid(volumeHandle string) (bool, error) {
sourcePath := getSourcePath(volumeHandle)
func (hp *hostPath) checkPVCapacityValid(volID string) (bool, error) {
sourcePath := hp.getVolumePath(volID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using volumeHandle instead of volID as in the rest of the code confused me ("what is this handle?!"), so I replaced it in the entire file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PV, the id to a physical volume on the storage backend is called VolumeHandle. In CSI spec, it is called volume_id. They are the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when both are the same, the cognitive overhead for understanding the source code is higher when some parts use one term and other parts use the other. Everything else uses "volume ID", so this code should do so too.

}

return strings.TrimSuffix(strings.TrimPrefix(volumeSP, "[/var/lib/csi-hostpath-data/"), "]")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions are only used by one test, so I moved them there.

@pohly pohly force-pushed the state-handling branch 3 times, most recently from 655d336 to 8869aeb Compare April 20, 2021 09:18
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 20, 2021
This replaces the previous approach, trying to reconstruct state from
observations, with a simpler dump/restore of the internal state as a
JSON file in the driver's data directory.

The advantage is that *all* volume and snapshot attributes get
restored, not just those that can be deducted from mount points.

No attempts are made to restore state properly after a node reboot.
return convertSnapshot(snapshot), nil
}
return &csi.ListSnapshotsResponse{}, nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return here was missing earlier. It only passed because the test is not very precise:
https://github.com/kubernetes-csi/csi-test/blob/a251c44fd49d9eedd55a40b71e5da4ad080ba431/pkg/sanity/controller.go#L1157-L1165

I created an issue for it:
kubernetes-csi/csi-test#335

if snapshot.VolID == req.SourceVolumeId {
return convertSnapshot(snapshot), nil
}
}
return &csi.ListSnapshotsResponse{}, nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@pohly pohly changed the title WIP: dump and restore internal state dump and restore internal state Apr 20, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2021
@pohly
Copy link
Contributor Author

pohly commented Apr 20, 2021

@fengzixu this is what I had in mind: a minimal "state" package whose only responsibility is to store volume and snapshot structs.

The overall patch is quite a bit smaller than what you currently have in #276 despite including some cleanup changes that aren't really necessary (volumeHandle -> volID).

What do you think?

/cc @msau42 @xing-yang

@fengzixu
Copy link
Contributor

@fengzixu this is what I had in mind: a minimal "state" package whose only responsibility is to store volume and snapshot structs.

The overall patch is quite a bit smaller than what you currently have in #276 despite including some cleanup changes that aren't really necessary (volumeHandle -> volID).

What do you think?

/cc @msau42 @xing-yang

Yep. The current implementation is fine for me. It should be enough for fixing the current bug. THanks. Let me close my PR.

@pohly
Copy link
Contributor Author

pohly commented Apr 20, 2021

@xing-yang can you review?

@@ -46,6 +46,7 @@ func main() {

flag.StringVar(&cfg.Endpoint, "endpoint", "unix://tmp/csi.sock", "CSI endpoint")
flag.StringVar(&cfg.DriverName, "drivername", "hostpath.csi.k8s.io", "name of the driver")
flag.StringVar(&cfg.StateDir, "statedir", "/csi-data-dir", "directory for storing state information across driver restarts, volumes and snapshots")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this configurable is useful because then it is possible again to run the driver as non-root with a state directory in /tmp.

Node operations cannot be tested easily that way (need root for mounting), but even that could be achieved with wrapper scripts that rely on sudo.

@xing-yang xing-yang self-assigned this Apr 20, 2021
if len(req.GetSourceVolumeId()) != 0 {
for _, snapshot := range hp.snapshots {
for _, snapshot := range hp.state.GetSnapshots() {
if snapshot.VolID == req.SourceVolumeId {
return convertSnapshot(snapshot), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns as soon as one snapshot's volume id matches the request, but it is possible that there are multiple snapshots with the same source volume id. I see that this is in the original code so this can be fixed in a different PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's fix this separately. Please file a bug so that we don't forget.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also we should have a csi-sanity test for it...

func doHealthCheckInNodeSide(volumeHandle string) (bool, string) {
mpExist, err := checkMountPointExist(volumeHandle)
func (hp *hostPath) doHealthCheckInNodeSide(volID string) (bool, string) {
sourcePath := hp.getVolumePath(volID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourcePath -> volumePath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change then also should percolate to other places, like checkMountPointExist. I've pushed a commit with that change.

Please take another look.

func (hp *hostPath) doHealthCheckInControllerSide(volumeHandle string) (bool, string) {
spExist, err := checkSourcePathExist(volumeHandle)
func (hp *hostPath) doHealthCheckInControllerSide(volID string) (bool, string) {
sourcePath := hp.getVolumePath(volID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourcePath -> volumePath

@xing-yang
Copy link
Contributor

This looks fine. I've asked @fengzixu to validate the volume health part.

The "sourcePath" name is not quite right: it is a "volume path" which
just happens to be used as source in some places.
@fengzixu
Copy link
Contributor

fengzixu commented Apr 27, 2021

I tested volume health part based on this PR, test cases as below are fine

  1. create volume ---> list volumes ---> get volume health state
  2. unmount volume ---> NodeGetVolumeStats/ControllerListVolumeCondition ---> get abnomal volume health state ----> generate event to pod/pvc Warning VolumeConditionAbnormal 4s (x12 over 71m) csi-pv-monitor-agent-hostpath.csi.k8s.io The volume isn't mounted
  3. mount volume ---> NodeGetVolumeStats/ControllerListVolumeCondition ----> get normal volume health state ----> generate event to pod/pvc Normal VolumeConditionNormal pod/my-csi-app The Volume returns to the healthy state

I also tested similar cases for "source path doesn't exist", it's also fine

cc @xing-yang

@xing-yang
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2021
@k8s-ci-robot k8s-ci-robot merged commit 9c84d82 into kubernetes-csi:master Apr 27, 2021
@pohly pohly mentioned this pull request Apr 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants