-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix lifecycle and ControllerUnpublishVolume description #533
Fix lifecycle and ControllerUnpublishVolume description #533
Conversation
(a) is there value having the CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume? (i.e. via some flag field in the request message) this could possibly be a valuable signal to the SP that additional validation/GC/state-repair needs to happen on the backend. perhaps SP authors can chime in here. (b) this is a breaking change w/ respect to SP expectations re: volume lifecycle. it may make sense to guard this w/ a specific capability in the controller service. and only then should a CO have expectations that such state transitions are reliable. otherwise SPs may receive calls in an unexpected/unsupported order. k8s isn't the only CO on the block, and just because it breaks the rules currently doesn't mean that all SPs need to support that pattern. |
There are COs of which implementation violates csi spec already. They violate csi spec reluctantly because there is no transition in case of node is shut down or in a non-recoverable state. Matching actual implementation and spec makes sense. |
reluctant COs or otherwise, this change seems to break the contract defined in the spec. it's unclear whether my proposals are under consideration. do you have any feedback re: the suggestions I made? |
This PR looks having room to describe more detail, but I believe that its direction is right because:
See logs for details. If someone appends ANOTHER call like #477 to the spec for the case, we will have to modify K8s and CSI plugins supporting 2. We don't want to change them because they are working very well now. So, it's good to modify the spec to follow the current as-is workflow. This PR will do so. |
With Nomad our experience is the primary reason we need to bypass the proper cleanup calls is because we can't communicate with the node plugin (ex. it's host isn't running). Right now we don't track the failure state specifically and just log the event and bump the internal state machine along, but it's totally feasible for us to implement. For SP plugins like AWS EBS, the controller could use that information to force-detach instead of detach.
As a CO implementer I'd suggest that if we were to have such a capability, the spec should provide specific guidance to what the CO and SP should expect as to the correct behavior if that capability is missing. Because by a strict reading of the current spec if the node plugin is unavailable the state machine cannot continue and the volume is unrecoverably lost from the perspective of the CO. Needless to say this is an awful user experience. So if the spec is going to be particular about it I'd love to be able to point to the plugin author who isn't providing that capability so that we can suggest to our users that they use a better plugin. 😁
fwiw, Nomad implements something similar as well. |
f5963c5
to
afd5b33
Compare
Thank you for giving me good advice. I also think your idea "CO notify the SP that it's bypassing proper cleanup calls when invoking ControllerUnpublishVolume" and "i.e. via some flag field in the request message"is good. |
It's true that today k8s already violates the spec, but only in situations where it's known to be safe. The workaround that exist to ensure safety when doing this still rely on external knowledge, though, which means they either need a human in the loop (not suitable for lights out operation) or they rely on some external component with special access to machine state (not available in all deployments). I'm going to put some more effort into bringing back my proposed solution in #477. The remaining objections seem to be related to the fact that the proposal only covers the CSI layer and people want to a proposal that covers the upper layers too. To this end I will write up a KEP that describes a comprehensive proposal. |
Yes, as mentioned above some cloud-y CSI drivers could invoke "force detach". Some more traditional (FC, iSCSI) drivers could at least try to clean up LUN mapping on the server (target) side, because the client (initiator) is in down / unreachable. |
As another CO implementer (Kubernetes), I have the same point. Only human can bring the volume back to a working state, because CO does not have any message that would indicate SP to recover the volume from such a state. |
@bswartz I read #477 again. #477 suggests to create new states |
We found that k8s calls NodeStageVolume and NodePublishVolume for attached volumes on a restarted node. The status of attached volume is "PUBLISHED" and it's another violation of the state transition diagram in Fig.6, but it's reasonable behavior for COs because attached volume might be detached before node restart. Please consider them when you will update #477. BTW, #477 looks storage fencing, the first plan of non-graceful shutdown in k8s but it was finally changed to use node fencing. Is my understanding incorrect? |
We intend to fix the broken behavior in Kubernetes: Initially the fix will be opt-in, but eventually the old spec-violating behavior will be removed entirely. Based on that plan, this PR is no longer needed. |
@bswartz Thank you for your notice. I'll check it. |
What type of PR is this?
This PR adds new transition which move from
PUBLISHED
/VOL_READY
toCREATED
directly byControllerUnpublishVolume
RPC when control plane can't reach the node(for example node shut down due to a hardware failure or a software problem).What this PR does / why we need it:
Current CSI volume lifecycle is not designed for the case when
Node
is unreachable.When
Node
is shutdown or in a non-recoverable state such as hardware failure or broken OS,Node Plugin
cannot issueNodeUnpublishVolume
/NodeUnstageVolume
. In this case, we want to make the statusCREATED
(volume is detached from the node and pods are evicted to other node and running)But in current CSI volume lifecycle, there is no transition from
PUBLISHED
/VOL_READY
/NODE_READY
toCREATED
.As a result, k8s doesn't follow the CSI spec and the status moves from
PUBLISHED
toCREATED
directly without going throughVOL_READY
and/orNODE_READY
status.We need to update the CSI volume lifecycle with considering the case when
Node
is unreachable.Which issue(s) this PR fixes:
Fixes #512
Special notes for your reviewer:
Does this PR introduce an API-breaking change?: