-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Volume Lifecycle is not correspond to actual k8s behavior #512
Comments
At first glance, this seems like a breaking change to the spec. As this
(CSI) is not a k8s project, it is not bound by the referenced KEP.
That said, others in the CSI community have thought long and hard about
similar problems. So perhaps there's a way to amend the spec in a way that
doesn't break existing drivers (or the integration w/ other platforms and
said drivers).
Community feedback here is welcome.
…On Tue, Jun 14, 2022 at 11:46 PM Yuiko Mouri ***@***.***> wrote:
As non-graceful node shutdown feature has been implemented in k8s as
alpha, CSI volume lifecycle is not correspond to actual k8s behavior.
see:
-
https://kubernetes.io/blog/2022/05/20/kubernetes-1-24-non-graceful-node-shutdown-alpha/
-
https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown
By using non-graceful node shutdown, status can be moved from PUBLISHED
to CREATED without going through VOL_READY and/or NODE_READY status.
In the KEP, written as below:
- Once pods are selected and forcefully deleted, the attachdetach
reconciler should check the out-of-service taint on the node. If the
taint is present, the attachdetach reconciler will not wait for 6 minutes
to do force detach. Instead it will force detach right away and allow
volumeAttachment to be deleted.
- This would trigger the deletion of the volumeAttachment objects. For
CSI drivers, this would allow ControllerUnpublishVolume to happen
without NodeUnpublishVolume and/or NodeUnstageVolume being called
first. Note that there is no additional code changes required for this
step. This happens automatically after the Proposed change in the previous
step to force detach right away.
—
Reply to this email directly, view it on GitHub
<#512>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR5KLBDKB5HOQQPN6U7A7LVPFGYVANCNFSM5YZZEDDQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
James DeFelice
|
@jdef Thank you for your comments. I understand what you say.
I want to find a good solution which doesn't break existing drivers. |
Note that this "force detach" behavior is not introduced by Non-Graceful Node Shutdown feature. Kubernetes already supports this behavior without Non-Graceful Node Shutdown. See "Test 2" in the PR description section below. By forcefully deleting the Pods on the shutdown node manually, volumes will be force-detached after a 6 minute wait by the Attach Detach Controller. |
Yes, Kubernetes already breaks CSI spec and can call ControllerUnpublish without NodeUnpublish / NodeUnstage succeeding if Kubernetes thinks the node is broken - it can't really call NodUnstage/Unpublish in that case or get its result. The last attempt to fix this is officially in CSI is in #477. |
Implementor of Nomad's CSI support here! 👋 For what it's worth we originally implemented the spec as-written and it turned out to cause our users a lot of grief. As of Nomad 1.3.0 (shipped in May of this year), we're doing something similar to what k8s has done where we make a best effort attempt to We drive this from the "client node" (our equivalent of the kubelet), so if the client node is merely disconnected and not dead, we can rely on the node unpublish/unstage having happened by the time we try to GC the claim from the control plane side. The control plane ends up retrying the |
Thanks @tgross. Are there any concerns from plugin providers that may be relying on CSI-as-written vs. the best-effort described herein? Also Mesos is another CO w/ CSI integration - anyone on that side of the house have input to add here? |
IIRC all the plugins I've tested that support I know that, for example, the AWS EBS provider just merrily returns OK to that API call and then the device doesn't actually get detached until it's unmounted. (Or the user can "force detach" via the API out-of-band.) So in that case the provider is graceful and has behavior that's eventually correct, so long as that node unpublish happens eventually. If the node unpublish never happens (say the CO has crashed unrecoverably but the host is still live), I think you end up with a hung volume. But arguably that's the right behavior. I just don't how prevalent that graceful treatment is across the ecosystem. |
@tgross Thank you for sharing information. |
The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly succesfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes two major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the healthy plugin instances. Fixes: #15415
The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly successfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host even for _different_ volumes. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach (because the device name is taken already and the API call fails) and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes four major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the plugin instances. * Ensure that requests go to _healthy_ plugin instances only. * Ensure that requests for controllers can go to a controller on any _live_ node, not just ones eligible for scheduling (which CSI controllers don't care about) Fixes: #15415
Current CSI volume lifecycle is not designed for the case when
Node
is unreachable.When
Node
is shutdown or in a non-recoverable state such as hardware failure or broken OS,Node Plugin
cannot issueNodeUnpublishVolume
/NodeUnstageVolume
. In this case, we want to make the statusCREATED
(volume is detached from the node and pods are evicted to other node and running)But in current CSI volume lifecycle, there is no transition from
PUBLISHED
/VOL_READY
/NODE_READY
toCREATED
.As a result, k8s doesn't follow the CSI spec and the status moves from
PUBLISHED
toCREATED
directly without going throughVOL_READY
and/orNODE_READY
status.We need to update the CSI volume lifecycle with considering the case when
Node
is unreachable.The text was updated successfully, but these errors were encountered: