Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume Lifecycle is not correspond to actual k8s behavior #512

Open
YuikoTakada opened this issue Jun 15, 2022 · 8 comments
Open

Volume Lifecycle is not correspond to actual k8s behavior #512

YuikoTakada opened this issue Jun 15, 2022 · 8 comments

Comments

@YuikoTakada
Copy link

YuikoTakada commented Jun 15, 2022

Current CSI volume lifecycle is not designed for the case when Node is unreachable.

When Node is shutdown or in a non-recoverable state such as hardware failure or broken OS, Node Plugin cannot issue NodeUnpublishVolume / NodeUnstageVolume . In this case, we want to make the status CREATED (volume is detached from the node and pods are evicted to other node and running)
But in current CSI volume lifecycle, there is no transition from PUBLISHED / VOL_READY / NODE_READY to CREATED.
As a result, k8s doesn't follow the CSI spec and the status moves from PUBLISHED to CREATED directly without going through VOL_READY and/or NODE_READY status.

We need to update the CSI volume lifecycle with considering the case when Node is unreachable.

@jdef
Copy link
Member

jdef commented Jun 15, 2022 via email

@YuikoTakada
Copy link
Author

@jdef Thank you for your comments.

I understand what you say.

At first glance, this seems like a breaking change to the spec. As this (CSI) is not a k8s project, it is not bound by the referenced KEP.

I want to find a good solution which doesn't break existing drivers.
Therefore, #477 also seems to try to solve this problem. WDYT?

@xing-yang
Copy link
Contributor

This would trigger the deletion of the volumeAttachment objects. For CSI drivers, this would allow ControllerUnpublishVolume to happen without NodeUnpublishVolume and/or NodeUnstageVolume being called first. Note that there is no additional code changes required for this step. This happens automatically after the Proposed change in the previous step to force detach right away.

Note that this "force detach" behavior is not introduced by Non-Graceful Node Shutdown feature. Kubernetes already supports this behavior without Non-Graceful Node Shutdown. See "Test 2" in the PR description section below. By forcefully deleting the Pods on the shutdown node manually, volumes will be force-detached after a 6 minute wait by the Attach Detach Controller.

kubernetes/kubernetes#108486

@jsafrane
Copy link
Contributor

Yes, Kubernetes already breaks CSI spec and can call ControllerUnpublish without NodeUnpublish / NodeUnstage succeeding if Kubernetes thinks the node is broken - it can't really call NodUnstage/Unpublish in that case or get its result.

The last attempt to fix this is officially in CSI is in #477.

@tgross
Copy link

tgross commented Sep 12, 2022

Yes, Kubernetes already breaks CSI spec and can call ControllerUnpublish without NodeUnpublish / NodeUnstage succeeding if Kubernetes thinks the node is broken - it can't really call NodUnstage/Unpublish in that case or get its result.

Implementor of Nomad's CSI support here! 👋 For what it's worth we originally implemented the spec as-written and it turned out to cause our users a lot of grief. As of Nomad 1.3.0 (shipped in May of this year), we're doing something similar to what k8s has done where we make a best effort attempt to NodeUnpublish / NodeUnstage before ControllerUnpublish.

We drive this from the "client node" (our equivalent of the kubelet), so if the client node is merely disconnected and not dead, we can rely on the node unpublish/unstage having happened by the time we try to GC the claim from the control plane side. The control plane ends up retrying the NodeUnstage/NodeUnpublish anyways, but proceeds on to ControllerUnpublish if it can't reach the node.

@jdef
Copy link
Member

jdef commented Sep 12, 2022

Thanks @tgross.

Are there any concerns from plugin providers that may be relying on CSI-as-written vs. the best-effort described herein?

Also Mesos is another CO w/ CSI integration - anyone on that side of the house have input to add here?

@tgross
Copy link

tgross commented Sep 14, 2022

Are there any concerns from plugin providers that may be relying on CSI-as-written vs. the best-effort described herein?

IIRC all the plugins I've tested that support ControllerPublish implement that step as "make the API call to the storage provider to attach the volume as a device on the target host" and then implement NodeStage as "call mount(8) to mount that device". So when unpublishing, if the unmount is missed, the API call will try to detach a mounted device.

I know that, for example, the AWS EBS provider just merrily returns OK to that API call and then the device doesn't actually get detached until it's unmounted. (Or the user can "force detach" via the API out-of-band.) So in that case the provider is graceful and has behavior that's eventually correct, so long as that node unpublish happens eventually. If the node unpublish never happens (say the CO has crashed unrecoverably but the host is still live), I think you end up with a hung volume. But arguably that's the right behavior. I just don't how prevalent that graceful treatment is across the ecosystem.

@YuikoTakada
Copy link
Author

@tgross Thank you for sharing information.
I've updated description to fit CO as a whole. This issue is not only of Kubernetes.

tgross added a commit to hashicorp/nomad that referenced this issue Jul 19, 2023
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly succesfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost.
(See also container-storage-interface/spec#512)

These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host. For example,
concurrently attaching AWS EBS volumes to an EC2 instance results in a race for
device names, which results in failure to attach and confused results when
releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars
for serializing storage provider API calls globally. As a result, we have to be
much more conservative about concurrency in Nomad than the spec allows.

This changeset includes two major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the
  RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this
  serializer and only one RPC is sent at a time. Any other RPCs in flight will
  block.
* Ensure that requests go to the same controller plugin instance whenever
  possible by sorting by lowest client ID out of the healthy plugin instances.

Fixes: #15415
tgross added a commit to hashicorp/nomad that referenced this issue Jul 20, 2023
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly successfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost. (See also
container-storage-interface/spec#512)

These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host even for
_different_ volumes. For example, concurrently attaching AWS EBS volumes to an
EC2 instance results in a race for device names, which results in failure to
attach (because the device name is taken already and the API call fails) and
confused results when releasing claims. So in practice many CSI plugins rely on
k8s-specific sidecars for serializing storage provider API calls globally. As a
result, we have to be much more conservative about concurrency in Nomad than the
spec allows.

This changeset includes four major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the RPC handler
  makes a destructive CSI Controller RPC, we send the RPC thru this serializer
  and only one RPC is sent at a time. Any other RPCs in flight will block.
* Ensure that requests go to the same controller plugin instance whenever
  possible by sorting by lowest client ID out of the plugin instances.
* Ensure that requests go to _healthy_ plugin instances only.
* Ensure that requests for controllers can go to a controller on any _live_
  node, not just ones eligible for scheduling (which CSI controllers don't care
  about)

Fixes: #15415
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants