-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to set pv affinity to nexus node #1578
Comments
Hi @dylex, sorry for the delay.
In case the nexus node is down mayastor will create a new nexus and swap paths on the initiator node. Maybe this logic has a bug, or perhaps ana was not enabled on the nodes.. Would you be able to upload a dump so we can take a look?
IIRC if your application is started on a mayastor engine node, then the application will already be pinned to the nexus. |
Thanks. I put a dump here though looking at the logs I think it did not capture the relevant period due to restarts. We do have the logs externally from that day, which I've put here. General timeline for 2024-01-08:
It doesn't seem like pods running on engine nodes are necessarily being scheduled to the nexus node, but maybe they were initially. Another factor that maybe we've setup poorly: in our 9-node cluster, 6 nodes are engine nodes with disk pools, while 3 nodes (the control plane nodes, k8s-160-1, k8s-162-1, k8s-elk) run mayastor-etcd, but also may run some client pods using mayastor. |
Ah this is because I suspect you have disabled mayastor loki. We recommend you keep it enable if possible as it's very important to capture logs on these type of scenarios. Anyway I might have found the issue:
At least one of your nodes does not have kernel multi-path enabled. If this is also true for the nodes where the applications were running then this means we cannot failover by creating a new nexus. (Also I think we need to start capturing info about "initiator multipath support" from all nodes, maybe we can show this type of info via the plugin and also the dumps, cc @Abhinandan-Purkait) |
Ah, yeah, we're already collecting all logs with filebeat, so prefered not to duplicate. If I'm reading that correctly, this probably needs the |
Yep, makes sense, I wonder if we can integrate a way for the plugin to talk to existing log collections and automatically add it to the support bundle.
Yes sounds about right. Yeah I don't think we've documented this properly, sorry about that. |
Current behaviour is that if the application is constrained to nodes with the io-engine label (openebs.io/engine=mayastor), the nexus is preferably placed on the same node the application is scheduled to, though is not a hard requirement (example: if the io-engine pod pod said node is in a bad state, we might place the nexus on a different node). I suggest we start by documenting this behaviour and if there's more requests for a hard pinning of the application and the nexus, we can revisit. |
Is your feature request related to a problem? Please describe.
We recently lost a node in our k8s mayastor cluster. For volumes it was a replica for, everything was fine and recovered after ioTimeout, but for those it was the target nexus for, things got stuck. Another node that was mounting from it reported nvme timeouts, and the pod with the mount hung in Terminating. I believe the result was it couldn't release the pv, so mayastor would not reallocate another nexus while the previous one was still mounted. Many processes on that node were hung -- anything looking at block devices or nvme. Ultimately we had to hard reboot the other node.
Now, this may well be a kernel issue with nvmeof, but unless I'm missing something may be something we have to live with.
Describe the solution you'd like
If mayastor had an option to set the pv node affinity to the node running the nexus, so pods would be scheduled on that node, and all nvmeof connections would be to the local node, then if a node fails, all the clients on that nexus would die if the node is lost.
Obviously this would not work in general or make sense for many deployments, but in our situation, where we run pods and storage on the same nodes, it would be a nice option, and also improve performance.
Describe alternatives you've considered
Certainly open to options If there's some better way to recover in this situation.
The biggest problem I see with this option is that pv nodeAffinity settings are immutable, so if the nexus needed to move, the pv would have to be recreated, which is probably rather difficult to manage correctly. An alternative would be to label nodes with the nexus they're running so pods could be manually assigned to the corresponding node somehow.
The text was updated successfully, but these errors were encountered: