Option to set pv affinity to nexus node #1578

dylex · 2024-01-12T16:26:19Z

Is your feature request related to a problem? Please describe.
We recently lost a node in our k8s mayastor cluster. For volumes it was a replica for, everything was fine and recovered after ioTimeout, but for those it was the target nexus for, things got stuck. Another node that was mounting from it reported nvme timeouts, and the pod with the mount hung in Terminating. I believe the result was it couldn't release the pv, so mayastor would not reallocate another nexus while the previous one was still mounted. Many processes on that node were hung -- anything looking at block devices or nvme. Ultimately we had to hard reboot the other node.

Now, this may well be a kernel issue with nvmeof, but unless I'm missing something may be something we have to live with.

Describe the solution you'd like
If mayastor had an option to set the pv node affinity to the node running the nexus, so pods would be scheduled on that node, and all nvmeof connections would be to the local node, then if a node fails, all the clients on that nexus would die if the node is lost.

Obviously this would not work in general or make sense for many deployments, but in our situation, where we run pods and storage on the same nodes, it would be a nice option, and also improve performance.

Describe alternatives you've considered
Certainly open to options If there's some better way to recover in this situation.

The biggest problem I see with this option is that pv nodeAffinity settings are immutable, so if the nexus needed to move, the pv would have to be recreated, which is probably rather difficult to manage correctly. An alternative would be to label nodes with the nexus they're running so pods could be manually assigned to the corresponding node somehow.

tiagolobocastro · 2024-01-15T12:04:55Z

Hi @dylex, sorry for the delay.

For volumes it was a replica for, everything was fine and recovered after ioTimeout, but for those it was the target nexus for, things got stuck

In case the nexus node is down mayastor will create a new nexus and swap paths on the initiator node. Maybe this logic has a bug, or perhaps ana was not enabled on the nodes.. Would you be able to upload a dump so we can take a look?
https://mayastor.gitbook.io/introduction/advanced-operations/supportability
Use --since to make sure it captures when this happened.

If mayastor had an option to set the pv node affinity to the node running the nexus
We had something like this before (setting hostname on the pv) but this can easily be a loaded gun as those set in the PV are immutable, so if you loose your node it's not so easy to move it to another node..

IIRC if your application is started on a mayastor engine node, then the application will already be pinned to the nexus.
So you could add the openebs.io/engine label to your application?
We could also add it to the pv, though rather than using hostname we could set the engine label? (though similarly the pv would be forever constrained to nodes with this label).

dylex · 2024-01-15T17:17:46Z

Thanks. I put a dump here though looking at the logs I think it did not capture the relevant period due to restarts. We do have the logs externally from that day, which I've put here. General timeline for 2024-01-08:

17:57 k8s-162-4 goes down
18:07 iotimeout reached, most volumes recover (k8s-162-4 is manually cordoned/drained at some point later)
19:12 k8s-162-4 restored, but 2 volumes remain in state Unknown without target nodes, pods on k8s-162-1 still stuck in Terminating
20:00 k8s-162-1 rebooted after hang
20:12 service is fully restored

It doesn't seem like pods running on engine nodes are necessarily being scheduled to the nexus node, but maybe they were initially. Another factor that maybe we've setup poorly: in our 9-node cluster, 6 nodes are engine nodes with disk pools, while 3 nodes (the control plane nodes, k8s-160-1, k8s-162-1, k8s-elk) run mayastor-etcd, but also may run some client pods using mayastor.

tiagolobocastro · 2024-01-15T22:12:20Z

Thanks. I put a dump here though looking at the logs I think it did not capture the relevant period due to restarts.

Ah this is because I suspect you have disabled mayastor loki. We recommend you keep it enable if possible as it's very important to capture logs on these type of scenarios.

Anyway I might have found the issue:

[2024-01-08T19:13:19.463877297+00:00 INFO io_engine:io-engine.rs:266] kernel nvme initiator multipath support: disabled

At least one of your nodes does not have kernel multi-path enabled. If this is also true for the nodes where the applications were running then this means we cannot failover by creating a new nexus.
Actually looks like we don't document this or I can't find it cc @avishnu

(Also I think we need to start capturing info about "initiator multipath support" from all nodes, maybe we can show this type of info via the plugin and also the dumps, cc @Abhinandan-Purkait)

dylex · 2024-01-16T15:50:05Z

Ah, yeah, we're already collecting all logs with filebeat, so prefered not to duplicate.

If I'm reading that correctly, this probably needs the nvme_core.multipath=Y kernel parameter (CONFIG_NVME_MULTIPATH=y already)? Yeah, I don't think I caught that in the docs, but will try, thanks!

tiagolobocastro · 2024-01-16T19:16:59Z

Ah, yeah, we're already collecting all logs with filebeat, so prefered not to duplicate.

Yep, makes sense, I wonder if we can integrate a way for the plugin to talk to existing log collections and automatically add it to the support bundle.

If I'm reading that correctly, this probably needs the nvme_core.multipath=Y kernel parameter (CONFIG_NVME_MULTIPATH=y already)? Yeah, I don't think I caught that in the docs, but will try, thanks!

Yes sounds about right. Yeah I don't think we've documented this properly, sorry about that.

tiagolobocastro · 2024-10-10T10:31:06Z

Current behaviour is that if the application is constrained to nodes with the io-engine label (openebs.io/engine=mayastor), the nexus is preferably placed on the same node the application is scheduled to, though is not a hard requirement (example: if the io-engine pod pod said node is in a bad state, we might place the nexus on a different node).

I suggest we start by documenting this behaviour and if there's more requests for a hard pinning of the application and the nexus, we can revisit.
We also should document that nvme_core.multipath is required for our HA functionality.

dylex assigned GlennBullingham Jan 12, 2024

tiagolobocastro unassigned GlennBullingham Jan 15, 2024

tiagolobocastro added Enhancement New feature or request documentation Improvements or additions to documentation labels Jan 20, 2024

tiagolobocastro added this to the OpenEBS v4.2 milestone Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to set pv affinity to nexus node #1578

Option to set pv affinity to nexus node #1578

dylex commented Jan 12, 2024

tiagolobocastro commented Jan 15, 2024

dylex commented Jan 15, 2024

tiagolobocastro commented Jan 15, 2024

dylex commented Jan 16, 2024

tiagolobocastro commented Jan 16, 2024

tiagolobocastro commented Oct 10, 2024

Option to set pv affinity to nexus node #1578

Option to set pv affinity to nexus node #1578

Comments

dylex commented Jan 12, 2024

tiagolobocastro commented Jan 15, 2024

dylex commented Jan 15, 2024

tiagolobocastro commented Jan 15, 2024

dylex commented Jan 16, 2024

tiagolobocastro commented Jan 16, 2024

tiagolobocastro commented Oct 10, 2024