Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to set pv affinity to nexus node #1578

Open
dylex opened this issue Jan 12, 2024 · 6 comments
Open

Option to set pv affinity to nexus node #1578

dylex opened this issue Jan 12, 2024 · 6 comments
Labels
documentation Improvements or additions to documentation Enhancement New feature or request
Milestone

Comments

@dylex
Copy link

dylex commented Jan 12, 2024

Is your feature request related to a problem? Please describe.
We recently lost a node in our k8s mayastor cluster. For volumes it was a replica for, everything was fine and recovered after ioTimeout, but for those it was the target nexus for, things got stuck. Another node that was mounting from it reported nvme timeouts, and the pod with the mount hung in Terminating. I believe the result was it couldn't release the pv, so mayastor would not reallocate another nexus while the previous one was still mounted. Many processes on that node were hung -- anything looking at block devices or nvme. Ultimately we had to hard reboot the other node.

Now, this may well be a kernel issue with nvmeof, but unless I'm missing something may be something we have to live with.

Describe the solution you'd like
If mayastor had an option to set the pv node affinity to the node running the nexus, so pods would be scheduled on that node, and all nvmeof connections would be to the local node, then if a node fails, all the clients on that nexus would die if the node is lost.

Obviously this would not work in general or make sense for many deployments, but in our situation, where we run pods and storage on the same nodes, it would be a nice option, and also improve performance.

Describe alternatives you've considered
Certainly open to options If there's some better way to recover in this situation.

The biggest problem I see with this option is that pv nodeAffinity settings are immutable, so if the nexus needed to move, the pv would have to be recreated, which is probably rather difficult to manage correctly. An alternative would be to label nodes with the nexus they're running so pods could be manually assigned to the corresponding node somehow.

@tiagolobocastro
Copy link
Contributor

Hi @dylex, sorry for the delay.

For volumes it was a replica for, everything was fine and recovered after ioTimeout, but for those it was the target nexus for, things got stuck

In case the nexus node is down mayastor will create a new nexus and swap paths on the initiator node. Maybe this logic has a bug, or perhaps ana was not enabled on the nodes.. Would you be able to upload a dump so we can take a look?
https://mayastor.gitbook.io/introduction/advanced-operations/supportability
Use --since to make sure it captures when this happened.

If mayastor had an option to set the pv node affinity to the node running the nexus
We had something like this before (setting hostname on the pv) but this can easily be a loaded gun as those set in the PV are immutable, so if you loose your node it's not so easy to move it to another node..

IIRC if your application is started on a mayastor engine node, then the application will already be pinned to the nexus.
So you could add the openebs.io/engine label to your application?
We could also add it to the pv, though rather than using hostname we could set the engine label? (though similarly the pv would be forever constrained to nodes with this label).

@dylex
Copy link
Author

dylex commented Jan 15, 2024

Thanks. I put a dump here though looking at the logs I think it did not capture the relevant period due to restarts. We do have the logs externally from that day, which I've put here. General timeline for 2024-01-08:

  • 17:57 k8s-162-4 goes down
  • 18:07 iotimeout reached, most volumes recover (k8s-162-4 is manually cordoned/drained at some point later)
  • 19:12 k8s-162-4 restored, but 2 volumes remain in state Unknown without target nodes, pods on k8s-162-1 still stuck in Terminating
  • 20:00 k8s-162-1 rebooted after hang
  • 20:12 service is fully restored

It doesn't seem like pods running on engine nodes are necessarily being scheduled to the nexus node, but maybe they were initially. Another factor that maybe we've setup poorly: in our 9-node cluster, 6 nodes are engine nodes with disk pools, while 3 nodes (the control plane nodes, k8s-160-1, k8s-162-1, k8s-elk) run mayastor-etcd, but also may run some client pods using mayastor.

@tiagolobocastro
Copy link
Contributor

Thanks. I put a dump here though looking at the logs I think it did not capture the relevant period due to restarts.

Ah this is because I suspect you have disabled mayastor loki. We recommend you keep it enable if possible as it's very important to capture logs on these type of scenarios.

Anyway I might have found the issue:

[2024-01-08T19:13:19.463877297+00:00 INFO io_engine:io-engine.rs:266] kernel nvme initiator multipath support: disabled

At least one of your nodes does not have kernel multi-path enabled. If this is also true for the nodes where the applications were running then this means we cannot failover by creating a new nexus.
Actually looks like we don't document this or I can't find it cc @avishnu

(Also I think we need to start capturing info about "initiator multipath support" from all nodes, maybe we can show this type of info via the plugin and also the dumps, cc @Abhinandan-Purkait)

@dylex
Copy link
Author

dylex commented Jan 16, 2024

Ah, yeah, we're already collecting all logs with filebeat, so prefered not to duplicate.

If I'm reading that correctly, this probably needs the nvme_core.multipath=Y kernel parameter (CONFIG_NVME_MULTIPATH=y already)? Yeah, I don't think I caught that in the docs, but will try, thanks!

@tiagolobocastro
Copy link
Contributor

Ah, yeah, we're already collecting all logs with filebeat, so prefered not to duplicate.

Yep, makes sense, I wonder if we can integrate a way for the plugin to talk to existing log collections and automatically add it to the support bundle.

If I'm reading that correctly, this probably needs the nvme_core.multipath=Y kernel parameter (CONFIG_NVME_MULTIPATH=y already)? Yeah, I don't think I caught that in the docs, but will try, thanks!

Yes sounds about right. Yeah I don't think we've documented this properly, sorry about that.

@tiagolobocastro tiagolobocastro added Enhancement New feature or request documentation Improvements or additions to documentation labels Jan 20, 2024
@tiagolobocastro
Copy link
Contributor

Current behaviour is that if the application is constrained to nodes with the io-engine label (openebs.io/engine=mayastor), the nexus is preferably placed on the same node the application is scheduled to, though is not a hard requirement (example: if the io-engine pod pod said node is in a bad state, we might place the nexus on a different node).

I suggest we start by documenting this behaviour and if there's more requests for a hard pinning of the application and the nexus, we can revisit.
We also should document that nvme_core.multipath is required for our HA functionality.

@tiagolobocastro tiagolobocastro added this to the OpenEBS v4.2 milestone Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants