Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for Dynamic Resource Allocation #1078

Merged
merged 4 commits into from
May 23, 2024

Conversation

moshe010
Copy link
Contributor

Dynamic Resource Allocation was added to k8s 1.26 [1].
In k8s 1.27 we added PodResources API to expose the dynamic resources from kubelet.

This PR allow multus to get Dynamic Resource from the podResource API and pass it to CNI as DeviceID.
To use this you will need k8s .127 with DynamicResourceAllocation and KubeletPodResourcesDynamicResources feature gates enabled.

[1] - https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation
[2] - https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3695-pod-resources-for-dra

@coveralls
Copy link

coveralls commented Apr 24, 2023

Coverage Status

coverage: 63.065% (-0.1%) from 63.17%
when pulling c9d411c on moshe010:dra
into c6a371b on k8snetworkplumbingwg:master.

@bn222
Copy link
Contributor

bn222 commented Jun 27, 2023

Moshe, small nit before we look into this: please split up the commit into "upgrade kube to v0.27.0 + vendor that" and then do the remaining "real" changes. This would make it easier to review.

@moshe010
Copy link
Contributor Author

Moshe, small nit before we look into this: please split up the commit into "upgrade kube to v0.27.0 + vendor that" and then do the remaining "real" changes. This would make it easier to review.

Sure will do

@moshe010
Copy link
Contributor Author

moshe010 commented Jul 9, 2023

are there more comments on this PR?

@bn222
Copy link
Contributor

bn222 commented Jul 13, 2023

The main comment (without looking into how DRA exactly works) from my side is that we might want to have a function that gets or adds the dynamic resources. The current change adds the for-loop in line, which increases the nesting quite a bit.

After reading the code, I want to make sure I understand what's going on here. @moshe010 , is there are reason why we don't want to change pkg/checkpoint/checkpoint.go ?

Most importantly, @moshe010 , Vrinda from my team will be taking a look into this in more detail. Please allow a bit more time for some comments. We want to understand how this would integrate with sriov network operator.

@moshe010
Copy link
Contributor Author

The main comment (without looking into how DRA exactly works) from my side is that we might want to have a function that gets or adds the dynamic resources. The current change adds the for-loop in line, which increases the nesting quite a bit.

Sure I will created separate function for dynamic resources

After reading the code, I want to make sure I understand what's going on here. @moshe010 , is there are reason why we don't want to change pkg/checkpoint/checkpoint.go ?
The checkpoint for multus was introduce before pod resource api exists for device plugin.
For DRA pod resource API will always exist so we don't need to support the fallback.
Using checkpoint in general is problematic because it may change between k8s version.

Most importantly, @moshe010 , Vrinda from my team will be taking a look into this in more detail. Please allow a bit more time for some comments. We want to understand how this would integrate with sriov network operator.

@wizhaoredhat
Copy link

LGTM

@vrindle
Copy link

vrindle commented Aug 25, 2023

@moshe010 @bn222 I deployed these changes on my custom multus setup. I did manual tests on my system to ensure that nothing will break. It seems to look good. I deployed these custom changes on my setup and it looks to be good. Since these changes are additive this was expected. It LGTM.

@moshe010
Copy link
Contributor Author

Is there any other who required to review or can we merge this?

@maiqueb
Copy link
Collaborator

maiqueb commented Sep 11, 2023

Can you update the docs ?

I am looking for a couple paragraphs about the feature and some user documentation on how can the user it.

@moshe010
Copy link
Contributor Author

moshe010 commented Sep 11, 2023

Can you update the docs ?

I am looking for a couple paragraphs about the feature and some user documentation on how can the user it.

sure, you want to me to add DRA section here https://github.com/k8snetworkplumbingwg/multus-cni/blob/master/docs/how-to-use.md?

@maiqueb
Copy link
Collaborator

maiqueb commented Sep 11, 2023

Can you update the docs ?
I am looking for a couple paragraphs about the feature and some user documentation on how can the user it.

sure, you want to me to add DRA section here https://github.com/k8snetworkplumbingwg/multus-cni/blob/master/docs/how-to-use.md?

Unsure what's the best place ...

But yeah, let's put it there, if needed you'll move it around.

Asking @s1061123 / @dougbtv 's opinion on this.

@moshe010
Copy link
Contributor Author

moshe010 commented Nov 6, 2023

The CI failed don't seem to be related to the PR see [1].
@maiqueb sorry for the late response, I updated the doc in how-to-use.md

[1] -
Notice: A new release of pip is available: 23.2.1 -> 23.3.1
Notice: To update, run: pip install --upgrade pip
Traceback (most recent call last):
File "/home/runner/.local/bin/j2", line 5, in
from j2cli import main
File "/home/runner/.local/lib/python3.12/site-packages/j2cli/init.py", line 4, in
import pkg_resources
ModuleNotFoundError: No module named 'pkg_resources'
Error: Process completed with exit code 1.

@maiqueb
Copy link
Collaborator

maiqueb commented Nov 6, 2023

The CI failed don't seem to be related to the PR see [1]. @maiqueb sorry for the late response, I updated the doc in how-to-use.md

[1] - Notice: A new release of pip is available: 23.2.1 -> 23.3.1 Notice: To update, run: pip install --upgrade pip Traceback (most recent call last): File "/home/runner/.local/bin/j2", line 5, in from j2cli import main File "/home/runner/.local/lib/python3.12/site-packages/j2cli/init.py", line 4, in import pkg_resources ModuleNotFoundError: No module named 'pkg_resources' Error: Process completed with exit code 1.

Hey @moshe010 ; all good.

Regarding jinja, you need to update it to latest version (we actually couldn't find an updated version, and thus replaced it).
Here's what was done in the multus dynamic networks repo.

Copy link
Collaborator

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat docs.

Just pointing out some typo.

docs/how-to-use.md Outdated Show resolved Hide resolved
@moshe010
Copy link
Contributor Author

moshe010 commented Nov 7, 2023

The CI failed don't seem to be related to the PR see [1]. @maiqueb sorry for the late response, I updated the doc in how-to-use.md
[1] - Notice: A new release of pip is available: 23.2.1 -> 23.3.1 Notice: To update, run: pip install --upgrade pip Traceback (most recent call last): File "/home/runner/.local/bin/j2", line 5, in from j2cli import main File "/home/runner/.local/lib/python3.12/site-packages/j2cli/init.py", line 4, in import pkg_resources ModuleNotFoundError: No module named 'pkg_resources' Error: Process completed with exit code 1.

Hey @moshe010 ; all good.

Regarding jinja, you need to update it to latest version (we actually couldn't find an updated version, and thus replaced it). Here's what was done in the multus dynamic networks repo.

I see so it mean that some will need to do similar fix to this repo to fix the jinja error.

@moshe010
Copy link
Contributor Author

The CI failed don't seem to be related to the PR see [1]. @maiqueb sorry for the late response, I updated the doc in how-to-use.md
[1] - Notice: A new release of pip is available: 23.2.1 -> 23.3.1 Notice: To update, run: pip install --upgrade pip Traceback (most recent call last): File "/home/runner/.local/bin/j2", line 5, in from j2cli import main File "/home/runner/.local/lib/python3.12/site-packages/j2cli/init.py", line 4, in import pkg_resources ModuleNotFoundError: No module named 'pkg_resources' Error: Process completed with exit code 1.

Hey @moshe010 ; all good.
Regarding jinja, you need to update it to latest version (we actually couldn't find an updated version, and thus replaced it). Here's what was done in the multus dynamic networks repo.

I see so it mean that some will need to do similar fix to this repo to fix the jinja error.

pushed fix #1189

@moshe010
Copy link
Contributor Author

Are we ok with this change? are we waiting for someone else to review it?
we will really appreciate help in review it getting merged to multus.


#### Install DRA driver

The current example uses Nvidia DRA driver for networking. This DRA driver is not publicly available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any plan to release this DRA driver? Or it is never released for internal use?

@vasrem
Copy link
Contributor

vasrem commented Mar 18, 2024

@s1061123 @dougbtv:

Folks, I've added an integration test with an example dra driver implementation as discussed in the last community meeting. Please have a look at the latest commit and let me if that's sufficient to get that PR merged.

Notice the condition on running the integration test. This looks irrelevant to this PR. If you have any insights let me know.

Signed-off-by: Moshe Levi <moshele@nvidia.com>
Signed-off-by: Moshe Levi <moshele@nvidia.com>
Signed-off-by: Vasilis Remmas <vremmas@nvidia.com>
@vasrem
Copy link
Contributor

vasrem commented Apr 11, 2024

@s1061123 @dougbtv:

We now run the e2e tests for both thick and thin as requested in the community meeting today. The difference is that all the DRA example Pods are running on the worker nodes. Before, the controller Pod was running on the control plane node where the network is not stable. I have added this PR #1259 to demonstrate what's failing and prove that this is not a regression introduced by this PR.

Dynamic Resource Allocation is alternative mechanism to device plugin which allow to requests pod and container resources. The feature is alpha in k8s 1.27.

The following sections describe how to use DRA with multus and Nvidia DRA driver. Other DRA networking driver vendors should follow similar concepts to make use of multus DRA support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some text along the lines of:

Dynamic Resource Allocation (DRA) is [currently an alpha](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/), and is subject to change. Please consider this functionality as a preview, and utilized a potentially replace for device plugin for SR-IOV. The architecture and usage of DRA in Multus CNI may be changed in the future as this technology matures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, updated! Added the fact that DRA is in alpha as warning.

Signed-off-by: Vasilis Remmas <vremmas@nvidia.com>
Copy link
Member

@dougbtv dougbtv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the patience getting this together.

@dougbtv dougbtv merged commit 9f5c023 into k8snetworkplumbingwg:master May 23, 2024
24 checks passed
@moshe010 moshe010 mentioned this pull request Jul 4, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants