add support for Dynamic Resource Allocation #1078

moshe010 · 2023-04-24T08:29:28Z

Dynamic Resource Allocation was added to k8s 1.26 [1].
In k8s 1.27 we added PodResources API to expose the dynamic resources from kubelet.

This PR allow multus to get Dynamic Resource from the podResource API and pass it to CNI as DeviceID.
To use this you will need k8s .127 with DynamicResourceAllocation and KubeletPodResourcesDynamicResources feature gates enabled.

[1] - https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation
[2] - https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3695-pod-resources-for-dra

coveralls · 2023-04-24T08:44:02Z

coverage: 63.065% (-0.1%) from 63.17%
when pulling c9d411c on moshe010:dra
into c6a371b on k8snetworkplumbingwg:master.

pkg/kubeletclient/kubeletclient.go

bn222 · 2023-06-27T12:45:26Z

Moshe, small nit before we look into this: please split up the commit into "upgrade kube to v0.27.0 + vendor that" and then do the remaining "real" changes. This would make it easier to review.

moshe010 · 2023-06-29T07:17:17Z

Moshe, small nit before we look into this: please split up the commit into "upgrade kube to v0.27.0 + vendor that" and then do the remaining "real" changes. This would make it easier to review.

Sure will do

moshe010 · 2023-07-09T18:35:51Z

are there more comments on this PR?

bn222 · 2023-07-13T08:50:01Z

The main comment (without looking into how DRA exactly works) from my side is that we might want to have a function that gets or adds the dynamic resources. The current change adds the for-loop in line, which increases the nesting quite a bit.

After reading the code, I want to make sure I understand what's going on here. @moshe010 , is there are reason why we don't want to change pkg/checkpoint/checkpoint.go ?

Most importantly, @moshe010 , Vrinda from my team will be taking a look into this in more detail. Please allow a bit more time for some comments. We want to understand how this would integrate with sriov network operator.

moshe010 · 2023-07-16T08:38:09Z

The main comment (without looking into how DRA exactly works) from my side is that we might want to have a function that gets or adds the dynamic resources. The current change adds the for-loop in line, which increases the nesting quite a bit.

Sure I will created separate function for dynamic resources

After reading the code, I want to make sure I understand what's going on here. @moshe010 , is there are reason why we don't want to change pkg/checkpoint/checkpoint.go ?
The checkpoint for multus was introduce before pod resource api exists for device plugin.
For DRA pod resource API will always exist so we don't need to support the fallback.
Using checkpoint in general is problematic because it may change between k8s version.

Most importantly, @moshe010 , Vrinda from my team will be taking a look into this in more detail. Please allow a bit more time for some comments. We want to understand how this would integrate with sriov network operator.

wizhaoredhat · 2023-08-23T15:08:44Z

LGTM

vrindle · 2023-08-25T21:08:05Z

@moshe010 @bn222 I deployed these changes on my custom multus setup. I did manual tests on my system to ensure that nothing will break. It seems to look good. I deployed these custom changes on my setup and it looks to be good. Since these changes are additive this was expected. It LGTM.

moshe010 · 2023-09-10T23:46:52Z

Is there any other who required to review or can we merge this?

maiqueb · 2023-09-11T07:07:00Z

Can you update the docs ?

I am looking for a couple paragraphs about the feature and some user documentation on how can the user it.

moshe010 · 2023-09-11T08:09:04Z

Can you update the docs ?

I am looking for a couple paragraphs about the feature and some user documentation on how can the user it.

sure, you want to me to add DRA section here https://github.com/k8snetworkplumbingwg/multus-cni/blob/master/docs/how-to-use.md?

maiqueb · 2023-09-11T08:10:16Z

Can you update the docs ?
I am looking for a couple paragraphs about the feature and some user documentation on how can the user it.

sure, you want to me to add DRA section here https://github.com/k8snetworkplumbingwg/multus-cni/blob/master/docs/how-to-use.md?

Unsure what's the best place ...

But yeah, let's put it there, if needed you'll move it around.

Asking @s1061123 / @dougbtv 's opinion on this.

moshe010 · 2023-11-06T11:36:35Z

The CI failed don't seem to be related to the PR see [1].
@maiqueb sorry for the late response, I updated the doc in how-to-use.md

[1] -
Notice: A new release of pip is available: 23.2.1 -> 23.3.1
Notice: To update, run: pip install --upgrade pip
Traceback (most recent call last):
File "/home/runner/.local/bin/j2", line 5, in
from j2cli import main
File "/home/runner/.local/lib/python3.12/site-packages/j2cli/init.py", line 4, in
import pkg_resources
ModuleNotFoundError: No module named 'pkg_resources'
Error: Process completed with exit code 1.

maiqueb · 2023-11-06T11:39:35Z

The CI failed don't seem to be related to the PR see [1]. @maiqueb sorry for the late response, I updated the doc in how-to-use.md

[1] - Notice: A new release of pip is available: 23.2.1 -> 23.3.1 Notice: To update, run: pip install --upgrade pip Traceback (most recent call last): File "/home/runner/.local/bin/j2", line 5, in from j2cli import main File "/home/runner/.local/lib/python3.12/site-packages/j2cli/init.py", line 4, in import pkg_resources ModuleNotFoundError: No module named 'pkg_resources' Error: Process completed with exit code 1.

Hey @moshe010 ; all good.

Regarding jinja, you need to update it to latest version (we actually couldn't find an updated version, and thus replaced it).
Here's what was done in the multus dynamic networks repo.

maiqueb

Neat docs.

Just pointing out some typo.

docs/how-to-use.md

moshe010 · 2023-11-07T10:16:27Z

The CI failed don't seem to be related to the PR see [1]. @maiqueb sorry for the late response, I updated the doc in how-to-use.md
[1] - Notice: A new release of pip is available: 23.2.1 -> 23.3.1 Notice: To update, run: pip install --upgrade pip Traceback (most recent call last): File "/home/runner/.local/bin/j2", line 5, in from j2cli import main File "/home/runner/.local/lib/python3.12/site-packages/j2cli/init.py", line 4, in import pkg_resources ModuleNotFoundError: No module named 'pkg_resources' Error: Process completed with exit code 1.

Hey @moshe010 ; all good.

Regarding jinja, you need to update it to latest version (we actually couldn't find an updated version, and thus replaced it). Here's what was done in the multus dynamic networks repo.

I see so it mean that some will need to do similar fix to this repo to fix the jinja error.

moshe010 · 2023-11-20T12:51:02Z

The CI failed don't seem to be related to the PR see [1]. @maiqueb sorry for the late response, I updated the doc in how-to-use.md
[1] - Notice: A new release of pip is available: 23.2.1 -> 23.3.1 Notice: To update, run: pip install --upgrade pip Traceback (most recent call last): File "/home/runner/.local/bin/j2", line 5, in from j2cli import main File "/home/runner/.local/lib/python3.12/site-packages/j2cli/init.py", line 4, in import pkg_resources ModuleNotFoundError: No module named 'pkg_resources' Error: Process completed with exit code 1.

Hey @moshe010 ; all good.
Regarding jinja, you need to update it to latest version (we actually couldn't find an updated version, and thus replaced it). Here's what was done in the multus dynamic networks repo.

I see so it mean that some will need to do similar fix to this repo to fix the jinja error.

pushed fix #1189

moshe010 · 2024-02-21T13:11:05Z

Are we ok with this change? are we waiting for someone else to review it?
we will really appreciate help in review it getting merged to multus.

s1061123 · 2024-03-01T05:02:32Z

docs/how-to-use.md

+
+#### Install DRA driver
+
+The current example uses Nvidia DRA driver for networking. This DRA driver is not publicly available.


Is there any plan to release this DRA driver? Or it is never released for internal use?

vasrem · 2024-03-18T16:22:59Z

@s1061123 @dougbtv:

Folks, I've added an integration test with an example dra driver implementation as discussed in the last community meeting. Please have a look at the latest commit and let me if that's sufficient to get that PR merged.

Notice the condition on running the integration test. This looks irrelevant to this PR. If you have any insights let me know.

Signed-off-by: Moshe Levi <moshele@nvidia.com>

Signed-off-by: Vasilis Remmas <vremmas@nvidia.com>

vasrem · 2024-04-11T18:09:24Z

@s1061123 @dougbtv:

We now run the e2e tests for both thick and thin as requested in the community meeting today. The difference is that all the DRA example Pods are running on the worker nodes. Before, the controller Pod was running on the control plane node where the network is not stable. I have added this PR #1259 to demonstrate what's failing and prove that this is not a regression introduced by this PR.

dougbtv · 2024-05-09T13:42:37Z

docs/how-to-use.md

+Dynamic Resource Allocation is alternative mechanism to device plugin which allow to requests pod and container resources. The feature is alpha in k8s 1.27.
+
+The following sections describe how to use DRA with multus and Nvidia DRA driver. Other DRA networking driver vendors should follow similar concepts to make use of multus DRA support.
+


Can we add some text along the lines of:

Dynamic Resource Allocation (DRA) is [currently an alpha](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/), and is subject to change. Please consider this functionality as a preview, and utilized a potentially replace for device plugin for SR-IOV. The architecture and usage of DRA in Multus CNI may be changed in the future as this technology matures.

Thank you, updated! Added the fact that DRA is in alpha as warning.

Signed-off-by: Vasilis Remmas <vremmas@nvidia.com>

dougbtv

LGTM! Thanks for the patience getting this together.

moshe010 force-pushed the dra branch from e1d7d89 to 42a31dc Compare April 24, 2023 08:40

maiqueb reviewed Jun 8, 2023

View reviewed changes

pkg/kubeletclient/kubeletclient.go Outdated Show resolved Hide resolved

moshe010 force-pushed the dra branch 2 times, most recently from 8588258 to 812d897 Compare June 25, 2023 08:49

moshe010 force-pushed the dra branch from 812d897 to 791f02d Compare June 29, 2023 07:23

moshe010 force-pushed the dra branch 2 times, most recently from ec58c31 to bedad0c Compare August 22, 2023 11:36

moshe010 force-pushed the dra branch from bedad0c to b8cf2e5 Compare November 6, 2023 11:25

maiqueb reviewed Nov 6, 2023

View reviewed changes

docs/how-to-use.md Outdated Show resolved Hide resolved

moshe010 force-pushed the dra branch from b8cf2e5 to 5a52e8c Compare November 7, 2023 09:32

moshe010 force-pushed the dra branch from 5a52e8c to 2de3986 Compare December 25, 2023 06:55

s1061123 requested changes Mar 14, 2024

View reviewed changes

vasrem force-pushed the dra branch from 2de3986 to c207c41 Compare March 18, 2024 16:17

moshe010 added 2 commits April 11, 2024 19:16

add support for Dynamic Resource Allocation

40378ca

Signed-off-by: Moshe Levi <moshele@nvidia.com>

support for Dynamic Resource Allocation doc update

202533c

Signed-off-by: Moshe Levi <moshele@nvidia.com>

vasrem force-pushed the dra branch from c207c41 to 19e6da4 Compare April 11, 2024 17:42

Add DRA Integration E2E test

2c796b5

Signed-off-by: Vasilis Remmas <vremmas@nvidia.com>

vasrem force-pushed the dra branch from 19e6da4 to 2c796b5 Compare April 11, 2024 17:53

vasrem mentioned this pull request Apr 11, 2024

Add e2e tests to capture connectivity issues between Pods running on Control Plane nodes and the API server #1259

Closed

dougbtv requested changes May 9, 2024

View reviewed changes

Add warning in docs that DRA is alpha and in preview

c9d411c

Signed-off-by: Vasilis Remmas <vremmas@nvidia.com>

dougbtv approved these changes May 23, 2024

View reviewed changes

dougbtv merged commit 9f5c023 into k8snetworkplumbingwg:master May 23, 2024
24 checks passed

moshe010 mentioned this pull request Jul 4, 2024

DRA for 1.31 kubernetes/kubernetes#125488

Merged

9 tasks

dougbtv mentioned this pull request Oct 9, 2024

[KEP-4817]: DRA: Resource Claim Status with possible standardized network interface data kubernetes/enhancements#4861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for Dynamic Resource Allocation #1078

add support for Dynamic Resource Allocation #1078

moshe010 commented Apr 24, 2023

coveralls commented Apr 24, 2023 •

edited

Loading

bn222 commented Jun 27, 2023

moshe010 commented Jun 29, 2023

moshe010 commented Jul 9, 2023

bn222 commented Jul 13, 2023

moshe010 commented Jul 16, 2023

wizhaoredhat commented Aug 23, 2023

vrindle commented Aug 25, 2023 •

edited

Loading

moshe010 commented Sep 10, 2023

maiqueb commented Sep 11, 2023

moshe010 commented Sep 11, 2023 •

edited

Loading

maiqueb commented Sep 11, 2023

moshe010 commented Nov 6, 2023

maiqueb commented Nov 6, 2023

maiqueb left a comment

moshe010 commented Nov 7, 2023

moshe010 commented Nov 20, 2023

moshe010 commented Feb 21, 2024

s1061123 Mar 1, 2024

vasrem commented Mar 18, 2024

vasrem commented Apr 11, 2024

dougbtv May 9, 2024

vasrem May 13, 2024

dougbtv left a comment


		#### Install DRA driver

		The current example uses Nvidia DRA driver for networking. This DRA driver is not publicly available.

		Dynamic Resource Allocation is alternative mechanism to device plugin which allow to requests pod and container resources. The feature is alpha in k8s 1.27.

		The following sections describe how to use DRA with multus and Nvidia DRA driver. Other DRA networking driver vendors should follow similar concepts to make use of multus DRA support.

add support for Dynamic Resource Allocation #1078

add support for Dynamic Resource Allocation #1078

Conversation

moshe010 commented Apr 24, 2023

coveralls commented Apr 24, 2023 • edited Loading

bn222 commented Jun 27, 2023

moshe010 commented Jun 29, 2023

moshe010 commented Jul 9, 2023

bn222 commented Jul 13, 2023

moshe010 commented Jul 16, 2023

wizhaoredhat commented Aug 23, 2023

vrindle commented Aug 25, 2023 • edited Loading

moshe010 commented Sep 10, 2023

maiqueb commented Sep 11, 2023

moshe010 commented Sep 11, 2023 • edited Loading

maiqueb commented Sep 11, 2023

moshe010 commented Nov 6, 2023

maiqueb commented Nov 6, 2023

maiqueb left a comment

Choose a reason for hiding this comment

moshe010 commented Nov 7, 2023

moshe010 commented Nov 20, 2023

moshe010 commented Feb 21, 2024

s1061123 Mar 1, 2024

Choose a reason for hiding this comment

vasrem commented Mar 18, 2024

vasrem commented Apr 11, 2024

dougbtv May 9, 2024

Choose a reason for hiding this comment

vasrem May 13, 2024

Choose a reason for hiding this comment

dougbtv left a comment

Choose a reason for hiding this comment

coveralls commented Apr 24, 2023 •

edited

Loading

vrindle commented Aug 25, 2023 •

edited

Loading

moshe010 commented Sep 11, 2023 •

edited

Loading