Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for nvidia MIG in Mesos containerizer #454

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jblache
Copy link

@jblache jblache commented Jan 30, 2023

This PR contributes support for static MIG (multi-instance GPU) configurations in the Mesos containerizer, whereby nvidia GPUs can be divided into isolated instances to better use GPU resources.

With these changes, Mesos will be able to pick up the MIG configuration and properly report available MIG GPUs instead of reporting the underlying, physical GPUs.

I have made an attempt at playing nice with the --resources and --nvidia_gpu_devices agent options, however more scrutiny/work will definitely be beneficial in this area (we don't use these options in our setup).

In addition to this PR, an update to the NVML library is required. These changes were built and tested with NVML taken from the 460.73.01 driver package.

This code has been running in production on A100 GPUs for several months now and is being contributed back in the hope it will be useful and beneficial to the community. Much more can be done with MIG, and this only scratches the surface.

@bmahler
Copy link
Contributor

bmahler commented Jan 30, 2023

Hi @jblache, cool to see this getting upstreamed. A couple of suggestions:

  • Can you a more detailed overview of the implementation approach in the PR description? It's a bit hard to dive into reviewing this code from the current description.
  • To ease reviewing, typically we break apart patches. For example, in this PR, the nvml.hpp/cpp files can easily be their own commit for review purposes. I haven't used github PRs for reviewing stacks of commits, but I think if you just split the commits and use multiple in the PR, it should be at least reviewable in chunks, but probably not as directly committable in chunks. If you want to use reviewboard I think there is still support with ./support/post-reviews.py

If others chime in here willing to review, perhaps we can organize a meeting to review more easily. I'm not active in mesos lately but would be willing to provide some feedback here.

@jblache
Copy link
Author

jblache commented Jan 31, 2023

Pushed a more granular version of the branch, which at least splits the discovery vs. isolation bits. Original postback was squashed for internal review reasons mostly.

On the approach, the NVML wrapper was updated to add support for the calls needed to discover/query MIG GPUs, and a couple of those added interfaces have a bit of logic included that keeps the calling code cleaner and overall makes sense.

GPU enumeration will check for MIG and enumerate MIG devices if present, instead of the underlying GPU.

The isolator now has to include all the nvidia-caps device nodes, which is a lot of them, but anything more granular seems pretty involved and brittle (IIRC one of the reasons I didn't look into it more is that it would likely get in the the way of a dynamic configuration).

The tricky bit is reconciling device allocations upon restart with running jobs. It turned out to be not too bad, but the code grew accordingly there. I forget the details, but there's a bit more data that needs to be kept around to be able to match everything up, and the matching logic needs to look for MIG vs. not MIG, basically.

This was a tactical patch for us and, as you can see, it sat on my TODO list for bit before I could get it out to you. I don't have bandwidth currently to engage in a full-blown review process, but wanted to get the code out for anyone who might have similar needs, and as groundwork for more evolved capabilities around MIG.

@andreaspeters
Copy link
Contributor

For me it looks fine. How about you @cf-natali @qianzhangxa ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants