-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use minimal CUDA container for PyTorch GitHub build #1091
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Tim Moon <tmoon@nvidia.com>
timmoon10
commented
Aug 9, 2024
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
timmoon10
commented
Aug 9, 2024
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
timmoon10
commented
Aug 12, 2024
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
ptrendx
approved these changes
Aug 13, 2024
mgoldfarb-nvidia
pushed a commit
to mgoldfarb-nvidia/TransformerEngine
that referenced
this pull request
Aug 14, 2024
* Use minimal CUDA container for PyTorch GitHub build Signed-off-by: Tim Moon <tmoon@nvidia.com> * Accidentally installed PyTorch in wrong test Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Debug sanity test Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Install PyTorch build dependencies Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Include NumPy as a dependency Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Disable sanity import test Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
We've experienced test failures in our GitHub PyTorch build tests since August 2 (success at 10:59:49 GMT, failure at 15:37:12 GMT). OOM errors happen while loading the 24.05 NGC PyTorch container (used since June with #919). It doesn't seem like anything has changed in the container (the downloaded hashes are the same), so I suspect something has changed in the GitHub runners. It's not the first time we've run into memory issues with the PyTorch container (see #462).
This PR modifies the PyTorch build test so it uses a minimal CUDA container, similar to the core build test.
Type of change
Changes
Checklist: