Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use minimal CUDA container for PyTorch GitHub build #1091

Merged
merged 7 commits into from
Aug 13, 2024

Conversation

timmoon10
Copy link
Collaborator

Description

We've experienced test failures in our GitHub PyTorch build tests since August 2 (success at 10:59:49 GMT, failure at 15:37:12 GMT). OOM errors happen while loading the 24.05 NGC PyTorch container (used since June with #919). It doesn't seem like anything has changed in the container (the downloaded hashes are the same), so I suspect something has changed in the GitHub runners. It's not the first time we've run into memory issues with the PyTorch container (see #462).

This PR modifies the PyTorch build test so it uses a minimal CUDA container, similar to the core build test.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor
  • Testing

Changes

  • Use CUDA container for PyTorch GitHub build

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 added the testing Improvements to tests or testing infrastructure label Aug 9, 2024
.github/workflows/build.yml Outdated Show resolved Hide resolved
.github/workflows/build.yml Outdated Show resolved Hide resolved
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
.github/workflows/build.yml Outdated Show resolved Hide resolved
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
timmoon10 and others added 3 commits August 12, 2024 10:22
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 marked this pull request as ready for review August 13, 2024 00:50
@ptrendx ptrendx merged commit dcc50c8 into NVIDIA:main Aug 13, 2024
14 checks passed
@timmoon10 timmoon10 deleted the minimal-github-pytorch-build branch August 14, 2024 17:47
mgoldfarb-nvidia pushed a commit to mgoldfarb-nvidia/TransformerEngine that referenced this pull request Aug 14, 2024
* Use minimal CUDA container for PyTorch GitHub build

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Accidentally installed PyTorch in wrong test

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Debug sanity test

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Install PyTorch build dependencies

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Include NumPy as a dependency

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Disable sanity import test

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Improvements to tests or testing infrastructure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants