Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xFail known bad tests on H100 and fix CVEs #549

Merged
merged 5 commits into from
Dec 19, 2024
Merged

Conversation

gagank1
Copy link
Collaborator

@gagank1 gagank1 commented Dec 18, 2024

No description provided.

Known issue on H100 (and GH200) with loading checkpoints. Also fixing
CVE in ARM container
@gagank1
Copy link
Collaborator Author

gagank1 commented Dec 18, 2024

/build-ci

Dockerfile.arm Show resolved Hide resolved
Copy link
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve but please make the test only xfail when not on H100 (or see my suggestion about also testing for cudnn version since I think it should work again with newer cuDNN, eg when we upgrade the base pytorch container to 24.10-py3.)

@gagank1
Copy link
Collaborator Author

gagank1 commented Dec 19, 2024

/build-ci

@gagank1 gagank1 enabled auto-merge (squash) December 19, 2024 21:06
@gagank1 gagank1 merged commit e9ed8cf into main Dec 19, 2024
4 checks passed
@gagank1 gagank1 deleted the gkaushik/gh200-hotfix-main branch December 19, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants