This repository has been archived by the owner on Mar 20, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 121
Pool Allocation Fails Due to Nvidia Installation #357
Comments
Can you please post your pool configuration file? |
Apologies for the delay. This was due to an out of date driver. A fix will be applied in the next release. |
alfpark
added a commit
that referenced
this issue
Apr 26, 2021
- Compute driver to 460.73.01 and CUDA 11.2 - Grid driver to 450.32.03 and CUDA 11.2 - Resolves #357
Hi @alfpark, I have a customer in production having this issue right now! They are a public health lab and they use the results for organ transplants, so truly a matter of life/death. Is there a workaround while the fix is put in place? |
You can always override the default GPU driver via pool configuration options. As a workaround, you can temporarily modify your gpu:
nvidia_driver:
source: "https://us.download.nvidia.com/tesla/460.73.01/NVIDIA-Linux-x86_64-460.73.01.run" |
alfpark
added a commit
that referenced
this issue
Mar 20, 2023
- Compute driver to 460.73.01 and CUDA 11.2 - Grid driver to 450.32.03 and CUDA 11.2 - Resolves #357
alfpark
added a commit
that referenced
this issue
Mar 20, 2023
- Compute driver to 460.73.01 and CUDA 11.2 - Grid driver to 450.32.03 and CUDA 11.2 - Resolves #357
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Problem Description
Batch Shipyard fails to allocate the pool due to a failure of Nvidia installation. Issue is reproducible with provided Pytorch-GPU recipe.
Batch Shipyard Version
Docker, v3.9.1
Redacted Configuration
https://github.com/Azure/batch-shipyard/tree/master/recipes/PyTorch-GPU/config
Expected Results
Nvidia installation completes correctly and pool is created
Actual Results
Nvidia installation crashes and pool is not created correctly
Steps to Reproduce
shipyard stdout
startup/stderr.txt
Additonal Comments
Similar issue which appears to be marked closed: #348
Note, no modifications to the provided sample recipe was made beyond connection details
The text was updated successfully, but these errors were encountered: