Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix A10 GPU on Azure #3707

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

[Core] Fix A10 GPU on Azure #3707

wants to merge 8 commits into from

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Jun 28, 2024

Fixes #3651. A10 GPUs require a special type of driver, Grid Driver, which is suggested to be installed by enabling the NvidiaGpuDriverLinux extension. This PR added this extension for A10 instances.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • The YAML in the issue and nvidia-smi works well.
    • The YAML in the issue, Qwen model e2e.
    • sky launch --cloud azure -c az-wo-a10 and checked that the drivers is not installed
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll changed the title [Core FIx A10 GPU on Azuew [Core] Fix A10 GPU on Azure Jun 28, 2024
@Michaelvll Michaelvll added the P0 label Jun 28, 2024
@cblmemo cblmemo added this to the v0.6.1 milestone Jun 29, 2024
@cblmemo cblmemo marked this pull request as ready for review June 29, 2024 02:30
@cblmemo
Copy link
Collaborator Author

cblmemo commented Jun 29, 2024

To discuss: Should we put the template into a separate file like https://github.com/skypilot-org/skypilot/blob/master/sky/skylet/providers/azure/azure-config-template.json

return

# Configure driver extension for A10 GPUs
create_result = poller.result().as_dict()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should merge the parameters for A10 to original parameters to avoid two create_or_update which can cause significant overhead? Please help profile it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I benchmarked and it reduces the provision time from 15m15s to 14m36s.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jul 2, 2024

Since now we got more quotas, I also tested the YAML in the issue and it works in e2e as well 🫡

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @cblmemo!

sky/skylet/providers/azure/node_provider.py Show resolved Hide resolved
Comment on lines 310 to 311
accs = azure_catalog.get_accelerators_from_instance_type(instance_type)
if accs is not None and "A10" in accs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid using azure_catalog in the node_provider as that seems to be an abstraction leakage, but I am ok to keep it as is for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, how about in the azure-ray.yaml we pass a need_nvidia_driver_extension: true within clouds/azure.py::make_deployable_variables. There we can check the accelerators, and get rid of the use of catalog here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! PTAL 🫡

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @cblmemo! Please check the comment above. We can get rid of the use of catalog in the node_provider with that. Other parts look good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No CUDA drivers in Azure A10
2 participants