Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Example] Distributed nccl test with OpenMPI #3693

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 27, 2024

In our example, there is no OpenMPI-based way to start distributed programs. This PR adds an example for it on GCP.

There are several issues discovered:

  • OpenMPI/NCCL may be installed in different paths on various clouds' base image, making the task YAML writing a painful experience.
  • We do not have SSH config set up for head node to access the worker node, making it hard for setting up the hostfile OpenMPI (in this example, we hack it by uploading the private key ~/.ssh/sky-key to the remote cluster -- not secure). [Core] Allow ssh access from head to worker nodes #3690

TODO:

  • Make this example runnable on any cloud.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c nccl-test --cloud gcp --use-spot examples/distributed_nccl_test_with_mpi.yaml
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant