Skip to content

This is a repository that contains a sample of how to make a Dockerfile and compile your program that uses MPI into slurm with enroot and pyxis from NVIDIA.

License

Notifications You must be signed in to change notification settings

evstigneevnm/slurm_gpu_mpi_docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sample for Slurm NVIDIA GPU MPI Docker interaction.

This is a repository that contains a sample of how to make a Dockerfile and compile your program that uses MPI into slurm managed cluster with enroot and pyxis from NVIDIA.

Problem we solve

One needs to run a custom program, possibly written in CUDA, Sycl or OpenCL with MPI, on a cluster with only isolated container support. It requires a correct build of MPI inside a container and interaction of the internal MPI inside a container and external MPI which is managed by Slurm via srun command that is, effectively, group managed mpiexec. This container operation and interaction is managed by NVIDIA enroot and pyxis that are installed on the cluster. But I was unable to find a good manual of how to build a custom C++/CUDA C++ program for said systems when multiple GPUs are used via MPI and containerization is mandatory. This manual and simple example should help anyone faced with the same complication. We solve it by using an NVIDIA sample Docker image and then use it to build our program with already built MPI.

Approach

In this repository a sample program that is written in C++/CUDA C++ is provided; It is build by a simple Makefile inside a Docker container. It can be used to test a remote cluster or a local computer with multiple GPUs and MPI as well as serve as an example of how to include your own code into the Docker image to be executed by Slurm with pyxis and MPI. You can use the provided Dockerfile to configure the image and build your own program instead of the sample I provided. See the Dockerfile and comments there for more details.

Definitions

Cluster is a remote distributed memory multiple GPU system that has Slurm installed; it executes tasks by running srun and supports NVIDIA enroot and pyxis. It requires that only containers can be executed.

Local machine is your local computer that has access to a Cluster. Assuming your local machine runs Docker and the OS is Linux.

We also assume that docker and enroot have rootless access on your Local machine, otherwise use sudo before each docker and enroot command on your Local machine (not recommended).

Instruction

On the Local machine

  1. On your Local machine install NVIDIA enroot:, instructions are found here.

  2. Clone this repository on your local machine with submodules, e.g. cloning into the directory ./slurm_gpu_mpi_docker for git version 2.13 and higher:

git clone --recurse-submodules https://github.com/evstigneevnm/slurm_gpu_mpi_docker.git 

go to the root directory of the project:

cd slurm_gpu_mpi_docker

Optionally modify cuda_mpi_prog/Makefile, variable CUDA_ARCHITECTURE to fit to the desired GPU architecture on your Cluster. The defaults are sm_70 (eg. V100) and sm_80 (eg. A100).

  1. From the root of the repository directory build Docker image as:
docker build -f docker_config/Dockerfile . -t slurmmpigpu/test:0.01

  1. Import local docker image to enroot sqsh format:
enroot import dockerd://slurmmpigpu/test:0.01

This will create a file slurmmpigpu+test+0.01.sqsh which is generated in the repository directory. Optionally check the hash of your local image, e.g. use sha224sum slurmmpigpu+test+0.01.sqsh

  1. Copy this file to your working directory on the Cluster, using rsync, scp or any other means.

On the Cluster

  1. Navigate to your working directory on the Cluster and check that the file is copied correctly from your local machine. From now on we assume that your working directory on the Cluster is /user/workdir. Optionally check the hash of your remote image, e.g. use sha224sum ~/workdir/slurmmpigpu+test+0.01.sqsh, and compare it to the hash of your local sqsh file.

  2. Your current entrypoint is the directory /mpi_cuda_sample inside the container, configured by the Dockerfile. It will be automatically accessed by the srun. To access other parts of the container provide a full path to srun as listed below.

  3. Optionally To get inside your container you can run the command:

srun --container-image ~/workdir/slurmmpigpu+test+0.01.sqsh --pty bash

WARNING! This command is not used for launching applications, but only to get into the shell bash inside your container.

  1. Next, we execute the container. For example, we execute our sample program on 4 nodes, each having 8 GPUs with MPI:
srun -N4 -n32 -G32 --gpus-per-node=8 --container-image ~/workdir/slurmmpigpu+test+0.01.sqsh --container-entrypoint test_mpi_cuda.bin

It should return PASS for all reduce operations.

WARNING! One doesn't need not execute mpiexec or mpirun inside the container. MPI call is configured automatically by pyxis.

Common execution template is:

srun -N number_of_nodes -n number_of_procs -G number_of_gpus --gpus-per-node=X --container-image ~/path/to/container.sqsh --container-mounts=/full/path/to/data:/local/container/data  --container-entrypoint containerized_program program_command_line_parameter[, program_command_line_parameter ...]
  • -N number_of_nodes is the number of nodes to be executed on. Each node can contain several GPUs.
  • -G number_of_gpus is the total number of GPUs to be passed to the container.
  • -n number_of_procs is the global number of MPI processes usually equals the total number of GPUs.
  • --gpus-per-node=X is the number of GPUs per each node (slurm assumes the nodes are GPU homogeneous).

WARNING! This sample program will terminate if the srun is configured incorrectly w.r.t. the GPUs. If the Cluster has each node of 8 GPUs and attempts to execute -G 10 option, this will result in the incorrect binding of GPUs. Hence resource acquisition is possible only as the multiple of a single node. Example of a CORRECT LAUNCH:

srun -N3 -n17 -G24 --gpus-per-node=8 --container-image ~/workdir/slurmmpigpu+test+0.01.sqsh --container-entrypoint test_mpi_cuda.bin

which will spawn 17 processes, each process will take a GPU from a node. It will use 3 nodes (-N3), each node is having 8 GPUs (--gpus-per-node=8) and pyxis will map 3 * 8 = 24 GPUs (-G24).

INCORRECT LAUNCH:

srun -N3 -n17 -G17 --gpus-per-node=8 --container-image ~/workdir/slurmmpigpu+test+0.01.sqsh --container-entrypoint test_mpi_cuda.bin

will throw an exception. Incorrect behavior is related to the mapping of 17 GPUs which are not multiple of 8.

Additional options for pyxis srun are found here.

Options that I find useful:

  • --container-mounts=SRC:DST, where SRC is a FULL path to the directory on the Cluster and DST is the FULL path inside the container. Through this mount your program can load data from and save data to the Cluster.
  • --container-entrypoint is the program inside the container to be executed in parallel by the slurm managed MPI.
  • --container-env=NAME[,NAME...] overrides system environment variables in the container, such as PATH, LD_LD_LIBRARY_PATH etc, e.g. the system variables form the Cluster override variables with the same name in the container.

About

This is a repository that contains a sample of how to make a Dockerfile and compile your program that uses MPI into slurm with enroot and pyxis from NVIDIA.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published