This repo covers material from the Grads@NERSC event. It includes minimal example scripts that show how to move from Jupyter notebooks to scripts that can run on multiple GPUs (and multiple nodes) on the Perlmutter supercomputer at NERSC.
We recommend running machine learning workloads for testing or small-scale runs on a Jupyter notebook to enable interactivity with the user. At NERSC, this can be done easily through JupyterHub with the following steps:
- Visit JupyterHub
- Select the required resource on Perlmutter and navigate to your local notebook. A minimal notebook is shared here as an example -- most workflows will build upon these basic building blocks. The notebook defines this common workflow in PyTorch that includes
- defining the neural network architecture torch.nn, PyTorch data loaders for processing data
- general optimization harness with a training and validation loop over the respective datasets.
- There are in-built Jupyter kernels for machine learning that are based on the PyTorch modules (similar ones exist for the other ML frameworks as well). A quick recommendation is to use the
pytorch-2.0.1
kernel to get PyTorch 2.0 features. For building custom kernels that are based on your conda environment or other means, please see the NERSC docs. The docs are also a great resource on how to use Jupyter on Perlmutter, in general. You may also review best practices for JupyterHub - Another quick way to install your own libraries on top of what is included in the modules is to simply do
pip install --user <library_name>
. The--user
flag will always install libraries into the path defined by the environment variablePYTHONUSERBASE
. - Quick note on libraries: While you can build your own conda environment, the other recommendation is to use modules or containers. In either case, if you need libraries in addition to what's already provided, use the
--user
flag so that the libraries are installed inPYTHONUSERBASE
. For modules, this is defined by default and for containers, we recommend you define this variable to some local location so that user defined libraries do not interfere with the default environment.
As you move to more larger workloads, the general recommendation is to use scripts -- this is especially so for multi-GPU workloads, since it is tricky to get this working with Jupyter notebooks. We also recommend organizing parts of your code, such as data loaders and neural network definitions, into separate subdirectories for more modular codebases. This makes it easier to add and modify features as the code gets improved and features are added.
- The example notebook has been converted into the
train_single_gpu
script with a class structure that allows for easier extension into custom workflows. - We have also added two additional routines that implement the checkpoint-restart function. This allows you to start the training from where a previous run ended by loading a saved model checkpoint (along with the optimizer and learning rate schedulers). We highly recommend you consider checkpoint-restart while submitting jobs, since the perlmutter
regular
GPU queue has a maximum job time limit of 24 hours (training longer than that will require multiple jobs with checkpoint-restart). - To run a quick single-GPU script, follow these steps:
- Request an interactive node with
salloc --nodes 1 --qos interactive -t 30 -C gpu -A <your_account>
- Load a default PyTorch module environment with
module load pytorch/2.0.1
. You may also use your own conda environment withmodule load conda; conda activate <your_env>
(plus loading any background modules, such ascudnn
, needed by your conda environment). - Run
python train_single_gpu.py
- Request an interactive node with
- The script will save model checkpoints every epoch to the
outputs
directory and also the best model checkpoint that tracks the lowest validation loss (general strategy to choose the best model that avoids overfitting)
To speed up training (due to large datasets), the most common strategy is to use data parallelism. The easiest framework here is PyTorch DistributedDataParallel or DDP, which includes comprehensive tutorials on how to use this framework. Following this, we can convert our single-GPU script to multi-GPU using these simple steps:
- Initialize
torch.distributed
usingThis will pick up thetorch.distributed.init_process_group(backend='nccl', init_method='env://')
world_size
(total number of GPUs used) andrank
(rank of current GPU) from the environment variables that you will need to set before running -- the submit launch scripts will show you how to do that below. - Set the local GPU device using the
LOCAL_RANK
environment variable (that will be defined similar to above) withlocal_rank = int(os.environ["LOCAL_RANK"]) torch.cuda.set_device(local_rank)
- Wrap the model with
DistributedDataParallel
using
model = DistributedDataParallel(model, device_ids=[local_rank], output_device=[local_rank])
- Proceed with training as you would with a single GPU. DDP will automatically sync the gradients across devices when you call
loss.backward()
during the backpropagation. - Cleanup the GPU groups after the training with
dist.destroy_process_group()
To launch the job:
- Define the environment variables that allow
torch.distributed
to set up the distributed environment (world_size
andrank
). We implement that in this bash script that can be sourced when you allocate multiple GPUs using srun. - Set
$MASTER_ADDR
withexport MASTER_ADDR=$(hostname)
We implement these in our submit scripts that you can launch:
- If you are using PyTorch modules (or your own conda env), submit submit_batch_modules.sh with
sbatch submit_batch_modules.sh
. Note that, in the script we have- Loaded the environment with
module load pytorch/2.0.1
- Set up the
$MASTER_ADDR
withexport MASTER_ADDR=$(hostname)
- Sourced the environment variables within the
srun
withsource export_DDP_vars.sh
: these will set the necessary variables fortorch.distributed
based on the allocated resources bysrun
.
- Loaded the environment with
- For running with shifter (containers), see submit_batch_shifter.sh. The commands are mostly the same except we use a containerized environment. Both scripts currently submit a 2 node job, but this can be changed to any number of nodes.
- We recommend using containers for more optimized libraries and better performance. NERSC provides PyTorch containers based on the NVIDIA GPU cloud containers. To query a list of PyTorch containers on Perlmutter, you can use
shifterimg images | grep pytorch
. The example shifter submit scripts use the containernersc/pytorch:ngc-23.07-v0
. - Quick DDP note: checkpoint-restart needs a slight modification if you try to load a model without the
DistributedDataParallel
wrapper (for example, if you are doing inference on a single GPU) that was trained on multiple GPUs (using theDistributedDataParallel
wrapper).
try:
self.model.load_state_dict(checkpoint['model_state'])
except:
new_state_dict = OrderedDict()
for key, val in checkpoint['model_state'].items():
name = key[7:]
new_state_dict[name] = val
self.model.load_state_dict(new_state_dict)
Models wrapped with DDP have an extra string .module
that needs to be removed. The above lines in the scripts take care of this automatically
- For logging application-specific metrics/visualizations and automatic hyperparameter optimization (HPO), we recommend Weights & Biases. See this tutorial that extends the above scripts to include Weights & Biases logging and automatic HPO on multi-GPU tests.
- Before moving to data parallelism, we first recommend that you optimize your code to run on single GPU efficiently. Check out this in-depth tutorial that takes you step-by-step in developing a large-scale AI for Science application: this includes single GPU optimizations and profiling, data parallelism, and, for very large models that do not fit on a single GPU, model parallelism.