This is an end-to-end example of training a simple Logistic Regression Pytorch model with DistributedDataParallel (DDP; single-node, multi-GPU data parallel training) on a fake dataset. The dataset gets distributed to multiple GPUs by DistributedSampler
. This builds off of this tutorial and the Pytorch DDP tutorial.
Let's say you have 8 GPUs and want to run it on GPUs 5, 6, and 7, since GPUs 0-4 are in use by others. Then, it can be run with: CUDA_VISIBLE_DEVICES=5,6,7 python3 main.py
Additional resources
- TODO: Implement validation in DistributedDataParallel forum link here
- DDP video tutorials
- Distributed Data Parallel Model Training in PyTorch