Training with DistributedDataParallel

This is an end-to-end example of training a simple Logistic Regression Pytorch model with DistributedDataParallel (DDP; single-node, multi-GPU data parallel training) on a fake dataset. The dataset gets distributed to multiple GPUs by DistributedSampler. This builds off of this tutorial and the Pytorch DDP tutorial.

Let's say you have 8 GPUs and want to run it on GPUs 5, 6, and 7, since GPUs 0-4 are in use by others. Then, it can be run with: CUDA_VISIBLE_DEVICES=5,6,7 python3 main.py

Additional resources

TODO: Implement validation in DistributedDataParallel forum link here
DDP video tutorials
Distributed Data Parallel Model Training in PyTorch

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
base.py		base.py
dataset.py		dataset.py
logistic.py		logistic.py
main.py		main.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training with DistributedDataParallel

About

Releases

Packages

Languages

guanzgrace/torch-distributed-data-parallel

Folders and files

Latest commit

History

Repository files navigation

Training with DistributedDataParallel

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages