Skip to content

Commit

Permalink
distributed training (#74)
Browse files Browse the repository at this point in the history
* [WIP] distributed

* [WIP] distributed training

* add script

* add smaller script

* fix device

* fix DDP model setting

* reorder model on device

* remove model.model...

* set print on process 0

* remove duplicate

* model folder set on main process

* test

* model.update fix

* exists_ok=True

* add exist_ok

* jz multinode fix

* fix batch size splitting

* isort and black

* [WIP] work on trainers

* [WIP] work on Distributed and trainers

* add tets example with adversarial trainer

* fix small issue

* fix master addr environ

* fix typo

* fix update with DDP

* udpate callback for distributed training

* diplay progress per process

* enhance display

* [WIP] make CoupledAdv distributed

* Cealn up trainers and add distributed training

* fix piwae tests

* fix test piwae

* fix some tests

* increase coverage

* increase coverage

* add predict on main process

* apply balck and isort

* update notebooks with batch_size

* update reproducibility scripts

* clean up

* isort & black

* update README

* remove assert 0

* update distributed script

* add wandb

* update script

* log only on main process

* test batch size

* loss dubugging

* test with AE

* test with adaptive batchsize

* test with larger batch size

* benchmark

* benchmark perf

* remove debug prints

* redece learning rate

* show results

* new net

* lr

* remove sigm

* lr

* epochs

* batch_size

* new test

* with sigm

* test

* test

* retest

* retest

* with rank

* test in trainer

* retest

* test

* test

* test no embedding

* test

* test distributed

* debuggin

* debug

* not learnable codebook

* fix typo

* contiguous

* fix issue

* test inplace

* no_grad(

* debug

* find unused

* debug

* test with dist_nn

* remove find_unused

* test with dist.nn

* chekc rank

* remove all_reduce

* test with ddp

* second all_reduce

* async

* add detach

* add detach

* test

* debug

* change

* with einsum

* contiguous

* remove parameter

* new test

* debug

* debug

* add barrier

* remove embeddings

* update code

* update

* update

* mass sanity check on all process

* revert to good VQVAE

* remove prints

* add dist backend to script

* reduce number of epoch in example

* udpate doc

* increase batch size in example

* add other script

* remove find_unused

* test without unused

* fix ununsed

* add num_workers option to Training config

* add num_workers to scripts

* test with embedding

* remove learned codebook

* grad accumulation for benchmark

* beanchmark

* add grad accumulation

* remove print

* benchmark

* remove num_workers

* add FFHQ to benchmark

* fix predict

* fix predict

* reduce number of samples in predict

* add parser

* add sigmoid

* update config

* add imagenet script

* convert img to RGB

* add sigmoid to decoder

* increase batch size

* change nets

* change nets

* add new script

* add convert to RGB

* update tests

* clean up

* prepare release

* update doc

* fix input_dim

* last figures

* doc fix
  • Loading branch information
clementchadebec committed Feb 6, 2023
1 parent 06555fb commit 08f805e
Show file tree
Hide file tree
Showing 180 changed files with 7,033 additions and 7,365 deletions.
55 changes: 52 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,16 @@ provides the possibility to perform benchmark experiments and comparisons by tra
the models with the same autoencoding neural network architecture. The feature *make your own autoencoder*
allows you to train any of these models with your own data and own Encoder and Decoder neural networks. It integrates experiment monitoring tools such [wandb](https://wandb.ai/), [mlflow](https://mlflow.org/) or [comet-ml](https://www.comet.com/signup?utm_source=pythae&utm_medium=partner&utm_campaign=AMS_US_EN_SNUP_Pythae_Comet_Integration) 🧪 and allows model sharing and loading from the [HuggingFace Hub](https://huggingface.co/models) 🤗 in a few lines of code.

**News** 📢

As of v0.1.0, `Pythae` now supports distributed training using PyTorch's [DDP](https://pytorch.org/docs/stable/notes/ddp.html). You can now train your favorite VAE faster and on larger datasets, still with a few lines of code.
See our speed-up [benchmark](#benchmark).

## Quick access:
- [Installation](#installation)
- [Implemented models](#available-models) / [Implemented samplers](#available-samplers)
- [Reproducibility statement](#reproducibility) / [Results flavor](#results)
- [Model training](#launching-a-model-training) / [Data generation](#launching-data-generation) / [Custom network architectures](#define-you-own-autoencoder-architecture)
- [Model training](#launching-a-model-training) / [Data generation](#launching-data-generation) / [Custom network architectures](#define-you-own-autoencoder-architecture) / [Distributed training](#distributed-training-with-pythae)
- [Model sharing with 🤗 Hub](#sharing-your-models-with-the-huggingface-hub-) / [Experiment tracking with `wandb`](#monitoring-your-experiments-with-wandb-) / [Experiment tracking with `mlflow`](#monitoring-your-experiments-with-mlflow-) / [Experiment tracking with `comet_ml`](#monitoring-your-experiments-with-comet_ml-)
- [Tutorials](#getting-your-hands-on-the-code) / [Documentation](https://pythae.readthedocs.io/en/latest/)
- [Contributing 🚀](#contributing-) / [Issues 🛠️](#dealing-with-issues-%EF%B8%8F)
Expand Down Expand Up @@ -141,8 +145,15 @@ To launch a model training, you only need to call a `TrainingPipeline` instance.
... output_dir='my_model',
... num_epochs=50,
... learning_rate=1e-3,
... batch_size=200,
... steps_saving=None
... per_device_train_batch_size=200,
... per_device_eval_batch_size=200,
... train_dataloader_num_workers=2,
... eval_dataloader_num_workers=2,
... steps_saving=20,
... optimizer_cls="AdamW",
... optimizer_params={"weight_decay": 0.05, "betas": (0.91, 0.995)},
... scheduler_cls="ReduceLROnPlateau",
... scheduler_params={"patience": 5, "factor": 0.5}
... )
>>> # Set up the model configuration
>>> my_vae_config = model_config = VAEConfig(
Expand Down Expand Up @@ -334,6 +345,44 @@ You can also find predefined neural network architectures for the most common da
```
Replace *mnist* by cifar or celeba to access to other neural nets.

## Distributed Training with `Pythae`
As of `v0.1.0`, Pythae now supports distributed training using PyTorch's [DDP](https://pytorch.org/docs/stable/notes/ddp.html). It allows you to train your favorite VAE faster and on larger dataset using multi-gpu and/or multi-node training.

To do so, you can build a python script that will then be launched by a launcher (such as `srun` on a cluster). The only thing that is needed in the script is to specify some elements relative to the distributed environment (such as the number of nodes/gpus) directly in the training configuration as follows

```python
>>> training_config = BaseTrainerConfig(
... num_epochs=10,
... learning_rate=1e-3,
... per_device_train_batch_size=64,
... per_device_eval_batch_size=64,
... train_dataloader_num_workers=8,
... eval_dataloader_num_workers=8,
... dist_backend="nccl", # distributed backend
... world_size=8 # number of gpus to use (n_nodes x n_gpus_per_node),
... rank=5 # process/gpu id,
... local_rank=1 # node id,
... master_addr="localhost" # master address,
... master_port="12345" # master port,
... )
```

See this [example script](https://github.com/clementchadebec/benchmark_VAE/blob/main/examples/scripts/distributed_training_imagenet.py) that defines a multi-gpu VQVAE training on ImageNet dataset. Please note that the way the distributed environnement variables (`world_size`, `rank` ...) are recovered may be specific to the cluster and launcher you use.

### Benchmark

Below are indicated the training times for a Vector Quantized VAE (VQ-VAE) with `Pythae` for 100 epochs on MNIST on V100 16GB GPU(s), for 50 epochs on [FFHQ](https://github.com/NVlabs/ffhq-dataset) (1024x1024 images) and for 20 epochs on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) on V100 32GB GPU(s).

| | Train Data | 1 GPU | 4 GPUs | 2x4 GPUs |
|:---:|:---:|:---:|:---:|---|
| MNIST (VQ-VAE) | 28x28 images (50k) | 235.18 s | 62.00 s | 35.86 s |
| FFHQ 1024x1024 (VQVAE) | 1024x1024 RGB images (60k) | 19h 1min | 5h 6min | 2h 37min |
| ImageNet-1k 128x128 (VQVAE) | 128x128 RGB images ($\approx$ 1.2M) | 6h 25min | 1h 41min | 51min 26s |


For each dataset, we provide the benchmarking scripts [here](https://github.com/clementchadebec/benchmark_VAE/tree/main/examples/scripts)


## Sharing your models with the HuggingFace Hub 🤗
Pythae also allows you to share your models on the [HuggingFace Hub](https://huggingface.co/models). To do so you need:
- a valid HuggingFace account
Expand Down
98 changes: 0 additions & 98 deletions docs/old/advanced/custom_autoencoder.rst

This file was deleted.

214 changes: 0 additions & 214 deletions docs/old/advanced/setting_configs.rst

This file was deleted.

Loading

0 comments on commit 08f805e

Please sign in to comment.