Training fails with multiple GPUs after the 1st epoch #43

deepbrain · 2018-09-15T03:34:46Z

If I fun with multiple, in my case 3 GPUs, with python3 -m multiproc main.py ...., I get the following error after successfully completing the 1st epoch:

Traceback (most recent call last):
File "main.py", line 392, in
val_loss, skipped_iters = train(total_iters, skipped_iters, elapsed_time)
File "main.py", line 305, in train
model.allreduce_params()
File "sentiment-discovery/model/distributed.py", line 41, in allreduce_params
dist.all_reduce(coalesced)
File "/home/tester/anaconda3/lib/python3.6/site-packages/torch/distributed/init.py", line 324, in all_reduce
return torch._C._dist_all_reduce(tensor, op, group)
RuntimeError: [/opt/conda/conda-bld/pytorch_1532579245307/work/third_party/gloo/gloo/transport/tcp/buffer.cc:76] Read timeout [127.0.0.1]:32444
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/conda/conda-bld/pytorch_1532579245307/work/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1532579245307/work/third_party/gloo/gloo/cuda_private.h:40: driver shutting down

raulpuric · 2018-09-15T04:02:39Z

So my guess as to why this is happening is that the last dataloader batch is not being properly dropped or padded.
Some of the training workers get fed a batch while others do not get a batch. Because of this the ones that got fed a batch, wait to do an allreduce indefinitely.

Do you have your full command so I can reproduce?

deepbrain · 2018-09-15T18:54:58Z

The issue happens regularly, almost every time, on a system with 3 1080Ti GPUs. On another system with 2 1080Tis and everything the same, the issue happens only in 10-20% of the time I use it with different size datasets. It happens at the end of the 1st epoch.

I would agree that it seems to be data size dependent and could be related to the size of the last batch.
Googling it reveals very similar issues in totally different settings, so it could be related to issues with DistributedDataParallel and pytorch. The batch size of 128 does not fit into 1080Ti memory, so I used a batch of 64 instead.

I am using the latest pre-built pytorch for conda:

Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
torch.version
'0.4.1'

Here is the full command line I use:

python3 -m multiproc main.py --load imdb_clf.pt --batch_size 64 --epochs 20 --lr 2e-4 --data ./data/twits/train_json.json &> twitlog.txt

I have cuda 9.2 and cudnn7

deepbrain · 2018-09-15T20:53:12Z

I changed the batch size from 64 to 100 and the issue disappeared on the 3 GPU system, so you are correct - it is definitely related to the size of the last batch. You can reproduce it with any small size dataset by varying the batch size.

raulpuric · 2018-09-18T07:22:24Z

I wasn't able to reproduce this on our smallest datasets. Could you print out the number of entries in your dataset for me, so I can try and create some synthetic data of the same length?

Adding print(len(train_data.dataset), len(val_data.dataset), len(test_data.dataset)) to main.py should suffice.

deepbrain · 2018-09-19T20:58:00Z

I changed my dataset size a bit and now I can't reproduce it too. Hopefully I will be able to get it to reproduce again if my data changes.

deepbrain · 2018-10-17T23:02:50Z

I got it to reproduce on a 2 GPU system with data sizes:

DATASET length: 42243230 422951 422408

full command line:

python3 -u -m multiproc main.py --load lang_model.pt --batch_size 110 --epochs 2 --lr 6e-4 --data ./data/twitter/train_json.json &> med_log.txt

Please let me know if you need the data file, I can upload it somewhere.

On another note, the train_json.json that I use is 14.4 Gigabytes in size, yet the program requires 64GB of RAM + another 64 GB of swap space to run. Is there a way to improve its memory footprint to run it with larger data sets?

raulpuric · 2018-10-18T17:59:34Z

Try the --lazy option. It will pull from disk and is meant to be used with large data files (such as the amazon reviews dataset).

If you could upload the file somewhere that would be very helpful. Even if you just replace the entries in the data file all with garbage text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails with multiple GPUs after the 1st epoch #43

Training fails with multiple GPUs after the 1st epoch #43

deepbrain commented Sep 15, 2018

raulpuric commented Sep 15, 2018

deepbrain commented Sep 15, 2018

deepbrain commented Sep 15, 2018

raulpuric commented Sep 18, 2018

deepbrain commented Sep 19, 2018

deepbrain commented Oct 17, 2018

raulpuric commented Oct 18, 2018

Training fails with multiple GPUs after the 1st epoch #43

Training fails with multiple GPUs after the 1st epoch #43

Comments

deepbrain commented Sep 15, 2018

raulpuric commented Sep 15, 2018

deepbrain commented Sep 15, 2018

deepbrain commented Sep 15, 2018

raulpuric commented Sep 18, 2018

deepbrain commented Sep 19, 2018

deepbrain commented Oct 17, 2018

raulpuric commented Oct 18, 2018