Multi-GPU when using Torchtext iterator for data loading #226
Replies: 22 comments
-
hi! install from master and try again? i believe we pushed a fix for this on master. if not, i can look at it deeper |
Beta Was this translation helpful? Give feedback.
-
@aitor-garcia-p actually, just released a new version with these fixes. Try again? if not we'll take a deeper look at it |
Beta Was this translation helpful? Give feedback.
-
Hi again, After digging a bit (with my limited understanding), I see that in this function, if the "batch" parameter is a torchtext.data.Batch object (as it happens when using a Torchtext Iterator) the Trainer function transfer_batch_to_gpu will miss it despite having several conditionals. I have made a test adding this additional condition:
(Or any other condition that catches a torchtext.data.Batch instance) But still I cannot make multi-gpu working when the batches come from a torchtext iterator.
And it complains about the following:
It seems that there is something with the torchtext iterator that prevents from a proper serialization for the distributed processes spawn. |
Beta Was this translation helpful? Give feedback.
-
yeah, looks like torchtext can't be pickled and thus not used with DDP. But you should verify that on the torchtext issues. If that's true, then i'd recommend DP or we can try to come up with a work around |
Beta Was this translation helpful? Give feedback.
-
Also, feel free to submit a PR with your changes so we can enable torchtext support |
Beta Was this translation helpful? Give feedback.
-
Hey @williamFalcon , |
Beta Was this translation helpful? Give feedback.
-
hey! sorry, been busy with deadlines but will look at it this week. want to take a stab at a PR? can help you finish it once you submit it |
Beta Was this translation helpful? Give feedback.
-
@ctlaltdefeat did you still want to submit this PR? |
Beta Was this translation helpful? Give feedback.
-
I've been busy too, and I think it may be more of an issue between |
Beta Was this translation helpful? Give feedback.
-
That's correct, torchtext can't be pickled and you'll want to use DP. Could you give a full stacktrace of the issue with DP? I'm not sure which step is emitting that error or if it's coming from dataloading or training. |
Beta Was this translation helpful? Give feedback.
-
The issue with DP (for me) is that the inability to use mixed-precision training offsets the benefit of multi-GPU training. |
Beta Was this translation helpful? Give feedback.
-
Any recent updates on this issue? |
Beta Was this translation helpful? Give feedback.
-
I am trying to run a torchtext dataset, it works fine with single GPU, but fails to work on dp, ddp (ddp2 out of bounds for me as no slurm). I think ddp may be an issue with another library (wandb.com). But for dp I am getting the same error as OP.
|
Beta Was this translation helpful? Give feedback.
-
@jeffling this is the error trace for DP with torchtext:
I am running train.py from this repository: https://github.com/Genei-Ltd/Siamese_BERT_blogpost/blob/master/train.py |
Beta Was this translation helpful? Give feedback.
-
@aced125 it looks like the batches aren't put into the right GPU. Could you look at the example code the OP put out regarding the hack with torchtext to place things into the right GPU? It also looks like @ctlaltdefeat had this working with DP, but couldn't use DP due to other reasons. Any tips? |
Beta Was this translation helpful? Give feedback.
-
@jeffling I've given up on torchtext datasets to be honest. It was easy enough to switch to a torch.utils.data.DataLoader instead. I am going to try PL on graph convolutions soon (will be using the pytorch-geometric library which also uses a custom DataLoader (which inherits from the torch DataLoader) so will let you know if that works well. |
Beta Was this translation helpful? Give feedback.
-
Hello, I am not able to get a simple toy example running using Torchtext iterators even on a single gpu. I am using a train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
train_iter, valid_iter, test_iter = BucketIterator.splits((train_data, valid_data, test_data), batch_size=batch_size) My trainer code is: model = SegmenterModule(80, 76)
trainer = Trainer(gpus=1, max_nb_epochs=3, default_save_path='checkpoints')
trainer.fit(model) But I get an error because the batch data is still on cpu and not moved to the gpu. Stack trace: Can someone please help me figure out the problem or share a working example using torchtext iterators? Also, should I open a new issue with this problem or let this question be here? |
Beta Was this translation helpful? Give feedback.
-
As some people mentioned here, I cannot make it work even for a single GPU. I debugged the code and it seems like As a result the data are not moved to GPU and the code gives my exception:
|
Beta Was this translation helpful? Give feedback.
-
@aitor-garcia-p @mateuszpieniak @jeffling let's close this one and continue discission how to improve the situation in #1245 |
Beta Was this translation helpful? Give feedback.
-
Have you find out the solution yet? |
Beta Was this translation helpful? Give feedback.
-
I also have the same problem. |
Beta Was this translation helpful? Give feedback.
-
Currently, you need to transfer manually data to a GPU using torchtext. Take a look at my gist |
Beta Was this translation helpful? Give feedback.
-
Hi there,
I have just discovered pytorch-lightning few days ago and it seems awesome (congratulations!)
I have a question I cannot solve by reading the docs and examples.
Is is fully compatible with Torchtext?
I am trying to use a Torchtext iterator to load the data in batches, and I have managed to make it work for a single GPU, but when I add additional GPUs to the trainer:
trainer = Trainer(experiment=exp, gpus=[0, 1])
it breaks saying:
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:397
I understand that the problem comes from the model and the data not being placed in the same GPU.
I am following the provided template, replacing the MNIST parts with my own data.
The way I load the training data is:
I use that little hack to get the current gpu device to parameterize the Torchtext BucketIterator, because if I leave the Torchtext iterator "device" field empty it defaults to cpu, and I get the corresponding complaint when the training starts with the model in the gpu:
RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'
But this hack does not work for more-than-one-gpu setting.
Am I missing something or am I doing something wrong?
I could also reimplement my data loading using regular Pytorch dataloaders as in the template, but I would like to know if I can stick to Torchtext and still get the multi-gpu goodies from Lightning :)
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions