distributed training with ffcv #240

sa-cloud · 2022-07-25T09:31:46Z

sa-cloud
Jul 25, 2022

Hi, I was using the DistributedDataParallel with torch distributed loader, and now I am trying to move to ffcv distributed.
I am starting a process group in the beginning of my code, like I was doing before, no changes there.
I was using 2cpu + 2gpu, and having 2 proceses (args.local_rank=0 and args.local_rank=1)

The result is having 2 processes running, and getting the "CUDA error: all CUDA-capable devices are busy or unavailable" on the second batch, when calling the loss.backward()
I am able to run the same code, without ffcv (using the torch.utils.data.DataLoader with DistributedSampler) with no errors.

Here are the relevant parts of my code:

from ffcv.fields.decoders import IntDecoder, RandomResizedCropRGBImageDecoder, SimpleRGBImageDecoder
from ffcv.transforms.common import Squeeze
import torchvision
import torch
from torch.nn.parallel import DistributedDataParallel as DDP

###preparing the pipelines
this_device = f'cuda:{args.local_rank}'
CIFAR_MEAN = [0.485, 0.456, 0.406]
CIFAR_STD = [0.229, 0.224, 0.225]

image_pipeline = [    RandomResizedCropRGBImageDecoder((224, 224)),
                      ToTensor(), 
                      ToDevice(torch.device(this_device), non_blocking=True),
                      ToTorchImage(),
                      Convert(torch.float16),
                      torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                          std=[0.229, 0.224, 0.225])
                                                          ]

 
label_pipeline = [IntDecoder(), ToTensor(), ToDevice(torch.device(this_device), non_blocking=True), Squeeze()]

pipelines = {
          'image': image_pipeline,
          'label': label_pipeline
          }
  	
###preparing the model		
torch.cuda.set_device(args.local_rank)
self.model = self.model.to(memory_format=torch.channels_last)
self.model = DDP(self.model.cuda(args.local_rank)) 

###preparing the loader
self.training_generator = Loader(ffcv_path, batch_size=20, num_workers=2, distributed = True, seed = 123,
              order=OrderOption.RANDOM, drop_last=True, pipelines = get_train_pipe_ffcv(args), batches_ahead = 1, os_cache=True)
  			

self.model.train()

###iterating batches
iterator = tqdm(self.training_generator)
for batch_idx, (input_data , target) in enumerate(iterator):
print(batch_idx)
with autocast():
  self.optimizer.zero_grad()
      output = self.model(input_data)
      loss = criterion(output, target)
      loss.backward()      #the second batch fails here with CUDA error
      self.optimizer.step()

The error:

2022-07-25 04:27:36.315 DEBUG    [trainer_ffcv.py:262] rank 1
2022-07-25 04:27:36.315 DEBUG    [trainer_ffcv.py:262] rank 0
                                       Exception in thread Thread-3:
Traceback (most recent call last):it/s]
  File "/u/jlerner/.conda/envs/ffcv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/u/jlerner/.conda/envs/ffcv/lib/python3.8/site-packages/ffcv/loader/epoch_iterator.py", line 99, in run
    event.record(ch.cuda.default_stream())
  File "/u/jlerner/.conda/envs/ffcv/lib/python3.8/site-packages/torch/cuda/streams.py", line 176, in record
    super(Event, self).record(stream)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
T 0 loss=2.147, acc=0.200, avr_acc=4, max_out=1
  0%|          | 1/250 [00:06<28:06,  6.77s/it]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training with ffcv #240

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

distributed training with ffcv #240

sa-cloud Jul 25, 2022

Replies: 0 comments

sa-cloud
Jul 25, 2022