You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I was using the DistributedDataParallel with torch distributed loader, and now I am trying to move to ffcv distributed.
I am starting a process group in the beginning of my code, like I was doing before, no changes there.
I was using 2cpu + 2gpu, and having 2 proceses (args.local_rank=0 and args.local_rank=1)
The result is having 2 processes running, and getting the "CUDA error: all CUDA-capable devices are busy or unavailable" on the second batch, when calling the loss.backward()
I am able to run the same code, without ffcv (using the torch.utils.data.DataLoader with DistributedSampler) with no errors.
Here are the relevant parts of my code:
from ffcv.fields.decoders import IntDecoder, RandomResizedCropRGBImageDecoder, SimpleRGBImageDecoder
from ffcv.transforms.common import Squeeze
import torchvision
import torch
from torch.nn.parallel import DistributedDataParallel as DDP
###preparing the pipelines
this_device = f'cuda:{args.local_rank}'
CIFAR_MEAN = [0.485, 0.456, 0.406]
CIFAR_STD = [0.229, 0.224, 0.225]
image_pipeline = [ RandomResizedCropRGBImageDecoder((224, 224)),
ToTensor(),
ToDevice(torch.device(this_device), non_blocking=True),
ToTorchImage(),
Convert(torch.float16),
torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
]
label_pipeline = [IntDecoder(), ToTensor(), ToDevice(torch.device(this_device), non_blocking=True), Squeeze()]
pipelines = {
'image': image_pipeline,
'label': label_pipeline
}
###preparing the model
torch.cuda.set_device(args.local_rank)
self.model = self.model.to(memory_format=torch.channels_last)
self.model = DDP(self.model.cuda(args.local_rank))
###preparing the loader
self.training_generator = Loader(ffcv_path, batch_size=20, num_workers=2, distributed = True, seed = 123,
order=OrderOption.RANDOM, drop_last=True, pipelines = get_train_pipe_ffcv(args), batches_ahead = 1, os_cache=True)
self.model.train()
###iterating batches
iterator = tqdm(self.training_generator)
for batch_idx, (input_data , target) in enumerate(iterator):
print(batch_idx)
with autocast():
self.optimizer.zero_grad()
output = self.model(input_data)
loss = criterion(output, target)
loss.backward() #the second batch fails here with CUDA error
self.optimizer.step()
The error:
2022-07-25 04:27:36.315 DEBUG [trainer_ffcv.py:262] rank 1
2022-07-25 04:27:36.315 DEBUG [trainer_ffcv.py:262] rank 0
Exception in thread Thread-3:
Traceback (most recent call last):it/s]
File "/u/jlerner/.conda/envs/ffcv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/u/jlerner/.conda/envs/ffcv/lib/python3.8/site-packages/ffcv/loader/epoch_iterator.py", line 99, in run
event.record(ch.cuda.default_stream())
File "/u/jlerner/.conda/envs/ffcv/lib/python3.8/site-packages/torch/cuda/streams.py", line 176, in record
super(Event, self).record(stream)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
T 0 loss=2.147, acc=0.200, avr_acc=4, max_out=1
0%| | 1/250 [00:06<28:06, 6.77s/it]
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I was using the DistributedDataParallel with torch distributed loader, and now I am trying to move to ffcv distributed.
I am starting a process group in the beginning of my code, like I was doing before, no changes there.
I was using 2cpu + 2gpu, and having 2 proceses (args.local_rank=0 and args.local_rank=1)
The result is having 2 processes running, and getting the "CUDA error: all CUDA-capable devices are busy or unavailable" on the second batch, when calling the
loss.backward()
I am able to run the same code, without ffcv (using the torch.utils.data.DataLoader with DistributedSampler) with no errors.
Here are the relevant parts of my code:
The error:
Beta Was this translation helpful? Give feedback.
All reactions