Runtime Error when Using custom batch_sampler with a Distributed Sampler #20224
Unanswered
uma-ibm
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I’m working on a Pytorch Lightning pipeline (on a machine with 4 H100 GPUs) where I need to pass a sorted dataset (audio here) into a custom batch sampler to (1) bucket these samples and (2) batch them down the line. My goal is to use the sorted indices for custom batch sampling to minimize padding (since sorting can bring similar lengths together and can reduce padding within a batch) during data loading in the collate_fn (which currently uses a zero padding technique which pads all samples to the max sequence).
Below is my get_dataloader code which is called in train code:
The issue I’m encountering is that when I manually initialize DistributedSampler (also turned off
use_distributed_sampler=False
in Trainer initalization), the below Value error occurs. If I let Lightning handle the distributed setup without manually changing the argument shuffle=False, no error occurs but it also disrupts my sorted indices (shuffles it) and my bucketing logic needs the sampler to output sorted indices.Interestingly, If i use a sampler =SequentialSampler(dataset), it overrides this with the DistributedSampler again…
Here’s the error I receive when using DistributedSampler(dataset, shuffle=False):
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Has anyone faced a similar challenge, or does anyone know how to prevent the dataset from being shuffled while using DistributedSampler in this context? I’m specifically looking for a way to maintain the sorted order for my custom batch sampling logic.
Also providing my bucket_sampler logic for more context:
cc: @awaelchli can you please help! thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions