Runtime Error when Using custom batch_sampler with a Distributed Sampler #20224

uma-ibm · 2024-08-24T01:35:09Z

uma-ibm
Aug 24, 2024

Hi everyone,

I’m working on a Pytorch Lightning pipeline (on a machine with 4 H100 GPUs) where I need to pass a sorted dataset (audio here) into a custom batch sampler to (1) bucket these samples and (2) batch them down the line. My goal is to use the sorted indices for custom batch sampling to minimize padding (since sorting can bring similar lengths together and can reduce padding within a batch) during data loading in the collate_fn (which currently uses a zero padding technique which pads all samples to the max sequence).
Below is my get_dataloader code which is called in train code:

if len(dataset_list) == 1:
    dataset = dataset_list[0]
else:
    dataset = torch.utils.data.ConcatDataset(dataset_list)

# Sort dataset by audio length in decreasing order
sorted_indices = sorted(range(len(dataset)), key=lambda i: dataset[i].audio.shape[1], reverse=True)
sorted_dataset = [dataset[i] for i in sorted_indices]

dataset = sorted_dataset

# Use DistributedSampler without shuffling
sampler = DistributedSampler(dataset, shuffle=False)

# Custom BucketBatchSampler with bucketing logic
bucket_sampler = BucketBatchSampler(
    sampler, 
    batch_size=self.config.physical_batch_size,
    drop_last=train,
    buckets=25,
)

# Create DataLoader with the custom batch sampler
dataloader = torch.utils.data.DataLoader(
    dataset=dataset,
    batch_sampler=bucket_sampler,
    collate_fn=dataset_list[0].collate_fn,
    num_workers=4,
    pin_memory=True,
    shuffle=train,
)

The issue I’m encountering is that when I manually initialize DistributedSampler (also turned off use_distributed_sampler=False in Trainer initalization), the below Value error occurs. If I let Lightning handle the distributed setup without manually changing the argument shuffle=False, no error occurs but it also disrupts my sorted indices (shuffles it) and my bucketing logic needs the sampler to output sorted indices.

Interestingly, If i use a sampler =SequentialSampler(dataset), it overrides this with the DistributedSampler again…
Here’s the error I receive when using DistributedSampler(dataset, shuffle=False):

ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Has anyone faced a similar challenge, or does anyone know how to prevent the dataset from being shuffled while using DistributedSampler in this context? I’m specifically looking for a way to maintain the sorted order for my custom batch sampling logic.

Also providing my bucket_sampler logic for more context:

class BucketBatchSampler(BatchSampler):
    def __init__(self, sampler, batch_size, drop_last=False, buckets=25, lengths_dict=None):
        self.sampler = sampler
        self.batch_size = batch_size
        self.drop_last = drop_last
        self.buckets = buckets
    
    def __iter__(self):
        indices = list(self.sampler)

        if self.drop_last or len(indices) % self.buckets == 0:
            bucket_size = len(indices) // self.buckets
        else:
            bucket_size = (len(indices) // self.buckets) + 1

        # Create buckets
        buckets = [
            indices[i:i + bucket_size]
            for i in range(0, len(indices), bucket_size)
        ]

        # Shuffle buckets
        random.shuffle(buckets)
        
        for bucket in buckets:
            # Shuffle samples within the bucket
            random.shuffle(bucket)
            
            # Yield batches from the bucket
            for i in range(0, len(bucket), self.batch_size):
                batch = bucket[i:i + self.batch_size]
                if len(batch) == self.batch_size or not self.drop_last:
                    yield batch

    def __len__(self):
        if self.drop_last:
            length = len(self.sampler) // self.batch_size
            return length
        else:
            length = (len(self.sampler) + self.batch_size - 1) // self.batch_size
            return length

cc: @awaelchli can you please help! thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime Error when Using custom batch_sampler with a Distributed Sampler #20224

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Runtime Error when Using custom batch_sampler with a Distributed Sampler #20224

uma-ibm Aug 24, 2024

Replies: 0 comments

uma-ibm
Aug 24, 2024