Will Avalanche's FFCV integration support concatenated datasets? #1543

niniack · 2023-11-28T13:28:38Z

niniack
Nov 28, 2023

Hi, I have been trying to integrate the FFCV implementation found on the main branch. I have run into the issue where I cannot use the HybridFfcvLoader because at some point I call customset_train = concat_datasets([customset_train, imagenet_train]) when I build my custom benchmark.

That is a problem because of the following check:

def has_ffcv_support(datasets: List[AvalancheDataset]):
    """
    Checks if the support for FFCV was enabled for the given
    dataset list.

    This will 1) check if all the given :class:`AvalancheDataset`
    point to the same leaf dataset and 2) if the leaf dataset
    has the proper FFCV info setted by the :func:`enable_ffcv`
    function.

    :param dataset: The list of datasets.
    :return: True if FFCV can be used to load the given datasets,
        False otherwise.
    """
    try:
        flat_set = single_flat_dataset(concat_datasets(datasets))
    except Exception:
        return False

    if flat_set is None:
        return False

    leaf_dataset = flat_set[0]

    return hasattr(leaf_dataset, "ffcv_info")

Specifically single_flat_dataset(concat_datasets(datasets)) returns None because my dataset is really a concatenation of two distinct datasets. When training, my dataset will fail this has_ffcv_support check and default my dataloader to the torch dataloader.

Main Question:
So, I was just wondering if requiring a single leaf dataset should be expected for the final version of the ffcv implementation?

So far, the only workaround I have thought of is to build a CombinedDataset(Dataset) class, so that I can "trick" Avalanche into thinking its all the same dataset. If you have a better suggestion, I would be grateful!

Side note:
It might be good to raise a warning when enable_ffcv was called but the torch dataloader was used as a fallback!

Thanks!

P.S. Just dropping a tag @lrzpellegrini, since I figured you were leading the charge on FFCV integration

Answered by lrzpellegrini

Dec 1, 2023

The solution seems right. Just make sure that the order in which datasets are concatenated is always the same across executions!

View full answer

niniack · 2023-11-30T08:37:40Z

niniack
Nov 30, 2023
Author

For posterity, I ended up going down the route of trying to "trick" Avalanche and it seems to have worked for my case. This isn't a particularly impressive solution and I might be causing some issues for myself down the line. But, for now, it works. I just reused the logic from Pytorch's ConcatDataset class in my own class which subclasses Dataset. Avalanche finds this one as the terminal leaf dataset.

The key is just to be very careful that both datasets are returning exactly the same types, shapes, etc, and that any transformations applied are also exactly the same. You can use this CombinedDataset class to enforce/override transformations applied to the datasets being combined.

Here's a stripped down version

class CombinedDataset(Dataset):
    """
    Dataset class to combine two dataset datasets with a common transformation.
    Acts as a single dataset, hiding the fact that it is really a ConcatDataset
    """

    def __init__(
        self,
    ):
     	
     	# ...
     	# variables, transforms, etc.


        # Initialize datasets
        self.datasets = []

        self.datasets.append(
            MyFirstDataset()
        )

        self.datasets.append(
            MySecondDataset()
        )

        # Initialize targets
        self.targets = []
        for dataset in self.datasets:
            self.targets.extend(dataset.targets)

        # Compute cumulative sizes
        self.cumulative_sizes = self._cumsum(self.datasets)

    @staticmethod
    def _cumsum(sequence):
        r, s = [], 0
        for e in sequence:
            l = len(e)
            r.append(l + s)
            s += l
        return r

    def __len__(self):
        return self.cumulative_sizes[-1]

    def __getitem__(self, index):
        if index < 0 or index >= len(self):
            raise IndexError("The index is out of range.")

        dataset_index = bisect.bisect_right(self.cumulative_sizes, index)
        sample_index = (
            index - self.cumulative_sizes[dataset_index - 1]
            if dataset_index > 0
            else index
        )

        # Unpack
        image, label = self.datasets[dataset_index][sample_index]

        return image, label

I'm going to close this discussion for now, but I'm still curious to know if there will ever be support for concatenated datasets in Avalanche's support of FFCV! The solution above is hacky and bound to be error prone, I think.

5 replies

niniack Nov 30, 2023
Author

I think FFCV would not work if you have heterogeneous data (either size or augmentations).

Right, this makes sense. Maybe there is some way to do a verification that the leaf datasets are heterogeneous as a special case to support concat datasets? Just a thought, but I can also understand how this is quite a niche situation.

lrzpellegrini Nov 30, 2023
Maintainer

It's not exactly correct to say that transformations are pre-computed. Only the encoding procedure is done at dataset preparation time. The decoding pipeline (transformations) is computed at loading time.

Still, the main issue with the FFCV loader is that it expects a single dataset to be loaded. In fact, this is similar to how the PyTorch loader works, but the PyTorch loader expects a Dataset object (which can be a concat dataset, a subset, etcetera), while FFCV expects a path to a previously encoded dataset. Avalanche correctly manages loading subsets (or concatenation of subsets) of a previously stored single FFCV blob, but it's not able to load from a concatenation of multiple FFCV blobs.

I think that using a concatenated dataset is doable if we encode the concatenated dataset and use FFCV to store it as a single dataset. Still, it would require some adjustments on the Avalanche side.

I'll try to put something together, but beware that it may take some time due to other commitments...

niniack Nov 30, 2023
Author

Thanks for checking in with the thread! Right, it makes sense that Avalanche would have to do some processing to bring a concat dataset into a single FFCV .beton

While I have you, do you have any thoughts on the temporary solution I threw together? It technically works (no errors/warnings) but I am not sure I am seeing a speed up from FFCV (yet... might have to play around a bit more)

Otherwise, I'm happy to help out where I can with advancing the FFCV work, thanks for all the work you've put into it so far! It's been so much easier to hack together something on top of the Avalanche implementation of it.

lrzpellegrini Dec 1, 2023
Maintainer

The solution seems right. Just make sure that the order in which datasets are concatenated is always the same across executions!

Answer selected by niniack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will Avalanche's FFCV integration support concatenated datasets? #1543

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Will Avalanche's FFCV integration support concatenated datasets? #1543

niniack Nov 28, 2023

Replies: 1 comment · 5 replies

niniack Nov 30, 2023 Author

niniack Nov 30, 2023 Author

lrzpellegrini Nov 30, 2023 Maintainer

niniack Nov 30, 2023 Author

lrzpellegrini Dec 1, 2023 Maintainer

niniack
Nov 28, 2023

Replies: 1 comment 5 replies

niniack
Nov 30, 2023
Author

niniack Nov 30, 2023
Author

lrzpellegrini Nov 30, 2023
Maintainer

niniack Nov 30, 2023
Author

lrzpellegrini Dec 1, 2023
Maintainer