Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FFCV doesn't work for large dataset #389

Open
richardrl opened this issue Sep 15, 2024 · 4 comments
Open

FFCV doesn't work for large dataset #389

richardrl opened this issue Sep 15, 2024 · 4 comments

Comments

@richardrl
Copy link

richardrl commented Sep 15, 2024

I am trying to load a 600GB dataset.

It froze for one hour on np.from_file in ffcv -> ffcv -> reader.py line 70 before I gave up and cancelled it.

I tried to fix this by using np.memmap.

        alloc_table = np.memmap(self._fname, dtype=ALLOC_TABLE_TYPE,
                                  offset=offset, shape=file_size, mode="r+")
        # alloc_table = np.fromfile(self._fname, dtype=ALLOC_TABLE_TYPE,
        #                           offset=offset)

The first time I did this, for some reason the subsequent code changed my 262GB Beton file to 6.2TB.

I need to recreate the beton now to try with just the read flag for memmap to see if I can get this working. Otherwise any tips?

@richardrl
Copy link
Author

richardrl commented Sep 15, 2024

        alloc_table = np.memmap(self._fname, dtype=ALLOC_TABLE_TYPE,
                                  offset=int(offset), shape=int(file_size/ALLOC_TABLE_TYPE.itemsize), mode="r")

I applied this fix and this seems to make that part of the code finish immediately. What other changes are needed to support the large dataset regime?

Also, the length of my dataset can be very large in some cases - up to 100 million frames. I wonder if there is a code bottleneck there as well for FFCV.

@richardrl
Copy link
Author

richardrl commented Sep 15, 2024

I made the dataset length much smaller and I'm able to load my dataloader:

train_dataloader = Loader(cfg.dataloader.beton_path,
                                  batch_size=cfg.dataloader.batch_size,
                                  num_workers=cfg.dataloader.num_workers,
                                  order=TemporalClipOrder,
                                  pipelines=PIPELINES,
                                  order_kwargs=dict(
                                      metadata_dict=torch.load(cfg.dataloader.metadata_path),
                                      num_clips=cfg.dataloader.order_kwargs.num_clips,
                                      sequence_length=cfg.horizon,
                                      pad_before=cfg.dataloader.order_kwargs.pad_before,
                                      pad_after=cfg.dataloader.order_kwargs.pad_after,
                                      frame_skip=cfg.dataloader.order_kwargs.frame_skip,
                                      artificial_video_ends=cfg.dataloader.order_kwargs.artificial_video_ends
                                  ),
                                  os_cache=False,
                                  drop_last=True
                                  )

However, I get a segfault immediately when I try to access it. I presume this is due to using the memmap. Is there a suggestion for how to make this whole setup work with the large dataset?

Segfault happens right after the pdb:

with tqdm.tqdm(train_dataloader, desc=f"Training epoch {self.epoch}", 
                       leave=True, mininterval=cfg.training.tqdm_interval_sec) as tepoch:
                   import pdb
                   pdb.set_trace()
                   for batch_idx, batch in enumerate(tepoch):

@richardrl
Copy link
Author

I'm finding that both for the initial beton creation and the initial dataloader load, it's requiring the full sizes of the dataset to load into memory - getting OOM errors otherwise. This is even with os_cache=False.

@richardrl
Copy link
Author

I got things to work by changing num_workers = 0 after getting the beton created. I'm not sure why, but this seems related: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/

I guess FFCV is doing something internally that blows up the memory upon Loader(...) when num_workers > 0.

Still have not tested if things work during beton creation stage, was getting OOM there unless I had more memory than the dataset size. I was using 60 workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant