Support for Concurrent Data Downloading and Processing #1019
Replies: 1 comment
-
@tandalalam more of a Q&A discussion than an issue/bug, but you can stream with webdataset, and there are ways to cache but not sure how reliable that is or how easy to fit with any pipeline. It's not a great idea for training as the $ in wasted GPU time by slowing the process down is usually not worth it (you want to download up front with CPU instances/machines without wasting GPU hrs). For streaming you can just change the prefix of your shard path from a local path to curl + flags + http You can also use Something like this, though that's specific to timm and leverages an _info.json to get the shard names |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am currently working on fine-tuning CLIP with a large dataset, and I have encountered a bottleneck during the downloading process. Downloading the dataset is quite time-consuming, and I was wondering if there’s a way to streamline this process by creating a pipeline where the data can be downloaded while the model is processing the batches.
Does WebDataset support this kind of concurrent downloading and processing out of the box? If not, are there any recommended strategies or configurations to achieve this?
Any guidance or references would be greatly appreciated.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions