Support for Concurrent Data Downloading and Processing #1019

tandalalam · 2025-01-13T15:10:37Z

tandalalam
Jan 13, 2025

Hi,

I am currently working on fine-tuning CLIP with a large dataset, and I have encountered a bottleneck during the downloading process. Downloading the dataset is quite time-consuming, and I was wondering if there’s a way to streamline this process by creating a pipeline where the data can be downloaded while the model is processing the batches.

Does WebDataset support this kind of concurrent downloading and processing out of the box? If not, are there any recommended strategies or configurations to achieve this?

Any guidance or references would be greatly appreciated.

Thank you!

rwightman · 2025-01-14T19:36:53Z

rwightman
Jan 14, 2025
Maintainer

@tandalalam more of a Q&A discussion than an issue/bug, but you can stream with webdataset, and there are ways to cache but not sure how reliable that is or how easy to fit with any pipeline. It's not a great idea for training as the $ in wasted GPU time by slowing the process down is usually not worth it (you want to download up front with CPU instances/machines without wasting GPU hrs).

For streaming you can just change the prefix of your shard path from a local path to curl + flags + http pipe:curl -f -s -L http://where/are/my_shards{000..111}.tar

You can also use pipe:aws s3 cp s3:// for s3, which is used a lot for training as instance -> s3 bandwidth is high and reliable. You can search to find examples of use with s3 in this repo.

Something like this, though that's specific to timm and leverages an _info.json to get the shard names
https://x.com/wightmanr/status/1743083207443267759

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Concurrent Data Downloading and Processing #1019

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Support for Concurrent Data Downloading and Processing #1019

tandalalam Jan 13, 2025

Replies: 1 comment

rwightman Jan 14, 2025 Maintainer

tandalalam
Jan 13, 2025

rwightman
Jan 14, 2025
Maintainer