WebDataset

WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives and uses only sequential/streaming data access. This brings substantial performance advantage in many compute environments, and it is essential for very large scale training.

While WebDataset scales to very large problems, it also works well with smaller datasets and simplifies creation, management, and distribution of training data for deep learning.

WebDataset implements standard PyTorch IterableDataset interface and works with the PyTorch DataLoader. Access to datasets is as simple as:

import webdataset as wds

dataset = wds.WebDataset(url).shuffle(1000).decode("torchrgb").to_tuple("jpg;png", "json")
dataloader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=16)

for inputs, outputs in dataloader:
    ...

In that code snippet, url can refer to a local file, a local HTTP server, a cloud storage object, an object on an object store, or even the output of arbitrary command pipelines.

WebDataset fulfills a similar function to Tensorflow's TFRecord/tf.Example classes, but it is much easier to adopt because it does not actually require any kind of data conversion: data is stored in exactly the same format inside tar files as it is on disk, and all preprocessing and data augmentation code remains unchanged.

Documentation

Installation

$ pip install webdataset

For the Github version:

$ pip install git+https://github.com/tmbdev/webdataset.git

Documentation: ReadTheDocs

Introductory Videos

Here are some videos talking about WebDataset and large scale deep learning:

Related Libraries and Software

The AIStore server provides an efficient backend for WebDataset; it functions like a combination of web server, content distribution network, P2P network, and distributed file system. Together, AIStore and WebDataset can serve input data from rotational drives distributed across many servers at the speed of local SSDs to many GPUs, at a fraction of the cost. We can easily achieve hundreds of MBytes/s of I/O per GPU even in large, distributed training jobs.

The tarproc utilities provide command line manipulation and processing of webdatasets and other tar files, including splitting, concatenation, and xargs-like functionality.

The tensorcom library provides fast three-tiered I/O; it can be inserted between AIStore and WebDataset to permit distributed data augmentation and I/O. It is particularly useful when data augmentation requires more CPU than the GPU server has available.

You can find the full PyTorch ImageNet sample code converted to WebDataset at tmbdev/pytorch-imagenet-wds

Name		Name	Last commit message	Last commit date
Latest commit History 484 Commits
.github/workflows		.github/workflows
docs		docs
docsrc		docsrc
notebooks		notebooks
testdata		testdata
webdataset		webdataset
.deepsource.toml		.deepsource.toml
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
conftest.py		conftest.py
environment.yml		environment.yml
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
pytest.ini		pytest.ini
readme.ipynb		readme.ipynb
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt
run-jupyterlab		run-jupyterlab
setup.py		setup.py
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebDataset

Documentation

Installation

Introductory Videos

Related Libraries and Software

About

Releases

Packages

Languages

License

DesignStripe/webdataset

Folders and files

Latest commit

History

Repository files navigation

WebDataset

Documentation

Installation

Introductory Videos

Related Libraries and Software

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages