TorchData (see note below on current status)

What is TorchData? | Stateful DataLoader | Install guide | Contributing | License

⚠️ June 2024 Status Update: Removing DataPipes and DataLoader V2

We are re-focusing the torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader. We do not plan on continuing development or maintaining the [DataPipes] and [DataLoaderV2] solutions, and they will be removed from the torchdata repo. We'll also be revisiting the DataPipes references in pytorch/pytorch. In release torchdata==0.8.0 (July 2024) they will be marked as deprecated, and sometime after 0.9.0 (Oct 2024) they will be deleted. Existing users are advised to pin to torchdata==0.9.0 or an older version until they are able to migrate away. Subsequent releases will not include DataPipes or DataLoaderV2. The old version of this README is available here. Please reach out if you suggestions or comments (please use #1196 for feedback).

What is TorchData?

The TorchData project is an iterative enhancement to the PyTorch torch.utils.data.DataLoader and torch.utils.data.Dataset/IterableDataset to make them scalable, performant dataloading solutions. We will be iterating on the enhancements under the torchdata repo.

Our first change begins with adding checkpointing to torch.utils.data.DataLoader, which can be found in stateful_dataloader, a drop-in replacement for torch.utils.data.DataLoader, by defining load_state_dict and state_dict methods that enable mid-epoch checkpointing, and an API for users to track custom iteration progress, and other custom states from the dataloader workers such as token buffers and/or RNG states.

Stateful DataLoader

torchdata.stateful_dataloader.StatefulDataLoader is a drop-in replacement for torch.utils.data.DataLoader which provides state_dict and load_state_dict functionality. See the Stateful DataLoader main page for more information and examples. Also check out the examples in this Colab notebook.

torchdata.nodes

torchdata.nodes is a library of composable iterators (not iterables!) that let you chain together common dataloading and pre-proc operations. It follows a streaming programming model, although "sampler + Map-style" can still be configured if you desire. See torchdata.nodes main page for more details. Stay tuned for tutorial on torchdata.nodes coming soon!

Installation

Version Compatibility

The following is the corresponding torchdata versions and supported Python versions.

`torch`	`torchdata`	`python`
`master` / `nightly`	`main` / `nightly`	`>=3.9`, `<=3.12` (`3.13` experimental)
`2.5.0`	`0.10.0`	`>=3.9`, `<=3.12`
`2.5.0`	`0.9.0`	`>=3.9`, `<=3.12`
`2.4.0`	`0.8.0`	`>=3.8`, `<=3.12`
`2.0.0`	`0.6.0`	`>=3.8`, `<=3.11`
`1.13.1`	`0.5.1`	`>=3.7`, `<=3.10`
`1.12.1`	`0.4.1`	`>=3.7`, `<=3.10`
`1.12.0`	`0.4.0`	`>=3.7`, `<=3.10`
`1.11.0`	`0.3.0`	`>=3.7`, `<=3.10`

Local pip or conda

First, set up an environment. We will be installing a PyTorch binary as well as torchdata. If you're using conda, create a conda environment:

conda create --name torchdata
conda activate torchdata

If you wish to use venv instead:

python -m venv torchdata-env
source torchdata-env/bin/activate

Install torchdata:

Using pip:

pip install torchdata

Using conda:

conda install -c pytorch torchdata

From source

pip install .

In case building TorchData from source fails, install the nightly version of PyTorch following the linked guide on the contributing page.

From nightly

The nightly version of TorchData is also provided and updated daily from main branch.

Using pip:

pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly/cpu

Using conda:

conda install torchdata -c pytorch-nightly

Contributing

We welcome PRs! See the CONTRIBUTING file.

Beta Usage and Feedback

We'd love to hear from and work with early adopters to shape our designs. Please reach out by raising an issue if you're interested in using this tooling for your project.

License

TorchData is BSD licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 631 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
packaging		packaging
scripts/release_notes		scripts/release_notes
test		test
tools		tools
torchdata		torchdata
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yaml		.prettierrc.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TorchData (see note below on current status)

What is TorchData?

Stateful DataLoader

torchdata.nodes

Installation

Version Compatibility

Local pip or conda

From source

From nightly

Contributing

Beta Usage and Feedback

License

About

Releases 12

Packages

Contributors 75

Languages

License

pytorch/data

Folders and files

Latest commit

History

Repository files navigation

TorchData (see note below on current status)

What is TorchData?

Stateful DataLoader

torchdata.nodes

Installation

Version Compatibility

Local pip or conda

From source

From nightly

Contributing

Beta Usage and Feedback

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 12

Packages 0

Contributors 75

Languages

Packages