Download Script for the-stack-v2 Dataset

Introduction

the-stack-v2 is the training data of starcoder v2. Whereas, the starcoder merely provides the metadata of its training dataset.

This repository implements concurrent downloading and packaging of the downloaded files into Parquet datasets, based on huangyangyu/starcoder_data.

Usage

You can use the following command line to download the dataset. Set your Hugging Face access token through the --hug_access_token parameter. Ensure the token has read permissions for the the-stack-v2 dataset. Fine-tune the max_workers parameter according to your environment.

python -m venv venv
source venv/bin/activate # activate th env based on your system
pip install boto3 botocore smart_open datasets tqdm pandas pyarrow

python download_the_stack_v2.py \
  --hug_access_token {your_huggingface_access_token} \
  --language Python \
  --download_folder {output_parquet_dir} \
  --max_workers 256

Note

This script is still in testing; it can download the Python subset (~1TB, 47272886 files) in approximately 10 hours.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
download_the_stack_v2.py		download_the_stack_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Download Script for the-stack-v2 Dataset

Introduction

Usage

Note

About

Releases

Packages

Languages

License

The-Hierophant/the_stack_v2_downloader

Folders and files

Latest commit

History

Repository files navigation

Download Script for the-stack-v2 Dataset

Introduction

Usage

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages