Skip to content

the data download script of the-stack-v2, which is the training data of StarCoder2. This repo implements concurrent downloading & efficiently saves tens of millions of small downloaded scripts into consolidated Parquet datasets.

License

Notifications You must be signed in to change notification settings

The-Hierophant/the_stack_v2_downloader

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Download Script for the-stack-v2 Dataset

Introduction

the-stack-v2 is the training data of starcoder v2. Whereas, the starcoder merely provides the metadata of its training dataset.

This repository implements concurrent downloading and packaging of the downloaded files into Parquet datasets, based on huangyangyu/starcoder_data.

Usage

You can use the following command line to download the dataset. Set your Hugging Face access token through the --hug_access_token parameter. Ensure the token has read permissions for the the-stack-v2 dataset. Fine-tune the max_workers parameter according to your environment.

python -m venv venv
source venv/bin/activate # activate th env based on your system
pip install boto3 botocore smart_open datasets tqdm pandas pyarrow

python download_the_stack_v2.py \
  --hug_access_token {your_huggingface_access_token} \
  --language Python \
  --download_folder {output_parquet_dir} \
  --max_workers 256

Note

This script is still in testing; it can download the Python subset (~1TB, 47272886 files) in approximately 10 hours.

About

the data download script of the-stack-v2, which is the training data of StarCoder2. This repo implements concurrent downloading & efficiently saves tens of millions of small downloaded scripts into consolidated Parquet datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%