Skip to content

Files

Failed to load latest commit information.

Latest commit

9aa5186 · Sep 11, 2023

History

History

dataset_scripts

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Prepare Datasets for Training

Support datasets:

  • MDCC
  • AISHELL-1
  • THCHS-30
  • MAGICDATA Mandarin Chinese Read Speech Corpus

Download and Extract MDCC Dataset

sh mdcc.sh

Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf

Download and Extract AISHELL-1 Dataset

sh aishell_1.sh

Download and Extract THCHS-30 Dataset

sh thchs_30.sh

Download and Extract MAGICDATA Mandarin Chinese Dataset

sh magicdata_mcrsc.sh