Audio Datasets

Mozilla’s Common Voice dataset

Collection:
- It has over 500 hours from 20,000 different people.
- It is mainly aligned by sentence.
- Created by volunteers reading requested phrases.
- Collected by a web application.
License: CC0 (public domain)
Language: global
Media:
Link: https://voice.mozilla.org/zh-TW

LibriSpeech

Collection:
- 1000 hours
- It is aligned by sentence level only (lack word-level alignment).
License: CC4
Media: FLAC encoder
Link: http://www.openslr.org/12/

OpenSLR

Summary: Composed of lots of projects.
Link: http://www.openslr.org/

TIDIGITS

Collection:
- 25000 digit sequence spoken
- 300 speakers
- quite room by paid contribution
License: Commerical License from Language Data Consortium
Media: NIST SPHERE
Link: https://catalog.ldc.upenn.edu/LDC93S10

CHiME

Collections:
- 50 hours
- aligned by sentence level
Media: 16 KHz WAV files
License: Restricted License
Link: http://spandh.dcs.shef.ac.uk/chime_challenge/index.html

Open Speech Recording (Google)

Collections:
- 105829 wav files (16-bit single-channel PCM encoded, 16 KHz rate)
- 35 words
- 1 second / 1 word
Link: https://aiyprojects.withgoogle.com/open_speech_recording

UrbanSound8k

Collections:
- 8732 label files (<=4s)
- Environment background sound : 10 classes
Link: https://urbansounddataset.weebly.com/urbansound8k.html