Youtube ASR Crawler

Usage

Each line in drama_list.txt is of the format [url] [playlist_name]; wavfiles and subtitle file will be saved to [data_dir]/[playlist_name]/[playlist_index].{wav,vtt}.
- bash crawl_playlist.sh [data_dir] playlists.txt
process.py then process .vtt files and process it in to kaldi-style data format.
- python process.py [data_dir] dir/to/{text,segments,wav.scp,utt2spk} [--merge-consecutive-segments]
Timestamps of some of the closed captions are not very accurate, so you might want to --merge-consecutive-segments then run something like segment_long_utterances.sh to obtain more accurate timestamp of each segments.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
crawl_playlist.sh		crawl_playlist.sh
playlists.txt		playlists.txt
process.py		process.py