-
Each line in
drama_list.txt
is of the format[url] [playlist_name]
; wavfiles and subtitle file will be saved to[data_dir]/[playlist_name]/[playlist_index].{wav,vtt}
.bash crawl_playlist.sh [data_dir] playlists.txt
-
process.py
then process.vtt
files and process it in to kaldi-style data format.python process.py [data_dir] dir/to/{text,segments,wav.scp,utt2spk} [--merge-consecutive-segments]
-
Timestamps of some of the closed captions are not very accurate, so you might want to
--merge-consecutive-segments
then run something likesegment_long_utterances.sh
to obtain more accurate timestamp of each segments.
-
Notifications
You must be signed in to change notification settings - Fork 1
Chung-I/youtube-asr-crawler
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description or website provided.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published