OpenSubtitles

This repository is a collection of scripts that help download and parse the OpenSubtitles corpus.

opensubtitles.sh: downloads, extracts, and merges the 2012/2013 corpora from http://opus.lingfil.uu.se
opensubtitles.py: naieve attempt at trying to download a single english subtitle for each imdb id. rate limits at 200 downloads per day

analyze.py: tries to cluster a single year of movie transcripts
explore.py: prints a list of all genres for the given year
load.py: loads all subtitles into memory for a given year, used by all other scripts
xml.py: parse xml file into subtitle.txt file
parse.py: find corresponding imdb id from opensubtitles id and get json of metadata

Citations

OpenSubtitles: http://www.opensubtitles.org/

Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
OpenSubtitles.ipynb		OpenSubtitles.ipynb
README.md		README.md
analyze.py		analyze.py
explore.py		explore.py
load.py		load.py
opensubtitles.py		opensubtitles.py
opensubtitles.sh		opensubtitles.sh
opensubtitles_imdb.tsv		opensubtitles_imdb.tsv
output.html		output.html
parse.py		parse.py
xml.py		xml.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenSubtitles

Citations

About

Releases

Packages

Languages

AlJohri/OpenSubtitles

Folders and files

Latest commit

History

Repository files navigation

OpenSubtitles

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages