This is an utility that allows you to collect movie scripts from several sources and create a database of 2.5k+ movie scripts as .txt
files along with the metadata for the movies.
There are four steps to the whole process:
- Collect scripts from various sources - Scrape websites for scripts in HTML, txt, doc or pdf format
- Collect metadata - Get metadata about the scripts from TMDb and IMDb for additional processing
- Find duplicates from different sources - Automatically group and remove duplicates from different sources.
- Parse Scripts - Convert scripts into lines with just Character and dialogue
The following steps MUST be run in order
Clone this repository:
git clone https://github.com/Aveek-Saha/Movie-Script-Database.git
cd Movie-Script-Database
Read the instructions for installing textract
first here.
Then install all dependencies using pip
pip install -r requirements.txt
Modify the sources you want to download in sources.json
. If you want a source to be included, set the value to true
, or else set it as false
.
python get_scripts.py
Collect all the scripts from the sources listed below:
{
"imsdb": "true",
"screenplays": "true",
"scriptsavant": "true",
"dailyscript": "true",
"awesomefilm": "true",
"sfy": "true",
"scriptslug": "true",
"actorpoint": "true",
"scriptpdf": "true"
}
- This might take a while (4+ hrs) depending on your network connection.
- The script takes advantage of parallel processing to speed up the download process.
- If there are missing/incomplete downloads, the script will only download the missing scripts if run again.
- In case of scripts in PDF or DOC format, the original file is stored in the
temp
directory.
Collect metadata from TMDb and IMDb:
python get_metadata.py
You'll need an API key for using the TMDb api and you can find out more about it here. Once you get the API key it has to be stored in a file called config.py
in this format:
tmdb_api_key = "<Your API key>"
This step will also combine duplicates, and your final metadata will be in this format:
{
"uniquescriptname": {
"files": [
{
"name": "Duplicate 1",
"source": "Source of the script",
"file_name": "name-of-the-file",
"script_url": "Original link to script",
"size": "size of file"
},
{
"name": "Duplicate 2",
"source": "Source of the script",
"file_name": "name-of-the-file",
"script_url": "Original link to script",
"size": "size of file"
}
],
"tmdb": {
"title": "Title from TMDb",
"release_date": "Date released",
"id": "TMDb ID",
"overview": "Plot summary"
},
"imdb": {
"title": "Title from IMDb",
"release_date": "Year released",
"id": "IMDb ID"
}
}
}
Run:
python clean_files.py
This will remove the duplicate files as best as possible without false positives. In the end, the files will be stored in the scripts\filtered
directory.
A new metadata file is created where only one file exists for each unique script name, in this format:
{
"uniquescriptname": {
"file": {
"name": "Movie name from source",
"source": "Source of the script",
"file_name": "name-of-the-file",
"script_url": "Original link to script",
"size": "size of file"
},
"tmdb": {
"title": "Title from TMDb",
"release_date": "Date released",
"id": "TMDb ID",
"overview": "Plot summary"
},
"imdb": {
"title": "Title from IMDb",
"release_date": "Year released",
"id": "IMDb ID"
}
}
}
The scripts are also cleaned to remove as much formatting weirdness that comes from using OCR to read from a PDF as possible.
Run:
python parse_files.py
This will parse your non duplicate scripts from the previous step. The parsed scripts are put into three folders
scripts/parsed/tagged
: Contains scripts where each line has been tagged. The tags areS
= SceneN
= Scene descriptionC
= CharacterD
= DialogueE
= Dialogue metadataT
= TransitionM
= Metadata
scripts/parsed/dialogue
: Contains scripts where each line has the character name, followed by a dialogue, in this format,C=>D
scripts/parsed/charinfo
: Contains a list of each character in the script and the number of lines they have, in this format,C: Number of lines
A new metadata file is created with the following format:
{
"uniquescriptname": {
"file": {
"name": "Movie name from source",
"source": "Source of the script",
"file_name": "name-of-the-file",
"script_url": "Original link to script",
"size": "size of file"
},
"tmdb": {
"title": "Title from TMDb",
"release_date": "Date released",
"id": "TMDb ID",
"overview": "Plot summary"
},
"imdb": {
"title": "Title from IMDb",
"release_date": "Year released",
"id": "IMDb ID"
},
"parsed": {
"dialogue": "name-of-the-file_dialogue.txt",
"charinfo": "name-of-the-file_charinfo.txt",
"tagged": "name-of-the-file_parsed.txt"
}
}
}
After running all the steps, your folder structure should look something like this:
scripts
│
├── unprocessed // Scripts from sources
│ ├── source1
│ ├── source2
│ └── source3
│
├── temp // PDF files from sources
│ ├── source1
│ ├── source2
│ └── source3
│
├── metadata // Metadata files from sources/cleaned metadata
│ ├── source1.json
│ ├── source2.json
│ ├── source3.json
│ └── meta.json
│
├── filtered // Scripts with duplicates removed
│
└── parsed // Scripts parsed using the parser
├── dialogue
├── charinfo
└── tagged
- IMSDb
- Dailyscript
- Awesomefilm
- Scriptsavanat
- Screenplays online
- Scripts for you
- Script Slug
- Actor Point
- Script PDF
Note:
Weeklyscript(Site no longer active)
If you use The Movie Script Database, please cite:
@misc{Saha_Movie_Script_Database_2021,
author = {Saha, Aveek},
month = {7},
title = {{Movie Script Database}},
url = {https://github.com/Aveek-Saha/Movie-Script-Database},
year = {2021}
}
The script for parsing the movie scripts come from this paper: Linguistic analysis of differences in portrayal of movie characters, in: Proceedings of Association for Computational Linguistics, Vancouver, Canada, 2017
and the code can be found here: https://github.com/usc-sail/mica-text-script-parser