The largest online retailer Amazon Prime Video is sponsoring a hackathon requesting participants to determine which low-budget movies will become popular box office films. Amazon Prime Video plans to obtain rights to these potentially popular movies for their streaming service. The purpose of this project is to assist their team in creating the list of movies to be used for the hackathon. To this end, I created an extract, transform, and load (ETL) pipeline to automate data wrangling. I then implemented the pipeline on one dataset of all movies released after 1990 from Wikipedia and another dataset of movie ratings from MovieLens in Kaggle. Lastly, I stored the resulting clean data within an SQL database.
As shown below, I extracted and read the three files in Jupyter as DataFrames.
wiki_movies_df | movies_metadata_df | ratings_df |
---|---|---|
I then transformed the DataFrames by using a try-except block to catch errors, refactoring code, filtering for specific values with regular expressions, deleting unreadable rows or columns, and cleaning any null values.
I merged the DataFrames wiki_movies_df and movies_metadata_df into a new DataFrame movies_df.
movies_df |
---|
I added movies_df to an SQL database along with ratings_df as tables named movies and ratings.
As shown below, filtering the available movies via the ETL pipeline shows that a total of 6052 movies have the potential to become established box office films. Each movie within the SQL table contains 31 columns of information, including IMDB ID, Kaggle ID, title, original title, tagline, Wikipedia URL, IMDB link, runtime, budget, etc. The unique identifier is the IMDB ID.
A total of 26,024,289 ratings are available, as shown below from the SQL query.
movies_df | ratings_df |
---|---|
Data source (files exceed upload capacity):
- wikipedia-movies.json
- movies_metadata.csv
- ratings.csv
Tools:
- Anaconda
- Jupyter Notebook
- NumPy
- Pandas
- psycopg2
- regular expressions (regex)
- SQLAlchemy
- JavaScript Object Notation (JSON)
Email: kate.wang00001@gmail.com
LinkedIn: katewang01