Dianne Jardinez, Aastha Arora, Swarna Latha
The objective of this project was to extract data from websites and available APIs. The following datasets were then transformed by cleaning, joining, and filtering into nine tables. The object-relational database, PostgreSQL, was used to load the datasets into pgAdmin.
The following Data Sources were used below:
-
- Method: Webscraping
- Used for: Collecting the Top 250 IMDB rated movie list
-
- Method: API Extraction
- Used for: Collecting IMDb id and other movie related details like actor, director, etc.
-
- Method: API Extraction
- Used For: Collecting streaming options for Top 250 IMDb movies
-
- Method: API Extraction
- Used For: Collecting movies on Netflix in released in the United States which have an IMDb rating between 7 and 10
-
- Method: Webscraping
- Used for: Collecting viewing Streaming Service availability and price
- Data extracted were formated in CSV and JSON files
- The following datasets were then transformed by cleaning, joining, and filtering into nine tables
- The object-relational database, PostgreSQL, was used to load the datasets into pgAdmin. A relational database was selected as the data was in a structured format
-
Extract:
Google scraping.ipynb
:- contains IMDB website and Google Search Engine Webscraping
netflix_high_imdb_rated(uNoGS api).ipynb
:- contains IMDB website Webscraping, OMDb API, and uNoGS API extraction
streaming_options(utelly api).ipynb
:- contains Utelly API extraction
-
Transform:
Transform.ipynb
:- contains all datasets that were transformed into nine tables
-
Load:
SQL
folder:- contains ERD and schema
SQL_Table
folder:- contains the creation of and all nine tables created in pgAdmin with PostgreSQL
- Project Report document:
- contains detailed project description and sample PostgreSQL queries in pgAdmin