Skip to content

Big Data ETL with Amazon reviews - using Google Colab, AWS, Spark, Postgresql, Java, Hadoop

Notifications You must be signed in to change notification settings

cmartell5/amazon-reviews-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Big-Data-ETL

Background

For this project I will put my ETL skills to the test. Many of Amazon's shoppers depend on product reviews to make a purchase. Amazon makes these datasets publicly available. They are quite large and can exceed the capacity of local machines. One dataset alone contains over 1.5 million rows; with over 40 datasets, data analysis can be very demanding on the average local computer. My first goal will be to perform the ETL process completely in the cloud and upload a DataFrame to an RDS instance. The second goal will be to use PySpark or SQL to perform a statistical analysis of selected data.

  • Extract two Amazon customer review datasets, transform each dataset into four DataFrames, and load the DataFrames into an RDS instance.

Project Steps

  1. Extracted data - Two Amazon customer review datasets into Spark DataFrames. One was for video games and one was for pet products.
  2. Transformed data - Created four dataframes from each dataset based off the information extracted:
    • "review_id_df" DataFrame with appropriate columns and datatypes
    • "products_df" DataFrame that drops the duplicates on the "product_id" and "product_title" columns
    • "customers_df" DataFrame that groups the data on the "customer_id" by the number of times a customer reviewed a product
    • "vine_df" DataFrame that has the "review_id", "star_rating", "helpful_votes", "total_votes", and "vine" columns
  3. Load data - Exported each DataFrame into the RDS instance to create four tables for each dataset

Tools used

Google Colab, AWS, Spark, postgresql, java, hadoop

image