Sentiment Analysis and Recommender System based on Amazon's Home and Kitchen Dataset.
The number of people who prefer online shopping is rapidly increasing due to the technological advancement and convenience of shopping anywhere at any time. The recent studies show that the most of all online shoppers use reviews to decide on what products to buy. Reviews not only help customers, but it also helps to strengthen sellers' trustworthiness. Amazon is the world’s largest retailer with millions of products. This project deals with the study of statistical analysis of reviews and ratings by Amazon users. All the intermediate results and references are uploaded to drive because of space constraint
- Do topic modeling(LDA) on review comments
- Build a recommendation system using collaborative filtering (ALS) method and topic modeling (LDA) method and compare their respective results
- Use topics generated from LDA and analyze sentiment of the reviews in order to score products by the topics mentioned in their reviews.
The amazon review datasets are freely available online. For this analysis we used a dataset specific to the “Home and Kitchen” category. Initially we explored the complete review data which had 21million entries, later we decided to work on a smaller subset which had 7million entries, due to technical difficulties faced by some of us. The dataset has following fields – reviewerID - the id of the user, asin - Amazon product id, reviewerName - name of the user, vote - the number of times the review was voted helpful, reviewText - the actual content of the review, overall - the rating of the product ranging from 1 to 5, summary - the title of the review, unixReviewTime - the time of the review in Unix format, reviewTime - the time of the review, style -a dictionary of the product metadata, e.g., "Format" is "Hardcover" , image - images that users post after they have received the product.
Link to the preprocessed dataset => Home & Kitchen Preprocessed Data
Topic Modeling with LDA
The dataset contains roughly 6 million data points. GPU is preferred for faster execution.(used HPC access provided by sjsu)
python version 3.6.10
pip install gensim
pip install pandas
pip install nltk and download(stopwords, wordnet corpuses)
pip install pyLDAvis
Download Topic_Modeling_grouped.ipynb and run it using jupyter notebook It will save generated model, corpus, and dictionary.
Download Recommendation_LDA.ipynb and run it with Jupyter Notebook to get recommendations
Visualize generated topic
Recommender system using ALS
Spark is implemented on Hadoop that is HDFS (hadoop distributed file system). Following versions of Java and Python needs to be maintained to run recommendation system (ALS) file.
1. Maintain java 8 version. java version "1.8.0_102" (brew cask install java8)
2. Upgrade python to latest version (brew upgrade python)
3. Install pyspark. (pip3 install pyspark)
1. pip3 show pyspark
Gives below output:
Name: pyspark
Version: 2.4.5
Summary: Apache Spark Python API
Home-page: https://github.com/apache/spark/tree/master/python
Author: Spark Developers
Author-email: dev@spark.apache.org
License: http://www.apache.org/licenses/LICENSE-2.0
Location: /opt/anaconda3/lib/python3.7/site-packages
Requires: py4j
Required-by:
2. vi ~/.bash_profile
3. export SPARK_HOME= /opt/anaconda3/lib/python3.7/site-packages/pyspark
4. export PATH=$SPARK_HOME/bin:$PATH
5. pip3 install findspark
6. jupyter notebook
- Item Based Collaborative Filtering
- User Based Collaborative Filtering
Dataset after preprocessing for ALS => ALS_AmazonRantings and ALS_AmazonTitles
-
Recommendation System (Item Based) file and run using jupyter notebook.
-
Recommendation System (User Based) file and run using jupyter notebook.
- Topic Modelling and Sentiment Analysis using VADER
- Topic Modelling and Sentiment Analysis using TextBlob
1. Install followignling libraries using pip:
pip install gensim
pip install wordcloud
pip install pyLDAvis
pip install gensim.models
pip install vaderSentiment
2. Dowbload nltk
nltk.download('vader_lexicon')
nltk.download()
3. For LDA MAllet
Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = 'path/to/mallet-2.0.8/bin/mallet' # update this path
LDA and LDA mallet : Choosing optimal topic number
Sentiment Analysis using Vader sentiment Analyser
1. Install the following libraries using pip3:
pip3 install pandas
pip3 install pickle
pip3 install numpy
pip3 install TextBlob
pip3 install gensim
pip3 install spacy
pip3 install pyLDAvis
2. Download the file from the path and update the file path in 'mallet_path' variable.
File Path : http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
3. Open a Python console and do the following:
import nltk
nltk.download()
Select stopwords from the Corpora tab in the GUI that opens and click on download.
Now Run the files
LDA And Sentiment Analysis for a single product using textblob
LDA And Sentiment Analysis for group of similar products using textblob
Used tableau for making visualizations on our large dataset holding millions of data.
Setting up Tableau for visualization
1. Downloaded Tableau desktop-free version available for students.
2. MySql workbench latest version is installed.
3. Install pymysql and sqlalchemy
4. Open jupyter Notebook, load the preprocessed dataset into pandas dataframe
5. Run the following command to establish connection to mysql server
from sqlalchemy import create_engine
import pymysql
engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}".format(user="root",pw="pwd",db="DB_Name"))
6. Insert whole DataFrame into MySQL using the following line
df.to_sql('Table_name', con = engine, if_exists = 'append', chunksize = 1000)
7. Open Tableau select Datasource from MySql server, Open worksheet and start plotting!
- Sum of ratings for each product to show teh top products having highest ratings. The bars in this plot are labeled by number of records per each productID.
- Change in ratings and reviews over the years
- Overall Rating distribution
- Fetching Ratings distribution in a month.
- Sentiment Result visualization for Vader Sentiment Analysis
- Ratings visualization for each topic in Vader Sentiment Analysis
- Ratings visualization for each topic in TextBlob