Skip to content

Commit

Permalink
Merge pull request #705 from tanuj437/main
Browse files Browse the repository at this point in the history
Customer Review Sentiment Anaylsis
  • Loading branch information
abhisheks008 authored Jul 12, 2024
2 parents e998399 + 1f9c31d commit 59e975f
Show file tree
Hide file tree
Showing 22 changed files with 388 additions and 0 deletions.
67 changes: 67 additions & 0 deletions Customer Review Sentiment Anaylsis/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Dataset Overview
This dataset contains customer reviews from a marketplace, with various attributes related to the reviews and products. The dataset file is approximately 450MB in size.

## File Description
The dataset file includes the following columns:

**marketplace**: The marketplace where the
review was posted.

**customer_id**: Unique identifier for the customer.
**review_id:** Unique identifier for the review.

**product_id**: Unique identifier for the product.
**product_parent**: Parent identifier for the product.
**product_title:** Title of the product.

**product_category**: Category of the product.
**star_rating**: Star rating given by the customer.
**helpful_votes**: Number of helpful votes the review received.

**total_votes:** Total number of votes the review received.
## Unique Values Overview
### Marketplace
Unique values: 0
Total values: [null] 55%, Other (405658) 45%
### customer_id
Unique values: 0
Total values: [null] 68%, Banjo 0%, Other (290242) 32%
### review_id
Unique values: 0
Total values: [null] 82%, Banjo 0%, Other (159357) 18%
### product_id
Unique values: 0
Total values: [null] 86%, Craft Work 0%, Other (122616) 14%
Numerical Data Overview
### Star Rating
Label Count
-32.00 - 371.20 94
371.20 - 774.40 4
774.40 - 1177.60 3
1177.60 - 1580.80 1
1984.00 - 2387.20 2
3596.80 - 4000.00 1
-32
4000
### Helpful Votes
Label Count
-5.00 - 427.60 54
427.60 - 860.20 7
860.20 - 1292.80 1
1725.40 - 2158.00 4
2158.00 - 2590.60 1
3888.40 - 4321.00 1
-5
4321




### Summary
File Size: 475.5MB
Number of Records: 904,615
### Usage
This dataset can be used for sentiment analysis, customer behavior analysis, and various other machine learning tasks related to product reviews and ratings.

# Dataset
This Dataset can be accessible and downloadable from [Kaggle](https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
91 changes: 91 additions & 0 deletions Customer Review Sentiment Anaylsis/Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Customer Review Sentiment Analysis - Model
## 📝 Description
This folder contains the pre-trained machine learning models and scripts used for sentiment analysis on the customer review dataset. The goal is to automatically categorize customer reviews into positive, neutral, or negative sentiments, helping to understand public perceptions of products.

## 📂 Contents
-**customer-review-sentiment-analysis.ipynb:** Jupyter Notebook containing the complete process of data preprocessing, model training, evaluation, and visualization.

-**model.pkl**: Pre-trained Logistic Regression model used for sentiment prediction.

-**tfidf_vectorizer.pkl:** Pre-trained TF-IDF vectorizer used for transforming text data.

-**README.md**: This document.

## 🎯 Goal
The goal of this sentiment analysis project is to enhance understanding of customer perceptions by organizing and analyzing reviews. By automatically classifying these reviews as positive, neutral, or negative, the project aims to provide insights into public opinion trends.

## 🧮 What I Did
In this sentiment analysis project, various models were evaluated to find the most effective one for classifying customer reviews. The models evaluated include:

### Logistic Regression

A simple linear model for binary and multi-class classification.
Achieved a high accuracy and balanced precision-recall performance.
### LightGBM Classifier

A Light Gradient Boosting Machine known for its efficiency and performance with large datasets.
Achieved competitive accuracy and was used as one of the benchmark models.
### XGBoost Classifier

An implementation of gradient-boosted decision trees designed for speed and performance.
Achieved competitive accuracy and served as another benchmark model.
### AdaBoost Classifier

An ensemble method that combines multiple weak classifiers to create a strong classifier.
Achieved good performance, particularly in precision and recall.

### Data Preprocessing and Augmentation

**Data Cleaning:** Normalized text, removed missing values, and duplicates.

**Tokenization:** Processed text data to remove stop words and perform lemmatization.

**TF-IDF Vectorization:** Converted text data into numerical features using TF-IDF.


## 🚀 Models Implemented

**Logistic Regression Model**

-Achieved an accuracy of 90.0%.
-Precision: 0.89, Recall: 0.90,
-F1-score: 0.89 (weighted average).


**XGBoost Classifier**

-Achieved an accuracy of 89.0%.
-Precision: 0.88, Recall: 0.89
-F1-score: 0.87 (weighted average).

**AdaBoost Classification**

-Achieved an accuracy of 88.0%.
-Precision: 0.86, Recall: 0.88
-F1-score: 0.86 (weighted average).

**LightGBM Classifier**

-chieved an accuracy of 89.0%.
-Precision: 0.88, Recall: 0.89
-F1-score: 0.88 (weighted average).

**Multi-Layer Perceptron (MLP)**

-Achieved an accuracy of 90.0%.
-Precision: 0.89, Recall: 0.90,
-F1-score: 0.89 (weighted average).


**Model Performance Analysis**
Training and Validation: Evaluated models based on accuracy, precision, and loss to select the best-performing model.


**Best Model**
The best-performing model, Logistic, has been saved as model.pkl and is ready for deployment using Streamlit.

## 📢 Conclusion
The customer review sentiment analysis project demonstrates the effectiveness of machine learning models, particularly Logistic Regression, in accurately predicting customer sentiment. The models help in organizing and prioritizing customer reviews, providing valuable insights for stakeholders.

## ✒️ Your Signature
Tanuj Saxena[LinkedIn](https://linkedin.com/in/tanuj-saxena-970271252/)

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
94 changes: 94 additions & 0 deletions Customer Review Sentiment Anaylsis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@

# Customer Review Sentiment Analysis
Customer Review Sentiment Analysis is a project focused on automatically analyzing and classifying customer reviews based on their sentiment. The project leverages machine learning techniques to understand and categorize customer sentiments expressed in reviews.

Note: The labeling is done based on Star.
<img width="922" alt="webapp" src="https://github.com/tanuj437/Customer-Review-Sentiment-Anaylsis/assets/128210429/5528ae7e-6ded-46d3-a845-026451cf40e6">

## 📝 Abstract
Customer Review Sentiment Analysis involves automatically identifying and classifying sentiment from customer reviews. Techniques such as natural language processing (NLP), machine learning models, and sentiment analysis algorithms are employed to achieve this.

## 🔍 Methodology
**Importing Libraries**
-Libraries such as NumPy, Pandas, Sklearn, Transformers, and others are imported for data manipulation, visualization, and machine learning model building.

**Loading the Dataset**
-The dataset contains multiple rows of comments labeled with their sentiment based on the Star rating.

**Data Preprocessing**
-Prepare data for analysis: handle missing values, encode categorical data, scale features, perform feature engineering, split into train-test sets, and normalize data. Ensure data is in a suitable format for machine learning algorithms.

**Training the Models**
-Each model is compiled using techniques like LightGBM and Logistic Regression.
The models are trained on the training dataset and evaluation is done.

**Model Performance Analysis**
-Training and validation loss and accuracy are plotted to visualize the models' performance.

<img width="404" alt="precision_cmp" src="https://github.com/tanuj437/Customer-Review-Sentiment-Anaylsis/assets/128210429/f34e3bfc-c14c-4c9e-8a58-e9de9f27c2d9">


**Model Prediction**
-The model is given a test dataset to check the accuracy and precision of the predictions.

<img width="416" alt="recall_cmp" src="https://github.com/tanuj437/Customer-Review-Sentiment-Anaylsis/assets/128210429/748c1245-c55c-4b4b-850a-e687c6c9ffe7">


**Deploy**
-Using the Streamlit library, the model is deployed for real-time sentiment analysis.

**Data and Model File Download**
-The dataset used in the project is taken from the Kaggle Customer Review Dataset. [Kaggle Dataset Link](https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset?select=amazon_reviews_us_Musical_Instruments_v1_00.tsv)

### Project Directory Structure
```
BRICS Sentiment Analysis
|- Dataset
|- column_overview.png
|- dataset_view.png
|- README.md
|- Model
|- customer-review-aentiment-analysis.ipynb
|- README.md
|- model.pkl
|-tfidf_vectorizer.pkl
|- Web App
|- app.py
|- README.md
|- Images
|- f1_cmp.png
|- README.md
|- precision_cmp.png
|- recall_cmp.png
|- review_length.png
|- sentiment_distribution.png
|- star_rating.png
|- star_ratingtocount.png
|- webapp.png
|- running_test.mp4
|-wordcloud.png
|- requirements.txt
|-README.md
```

## How to Use
**Requirements**
-Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the requirements.txt file.

**Download Data**
-Download the brics_comments.csv dataset from Kaggle mentioned in the dataset section of the project.

**Run the Jupyter Notebook**
-Open the provided Jupyter Notebook file and run each cell sequentially. Make sure to update any file paths or configurations as needed for your environment.

**Training and Evaluation**
-Train the models using the provided data and evaluate their performance using metrics such as accuracy and loss.

**Interpret Results**
-Analyze the model's performance using the visualizations and metrics provided in the notebook.

Feel free to reach out if you encounter any issues or need further assistance with running the notebook.

## Connect with Me
Tanuj Saxena [LinkedIn](https://www.linkedin.com/in/tanuj-saxena-970271252/)
40 changes: 40 additions & 0 deletions Customer Review Sentiment Anaylsis/Webapp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Customer Review Sentiment Analysis Web App

## Goal 🎯
The goal of this sentiment analysis web application is to understand public perceptions about customer reviews on various products. By analyzing these reviews, the app helps in organizing and prioritizing insights, detecting sentiment trends, and ensuring that diverse viewpoints are represented. It streamlines the process of understanding customer opinions and provides valuable feedback for stakeholders. 🌍🔍

## Model(s) Used for the Web App 🧮
The model used in this web app is a pre-trained Logistic Regression, which has been fine-tuned for sentiment analysis. The TF-IDF vectorize model is used for encoding the text into embeddings, and the Logistic model predicts the sentiment with high accuracy.

## Video Demonstration 🎥



https://github.com/user-attachments/assets/46cada4d-cefd-41ac-af8d-415a23a035b9




## How to Run the Web App

### Requirements
Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the `requirements.txt` file.

### Installation
1. **Clone the repository:**
```bash
gh repo clone tanuj437/Customer-Review-Sentiment-Analysis
cd Customer-Review-Sentiment-Analysis/WebApp
```
2. **Install the Dependencies**
```bash
pip install -r requirements.txt
```
3. **Run the Streamlit app**
```bash
streamlit run app.py
```
### Signature ✒️
Tanuj Saxena

[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/tanuj-saxena-970271252/)
49 changes: 49 additions & 0 deletions Customer Review Sentiment Anaylsis/Webapp/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import streamlit as st
import pickle
import re
from nltk.corpus import stopwords

# Load NLTK stopwords
stop_words = set(stopwords.words('english'))

# Define text preprocessing function
def preprocess_text(text):
text = text.lower()
text = re.sub(r'<[^>]+>', ' ', text) # Remove HTML tags
text = re.sub(r'[^a-z\s]', '', text) # Remove non-alphabetic characters
text = ' '.join([word for word in text.split() if word not in stop_words]) # Remove stopwords
return text

# Load the pretrained model
model_filename = 'Model/model.pkl'
with open(model_filename, 'rb') as file:
logistic_model = pickle.load(file)

# Load the pretrained TF-IDF vectorizer
vectorizer_filename = 'Model/tfidf_vectorizer.pkl'
with open(vectorizer_filename, 'rb') as file:
vectorizer = pickle.load(file)

# Define a function to make predictions
def predict_sentiment(text):
preprocessed_text = preprocess_text(text)
transformed_text = vectorizer.transform([preprocessed_text])
prediction = logistic_model.predict(transformed_text)
return prediction[0]

# Streamlit app
st.title('Sentiment Analysis Web App')

st.write('This is a web app to classify the sentiment of customer reviews as positive, neutral, or negative.')

# User input
user_input = st.text_area('Enter a customer review:', '')

if st.button('Predict'):
if user_input:
prediction = predict_sentiment(user_input)
st.write(f'The sentiment of the review is: **{prediction}**')
else:
st.write('Please enter a review to get a prediction.')

# To run the app, save this script and use the command: streamlit run your_script_name.py
34 changes: 34 additions & 0 deletions Customer Review Sentiment Anaylsis/images/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Image Folder Overview
This folder contains various visualizations that represent different aspects of the dataset and model performance. Below is a detailed description of each visualization.

## Visualizations
## Sentiment Distribution

Description: This chart shows the distribution of sentiments (positive, neutral, negative) across the dataset. It provides an overview of how many reviews fall into each sentiment category.
## Star Rating Distribution

Description: This chart displays the distribution of star ratings given by customers. It helps to understand the overall satisfaction level of the customers based on the star ratings.
## Star Rating to Count

Description: This chart shows the count of reviews for each star rating. It is useful to see the frequency of each rating given by the customers.
## Word Cloud

Description: A word cloud visualization that highlights the most frequently occurring words in the reviews. Larger words represent higher frequency, providing insight into common themes and topics in the reviews.
## Review Length Distribution

Description: This histogram shows the distribution of review lengths in terms of word count. It helps to understand the typical length of customer reviews in the dataset.

## F1 Score Comparison

Description: This bar chart compares the F1 scores of different models used in the analysis. The F1 score is a measure of a model's accuracy, balancing precision and recall.
## Recall Comparison

Description: This bar chart compares the recall scores of different models used in the analysis. Recall measures the ability of a model to identify all relevant instances in the dataset.


## Precision Comparison

Description: This bar chart compares the precision scores of different models used in the analysis. Precision measures the accuracy of the positive predictions made by the model.

### Usage
These visualizations provide a comprehensive view of the dataset's characteristics and the performance of various models used for sentiment analysis. They can be used to gain insights into customer reviews, model effectiveness, and areas for improvement in analysis.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions Customer Review Sentiment Anaylsis/requirement.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
pandas==1.3.3
matplotlib==3.4.3
seaborn==0.11.2
nltk==3.6.3
wordcloud==1.8.1
numpy==1.21.2
scikit-learn==0.24.2
keras==2.6.0
tensorflow==2.6.0
lightgbm==3.2.1
xgboost==1.4.2
streamlit==0.86.0

0 comments on commit 59e975f

Please sign in to comment.