Skip to content

Juhibhojani/Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Sentiment-Analysis

Task : Performing sentiment analysis on movie review Dataset used : IMDB Dataset of 50K Movie Reviews
Source : Kaggle
URL : https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset metadata :

It consists of 50,000 rows with equal division of positive and negative reviews. There are two rows, namely "Sentiment" and "Review".

Steps performed

  1. Importing Neccessary libraries
  2. Performing Exploratory Data Analysis -> Showcasing 10 positive and negative sentiments
    -> Dropping Duplicate values
    -> Checking for NULL values
    -> Displaying percentage of positive and negative sentiment
    -> Analysing number of words in each category of sentiment
  3. Data Cleaning
    -> Decode HTML encoded characters
    -> Removing Stop words (only those stopwords which arent negative)
    -> Removing URL's
  4. Tokenization
  5. Stemming and Lemming
  6. Displaying Word Cloud
  7. Applying Tf-Idf vectorizer and different models
  8. Applying Tf-Idf with bigrams
  9. Applying Word2Vec as word embedding technique
  10. Result

Libraries

  1. Pandas
  2. Numpy
  3. Sklearn
  4. NLTK
  5. Wordcloud
  6. BeautifulSoup
  7. Matplotlib
  8. Gensim

Models

  1. Decision Tree Classifier
  2. Random Forest Classifier
  3. Logisitic Regression
  4. KNN
  5. Navie Bayes
  6. SVM

Result:


Using Tf-Idf vectorizer for feature extraction, we obtain highest accuracy of 0.87 using SVM model and using word2vec as word embedding technique highest accuracy is using both SVM and Logistic regression which is 0.88