Task : Performing sentiment analysis on movie review
Dataset used : IMDB Dataset of 50K Movie Reviews
Source : Kaggle
URL : https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
It consists of 50,000 rows with equal division of positive and negative reviews. There are two rows, namely "Sentiment" and "Review".
- Importing Neccessary libraries
- Performing Exploratory Data Analysis
-> Showcasing 10 positive and negative sentiments
-> Dropping Duplicate values
-> Checking for NULL values
-> Displaying percentage of positive and negative sentiment
-> Analysing number of words in each category of sentiment - Data Cleaning
-> Decode HTML encoded characters
-> Removing Stop words (only those stopwords which arent negative)
-> Removing URL's - Tokenization
- Stemming and Lemming
- Displaying Word Cloud
- Applying Tf-Idf vectorizer and different models
- Applying Tf-Idf with bigrams
- Applying Word2Vec as word embedding technique
- Result
- Pandas
- Numpy
- Sklearn
- NLTK
- Wordcloud
- BeautifulSoup
- Matplotlib
- Gensim
- Decision Tree Classifier
- Random Forest Classifier
- Logisitic Regression
- KNN
- Navie Bayes
- SVM
Using Tf-Idf vectorizer for feature extraction, we obtain highest accuracy of 0.87 using SVM model and using word2vec as word embedding technique highest accuracy is using both SVM and Logistic regression which is 0.88