This project demonstrates sentiment analysis on the NLTK movie reviews dataset using machine learning techniques. The project includes data preprocessing, feature extraction using TF-IDF vectorisation, and the implementation of two machine learning models: Multinomial Naive Bayes and Logistic Regression. The aim is to classify movie reviews as positive or negative based on their content.
- Pandas
- NLTK
- Scikit-Learn
- Seaborn
- Matplotlib
The dataset used in this project is the NLTK movie reviews dataset. It contains 2,000 movie reviews categorized into positive and negative sentiments.
- Source: NLTK library
- Categories: Positive, Negative
- Number of Reviews: 2,000
sentiment-analysis/
│
├── data/
│ └── movie_reviews_dataset.csv
│
├── notebook/
│ └── sentiment_analysis.ipynb
│
├── results/
│ ├── distribution_of_sentiment_categories.png
│ └── classification_report.txt
│
└── requirements.txt
-
Data Collection:
The dataset used for this project is the NLTK movie reviews dataset, containing 2,000 labeled movie reviews (positive or negative).
-
Data Preprocessing:
- Loading the Data: Using NLTK's built-in functions.
- Cleaning the Text Data: Removing stopwords, converting to lowercase, and removing punctuation.
- Tokenization: Converting text data into individual words.
- Dataframe Creation: Converting the cleaned data into a Pandas DataFrame and saving as a CSV file.
-
Exploratory Data Analysis (EDA):
- Checking for missing values and removing duplicates.
- Visualizing the distribution of sentiment categories.
-
Feature Extraction:
Using TF-IDF vectorization to transform text data into numerical features.
-
Model Building and Training:
- Splitting the data into training and testing sets (80-20 split).
- Training a Multinomial Naive Bayes classifier and a Logistic Regression model on the TF-IDF features.
-
Model Evaluation:
Calculating the accuracy score and generating classification reports for both models.
- Clone the repository:
git clone https://github.com/ellahu1003/sentiment-analysis-project.git cd sentiment-analysis-project
- Install the required libraries:
pip install -r Requirements.txt
- Run the Jupyter Notebook:
jupyter notebook notebook/sentiment_analysis.ipynb
The 'Requirements.txt' file lists all the Python packages required to run the project. Install these dependencies to avoid any compatibility issues.
- The accuracy of the Multinomial Naive Bayes model: [0.785].
- The accuracy of the Logistic Regression model: [0.795].
- Detailed classification reports for both models are available in classification_report.txt.
- Distribution of the sentiment categories is visualised in distribution_of_sentiment_categories.png.
The sentiment analysis project successfully demonstrates the application of natural language processing and machine learning techniques to classify movie reviews as positive or negative. The Multinomial Naive Bayes and Logistic Regression models both performed well, with Logistic Regression slightly outperforming in terms of accuracy.