This repository contains a Jupyter Notebook that demonstrates the effect of feature selection techniques on sentiment analysis models for medical practitioners' reviews. The project explores multiple models and evaluates their performance using various feature selection methods.
The dataset used is based on medical reviews on RateMD and has been modified for the purpose of this report.
The code in the notebook covers the following:
- Provides a benchmark for evaluating other models.
-
Chi-Square
-
Mutual Information (MI)
Both implemented using SelectKBest from scikit-learn.
- TF-IDF
- Word Embedding using SBERT
- Evaluated with no feature selection, Chi-Squared, MI, TF-IDF, and Word Embedding.
- Evaluated with no feature selection, Chi-Squared, MI, TF-IDF, and Word Embedding.
- Evaluated for its performance across various feature selection methods.
- Tested with no feature selection, Chi-Squared, MI, TF-IDF, and Word Embedding.
- Selected best-performing model to predict sentiment labels on a Kaggle dataset.
Each model, combined with feature selection methods, is evaluated using:
-
Classification reports
-
Heatmaps for performance visualization
The project provides insights into the performance of various models and feature selection methods, highlighting their impact on sentiment analysis accuracy and interpretability. Please see report for more information.