IMDB dataset consists of 50,000 movie reviews split into train and test set by using (50-50)[%] split. Dataset is balanced and it contains 25000 positive and 25000 negative reviews. The goal of the project was to develop Sentiment Analyzer which could determine if some review is positive or negative. IMDB dataset was used with train/test split already built in the IMDB class constructor. Train set is then split into train/validation split using (80-20)[%] ratio.
Vocab for the datasets was created using pretrained GloVe embeddings with embedding_dim=300. In the vocab creation process only words with minimum frequency of 10 occurences were defined, other are marked as unknown. Maximum size of the vocab was fixed to 25000 during the creation process.
The network uses an Embedding layer, two bidirectional LSTM layers and a fully connected(linear) layer for calculating the output probability.
Train Acc. | Validation Acc. | Test Acc. | Test loss. |
---|---|---|---|
89.79 | 88.81 | 88.22 | 0.307 |
-
Open Anaconda Prompt and navigate to the directory of this repo by using:
cd PATH_TO_THIS_REPO
-
Execute
conda env create -f environment.yml
This will set up an environment with all necessary dependencies. -
Activate previously created environment by executing:
conda activate sentiment-analysis
-
Training and/or testing the model.
a) Start the main script:
python src/main.py
which will automatically instantiate the model and start training it after dataset is loaded. After training the model performance will be evaluated on the test set.b) If you don't want to train the model, you can use model which was pretrained by me using Google Colab. To achieve this just execute:
python src/main.py --mode test
. This will load pretrained weights and evaluate the model performance on the test set.