Dataset informations:
- Clothing ID: A unique identifier for each clothing item.
- Age: The age of the reviewer.
- Title: The title of the review provided by the reviewer.
- Review Text: The detailed text of the review where the reviewer expresses their thoughts and opinions about the product.
- Rating: The rating given by the reviewer, typically on a scale from 1 to 5 stars.
- Recommended IND: An indicator (usually binary) that suggests whether the reviewer recommended the product (e.g., 1 for recommended, 0 for not recommended).
- Positive Feedback Count: The count of positive feedback or "likes" received on the review by other users.
- Division Name: The name of the product division or category.
- Department Name: The name of the department within the division to which the product belongs.
- Class Name: The specific class or category to which the product belongs.
- Perform data cleaning and exploratory data analysis (EDA) on the dataset to uncover insights from product reviews.
- Test different types of BERT models from Hugging Face with varying output classes. This step involves experimenting with pretrained models to evaluate their performance on the dataset without fine-tuning.
- Make a decision on the number of output classes (2 classes, 3 classes or 5 classes) and the type of BERT model to use (BERT, roBERTa or distilBERT) for the final sentiment analysis model
- Fine-tune the dataset after deciding on how many output classes and which type of BERT model to use for the final model.
- Removed null values
- Removed duplicated rows
Answering the following questions
- From the graph, it shows that the ratings are increasing in a linear trend.
- While there is some imbalance, it doesn't seem extreme, as there is a reasonable spread of ratings across all values from 1 to 5 stars.
- From this dataset, it seems that dresses, knits and blouses are among the highest reviewed products.
- From the plot, all the products are fairly balanced with its distribution of ratings.
- The max_length is 115 (This information allows us to figure out the max_length of the product reviews, it is crucial for the fine-tuning process)
Feature | BERT | RoBERTa | DistilBERT |
---|---|---|---|
Training Objectives | MLM (Masked Language Model), NSP | MLM (Masked Language Model) | MLM (Masked Language Model) |
Data Preprocessing | Random Masking | Dynamic Masking, No NSP | Random Masking, Pruning Attention |
Next Sentence Prediction (NSP) | Yes | No | No |
Training Duration | Extended | Longer, Larger Dataset | Shorter, Pruned Layers |
Sentence Embeddings | [CLS] Token | No [CLS] Token for Sentence Tasks | [CLS] Token |
Batch Training | Fixed Batch Size | Dynamic Batch Size | Smaller Model Size |
Model Size | Large | Larger | Smaller |
Number of Layers | Configurable, Typically 12 or 24 | Configurable, Typically 12 or 24 | Reduced (Distilled), Typically 6 |
Performance | Benchmark Model | Improved Performance on Tasks | Trade-Off between Size and Quality |
Testing different types of pretrained BERT models with different output classes without fine-tuning the models to our dataset:
- BERT model: Pretrained on 5 output classes (1 star to 5 star) - Link to the model
- roBERTa model: Pretrained on 3 output classes (0 : Negative, 1 : Neutral, 2 : Positive) - Link to the model
- distilBERT model: Pretrained on 2 output classes (0 : Negative, 1 : Positive) - Link to the model
Models | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
pretrained BERT model (5 output classes) | 0.566 | 0.653 | 0.566 | 0.592 |
pretrained roBERTa model (3 output classes) | 0.793 | 0.771 | 0.793 | 0.776 |
pretrained distilBERT (2 output classes) | 0.837 | 0.850 | 0.837 | 0.842 |
- As expected, the lower output classes will have an easier learning pattern. Hence higher accuracy.
- The BERT model with 5 output classes performed the worst due to its narrow sentiment range.
- Even without fine-tuning, the distilBERT already achieved a pretty good accuracy of 84% with 2 output classes
- In comparison, the roBERTa model also achieved a relatively high accuracy of 79% eventhough it is predicting 3 classes, only a 0.5% difference between roBERTA and distilBERT. This is also expected, because roBERTa has the largest number of parameters among the three.
Reasons for choosing 3 output classes:
- Multi-class Classification (5 classes): Avoided due to the dataset's narrow sentiment ranges, requiring a larger dataset for effective capture.
- Binary Classification (2 classes): Not chosen as the dataset's rating distribution is relatively balanced; binary classification risks oversimplifying and losing information.
- Chose 3 output classes to distinguish between positive, negative, and neutral sentiments, providing richer insights.
- For this project, the choice will be distilBERT over BERT and roBERTa because distilBERT has a faster performance in both training and inference times. DistilBERT's smaller size and streamlined architecture contribute to quicker computations, ensuring computational efficiency throughout the model's lifecycle.
- Link to the base-distilBERT model
- The model was fine-tuned in a google colab environment (utilizing a GPU)
- The fine-tuned model was trained on the clothing dataset
- The fine-tuned model was evaluated using a test dataset that has been split from the train_test_split process
- The fine-tuned model was compared with the pretrained roBERTa model with 3 output classes.
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
learning_rate=5e-5,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
)
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
pretrained roBERTa (3 classes) | 0.789 | 0.772 | 0.789 | 0.773 |
pretrained distilBERT (2 classes) | 0.837 | 0.850 | 0.837 | 0.842 |
Fine-tuned distilBERT model (3 classes) | 0.849 | 0.860 | 0.849 | 0.853 |
- Pretrained model used: cardiffnlp/twitter-roberta-base-sentiment-latest
- Fine-tuned base model: distilbert-base-uncased
- From the table, the fine-tuned distilBERT model showed a slight performance improvement compared to the pretrained distilBERT with 2 output classes. This improvement is noteworthy, considering the expectation that having 3 output classes could potentially lead to a lower accuracy. The fine-tuning process allows the model to adapt more closely to the nuances of the specific sentiment analysis task, resulting in enhanced performance.
- Also, the pretrained roBERTa model demonstrates competitive performance, closely trailing the fine-tuned distilBERT, even without undergoing the fine-tuning process. This result aligns with the expectation that roBERTa, with its larger number of parameters and advanced architecture, has the potential for strong out-of-the-box performance.
- Thus, fine-tuning the roBERTa model could present an opportunity to surpass the performance of the fine-tuned distilBERT.
- Continue Fine-Tuning the distilBERT model with different parameters to achieve a higher accuracy
- Fine-tune a roBERTa base model to the dataset and compare it with distilBERT
Link to the fine-tuned model: https://huggingface.co/ongaunjie/distilbert-cloths-sentiment
- Example input: "this dress is kinda okay"
- 0 - Negative
- 1 - Neutral
- 2 - Positive