This repository contains a Sentiment Analysis over different Myers-Briggs Personality subreddits.
For each personality, data were scraped from the posts of the corresponding subreddit (e.g. r/infj for the INFJ personality type), using this script.
The following models were applied.
- DistilBERT base uncased finetuned SST-2 , to obtain a POSITIVE and NEGATIVE score for each post.
- Distilbert-base-uncased-emotion, to get post scores over the following emotions: love, joy, anger, sadness, surprise and fear.
The notebooks type_analysis.ipynb and aggregate_anaysis.ipynb contain visualizations of the performed analysis.
type_analysis.ipynb shows:
- POSITIVE/NEGATIVE percentage and average emotion associated with each personality subreddit, along with their comparison.
- A frequency wordcloud for each subreddit.
aggregate_analysis.ipynb includes:
- A study on whether there's a dependence between Myers-Briggs traits (Extraversion/Introversion, Sensing/Intuition, Thinking/ Feeling and Judging/Perceiving) and the sentiment/emotion scores of subreddit posts. This investigation was performed computing Chi-Squared and Odds Ratio and conveyed by visualizations.
- Visualizations of the previous quest, but considering Dominant Cognitive Fuctions.
- A clustering of personalities based on POSITIVE/NEGATIVE percentage and average emotions of their subreddit posts.
To run all the code in the respository, you can create a virtual environment and run the following commands.
virtualenv venv
source ./venv/bin/activate
pip install -r requirements.txt
To execute subreddit_post_scraper.py, you first need an instance of a MySQL database to connect to.
You also need some parameters associated to your reddit account and to the MySQL database: all needs to be inserted in a config.py
file, following the schema of config.example.py
.