- Overview
- Data Collection
- [Reddit Data Collection Using Pushshift Reddit API Code Link](#reddit-data-collection-using-pushshift-reddit-api--code-link--https---githubcom-syedmuhammadhamza-myers-briggs-type-indicator-mbti-classification-web-app-blob-main-model-posts-20and-20comments-20data-20collection-20with-20pushshiftpy-)
- Dataset
- Data cleaning and preprocessing
- Exploratory Data Analysis
- Feature engineering
- Model Building and Evaluation
- Model performance
- Future Improvements
- User interface
- Productionization
- Technologies
- References
During my sophomore year of bachelors, I stumbled upon a book titled "Gifts differing: understanding personality type" by Isabel Briggs Myers and Peter B. Myers through a friend I met on Reddit
"This book distinguishes four categories of personality styles and shows how these qualities determine the way you perceive the world and come to conclusions about what you've seen"
later that same year, I came across a self-report by the same author titled "Myers–Briggs Type Indicator (MBTI)" designed to identify a person's personality type, strengths, and preferences, and based on this study people are identified as having one of 16 personality types
- ISTJ - The Inspector
- ISTP - The Crafter
- ISFJ - The Protector
- ISFP - The Artist
- INFJ - The Advocate
- INFP - The Mediator
- INTJ - The Architect
- INTP - The Thinker
- ESTP - The Persuader
- ESTJ - The Director
- ESFP - The Performer
- ESFJ - The Caregiver
- ENFP - The Champion
- ENFJ - The Giver
- ENTP - The Debater
- ENTJ - The Commander
Around the same time, I became interested in Machine learning and data science. One of the most fascinating aspects that got me interested in ML was the fact how most dating applications don't use Machine learning for matching people this article explains how Tinder was matching people for so long let me quote some of it here
"A few years ago, Tinder let Fast Company reporter Austin Carr look at his “secret internal Tinder rating,” and vaguely explained to him how the system worked. Essentially, the app used an Elo rating system, which is the same method used to calculate the skill levels of chess players: You rose in the ranks based on how many people swiped right on (“liked”) you, but that was weighted based on who the swiper was. The more right swipes that person had, the more their right swipe on you meant for your score. Tinder would then serve people with similar scores to each other more often, assuming that people whom the crowd had similar opinions of would be in approximately the same tier of what they called “desirability.” (Tinder hasn’t revealed the intricacies of its points system, but in chess, a newbie usually has a score of around 800 and a top-tier expert has anything from 2,400 up.) (Also, Tinder declined to comment for this story.) "
Influenced by all these facts, I came up with the idea of Myers–Briggs Type Indicator (MBTI) classification where my classifier can classify your personality type based on Isabel Briggs Myers self-study Myers–Briggs Type Indicator (MBTI). The classification result can be further used to match people with the most compatible personality types
One of the most difficult challenges for me was the identification of what kind of data to be collected to use for classify Myers–Briggs personality types. During my final year research project at my university, I collected data from Reddit, specifically posts from mental health communities in Reddit. By analyzing and learning posting information written by users, my proposed model could accurately identify whether a user’s post belongs to a specific mental disorder, I used similar reasoning in this project, moreover to my surprise there are all 16 personality types subreddits on Reddit some even with 133k members tho there are some subreddit with only few thousand members I collected data from all theses 16 subreddits using Pushshift Reddit API
Reddit Data Collection Using Pushshift Reddit API Code Link
Subreddit | Number of subscribers | Number of posts collected |
---|---|---|
ISTJ | 12k | 2600 |
INFJ | 101K | 10,000 |
INTJ | 108K | 6,400 |
ENFJ | 18.9K | 6,600 |
ISTP | 19.3K | 9,200 |
ESFJ | 4K | 800 |
INFP | 133K | 8,600 |
ESTP | 5K | 830 |
ENFP | 68K | 1200 |
ESTP | 5K | 1700 |
ESTJ | 2.8K | 700 |
ENTJ | 20K | 9000 |
INTP | 121K | 12,000 |
ISFJ | 12K | 4,400 |
ENTP | 44K | 7,600 |
ISFP | 16K | 4,100 |
following data has been collected in a total of 16 CSV files during Data cleaning and preprocessing these 16 files has been concatenated into a final CSV file
Subreddit | Body | Date |
---|---|---|
Subreddit name of post | Text of post | Posting date |
Data cleaning and preprocessing included the following
- Removing rows with Links in Body feature
- Removing rows with Emojis in Body feature
- Removing rows with HTML elements in the Body feature
- Removing rows with punctuations in the Body feature
- Removing rows with stopwords in the Body feature
- Removing rows with [removed] in Body feature
- Removing rows with [deleted] in Body feature
- Removing rows with just numbers in the Body feature
Exploratory Data Analysis included the following
- Class Imbalance check
- N-gram Analysis
- Generating WordClouds
During data collection, I noticed there were not many posts in some subreddits, reflected by the fact my code collected little amount of data for ESTJ, ESTP, ESFP, ESFJ, ISTJ, and ISFJ subreddits as a result during EDA I noticed the class imbalance situation
One of the most effective ways to solve the problem of Class Imbalance for NLP tasks is to use an oversampling technique called SMOTE( Synthetic Minority Oversampling Technique oversampling methods) hence I solved Class Imbalance using SMOTE for this problem
For Multinomial Regression, I have used Bag of words and TF-IDF features of each Reddit Post
during Visualization of my high dimensional embeddings I converted my higher dimensional TF-IDF features/Bag of words features into two-dimensional using Truncated-SVD then visualized my 2D embeddings the resultant visualization is not linearly separable in 2D hence models like SVM and Logistic regression will not perform well that was the rationale for Using RNN architecture with LSTM in this project
For this project, I trained three models
- Multinomial Logistic Regression with Bag of words features, Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems. Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems.
- Multinomial Logistic Regression with TF-IDF features, Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems. Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems.
- Recurrent Neural Networks with LSTM, Feed-forward neural networks have no memory of the input they receive and are bad at predicting what’s coming next. Because a feed-forward network only considers the current input, it has no notion of order in time. It simply can’t remember anything about what happened in the past except its training. In a RNN the information cycles through a loop. When it makes a decision, it considers the current input and also what it has learned from the inputs it received previously. A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding ’forget’ gates.
Algorithm | Accuracy | Recall | Precision | F1 |
---|---|---|---|---|
Multinomial Logistic Regression with Bag of words features Score | 45.17% | 0.48 | 0.47 | 0.45 |
Multinomial Logistic Regression with TF-IDF features Model | 50.20% | 0.55 | 0.58 | 0.56 |
Recurrent Neural Networks with LSTM | 95.33% | 0.70 | 0.69 | 0.69 |
Looking at the train and test accuracy plots or loss plots over epochs it's visible our model started to overfit after 8 epochs hence the final Model has been trained through 8 epochs
The data collected for the problem is not representative enough especially for some classes where collected posts were few hundreds I tried learning curve analysis for eight different sizes of datasets and the result of the learning curve confirmed there is a gap between training and test score pointing towards High Variance problem hence in the future if more posts can be collected then the resultant dataset will improve the performance of these models
- Used HTML,CSS and JavaScript,
- Deployed model to production using Flask
- Python
- Scikit-learn
- Matplotlib & Seaborn for data visualization
- NLTK
- TensorFlow
- SMOTE
- Sklearn for model building
- Python flask for HTTP server
- HTML/CSS/Javascript for UI
[1]. https://link.springer.com/referenceworkentry/10.1007%2F978-3-319-28099-8_50-1
[2]. https://arxiv.org/abs/1106.1813
[3]. https://www.mentalhelp.net/psychological-testing/myers-briggs-type-indicator/
©SyedMuhammadHamza Licensed under MIT License