Mined and investigated unstructured Facebook user’s posts with Vietnamese language processing while comparing traditional ML methods with deep learning methods such as convolutional neural networks, and LSTM to predict the user’s age. Achieved accuracy = 81.4%.
Supervised classification learning problem. Text -> Age Category (A: 18-23, B: 24-30, C: 30-40, D: 40+)
Overall: 22694 entries
A: 5465 entries (24.08%)
B: 7837 entries (34.53%)
C: 3957 entries (17.44%)
D: 896 entries (3.95%)
- Mined unstructured FB user's posts (user's age need to be present)
- Preprocessed to categorize age into classes ()
- Investigate class's distributrion and preprocess posts:
a) Replace emojis with " emoji_icon " to remove bias toward a specific emoji
b) Tokenize Vietnamese words
c) Remove Vietnamese stop-words
d) Remove numbers and punctuations
e) Collapse all posts into one vector - Apply learners:
a) traditional machine learning model:
i. vectorize the words by frequency
ii. max absolute scaling
iii. apply SVM - accuracy: 50%
b) deep learning model:
i. only take the vector that is more than 200 items
ii. padding vectors up to 800
iii. apply CNN model:
- embedding layer: 71% (I remember that the accuracy back in the summer was higher - around 80 % - need to check again)
- word2vec (200 features and 15 contexts) : 60%
├── text_analysis\
| ├── data_processing.ipynb
| ├── text_analysis.py
| ├── cnn_age_predict.ipynb
| ├── utils\
| ├── utils.py
| ├── smote.py
| ├── class_weights.py