SUPSI - Bachelor's thesis - 2023
Author: Davis Fusco
This project focuses on the exploratory analysis of user data on social media platforms to uncover insights into behavioral dynamics. Leveraging the K-Means clustering algorithm, we categorize users based on their interactions, highlighting distinct behavioral patterns. Additionally, a Random Forest classifier enriched with Explainable AI techniques, specifically SHAP (SHapley Additive exPlanations), provides detailed information on feature importance within each cluster. These methods not only enhance technical validity but also make findings accessible for practical implementation. The results offer valuable indications for crafting effective intervention strategies to improve ethics and correctness in social media interactions.
The dataset for this thesis project comprises two main components: a primary set of 4 million tweets and a supplementary dataset with information on over 1 million users. The focus is on characterizing online user behavior related to the early stages of the COVID-19 pandemic, specifically in the Italian language.
- Size: 4 million records
- Columns: 36 variables
- Key Features: Includes tweet content, tweet type (original, retweet, quoted text), hashtags, embedded links, and user mentions.
- Context: Captures user sentiments and interactions during the initial phase of the COVID-19 pandemic.
- Size: Over 1 million records
- Columns: 15 variables
- Key Features: Encompasses user details such as follower/following counts, likes, and other metrics related to user activity.
- Context: Provides comprehensive information about users engaged in the analyzed tweets.
The initial exploration involves understanding the dynamics of user interactions in the context of the COVID-19 pandemic. The analysis performed on this dataset include:
-
Sentiment and Emotional Dynamics:
- Identification and categorization of sentiments, focusing on negative emotions, toxic behaviors, and potential misinformation.
-
User Clustering with K-Means:
- Utilization of the K-Means clustering algorithm to categorize users into distinct groups based on their behavioral patterns.
- Highlighting specific characteristics of each user cluster.
-
User Characterization with Random Forest:
- Implementation of a Random Forest classifier to further characterize users within the identified clusters.
- Enrichment of the classifier with Explainable AI techniques, specifically using SHAP (SHapley Additive exPlanations).
- Extraction of detailed information on the importance of individual features in different user clusters.
-
Technical Validity and Practical Applicability:
- Strengthening the technical validity of conclusions through algorithmic approaches.
- Enhancing accessibility and applicability of interpretations in real-world scenarios.
-
Insights for Intervention Strategies:
- Derivation of valuable insights for the implementation of effective intervention strategies on social media platforms.
- Focus on improving ethics and correctness in online interactions.
This dataset serves as the foundation for the subsequent analysis conducted in the thesis, providing valuable insights into online user behavior during a critical period of global concern.
Make sure to have the listed libraries installed in your Python environment before running the project. You can install them using the following command:
pip install -r requirements.txt
Note: It's recommended to set up a virtual environment using tools like virtualenv or conda for better project isolation.