Characterization of user misbehavior in online social media

SUPSI - Bachelor's thesis - 2023

Author: Davis Fusco

Introduction

This project focuses on the exploratory analysis of user data on social media platforms to uncover insights into behavioral dynamics. Leveraging the K-Means clustering algorithm, we categorize users based on their interactions, highlighting distinct behavioral patterns. Additionally, a Random Forest classifier enriched with Explainable AI techniques, specifically SHAP (SHapley Additive exPlanations), provides detailed information on feature importance within each cluster. These methods not only enhance technical validity but also make findings accessible for practical implementation. The results offer valuable indications for crafting effective intervention strategies to improve ethics and correctness in social media interactions.

Dataset Description

The dataset for this thesis project comprises two main components: a primary set of 4 million tweets and a supplementary dataset with information on over 1 million users. The focus is on characterizing online user behavior related to the early stages of the COVID-19 pandemic, specifically in the Italian language.

Tweet Dataset

Size: 4 million records
Columns: 36 variables
- Key Features: Includes tweet content, tweet type (original, retweet, quoted text), hashtags, embedded links, and user mentions.
- Context: Captures user sentiments and interactions during the initial phase of the COVID-19 pandemic.

User Dataset

Size: Over 1 million records
Columns: 15 variables
- Key Features: Encompasses user details such as follower/following counts, likes, and other metrics related to user activity.
- Context: Provides comprehensive information about users engaged in the analyzed tweets.

Initial Exploratory Analysis

The initial exploration involves understanding the dynamics of user interactions in the context of the COVID-19 pandemic. The analysis performed on this dataset include:

Sentiment and Emotional Dynamics:
- Identification and categorization of sentiments, focusing on negative emotions, toxic behaviors, and potential misinformation.
User Clustering with K-Means:
- Utilization of the K-Means clustering algorithm to categorize users into distinct groups based on their behavioral patterns.
- Highlighting specific characteristics of each user cluster.
User Characterization with Random Forest:
- Implementation of a Random Forest classifier to further characterize users within the identified clusters.
- Enrichment of the classifier with Explainable AI techniques, specifically using SHAP (SHapley Additive exPlanations).
- Extraction of detailed information on the importance of individual features in different user clusters.
Technical Validity and Practical Applicability:
- Strengthening the technical validity of conclusions through algorithmic approaches.
- Enhancing accessibility and applicability of interpretations in real-world scenarios.
Insights for Intervention Strategies:
- Derivation of valuable insights for the implementation of effective intervention strategies on social media platforms.
- Focus on improving ethics and correctness in online interactions.

This dataset serves as the foundation for the subsequent analysis conducted in the thesis, providing valuable insights into online user behavior during a critical period of global concern.

Prerequisites

General

Text Analysis

Cluster Analysis

Cluster Characterization

Data Visualization

Make sure to have the listed libraries installed in your Python environment before running the project. You can install them using the following command:

pip install -r requirements.txt

Note: It's recommended to set up a virtual environment using tools like virtualenv or conda for better project isolation.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
notebooks		notebooks
papers		papers
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Characterization of user misbehavior in online social media

Introduction

Dataset Description

Tweet Dataset

User Dataset

Initial Exploratory Analysis

Prerequisites

General

Text Analysis

Cluster Analysis

Cluster Characterization

Data Visualization

About

Releases

Packages

Languages

davisf20/characterization-of-user-misbehavior-in-online-social-media

Folders and files

Latest commit

History

Repository files navigation

Characterization of user misbehavior in online social media

Introduction

Dataset Description

Tweet Dataset

User Dataset

Initial Exploratory Analysis

Prerequisites

General

Text Analysis

Cluster Analysis

Cluster Characterization

Data Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages