Skip to content

Indonesian Social Media Post Toxicity Classification

Notifications You must be signed in to change notification settings

ahmadizzan/netifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Netifier: Negativity Classifier

Introduction

The rapid spread of information through internet have benefitted our lives in many different ways. But, it also introduces us to some problems, one of them being the spead of negative contents on the internet. The presence of 'toxic' post led to peopple struggle to have effective conversations.

Inspired by Toxic Comment Classification Challenge, we decided to do similar thing using data from Indonesian social media. Our goal is to analyze and create multi-label text toxicity classifier using machine learning.

Contributions

  1. Created Indonesian Social Media Text Toxicity Dataset
  2. Created Pipeline For The Task: Exploratory Data Analysis, Data Preprocessing, and Modelling
  3. Compared Various Machine Learning Model Performance On This Task

Dataset

As far as we know, there's no available public dataset on Indonesian text toxicity and we decided to collect the data ourselves. We scraped posts on famous social media sites in Indonesia, such as Instagram, Twitter, and Kaskus. We then manually labelled ~7000 samples into 4 categories: pornography, hate speech, racism, and radicalism.

We also attempted to collect more data using semi-supervised method. We collected additional ~20.000 samples through this method. All of the data could be downloaded from this repository.

Project Organization

├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
└── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
 						  and short description

Project Members

  • Ahmad Izzan
  • Christian Wibisono
  • Ilham Firdausi Putra

About

Indonesian Social Media Post Toxicity Classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published