Detecting Income Level of Dutch Twitter Users using Stylometric Features

Léon Melein (S2580861), University of Groningen, l.r.melein@student.rug.nl

Abstract

Income prediction is a relatively undiscovered aspect of author profiling. Early research on English (Flekova et al., 2016) linking Twitter users to occupations and their respective average incomes, obtained promising results. There is no comparable research for Dutch speakers yet. In this thesis, we explore to what extent author profiling can predict the income level of Dutch users.

We do so by creating a dataset of 2000 Twitter users. These are divided into two income classes as there currently is no complete income data available for individual occupations in The Netherlands. We use distant supervision to annotate users with their occupational class and their income. We then extract a number of surface, readability and n-gram features from the users' posts. Using logistic regression, we try to classify the users on their income class with those features.

After testing various feature groupings, the classifier proved to be the most robust with uni-, bi- and trigram features, reaching an F1-score of 0.72. Although this indicates that profiling can predict a user's income class to a very large extent, this can only be seen as a first indication as the scope of this study is limited. With clear directions for future improvement, we hope that this study may be a stepping stone towards the prediction of individual incomes for Dutch authors.

Python packages

datagathering: tools used to download data from Twitter for our users and make a random selection of users.
preprocessing: tools used to process the downloaded data into a suitable format for use with our classifier
machinelearning: tools used to build the classifier that will predict income class for a given user.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
corpus		corpus
datagathering		datagathering
machinelearning		machinelearning
preprocessing		preprocessing
supportdata/input_files		supportdata/input_files
writing		writing
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting Income Level of Dutch Twitter Users using Stylometric Features

Abstract

Python packages

About

Releases

Packages

Contributors 2

Languages

leonmelein/ba_thesis

Folders and files

Latest commit

History

Repository files navigation

Detecting Income Level of Dutch Twitter Users using Stylometric Features

Abstract

Python packages

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages