Humor detection is a complex and ambiguous task in natural language processing. This has made automatic humor detection challenging, particularly for languages with limited resources such as Arabic. In this paper, we attempt to solve this task by collecting and annotating Arabic humorous tweets (dialects) and Modern Standard Arabic (MSA) text then performing automatic humor detection on the collected data. We experimented on the collected dataset by fine-tuning seven Arabic Pre-Trained language models which are: AraBERTv02, Arabertv02-twitter, QARIB, MarBERT, MARBERTv2, CAMeLBERT-DA, and CAMeLBERT-MIX to establish a baseline classification system. We concluded that CAMeLBERT-DA was the best-performing model and it achieved an F1-score and accuracy of 72.11%.
- humor.tsv : File that contains tweets with two labels, "humor" and "non-humor"
If you use this dataset please cite as:
@inproceedings{[Al-Khalifa et al., 2022],
title={A Dataset for Detecting Humor in Arabic Text},
author={Hend Al-Khalifa, Fetoun AlZahrani, Hala Qawara, Reema AlRowais, Sawsan Alowa and Luluh AlDhubayi},
booktitle={The 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)},
year={2022}
}
This work is licensed under a Creative Commons Attribution 4.0 International License.