ArSarcasm-v2 is an extension of the original ArSarcasm dataset published along with the paper From Arabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset. ArSarcasm-v2 conisists of ArSarcasm along with portions of DAICT corpus and some new tweets. Each tweet was annotated for sarcasm, sentiment and dialect. The final dataset consists of 15,548 tweets divided into 12,548 training tweets and 3,000 testing tweets. ArSarcasm-v2 was used and released as a part of the shared task on sarcasm detection and sentiment analysis in Arabic. You can find more details in the Overview of the WANLP 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic
ArSarcasm-v2 is provided in a CSV format, we provide the same split that was used for the shared task. The training set contains 12,548 tweets, while the test set contains 3,000 tweets.
The dataset contains the following fields:
tweet
: the original tweet text.sarcasm
: boolean that indicates whether a tweet is sarcastic or not.sentiment
: the sentiment of the tweet (positive, negative, neutral).dialect
: the dialect used in the tweet, we used the 5 main regions in the Arab world, follows the labels and their meanings:msa
: modern standard Arabic.egypt
: the dialect of Egypt and Sudan.levant
: the Levantine dialect including Palestine, Jordan, Syria and Lebanon.gulf
: the Gulf countries including Saudi Arabia, UAE, Qatar, Bahrain, Yemen, Oman, Iraq and Kuwait.magreb
: the North African Arab countries including Algeria, Libya, Tunisia and Morocco.
Please use the following citation if you use ArSarcasm-v2:
@inproceedings{abufarha-etal-2021-arsarcasm-v2,
title = "Overview of the WANLP 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic",
author = "Abu Farha, Ibrahim and
Zaghouani, Wajdi and
Magdy, Walid",
booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
month = april,
year = "2021",
}
If you are interested in other Arabic NLP resources check: