Copyright © 2017-2018 New York University Abu Dhabi
Computational Approaches to Modeling Language (CAMeL) Lab
We present the Gumar Corpus n-grams. The n-grams are generated from the Gumar Corpus, a large-scale corpus of Gulf Arabic containing more than 100 million words [1,2]. The n-grams are in order of 5, that is 5, 4, 3, 2 and 1 grams with their respective frequency counts and the number of documents they appear in. The n-grams are counted across the entire corpus and also across each dialect category individually. The format of the n-gram files follows a similar format of Google n-grams with the exception of the year column which we don't produce.
- All documents of the corpus are converted into plain text.
- Basic UTF-8 character cleaning.
- Punctuation separation.
Below are categorizations of the dialects and their respective document counts. For specific information per document please refer to the spreadsheet attached with this package.
Tag | Dialect | Document Count |
---|---|---|
SA | Saudi | 770 |
AE | Emirati | 115 |
KW | Kuwaiti | 87 |
OM | Omani | 14 |
QA | Qatari | 10 |
BA | Bahraini | 8 |
MSA | Modern Standard Arabic | 82 |
EGY | Egyptian | 3 |
LEV | Levantine | 5 |
MOR | Moroccan | 1 |
IRQ | Iraqi | 5 |
YEM | Yemeni | 1 |
UNID_GA | Unidentified Gulf Arabic | 116 |
MIXED_GA | Mixed Gulf Arabic | 11 |
MIXED | Gulf Arabic mixed with other Arabic dialects | 4 |
You can download the GUMAR n-grams here.
The n-grams are split by dialect into seperate compressed folders of the form
<TAG>.tar.xz
where <TAG> is one of the dialect tags listed above.
There is an additional file ALL.tar.xz
that contains n-grams of all the
dialects combined.
Once downloaded, you can extract the files by running the following:
tar -xJf <TAG>.tar.xz
This will generate a folder <TAG>/
in the current working directory.
Each folder contains the following n-gram files:
1-grams_<TAG>.tsv
2-grams_<TAG>.tsv
3-grams_<TAG>.tsv
4-grams_<TAG>.tsv
5-grams_<TAG>.tsv
Each n-gram file consists of three tab separated columns as follows:
<n-gram> TAB <frequency> TAB <# of documents> NEWLINE
Each <n-gram> larger than one is single space separated.
Example of a 2-grams row:
انتظر منك 85 69
* Note that the example above is displayed right-to-left but the columns are in the order described.
Each n-gram file is sorted by <frequency>
in descending order.
If you would like more details on the data used to generate the n-grams, take a look at the Gumar_Info.tsv file. It is a Tab Separated Values file containing author and title information for each document, as well as its dialect and the link it was downloaded from. Duplicate entries for title-author pairs indicate that a document was split into multiple files.
* Please note that some entries in Gumar_Info.tsv containing double-quotes have been escaped. We recommend using a TSV reader (eg. Microsoft Excel, Apple Numbers, Google Docs, etc.) to parse these properly.
Please use the following citation when referencing or using this resource:
Khalifa, Salam, Nizar Habash, Dana Abdulrahim, and Sara Hassan. "A Large Scale Corpus of Gulf Arabic." In Language Resources and Evaluation Conference. 2016. Portorož, Slovenia
The Gumar Corpus n-grams are licensed under a Creative Commons Attribution 3.0 Unported License.