Code for EACL 2021 paper Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling
Please make sure you have pytorch >=1.4
and fairseq >= 0.9
installed.
The data/
folder includes some of the datasets used in the paper (Some of the datasets can only be accessed through the LDC).
NER/twitter
: social media NER dataset from (Darwish, 2013)NER/twitter.norm
: same dataset but normalized.NER/ANERCorp
: MSA dataset obtained from (Benajiba et al., 2007)NER/zero-shot-dialect
: this is the zero-shot dialectal setting. training data is from ANERCorp (Benajiba et al., 2007) while validation and test sets come from the dialectal portion of the Twitter data (Darwish, 2013).NER/zero-shot-msa
: same as above but validation and test sets come from the MSA portion of the Twitter data.
POS-tagging/egy
,POS-tagging/glf
,POS-tagging/lev
, andPOS-tagging/msa
dialectal POS tagging datasets obtained from (Darwish et al., 2018)POS-tagging/zero-shot-*
: training data is MSA, development, and test data from dialects.
These are unlabeled examples used for self-training.
unlabeled_aoc
: Unlabeled AOC tweets taken from (El Araby and Mageed, 2018).
Format your data using IOB format with a token per line and an empty line separating sentences. For example:
لنزار B-PERS
عدد O
كبير O
من O
الكتب O
النثرية O
أهمها O
Similar to NER:
و CONJ
هو PRON
في PREP
محام NOUN
TB TB
ين NSUFF
Typically each data folder has 3 files: train.txt
, valid.txt
, and test.txt
We first need XLM-R models. You can donwload XLM-R models using the following commands
mkdir pretrained_models
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz # base model
tar -xzvf xlmr.base.tar.gz # extract it
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz # large model
tar -xzvf xlmr.large.tar.gz
To finetune XLM-RoBERTa without self-training use the following command:
python main.py --data_dir=data_path --task_name=ner \
--output_dir=output_dir \
--max_seq_length=320 --num_train_epochs 5 \
--do_eval --warmup_proportion=0.1 \
--pretrained_path pretrained_models/xlmr.base/ \
--learning_rate 0.00001\
--gradient_accumulation_steps 1 --eval_on test --dropout 0.1\
--train_batch_size 16 --eval_batch_size 128 --do_train
To fine-tune your model with self-training, you need to add the flag --self_training
. Also, you need to specify your selection mechanism:
- Use integer values > 1 for fixed size selection. For example,
python main.py --data_dir=data_path --task_name=ner \
--output_dir=output_dir \
--max_seq_length=320 --num_train_epochs=5 \
--do_eval --warmup_proportion=0.1 \
--pretrained_path=pretrained_models/xlmr.base/ \
--learning_rate=0.00001\
--gradient_accumulation_steps=1 --eval_on test --dropout=0.1\
--train_batch_size=16 --eval_batch_size=128 --do_train --self_training --K=100
- Use float values <=1.0 for probability threshold. For example:
python main.py --data_dir=data_path --task_name=ner \
--output_dir=output_dir \
--max_seq_length=320 --num_train_epochs=5 \
--do_eval --warmup_proportion=0.1 \
--pretrained_path=pretrained_models/xlmr.base/ \
--learning_rate 0.00001\
--gradient_accumulation_steps=1 --eval_on test --dropout=0.1\
--train_batch_size=16 --eval_batch_size=128 --do_train --self_training --K=0.90
To train for POS tagging, run with --task_name=pos
.
To evaluate or predict labels using a finetuned model:
python predict.py --task_name=ner \
--load_model=path/to/finetuned/model.pt \
--pretrained_path pretrained_models/xlmr.base/ \
--predict_file=path/to/IOB/file
If you use this code, please cite this paper
@inproceedings{khalifa-etal-2021-self,
title = "Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal {A}rabic Sequence Labeling",
author = "Khalifa, Muhammad and
Abdul-Mageed, Muhammad and
Shaalan, Khaled",
booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
month = apr,
year = "2021",
publisher = "Association for Computational Linguistics",
}