Skip to content

NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

License

Notifications You must be signed in to change notification settings

LazarusNLP/NusaBERT

Repository files navigation

NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

This project aims to extend the multilingual and multicultural capability of IndoBERT (Wilie et al., 2020). We expanded the IndoBERT tokenizer on 12 new regional languages of Indonesia, and continued pre-training on a large-scale corpus consisting of the Indonesian language and 12 regional languages of Indonesia. Our models are highly competitive and robust on multilingual and multicultural benchmarks, such as IndoNLU, NusaX, and NusaWrites.

logo

Pre-trained Models

Model #params Dataset
LazarusNLP/NusaBERT-base 111M sabilmakbar/indo_wiki, acul3/KoPI-NLLB, uonlp/CulturaX
LazarusNLP/NusaBERT-large 337M sabilmakbar/indo_wiki, acul3/KoPI-NLLB, uonlp/CulturaX

Results

We evaluate our models on three benchmarks: IndoNLU, NusaX, and NusaWrites, which measures the model's natural language understanding, multilingual, and multicultural capabilities. The datasets supports a variety of languages of Indonesia.

The values on the table below denotes the F1 score on the test set.

IndoNLU (Classification)

Model EmoT SmSA CASA HoASA WReTE AVG
mBERT 67.30 84.14 72.23 84.63 84.40 78.54
XLM-MLM 65.75 86.33 82.17 88.89 64.35 77.50
XLM-R Base 71.15 91.39 91.71 91.57 79.95 85.15
XLM-R Large 78.51 92.35 92.40 94.27 83.82 88.27
IndoBERT Lite Base p1 73.88 90.85 89.68 88.07 82.17 84.93
IndoBERT Lite Base p2 72.27 90.29 87.63 87.62 83.62 84.29
IndoBERT Base p1 75.48 87.73 93.23 92.07 78.55 85.41
IndoBERT Base p2 76.28 87.66 93.24 92.70 78.68 85.71
IndoBERT Lite Large p1 75.19 88.66 90.99 89.53 78.98 84.67
IndoBERT Lite Large p2 70.80 88.61 88.13 91.05 85.41 84.80
IndoBERT Large p1 77.08 92.72 95.69 93.75 82.91 88.43
IndoBERT Large p2 79.47 92.03 94.94 93.38 80.30 88.02
Our work
LazarusNLP/NusaBERT-base 76.10 87.46 91.26 89.80 76.77 84.28
LazarusNLP/NusaBERT-large 78.90 87.36 92.13 93.18 82.64 86.84

IndoNLU (Sequence Labeling)

Model POSP BaPOS TermA KEPS NERGrit NERP FacQA AVG
mBERT 91.85 83.25 89.51 64.31 75.02 69.27 61.29 76.36
XLM-MLM 95.87 88.40 90.55 65.35 74.75 75.06 62.15 78.88
XLM-R Base 95.16 84.64 90.99 68.82 79.09 75.03 64.58 79.76
XLM-R Large 92.73 87.03 91.45 70.88 78.26 78.52 74.61 81.92
IndoBERT Lite Base p1 91.40 75.10 89.29 69.02 66.62 46.58 54.99 70.43
IndoBERT Lite Base p2 90.05 77.59 89.19 69.13 66.71 50.52 49.18 70.34
IndoBERT Base p1 95.26 87.09 90.73 70.36 69.87 75.52 53.45 77.47
IndoBERT Base p2 95.23 85.72 91.13 69.17 67.42 75.68 57.06 77.34
IndoBERT Lite Large p1 91.56 83.74 90.23 67.89 71.19 74.37 65.50 77.78
IndoBERT Lite Large p2 94.53 84.91 90.72 68.55 73.07 74.89 62.87 78.51
IndoBERT Large p1 95.71 90.35 91.87 71.18 77.60 79.25 62.48 81.21
IndoBERT Large p2 95.34 87.36 92.14 71.27 76.63 77.99 68.09 81.26
Our work
LazarusNLP/NusaBERT-base 95.77 96.02 90.54 66.67 72.93 82.29 54.81 79.86
LazarusNLP/NusaBERT-large 96.89 96.76 91.73 71.53 79.86 85.12 66.77 84.09

NusaX

Model ace ban bbc bjn bug eng ind jav mad min nij sun AVG
Naive Bayes 72.5 72.6 73.0 71.9 73.7 76.5 73.1 69.4 66.8 73.2 68.8 71.9 72.0
SVM 75.7 75.3 76.7 74.8 77.2 75.0 78.7 71.3 73.8 76.7 75.1 74.3 75.4
Logistic Regression 77.4 76.3 76.3 75.0 77.2 75.9 74.7 73.7 74.7 74.8 73.4 75.8 75.4
IndoNLU IndoBERT Base 75.4 74.8 70.0 83.1 73.9 79.5 90.0 81.7 77.8 82.5 75.8 77.5 78.5
IndoNLU IndoBERT Large 76.3 79.5 74.0 83.2 70.9 87.3 90.2 85.6 77.2 82.9 75.8 77.2 80.0
IndoLEM IndoBERT Base 72.6 65.4 61.7 71.2 66.9 71.2 87.6 74.5 71.8 68.9 69.3 71.7 71.1
mBERT Base 72.2 70.6 69.3 70.4 68.0 84.1 78.0 73.2 67.4 74.9 70.2 74.5 72.7
XLM-R Base 73.9 72.8 62.3 76.6 66.6 90.8 88.4 78.9 69.7 79.1 75.0 80.1 76.2
XLM-R Large 75.9 77.1 65.5 86.3 70.0 92.6 91.6 84.2 74.9 83.1 73.3 86.0 80.0
Our work
LazarusNLP/NusaBERT-base 76.51 78.67 74.02 82.38 71.64 84.09 89.74 84.09 75.62 80.77 74.93 85.21 79.81
LazarusNLP/NusaBERT-large 81.8 82.83 74.71 86.51 73.36 84.63 93.33 87.20 82.50 83.54 77.72 82.74 82.57

NusaWrites (NusaParagraph)

Models Emotion Rhetorical Mode Topic
Naive Bayes 75.51 37.73 85.06
SVM 76.36 45.44 85.86
Logistic Regression 78.23 45.21 87.67
IndoNLU IndoBERT Base 67.12 47.92 85.87
IndoNLU IndoBERT Large 62.65 31.75 85.41
IndoLEM IndoBERT Base 66.94 51.93 84.87
mBERT 63.15 50.01 73.82
XLM-R Base 59.15 49.17 71.68
XLM-R Large 67.42 51.57 83.05
Our work
LazarusNLP/NusaBERT-base 67.18 51.34 84.17
LazarusNLP/NusaBERT-large 71.82 53.06 85.89

NusaWrites (NusaTranslation)

Models Emotion Sentiment
Naive Bayes 52.70 74.89
SVM 55.08 76.04
Logistic Regression 56.18 74.89
IndoNLU IndoBERT Base 54.50 75.24
IndoNLU IndoBERT Large 57.80 77.40
IndoLEM IndoBERT Base 52.59 69.08
mBERT 44.13 68.72
XLM-R Base 47.02 68.62
XLM-R Large 54.84 79.06
Our work
LazarusNLP/NusaBERT-base 56.54 77.07
LazarusNLP/NusaBERT-large 61.40 79.54

Installation

git clone https://github.com/LazarusNLP/NusaBERT.git
cd NusaBERT
pip install -r requirements.txt

Dataset

For pre-training we leverage three existing open-source corpora that includes the Indonesian language and regional languages of Indonesia. A summary of the datasets are as follows:

Dataset Language #documents
uonlp/CulturaX Indonesian (ind) 23,251,368
uonlp/CulturaX Javanese (jav) 2,058
uonlp/CulturaX Malay (msa) 238,000
uonlp/CulturaX Sundanese (sun) 1,554
sabilmakbar/indo_wiki Acehnese (ace) 12,904
sabilmakbar/indo_wiki Balinese (ban) 19,837
sabilmakbar/indo_wiki Banjarese (bjn) 10,437
sabilmakbar/indo_wiki Buginese (bug) 9,793
sabilmakbar/indo_wiki Gorontalo (gor) 14,514
sabilmakbar/indo_wiki Indonesian (ind) 654,287
sabilmakbar/indo_wiki Javanese (jav) 72,667
sabilmakbar/indo_wiki Banyumasan (map_bms) 11,832
sabilmakbar/indo_wiki Minangkabau (min) 225,858
sabilmakbar/indo_wiki Malay (msa) 346,186
sabilmakbar/indo_wiki Nias (nia) 1,650
sabilmakbar/indo_wiki Sundanese (sun) 61,494
sabilmakbar/indo_wiki Tetum (tet) 1,465
acul3/KoPI-NLLB Acehnese (ace) 792,594
acul3/KoPI-NLLB Balinese (ban) 244,545
acul3/KoPI-NLLB Banjarese (bjn) 296,314
acul3/KoPI-NLLB Javanese (jav) 1,155,142
acul3/KoPI-NLLB Minangkabau (min) 113,323
acul3/KoPI-NLLB Sundanese (sun) 894,626

Extend NusaBERT Tokenizer

We first need to train a WordPiece tokenizer on our pre-pretraining corpus, whose vocab size we limit up to 10,000. We then add non-overlapping tokens from the new tokenizer to the original IndoBERT tokenizer. Since there are overlapping tokens between the two tokenizers, we only ended up adding 1,511 new tokens to the original tokenizer. Refer to the script for more details.

Pre-train NusaBERT

We modified the Hugging Face 🤗 masked language modeling pre-training script and conducted continued pre-training of IndoBERT on the dataset detailed above. Running pre-training is as simple as:

python scripts/run_mlm.py \
    --model_name_or_path indobenchmark/indobert-base-p1 \
    --tokenizer_name LazarusNLP/nusabert-base \
    --max_seq_length 128 \
    --per_device_train_batch_size 256 \
    --per_device_eval_batch_size 256 \
    --do_train --do_eval \
    --max_steps 500000 \
    --warmup_steps 24000 \
    --learning_rate 3e-4 \
    --weight_decay 0.01 \
    --optim adamw_torch_fused \
    --bf16 \
    --preprocessing_num_workers 24 \
    --dataloader_num_workers 24 \
    --save_steps 10000 --save_total_limit 3 \
    --output_dir outputs/nusabert-base \
    --overwrite_output_dir \
    --report_to tensorboard \
    --push_to_hub --hub_private_repo \
    --hub_model_id LazarusNLP/nusabert-base

We achieved a negative log-likelihood loss of 1.4876 and an accuracy of 68.66% on a heldout subset (5%) of the pre-training corpus.

Fine-tune NusaBERT

We developed fine-tuning scripts for NusaBERT based on fine-tuning scripts from Hugging Face 🤗's sample fine-tuning scripts.

In particular, we developed fine-tuning scripts for single-sentence classification, multi-class multi-label classification, token classification, and pair token classification, which you can find in scripts. These scripts support IndoNLU, NusaX, and NusaWrites datasets.

Single-Sentence Classification Task

The tasks included under this category are emotion classification, sentiment analysis, topic classification, etc. To fine-tune for single-sentence classification, run the following command and modify accordingly:

python scripts/run_classification.py \
    --model-checkpoint LazarusNLP/NusaBERT-base \
    --dataset-name indonlp/indonlu \
    --dataset-config emot \
    --input-column-names tweet \
    --target-column-name label \
    --input-max-length 128 \
    --output-dir outputs/nusabert-base-emot \
    --num-train-epochs 100 \
    --optim adamw_torch_fused \
    --learning-rate 1e-5 \
    --weight-decay 0.01 \
    --per-device-train-batch-size 32 \
    --per-device-eval-batch-size 64 \
    --hub-model-id LazarusNLP/NusaBERT-base-EmoT

Single-Sentence Classification recipes are provided here.

Multi-label Multi-class Classification

The task included under this category is aspect-based sentiment analysis (e.g. IndoNLU CASA and HoASA). To fine-tune for multi-label multi-class classification, run the following command and modify accordingly:

python scripts/run_multi_label_classification.py \
    --model-checkpoint LazarusNLP/NusaBERT-base \
    --dataset-name indonlp/indonlu \
    --dataset-config casa \
    --input-column-name sentence \
    --target-column-names fuel,machine,others,part,price,service \
    --input-max-length 128 \
    --output-dir outputs/nusabert-base-casa \
    --num-train-epochs 100 \
    --optim adamw_torch_fused \
    --learning-rate 1e-5 \
    --weight-decay 0.01 \
    --per-device-train-batch-size 32 \
    --per-device-eval-batch-size 64 \
    --hub-model-id LazarusNLP/NusaBERT-base-CASA

Multi-label Multi-class Classification recipes are provided here.

Token Classification

Token classification is also known as sequence labeling. The tasks included under this category are part-of-speech tagging (POS), named entity recognition (NER), and token-level span extraction (e.g. IndoNLU TermA, KEPS). To fine-tune for token classification, run the following command and modify accordingly:

python scripts/run_token_classification.py \
        --model-checkpoint LazarusNLP/NusaBERT-base \
        --dataset-name indonlp/indonlu \
        --dataset-config posp \
        --input-column-name tokens \
        --target-column-name pos_tags \
        --output-dir outputs/nusabert-base-posp \
        --num-train-epochs 10 \
        --optim adamw_torch_fused \
        --learning-rate 2e-5 \
        --weight-decay 0.01 \
        --per-device-train-batch-size 16 \
        --per-device-eval-batch-size 64 \
        --hub-model-id LazarusNLP/NusaBERT-base-POSP

Token Classification recipes are provided here.

Pair Token Classification

Pair token classification is much like token-classification, except involving a pair of input sentences instead of one. The tasks included under this category is token-level question-passage-answering (e.g. IndoNLU FacQA). To fine-tune for pair question-answering, run the following command and modify accordingly:

python scripts/run_pair_token_classification.py \
    --model-checkpoint LazarusNLP/NusaBERT-base \
    --dataset-name indonlp/indonlu \
    --dataset-config facqa \
    --input-column-name-1 question \
    --input-column-name-2 passage \
    --target-column-name seq_label \
    --output-dir outputs/nusabert-base-facqa \
    --num-train-epochs 10 \
    --optim adamw_torch_fused \
    --learning-rate 2e-5 \
    --weight-decay 0.01 \
    --per-device-train-batch-size 16 \
    --per-device-eval-batch-size 64 \
    --hub-model-id LazarusNLP/NusaBERT-base-FacQA

Pair Token Classification recipes are provided here.

Citation

If you use NusaBERT in your research, please cite the following:

@misc{wongso2024nusabertteachingindobertmultilingual,
    title = {NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural}, 
    author = {Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
    year = {2024},
    eprint = {2403.01817},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL},
    url = {https://arxiv.org/abs/2403.01817}, 
}

Credits

NusaBERT is developed with love by: