Recent advances in AI enable smarter applications based on texts. It's good but they are mostly in English due to its abundance of texts available from the Internet.
This repository explores LM in Cantonese (Yue Chinese, 廣東話), a langauge predominantly spoken in Guangzhou, Hong Kong and Macau, and containing very challenging lingual properties for AI to learn.
AI 喺呢幾年發展得好快,好多嘢都話用 AI 處理會醒好多,但其實喺「語言處理」嘅領域入面,好多嘅資源都只係得英文,所以要落手做廣東話嘅 NLP,其實唔容易。
所以諗住喺呢度開個 Repo ,鼓勵更多人開發廣東話 AI。
- Mixed Languages (English, Chinese, Yue)
夾雜多種語言 - Complex Syntax
語法複雜 - Scarce Resource
資源稀少 - Many Homonyms & Homophones in online texts
網上嘅字通常有好多一語多義/同音異字
We adopt the following preprocessing to the model:
用呢個 model 前我哋會對文字做一啲嘅處理:
-
WordPiece Tokenizer from forked 🤗Tokenizers which,
-
strips accents like the original BERT
除去組合附加符號 (e.g.à
→a
) -
uses lower casing
使用細階英文 -
treats symbols/numers as a separate token
符號/數字全部當係一個 token -
Simplified Chinese → Traditional Chinese (Since most of our corpus are in Trad. Chinese)
簡轉繁(因為文本大部分都係繁體字)Using OpenCC v1.1.1 from here
-
normalizes Unicode Characters (Some are hand-crafted) by
統一中文字符(其中一啲係人手分類)- Symbols of the same functionality 相同功能嘅符號 (e.g.
【
→[
) - Variant Chinese characters 異體字 (e.g.
俢
→修
) - Deomposing rare characters 將罕見字拆開 (e.g.
偆
→亻春
)
(Mapping here)
- Symbols of the same functionality 相同功能嘅符號 (e.g.
-
-
Newlines are regarded as a token, i.e.
<nl>
Tensorflow
Pytorch
OpenCC
(Simpl-to-Trad, 簡轉繁) @ v1.1.1🤗Tokenizers
(forked version is used for normalization)
# Installing OpenCC v1.1.1 by
sudo bash ./install_opencc.sh
# Installing by forked 🤗 Tokenizers by
pip3 install 'git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python'
# This takes some time!
# This is forked from tokenizers@v0.8.1
# with python package renamed to tokenizers_zh
zh | en |
---|---|
~ 80 GB (incl. ~ 20 GB Cantonese) |
~ 100 GB |
Since we have NO datasets in Cantonese, we evaluate the models on both English and Chinese datasets:
- MNLI (Entailment Prediction)
- DRCD (Reading Comprehension)
- SQuAD-v2 (Reading Comprehension)
- CMRC2018 (Reading Comprehension)
-
Sentence Order Prediction (SOP)
SOP is a pretraining objective that is used in Albert. StructBERT also introduces Sentence Structural Objective, but since the code for electra reads the data sequentially, this repo explores SOP first
-
Cluster Objective
DocProduct is a cool project training a BERT model to cluster similar Q&A -- if a text A answers the question Q, then Q and A will be close in vector representation.
This means the model must predict the possible contexts (before and after) in order to embed a vector that can minimize the cost function
Details refer to the DocProduct repo.
- Normalize Chinese characters
- ELECTRA-small
- ELECTRA-base
- ELECTRA-base-sop
- ELECTRA-albert-base
- ELECTRA-albert-xlarge
- ELECTRA-base-cluster
- ELECTRA-large
- Evaluation in Cantonese dataset
- Upload to 🤗Huggingface
Model | params # | L/H | MNLI-en | DRCD-dev (EM/F1) |
SQuADv2-dev (EM/F1) |
CMRC2018-dev (EM/F1) |
|
---|---|---|---|---|---|---|---|
🐤 | BERT (s) | 12M | 12/256 | 77.6 | 60.5/64.2🤗 | ||
🐦 | BERT (b) | 110M | 12/768 | 84.3 | 85.0/91.2 | 72.4/75.8🤗 | |
🦅 | BERT (l) | 334M | 12/1024 | 87.1 | 92.8/86.7 | ||
🐦 | roBERTa (b) | 110M | 12/768 | 87.6 | 86.6/92.5 | 78.5/81.7🤗 | |
🦅 | roBERTa (l) | 335M | 24/1024 | 90.2 | 88.9/94.6 | ||
🐤 | alBERT (b) | 12M | 12/768 | 84.6 | 79.3/82.1 | ||
🐤 | alBERT (l) | 18M | 24/1024 | 86.5 | 81.8/84.9 | ||
🐦 | alBERT (xl) | 60M | 24/2048 | 87.9 | 84.1/87.9 | ||
🦅 | alBERT (xxl) | 235M | 12/4096 | 90.6 | 86.9/89.8 | ||
🐤 | ELECTRA (s) | 14M | 12/256 | 81.6 | 83.5/89.2 | 69.7/73.4🤗 | |
🐦 | ELECTRA (b) | 110M | 12/768 | 88.5 | 89.6/94.2 | 80.5/83.3 | 69.3/87.0 |
🦅 | ELECTRA (l) | 335M | 24/1024 | 90.7 | 88.8/93.3 | 88.0/90.6 | |
🐦 | XLM-R (b) | 270M | 12/768 | ||||
🦅 | XLM-R (l) | 550M | 24/1024 | 89.0 | |||
Ours (1.2M) | |||||||
🐤 | ELECTRA (s) | 14M | 12/256 | 80.7 | 82.1/88.0 | 69.4/72.1 | |
🐦 | ELECTRA (b) | 110M | 12/768 | 86.3 | 88.2/92.5 | 80.4/83.3 | |
🐦 | albert (xl) | 60M | 12/2048 | 87.7 | 89.9/94.7 | 82.9/85.9 | |
Ours (1.5M) | |||||||
🐦 | ELECTRA (b) | 110M | 12/768 | 86.8 | 88.5/93.3 | 80.8/83.7 | 67.4/86.7 |
+ finetuned after SQuAD | 89.5/94.1 | 70.2/88.5 | |||||
Model | params # | L/H | MNLI-en | DRCD-dev (EM/F1) |
SQuADv2-dev (EM/F1) |
|
---|---|---|---|---|---|---|
🐤 | BERT (s) | 12M | 12/256 | 77.6 | 60.5/64.2🤗 | |
🐤 | alBERT (b) | 12M | 12/768 | 84.6 | 79.3/82.1 | |
🐤 | alBERT (l) | 18M | 24/1024 | 86.5 | 81.8/84.9 | |
🐤 | ELECTRA (s) | 14M | 12/256 | 81.6 | 83.5/89.2 | 69.7/73.4🤗 |
Ours | ||||||
🐤 | ELECTRA (s) | 14M | 12/256 | 80.7 | 82.1/88.0 | 69.4/72.1 |
Model | params # | L/H | MNLI-en | DRCD-dev (EM/F1) |
SQuADv2-dev (EM/F1) |
CMRC2018-dev (EM/F1) |
|
---|---|---|---|---|---|---|---|
🐦 | BERT (b) | 110M | 12/768 | 84.3 | 85.0/91.2 | 72.4/75.8🤗 | |
🐦 | roBERTa (b) | 110M | 12/768 | 87.6 | 86.6/92.5 | 78.5/81.7🤗 | 67.4/87.2 |
🐦 | ELECTRA (b) | 110M | 12/768 | 88.5 | 89.6/94.2 | 80.5/83.3 | 69.3/87.0 |
Ours | |||||||
🐦 | ELECTRA (b) | 110M | 12/768 | 86.3 | 88.2/92.5 | 80.4/83.3 | |
Ours (1.5M) | |||||||
🐦 | ELECTRA (b) | 110M | 12/768 | 86.8 | 88.5/93.3 | 80.8/83.7 | 67.4/86.7 |
+ finetuned after SQuAD | 89.5/94.1 | 70.2/88.5 |
Electra checkpoints are put here in Google Drive.
Electra-albert checkpoints are here in Google Drive
Model | params # | L/H | MNLI-en | DRCD-dev (EM/F1) |
SQuADv2-dev (EM/F1) |
|
---|---|---|---|---|---|---|
Ours (1.5M) | ||||||
🐦 | ELECTRA (b) | 110M | 12/768 | 86.8 | 88.5/93.3 | 80.8/83.7 |
+ finetuned after SQuAD | 89.5/94.1 | |||||
Ours (1.5M) + SOP | ||||||
🐦 | ELECTRA (b) | 110M | 12/768 | 87.1 | 88.6/93.6 | 80.4/83.2 |
+ finetuned after SQuAD | 89.7/94.1 |
Special thanks to Google's TensorFlow Research Cloud (TFRC) for providing TPU-v3 for all the training in this repo!