This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.
Our paper has been accepted to ROCLING! Please check out our paper.
- transformers ≤ 4.15.0
- pytorch
We uploaded our models to HuggingFace hub.
- Pretrained models using a masked language modeling (MLM) objective.
- Fine-tuned models for Word Segmentation.
- Fine-tuned models for Part-of-Speech tagging.
The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.
pip install transformers==4.15.0
pip install torch==1.10.2
-
Pre-trained Language Model
You can use ckiplab/bert-base-han-chinese directly with a pipeline for masked language modeling.
from transformers import pipeline # Initialize unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese') # Input text with [MASK] unmasker("黎[MASK]於變時雍。") # output [{'sequence': '黎 民 於 變 時 雍 。', 'score': 0.14885780215263367, 'token': 3696, 'token_str': '民'}, {'sequence': '黎 庶 於 變 時 雍 。', 'score': 0.0859643816947937, 'token': 2433, 'token_str': '庶'}, {'sequence': '黎 氏 於 變 時 雍 。', 'score': 0.027848130092024803, 'token': 3694, 'token_str': '氏'}, {'sequence': '黎 人 於 變 時 雍 。', 'score': 0.023678112775087357, 'token': 782, 'token_str': '人'}, {'sequence': '黎 生 於 變 時 雍 。', 'score': 0.018718384206295013, 'token': 4495, 'token_str': '生'}]
You can use ckiplab/bert-base-han-chinese to get the features of a given text in PyTorch.
from transformers import AutoTokenizer, AutoModel # Initialize tokenzier and model tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese") model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese") # Input text text = "黎民於變時雍。" encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) # get encoded token vectors output.last_hidden_state # torch.Tensor with Size([1, 9, 768]) # get encoded sentence vector output.pooler_output # torch.Tensor with Size([1, 768])
-
Word Segmentation (WS)
In WS, ckiplab/bert-base-han-chinese-ws divides written the text into meaningful units - words. The task is formulated as labeling each word with either beginning (B) or inside (I).
from transformers import pipeline # Initialize classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws") # Input text classifier("帝堯曰放勳") # output [{'entity': 'B', 'score': 0.9999793, 'index': 1, 'word': '帝', 'start': 0, 'end': 1}, {'entity': 'I', 'score': 0.9915047, 'index': 2, 'word': '堯', 'start': 1, 'end': 2}, {'entity': 'B', 'score': 0.99992275, 'index': 3, 'word': '曰', 'start': 2, 'end': 3}, {'entity': 'B', 'score': 0.99905187, 'index': 4, 'word': '放', 'start': 3, 'end': 4}, {'entity': 'I', 'score': 0.96299917, 'index': 5, 'word': '勳', 'start': 4, 'end': 5}]
-
Part-of-Speech (PoS) Tagging
In PoS tagging, ckiplab/bert-base-han-chinese-pos recognizes parts of speech in a given text. The task is formulated as labeling each word with a part of the speech.
from transformers import pipeline # Initialize classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-pos") # Input text classifier("帝堯曰放勳") # output [{'entity': 'NB1', 'score': 0.99410427, 'index': 1, 'word': '帝', 'start': 0, 'end': 1}, {'entity': 'NB1', 'score': 0.98874336, 'index': 2, 'word': '堯', 'start': 1, 'end': 2}, {'entity': 'VG', 'score': 0.97059363, 'index': 3, 'word': '曰', 'start': 2, 'end': 3}, {'entity': 'NB1', 'score': 0.9864504, 'index': 4, 'word': '放', 'start': 3, 'end': 4}, {'entity': 'NB1', 'score': 0.9543974, 'index': 5, 'word': '勳', 'start': 4, 'end': 5}]
Language Model | MLM Training Data | MLM Testing Data | |||
---|---|---|---|---|---|
上古 | 中古 | 近代 | 現代 | ||
ckiplab/bert-base-han-Chinese | 上古 | 24.7588 | 87.8176 | 297.1111 | 60.3993 |
中古 | 67.861 | 70.6244 | 133.0536 | 23.0125 | |
近代 | 69.1364 | 77.4154 | 46.8308 | 20.4289 | |
現代 | 118.8596 | 163.6896 | 146.5959 | 4.6143 | |
Merge | 31.1807 | 61.2381 | 49.0672 | 4.5017 | |
ckiplab/bert-base-chinese | - | 233.6394 | 405.9008 | 278.7069 | 8.8521 |
WS Model | Training Data | Testing Data | |||
---|---|---|---|---|---|
上古 | 中古 | 近代 | 現代 | ||
ckiplab/bert-base-han-chinese-ws | 上古 | 97.6090 | 88.5734 | 83.2877 | 70.3772 |
中古 | 92.6402 | 92.6538 | 89.4803 | 78.3827 | |
近代 | 90.8651 | 92.1861 | 94.6495 | 81.2143 | |
現代 | 87.0234 | 83.5810 | 84.9370 | 96.9446 | |
Merge | 97.4537 | 91.9990 | 94.0970 | 96.7314 | |
ckiplab/bert-base-chinese-ws | - | 86.5698 | 82.9115 | 84.3213 | 98.1325 |
POS Model | Training Data | Testing Data | |||
---|---|---|---|---|---|
上古 | 中古 | 近代 | 現代 | ||
ckiplab/bert-base-han-chinese-pos | 上古 | 91.2945 | - | - | - |
中古 | 7.3662 | 80.4896 | 11.3371 | 10.2577 | |
近代 | 6.4794 | 14.3653 | 88.6580 | 0.5316 | |
現代 | 11.9895 | 11.0775 | 0.4033 | 93.2813 | |
Merge | 88.8772 | 42.4369 | 86.9093 | 92.9012 |
Copyright (c) 2022 CKIP Lab under the GPL-3.0 License.
Please cite our paper if you use Han-Transformers in your work:
@inproceedings{lin-ma-2022-hantrans,
title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",
author = "Lin, Chin-Tung and Ma, Wei-Yun",
booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",
year = "2022",
address = "Taipei, Taiwan",
publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",
url = "https://aclanthology.org/2022.rocling-1.21",
pages = "164--173",
}