This is the implementation of Improving Chinese WordSegmentation with Wordhood Memory Networks at ACL2020.
We will keep updating this repository these days.
If you use or extend our work, please cite our paper at ACL2020.
@inproceedings{tian-etal-2020-improving,
title = "Improving Chinese Word Segmentation with Wordhood Memory Networks",
author = "Tian, Yuanhe and Song, Yan and Xia, Fei and Zhang, Tong and Wang, Yonggang",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
pages = "8274--8285",
}
Our code works with the following environment.
python=3.6
pytorch=1.1
In our paper, we use BERT (paper) and ZEN (paper) as the encoder.
For BERT, please download pre-trained BERT-Base Chinese from Google or from HuggingFace. If you download it from Google, you need to convert the model from TensorFlow version to PyTorch version.
For ZEN, you can download the pre-trained model from here.
For WMSeg, you can download the models we trained in our experiments from here.
Run run_sample.sh
to train a model on the small sample data under the sample_data
directory.
We use SIGHAN2005 and CTB6 in our paper.
To obtain and pre-process the data, please go to data_preprocessing
directory and run getdata.sh
. This script will download and process the official data from SIGHAN2005. For CTB6, you need to obtain the official data first, and then put the LDC07T36
folder under the data_preprocessing
directory.
All processed data will appear in data
directory.
You can find the command lines to train and test model on a specific dataset in run.sh
.
Here are some important parameters:
--do_train
: train the model--do_test
: test the model--use_bert
: use BERT as encoder--use_zen
: use ZEN as encoder--bert_model
: the directory of pre-trained BERT/ZEN model--use_memory
: use memory--decoder
: usecrf
orsoftmax
as the decoder--ngram_flag
: useav
,dlg
, orpmi
to construct the lexicon N--model_name
: the name of model to save