Authored by Chen Z.Y. and Ni B.L..
This is a project to handle with Chinese Tokenization. It is also our homework for course of Natural Language Processing.
We've implemented the followings
- Rule-based tokenization with a shortest path.
- Simple 2-gram language model without part-of-speech
- HMM model with part-of-speech
- Character-based tagging
We develop with Python3.7 on Windows 10. However, we also run our codes successfully on Linux.
The only special package we use is ahocorasick
. Run the following command to install. Please note that it requires C++ lib. Install Microsoft Visual C++ Build Tools
on Windows.
pip install pyahocorasick
git clone https://github.com/volgachen/Chinese-Tokenization
python test_exp1.py --use-re --score Markov
to experiment with rule-based algorithms. You can remove --use-re
if you do not want re-replacement. --score
can be choosen from None, Markov and HMM, which decides if we combine with language models.
python test_exp2.py
to see results with simple 2-gram lauguage model. Several results with different datasets will be shown. You can run python test_exp2.py train
or python test_exp2.py test
to execute evaluation within/without training set.
python BMES_exps/BMES.py
to run experiments about character-based tagging. We could see experiments with different train/val set and experiments with or without re-replacement. You need first run python BMES_exps/convert_BMES.py
to get organized corpus for this experiment.