Experiments in Chinese Historical Phonology using matrix decomposition and factorization methods.
We use python for to prepare our data. The following packages are required:
- pandas
- numpy
- cjklib
- vPhon: a Vietnamese phonetizer: clone it to your local directory
/path/to/vPhon
- fancyimpute: install it from github repository
In addition to cjklib, Unihan Database is used. The latest Unihan.zip
can be downloaded from https://www.unicode.org/Public/UCD/. Unzip it to /path/to/Unihan
.
Once you have cloned this repository to your local /path/to/ChnHistPhon
, you can run
python /path/to/ChnHistPhon/ChnHistPhon_1_data_preparation.py
which will create ChnCharData.csv
a dataset of Chinese characters we need in /path/to/ChnHistPhon/results
.
We used softImpute
(Mazumder et al., 2010.) to complete the data matrix in ChnCharData.csv
, which is followed by dictionary learning and sparse coding in ChnHistPhon_2_run_SoftImpute_DictionaryLearning.py
.
The results can be viewed here.