Skip to content

mondain-dev/chn-hist-phon

Repository files navigation

ChnHistPhon - Chinese Historical Phonology

Experiments in Chinese Historical Phonology using matrix decomposition and factorization methods.

Prerequisites

We use python for to prepare our data. The following packages are required:

In addition to cjklib, Unihan Database is used. The latest Unihan.zip can be downloaded from https://www.unicode.org/Public/UCD/. Unzip it to /path/to/Unihan.

Running experiments

Prepare data

Once you have cloned this repository to your local /path/to/ChnHistPhon, you can run

python /path/to/ChnHistPhon/ChnHistPhon_1_data_preparation.py

which will create ChnCharData.csv a dataset of Chinese characters we need in /path/to/ChnHistPhon/results.

Perform low-rank SVD

We used softImpute (Mazumder et al., 2010.) to complete the data matrix in ChnCharData.csv, which is followed by dictionary learning and sparse coding in ChnHistPhon_2_run_SoftImpute_DictionaryLearning.py.

Results

The results can be viewed here.