ETL tool for Unicode's Han Unification (UNIHAN) database releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the database from Unicode's website to a flat, tabular or structured, tree-like format.
unihan-etl can be used as a python library through its API, to retrieve data as a python object, or through the CLI to retrieve a CSV, JSON, or YAML file.
Part of the cihai project. Similar project: libUnihan.
UNIHAN Version compatibility (as of unihan-etl v0.10.0): 11.0.0 (released 2018-05-08, revision 25).
UNIHAN's data is dispersed across multiple files in the format of:
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type.
kHanyuPinyin maps Unicode codepoints to
Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn
represents
an entry. Complicating it further, more variations:
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
kHanyuPinyin supports multiple entries delimited by spaces. ":" (colon) separate locations in the work from pinyin readings. "," (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.
$ unihan-etl
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
To preview in the CLI, try tabview or csvlens.
$ unihan-etl -F json --no-expand
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
Tools:
$ unihan-etl -F yaml --no-expand
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
Filter via the CLI with yq.
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.
Why not CSV?
Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.
$ unihan-etl -F json
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": ["(same as U+4E18 丘) hillock or mound"],
"kCantonese": ["jau1"],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
"kCantonese": ["tim2"],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": ["tiàn"]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
$ unihan-etl -F yaml
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
- automatically downloads UNIHAN from the internet
- strives for accuracy with the specifications described in UNIHAN's database design
- export to JSON, CSV and YAML (requires pyyaml) via
-F
- configurable to export specific fields via
-f
- accounts for encoding conflicts due to the Unicode-heavy content
- designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
- core component and dependency of cihai, a CJK library
- data package support
- expansion of multi-value delimited fields in YAML, JSON and python dictionaries
- supports >= 3.7 and pypy
If you encounter a problem or have a question, please create an issue.
To download and build your own UNIHAN export:
$ pip install --user unihan-etl
or by pipx:
$ pipx install unihan-etl
pip:
$ pip install --user --upgrade --pre unihan-etl
pipx:
$ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force
// Usage: unihan-etl@next load yoursession
unihan-etl
offers customizable builds via its command line arguments.
See unihan-etl CLI arguments for information on how you can specify columns, files, download URL's, and output destination.
To output CSV, the default format:
$ unihan-etl
To output JSON:
$ unihan-etl -F json
To output YAML:
$ pip install --user pyyaml
$ unihan-etl -F yaml
To only output the kDefinition field in a csv:
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
process.py # argparse, download, extract, transform UNIHAN's data
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
util.py # utility / helper functions
# test suite
tests/*
$ git clone https://github.com/cihai/unihan-etl.git
$ cd unihan-etl
Bootstrap your environment and learn more about contributing. We use the same conventions / tools across all cihai projects: pytest
, sphinx
, flake8
, mypy
, black
, isort
, tmuxp
, and file watcher helpers (e.g. entr(1)
).