GitHub - addohm/CEDICT_Parser: Python files for parsing chinese dictionary CEDICT

addohm / CEDICT_Parser Public

forked from zkoch/CEDICT_Parser

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Python files for parsing chinese dictionary CEDICT

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dictionaries		dictionaries
resources		resources
.gitignore		.gitignore
README.txt		README.txt
parser.py		parser.py
pinyin.py		pinyin.py

Repository files navigation

CC-CEDICT is an English Chinese dictionary freely available for use in applications and other such things. It can be downloaded here: http://www.mdbg.net/chindict/chindict.php?page=cc-cedict

The dictionary contains around 100k lines, and they all follow the same order:

TRADITIONAL_CHARS SIMPLIFIED_CHARS [PINYIN] /DEF 1/DEF 2

The biggest issue with this dictionary, however, is that the pinyin comes in the form: zhong1 guo2. That's not all that helpful for people, so we need to convert those into letters with actual tones marks:  Zhōngguó

This code was originally writhed for use in converting the dictionary into a JS object for use in another project, but you should be able to extract what you need for your own purposes.

File Overview:

pinyin.py -> Within is a method that takes a string of pinyin and tone marks (e.g. zhong1 guo2) and converts to actual tone marks

parser.py -> Reads the dictionary, parses out the simplified chars, pinyin, and definitions, uses pinyin.py to convert the pinyin, and then puts the resulting dictionaries into an array.

GOALS:

- write cleaner code
- genericize it so people can use it easily
- do some command line wizardry maybe