Xin chào các bạn!
I'm in the process of learning Vietnamese, and something I want to focus on is expanding my vocabulary! Vietnamese has quite a few differences from English, but an important one is the low morpheme to word ratio. Vietnamese creates more complicated words (or words in different tenses) through creating compound words. When you learn a single morpheme word in Vietnamese, you can usually build on that word to create more complex words around the same idea. This project utilizes Python and Obsidian to visualize these compound words to improve vocabulary while learning the language.
To see how this project was created, and get a glimpse of my problem-solving method, check out the video here:
This Python project takes a Vietnamese to English dictionary in the form of an XML file. It manipulates the XML data into a Pandas DataFrame, and then passes it to a search function. There are two search functions in the project:
reverse_search_method Reverse search method receives a DataFrame of words, then takes each word and splits it into morphemes and reverses the list. It iterates the morphemes, creates a word with a morpheme length equal to the iterated index, then reverses the word to put it back in proper order. This sub-word is then added to the connection list for the main word. This method is deprecated, in favor of dict_match_search_method, as it creates false positives in some words, linking to words that are not in the dictionary (i.e. morphemes used to add meaning to a compound word, but which are not words in their own right) and misses some connections to morpheme combinations created by using the center morphemes of a four (or higher) morpheme word.
dict_match_search_method Dictionary match search method receives a DataFrame of words and iterates it. For each word, it creates an array of all the possible sub-words that could exist in it. For instance, the word sinh hóa học (biochemistry) consists of five possible sub-words:
- One Morpheme
- sinh
- hóa
- học
- Two Morphemes
- sinh hóa
- hóa học
These words are compared against a filtered dictionary DataFrame containing only words with the same morpheme length as the sub-word. If a word exists in the dictionary a connection is made between the root word and the sub-word. This method is slightly slower than reverse_search_method, but ensures 100% connection with existing words from the dictionary, and removes all dead links.
After running the script, 23000 markdown files will be outputted, which can then be imported into an Obsidian Vault. This will result in a graph view like the following:
You can use this project to visualize Vietnamese as well! Required Software:
- Obsidian
- Python
- source.txt
Instructions:
- Clone this repository to your computer
- run
python main.py
- Open Obsidian
- Click Open another vault (bottom of left navigation bar)
- Click Open folder as Vault
- Select the folder ObsidianVault in this repository
Obsidian will take some time to index and import this vault. That time is greatly decreased by not opening the Graph View until the indexing is complete.