You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The token "อ็อกซ์ฟอร์ด" or "Oxford" is retained as a single token in LSTM, whereas Dictionary incorrectly splits it into multiple words. Dictionary segments it into something like: "O xf ord"
At the start of the sentence: the Dictionary model correctly identifies "อ้างอิง (reference) จาก (from) พจนานุกรม (dictionary)". LSTM splits อ้างอิง into two words; it should preferably be one word, but it is arguably a compound word.
Neither of the models gets "อัสสัม", "Assam", as a single token.
Prefer the Dictionary result of "ดิน แดน อัส สัม" over "ดินแดนอัส สัม"
The differences in the first part of the sentence appear to be more disagreements over what is a compound word. Dictionary retains "ทำให้" as a single token, but LSTM retains "สูญเสีย" as a single token. I think ideally both are single tokens.
The text was updated successfully, but these errors were encountered:
I have made the changes in the README.md file please review and if the changes don't meet the requirement then please explain what else to do as I would like to contribute to this.
my PR #35
The README contains quantitative results, but it would be good to also include some examples.
The two models can be used on this web page: https://unicode-org.github.io/icu4x/wasm-demo/
For Thai, here are some: (internal reference: b/180649116)
1. Oxford
2. Kingdom of Kushan
3. Impact
4. Land of Assam
The text was updated successfully, but these errors were encountered: