Add specific examples of dictionary/LSTM diffs #25

sffc · 2023-12-15T18:25:50Z

The README contains quantitative results, but it would be good to also include some examples.

The two models can be used on this web page: https://unicode-org.github.io/icu4x/wasm-demo/

For Thai, here are some: (internal reference: b/180649116)

1. Oxford

Model	Text
Dictionary	. อ้างอิง . จาก . พจนานุกรม . ภาษา . อังกฤษ . ของ . อ็อก . ซ์ . ฟอร์ด .
LSTM	. อ้าง . อิง . จากพจนานุกรม . ภาษา . อังกฤษ . ของ . อ็อกซ์ฟอร์ด .
Translation	According to the Oxford English Dictionary

The token "อ็อกซ์ฟอร์ด" or "Oxford" is retained as a single token in LSTM, whereas Dictionary incorrectly splits it into multiple words. Dictionary segments it into something like: "O xf ord"
At the start of the sentence: the Dictionary model correctly identifies "อ้างอิง (reference) จาก (from) พจนานุกรม (dictionary)". LSTM splits อ้างอิง into two words; it should preferably be one word, but it is arguably a compound word.

2. Kingdom of Kushan

Model	Text
Dictionary	. กษัตริย์ . ที่ . ปกครอง . อาณาจักร . กุ . ษา . ณะ .
LSTM	. กษัตริย์ . ที่ . ปกครอง . อาณาจักร . กุษาณะ .
Translation	The king who ruled the Kingdom of Kushan

Should be "อาณาจักร กุษาณะ" "Kingdom (of) Kushan". LSTM correctly identifies this. But it's arguable that อาณาจักร is a compound word of อาณา + จักร.

3. Impact

Model	Text
Dictionary	. ซึ่ง . จัด . ขึ้น . ที่ . ศูนย์ . แสดง . สินค้า . และ . การ . ประชุม . อิมแ . พค .
LSTM	. ซึ่ง . จัด . ขึ้น . ที่ . ศูนย์ . แสดง . สินค้า . และ . การ . ประชุม . อิมแพค .
Translation	Center (for) showing product and Meeting Impact / "Impact Exhibition and Convention Center" where Impact is a English-borrowed name.

LSTM identifies "อิมแพค", "Impact", as being a single borrow word.

4. Land of Assam

Model	Text
Dictionary	. ทำให้ . พม่า . ต้อง . สูญ . เสีย . ดิน . แดน . อัส . สัม .
LSTM	. ทำ . ให้ . พม่า . ต้อง . สูญเสีย . ดินแดนอัส . สัม .
Translation	Causing Burmese to lose land (of) Assam

Neither of the models gets "อัสสัม", "Assam", as a single token.
Prefer the Dictionary result of "ดิน แดน อัส สัม" over "ดินแดนอัส สัม"
The differences in the first part of the sentence appear to be more disagreements over what is a compound word. Dictionary retains "ทำให้" as a single token, but LSTM retains "สูญเสีย" as a single token. I think ideally both are single tokens.

5usu · 2024-11-07T19:54:34Z

I have made the changes in the README.md file please review and if the changes don't meet the requirement then please explain what else to do as I would like to contribute to this.
my PR #35

sffc mentioned this issue Dec 15, 2023

Please consider an alternative mitigation mechanism to word segmention WICG/scroll-to-text-fragment#251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add specific examples of dictionary/LSTM diffs #25

Add specific examples of dictionary/LSTM diffs #25

sffc commented Dec 15, 2023 •

edited

Loading

5usu commented Nov 7, 2024 •

edited

Loading

Add specific examples of dictionary/LSTM diffs #25

Add specific examples of dictionary/LSTM diffs #25

Comments

sffc commented Dec 15, 2023 • edited Loading

1. Oxford

2. Kingdom of Kushan

3. Impact

4. Land of Assam

5usu commented Nov 7, 2024 • edited Loading

sffc commented Dec 15, 2023 •

edited

Loading

5usu commented Nov 7, 2024 •

edited

Loading