Skip to content

Commit

Permalink
Added text romanization logics and added proper dockstrings
Browse files Browse the repository at this point in the history
  • Loading branch information
ranzaka committed Jun 21, 2024
1 parent b995ba4 commit a5f4010
Show file tree
Hide file tree
Showing 16 changed files with 3,119 additions and 193 deletions.
3 changes: 3 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 14 additions & 0 deletions .idea/sinlib.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

19 changes: 16 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sinlib (Buggy alpha version)
# Sinlib

![Alt text](sinlib.png)

Expand Down Expand Up @@ -29,14 +29,27 @@ encoding = tokenizer("මේ අතර, පෙබරවාරි මාසයේ
[tokenizer.token_id_to_token_map[id] for id in encoding]
['මේ', ' ', '', '', '', ',', ' ', 'පෙ', '', '', 'වා', 'රි', ' ', 'මා', '', 'යේ', ' ', '', '', 'මු']
```

02. Preprocessor
```python
sent = ['මෙය සිංහල වාක්‍යක්', 'මෙය සිංහල වාක්‍යක් සමග english character කීපයක්','This is complete english sentence']
print(sent)
['මෙය සිංහල වාක්\u200dයක්', 'මෙය සිංහල වාක්\u200dයක් සමග english character කීපයක්', 'This is complete english sentence']
#['මෙය සිංහල වාක්\u200dයක්', 'මෙය සිංහල වාක්\u200dයක් සමග english character කීපයක්', 'This is #complete english sentence']

from sinlib.preprocessing import get_sinhala_character_ratio

get_sinhala_character_ratio(sent)
[0.9, 0.46875, 0.0]
#[0.9, 0.46875, 0.0]
```

03. Sinnhala Romanizer
```python
texts = ["hello, මේ මාසයේ ගත වූ දින 15ක කාලය තුළ කොළඹ නගරය ආශ්‍රිත ව", "මෑතකාලීන ව රට මුහුණ දුන් අභියෝගාත්මකම ආර්ථික කාරණාව ණය ප්‍රතිව්‍යුගතකරණය බව මුදල් රාජ්‍ය අමාත්‍ය ආචාර්ය රංජිත් සියඹ$$$ mahatha see more****"]

from sinlib import Romanizer

romanizer = Romanizer(char_mapper_fp = None, tokenizer_vocab_path = None)
romanizer(text)
#['hello, me masaye gatha wu dina 15ka kalaya thula kolaba nagaraya ashritha wa',
# 'methakaleena wa rata muhuna dun abhiyogathmakama arthika karanawa naya prathiwyugathakaranaya #bawa mudal rajya amathya acharya ranjith siyaba$$$ mahatha see more****']
```
1 change: 1 addition & 0 deletions data/char_map.json

Large diffs are not rendered by default.

Loading

0 comments on commit a5f4010

Please sign in to comment.