Skip to content

Commit

Permalink
Bug fixing
Browse files Browse the repository at this point in the history
  • Loading branch information
ranzaka committed Sep 3, 2024
1 parent a9b2d80 commit 4d4aa76
Show file tree
Hide file tree
Showing 4 changed files with 1,028 additions and 2,282 deletions.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "sinlib"
version = "0.1.4"
version = "0.1.5"
description = "Sinhala NLP Toolkit"
authors = [
{ name = "Ransaka", email = "ransaka.ravihara@gmail.com" }
Expand Down
3 changes: 2 additions & 1 deletion src/sinlib/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ def __encode(self, text, truncate_and_pad: bool, allowed_special_tokens: list =
text_encodings.append(self.vocab_map[token])
else:
continue

else:
text_encodings.append(self.vocab_map.get(token, self.unknown_token_id))
if truncate_and_pad:
return self.pad_or_truncate(
sequence=text_encodings,
Expand Down
12 changes: 7 additions & 5 deletions src/sinlib/utils/data/config.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
{
"unknown_token": "<unk>",
"pad_token": "<pad>",
"unknown_token_id": 126,
"pad_token_id": 127,
"max_length": 30
"unknown_token": "<|unk|>",
"pad_token": "<|pad|>",
"unknown_token_id": 1015,
"pad_token_id": 1016,
"max_length": 256,
"end_of_text_token": "<|endoftext|>",
"end_of_text_token_id": 1017
}
Loading

0 comments on commit 4d4aa76

Please sign in to comment.