Skip to content

Commit

Permalink
added script to create tokenizers out of hf datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
chandralegend committed Jul 21, 2024
1 parent 02e2de7 commit d088808
Show file tree
Hide file tree
Showing 7 changed files with 122 additions and 108,894 deletions.
1 change: 0 additions & 1 deletion multi_tokenizer/pretrained/chinese_tokenizer.json
Original file line number Diff line number Diff line change
Expand Up @@ -49822,4 +49822,3 @@
]
}
}

1 change: 0 additions & 1 deletion multi_tokenizer/pretrained/spanish_tokenizer.json
Original file line number Diff line number Diff line change
Expand Up @@ -49822,4 +49822,3 @@
]
}
}

7 changes: 6 additions & 1 deletion multi_tokenizer/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,12 @@ def pre_tokenize(self, text: str) -> list[tuple[str, tuple[int, int]]]:
output = (
[(tokenizer.language_prefix_token, (-1, 0))]
+ output
+ [(tokenizer.language_suffix_token, (len(detected_text) - 2, len(detected_text) - 1))]
+ [
(
tokenizer.language_suffix_token,
(len(detected_text) - 2, len(detected_text) - 1),
)
]
)
# Offsetting the start and end indices of the tokens to match the original text
output = [
Expand Down
Loading

0 comments on commit d088808

Please sign in to comment.