[FEATURE] Update tokenizers #158

KINGNEWBLUSH · 2024-03-10T09:33:20Z

Description

update tokenizer modules, allow tokenizing using nltk, spacy and bpe.

Pull request type

[DATASET] Add a new dataset
[BUGFIX] Bugfix
[FEATURE] New feature (non-breaking change which adds functionality)
[BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[STYLE] Code style update (formatting, renaming)
[REFACTOR] Refactoring (no functional changes, no api changes)
[BUILD] Build related changes
[DOC] Documentation content changes
[OTHER] Other (please describe):

Changes

update EduNLP/SIF/tokenization/text/tokenization.py
update test_tokenizers.py
update setup.py

Essentials

PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage and al tests passing
Code is well-documented (extended the README / documentation, if necessary)
[] If this PR is your first one, add your name and github account to AUTHORS.md

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

modified: setup.py modified: tests/test_tokenizer/test_tokenizer.py

codecov-commenter · 2024-03-10T13:06:05Z

Codecov Report

Attention: Patch coverage is 95.74468% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 97.31%. Comparing base (855e250) to head (e86a5e6).

Files	Patch %	Lines
EduNLP/SIF/tokenization/text/tokenization.py	95.74%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #158      +/-   ##
==========================================
- Coverage   97.33%   97.31%   -0.03%     
==========================================
  Files          85       85              
  Lines        4693     4729      +36     
==========================================
+ Hits         4568     4602      +34     
- Misses        125      127       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nnnyt

Also, please add your name to AUTHORS.md

tests/test_tokenizer/test_tokenizer.py

nnnyt · 2024-03-11T04:40:55Z

tests/test_tokenizer/test_tokenizer.py

+        '$', '4', '$', 'packs', 'left', '$', '25', '$', 'each', 'how', 'many',
+        'are', 'sold'
+    ]
+    for tok in ['nltk', 'spacy']:


add test for bpe

modified: tests/test_tokenizer/test_tokenizer.py

nnnyt · 2024-03-12T07:16:40Z

EduNLP/SIF/tokenization/text/tokenization.py

+
+    elif (tokenizer == 'bpe'):
+        try:
+            tokenizer = HGTokenizer.from_file('bpeTokenizer.json')


change this to a parameter instead of a hard-coded path. Or directly reuse tok_model param

nnnyt · 2024-03-12T07:17:27Z

EduNLP/SIF/tokenization/text/tokenization.py

+            trainer = BpeTrainer(
+                special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
+            tokenizer.train(files=[bpe_trainfile], trainer=trainer)
+            tokenizer.save('bpeTokenizer.json', pretty=True)


KenelmQLH · 2024-03-13T15:22:10Z

tests/test_tokenizer/test_tokenizer.py

+    tokenizer = get_tokenizer("pure_text", text_params={"tokenizer": 'bpe', "stopwords": set(",?"),
+                              "bpe_trainfile": data_path})
+    tokens = tokenizer(items)
+    ret = next(tokens)


是否支持中文？

KINGNEWBLUSH added 2 commits March 10, 2024 08:59

modified: EduNLP/SIF/tokenization/text/tokenization.py

dbe8936

modified: setup.py modified: tests/test_tokenizer/test_tokenizer.py

Update tokenization.py

8fd96b2

nnnyt changed the title ~~Update tokenizers~~ [FEATURE] Update tokenizers Mar 11, 2024

nnnyt requested changes Mar 11, 2024

View reviewed changes

KINGNEWBLUSH added 18 commits March 11, 2024 06:38

modified: AUTHORS.md

025fa86

modified: tests/test_tokenizer/test_tokenizer.py

modified: EduNLP/SIF/tokenization/text/tokenization.py

aea99a2

modified: tests/test_tokenizer/test_tokenizer.py

modified: EduNLP/SIF/tokenization/text/tokenization.py

970c1b9

modified: tests/test_tokenizer/test_tokenizer.py

modified: EduNLP/SIF/tokenization/text/tokenization.py

a289a7a

modified: tests/test_tokenizer/test_tokenizer.py

modified: EduNLP/SIF/tokenization/text/tokenization.py

ad7df8b

modified: tests/test_tokenizer/test_tokenizer.py

modified: EduNLP/SIF/tokenization/text/tokenization.py

9423b31

modified: tests/test_tokenizer/test_tokenizer.py

modified: EduNLP/SIF/tokenization/text/tokenization.py

5792e48

modified: tests/test_tokenizer/test_tokenizer.py

modified: EduNLP/SIF/tokenization/text/tokenization.py

edc266f

modified: tests/test_tokenizer/test_tokenizer.py

modified: EduNLP/SIF/tokenization/text/tokenization.py

c526016

modified: EduNLP/SIF/tokenization/text/tokenization.py

64c6cda

modified: EduNLP/SIF/tokenization/text/tokenization.py

3a53b51

modified: EduNLP/SIF/tokenization/text/tokenization.py

569bb9f

modified: EduNLP/SIF/tokenization/text/tokenization.py

4542258

modified: tests/test_tokenizer/test_tokenizer.py

modified: EduNLP/SIF/tokenization/text/tokenization.py

721bc0a

modified: tests/test_tokenizer/test_tokenizer.py

1476f8a

modified: EduNLP/SIF/tokenization/text/tokenization.py

f02ccce

modified: tests/test_tokenizer/test_tokenizer.py

modified: tests/test_tokenizer/test_tokenizer.py

05172b4

modified: EduNLP/SIF/tokenization/text/tokenization.py

767778f

modified: tests/test_tokenizer/test_tokenizer.py

nnnyt requested changes Mar 12, 2024

View reviewed changes

modified: EduNLP/SIF/tokenization/text/tokenization.py

e86a5e6

nnnyt approved these changes Mar 12, 2024

View reviewed changes

nnnyt requested a review from KenelmQLH March 12, 2024 07:32

KenelmQLH reviewed Mar 13, 2024

View reviewed changes

KenelmQLH merged commit 7abc7d1 into bigdata-ustc:dev Mar 14, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Update tokenizers #158

[FEATURE] Update tokenizers #158

KINGNEWBLUSH commented Mar 10, 2024

codecov-commenter commented Mar 10, 2024 •

edited

Loading

nnnyt left a comment

nnnyt Mar 11, 2024

nnnyt Mar 12, 2024

nnnyt Mar 12, 2024

KenelmQLH Mar 13, 2024

[FEATURE] Update tokenizers #158

[FEATURE] Update tokenizers #158

Conversation

KINGNEWBLUSH commented Mar 10, 2024

Description

Pull request type

Changes

Essentials

Comments

codecov-commenter commented Mar 10, 2024 • edited Loading

Codecov Report

nnnyt left a comment

Choose a reason for hiding this comment

nnnyt Mar 11, 2024

Choose a reason for hiding this comment

nnnyt Mar 12, 2024

Choose a reason for hiding this comment

nnnyt Mar 12, 2024

Choose a reason for hiding this comment

KenelmQLH Mar 13, 2024

Choose a reason for hiding this comment

codecov-commenter commented Mar 10, 2024 •

edited

Loading