Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Update tokenizers #158

Merged
merged 21 commits into from
Mar 14, 2024
Merged

[FEATURE] Update tokenizers #158

merged 21 commits into from
Mar 14, 2024

Conversation

KINGNEWBLUSH
Copy link

Description

update tokenizer modules, allow tokenizing using nltk, spacy and bpe.

Pull request type

  • [DATASET] Add a new dataset
  • [BUGFIX] Bugfix
  • [FEATURE] New feature (non-breaking change which adds functionality)
  • [BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [STYLE] Code style update (formatting, renaming)
  • [REFACTOR] Refactoring (no functional changes, no api changes)
  • [BUILD] Build related changes
  • [DOC] Documentation content changes
  • [OTHER] Other (please describe):

Changes

  • update EduNLP/SIF/tokenization/text/tokenization.py
  • update test_tokenizers.py
  • update setup.py

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage and al tests passing
  • Code is well-documented (extended the README / documentation, if necessary)
  • [] If this PR is your first one, add your name and github account to AUTHORS.md

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

	modified:   setup.py
	modified:   tests/test_tokenizer/test_tokenizer.py
@codecov-commenter
Copy link

codecov-commenter commented Mar 10, 2024

Codecov Report

Attention: Patch coverage is 95.74468% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 97.31%. Comparing base (855e250) to head (e86a5e6).

Files Patch % Lines
EduNLP/SIF/tokenization/text/tokenization.py 95.74% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #158      +/-   ##
==========================================
- Coverage   97.33%   97.31%   -0.03%     
==========================================
  Files          85       85              
  Lines        4693     4729      +36     
==========================================
+ Hits         4568     4602      +34     
- Misses        125      127       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nnnyt nnnyt changed the title Update tokenizers [FEATURE] Update tokenizers Mar 11, 2024
Copy link
Collaborator

@nnnyt nnnyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add your name to AUTHORS.md

tests/test_tokenizer/test_tokenizer.py Outdated Show resolved Hide resolved
'$', '4', '$', 'packs', 'left', '$', '25', '$', 'each', 'how', 'many',
'are', 'sold'
]
for tok in ['nltk', 'spacy']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test for bpe

	modified:   tests/test_tokenizer/test_tokenizer.py
    modified:   tests/test_tokenizer/test_tokenizer.py
	modified:   tests/test_tokenizer/test_tokenizer.py
	modified:   tests/test_tokenizer/test_tokenizer.py
	modified:   tests/test_tokenizer/test_tokenizer.py
	modified:   tests/test_tokenizer/test_tokenizer.py
modified:   tests/test_tokenizer/test_tokenizer.py
	modified:   tests/test_tokenizer/test_tokenizer.py
modified:   tests/test_tokenizer/test_tokenizer.py
	modified:   tests/test_tokenizer/test_tokenizer.py
modified:   tests/test_tokenizer/test_tokenizer.py

elif (tokenizer == 'bpe'):
try:
tokenizer = HGTokenizer.from_file('bpeTokenizer.json')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to a parameter instead of a hard-coded path. Or directly reuse tok_model param

trainer = BpeTrainer(
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=[bpe_trainfile], trainer=trainer)
tokenizer.save('bpeTokenizer.json', pretty=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@nnnyt nnnyt requested a review from KenelmQLH March 12, 2024 07:32
tokenizer = get_tokenizer("pure_text", text_params={"tokenizer": 'bpe', "stopwords": set(",?"),
"bpe_trainfile": data_path})
tokens = tokenizer(items)
ret = next(tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否支持中文?

@KenelmQLH KenelmQLH merged commit 7abc7d1 into bigdata-ustc:dev Mar 14, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants