Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Update D2V, AutoTokenizer, and pretraining scripts #155

Merged
merged 11 commits into from
Mar 4, 2024

Conversation

KenelmQLH
Copy link
Collaborator

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

(Brief description on what this PR is about)

What does this implement/fix? Explain your changes.

...

Pull request type

  • [DATASET] Add a new dataset
  • [BUGFIX] Bugfix
  • [FEATURE] New feature (non-breaking change which adds functionality)
  • [BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [STYLE] Code style update (formatting, renaming)
  • [REFACTOR] Refactoring (no functional changes, no api changes)
  • [BUILD] Build related changes
  • [DOC] Documentation content changes
  • [OTHER] Other (please describe):

Changes

  1. Update D2V: support for token vectors
  2. Add AutoTokenizer
  3. Update pretraining scripts for Disenq and QuesNet

Does this close any currently open issues?

N/A

Any relevant logs, error output, etc?

N/A

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage and al tests passing
  • Code is well-documented (extended the README / documentation, if necessary)
  • If this PR is your first one, add your name and github account to AUTHORS.md

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@KenelmQLH KenelmQLH added the enhancement New feature or request label Feb 22, 2024
@KenelmQLH KenelmQLH requested a review from nnnyt February 22, 2024 11:46
@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2024

Codecov Report

Attention: Patch coverage is 93.27586% with 39 lines in your changes are missing coverage. Please review.

Project coverage is 97.31%. Comparing base (598d788) to head (84b79c7).

Files Patch % Lines
EduNLP/Pretrain/quesnet_vec.py 91.26% 11 Missing ⚠️
EduNLP/Pretrain/disenqnet_vec.py 90.41% 7 Missing ⚠️
EduNLP/I2V/i2v.py 71.42% 6 Missing ⚠️
EduNLP/ModelZoo/hf_model/hf_model.py 96.07% 4 Missing ⚠️
EduNLP/Vector/gensim_vec.py 82.35% 3 Missing ⚠️
EduNLP/SIF/tokenization/formula/ast_token.py 86.66% 2 Missing ⚠️
EduNLP/SIF/tokenization/tokenization.py 71.42% 2 Missing ⚠️
EduNLP/ModelZoo/quesnet/quesnet.py 96.42% 1 Missing ⚠️
EduNLP/Pretrain/elmo_vec.py 95.00% 1 Missing ⚠️
EduNLP/Pretrain/hugginface_utils.py 90.00% 1 Missing ⚠️
... and 1 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #155      +/-   ##
==========================================
- Coverage   97.81%   97.31%   -0.51%     
==========================================
  Files          80       84       +4     
  Lines        4349     4650     +301     
==========================================
+ Hits         4254     4525     +271     
- Misses         95      125      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bert_config = AutoConfig.from_pretrained(pretrained_model_dir)
if init:
logger.info(f'Load AutoModel from checkpoint: {pretrained_model_dir}')
self.bert = AutoModel.from_pretrained(pretrained_model_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to sth like self.model? AutoModel should not be constrained to BERT

bert_config = AutoConfig.from_pretrained(pretrained_model_dir)
if init:
logger.info(f'Load AutoModel from checkpoint: {pretrained_model_dir}')
self.bert = AutoModel.from_pretrained(pretrained_model_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

pass


def finetune_edu_auto_model(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be sth like pretrain_hf_auto_model? It is only used for huggingface models. Also, it is domain pretraining instead of fine-tuning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the file name should be hf_auto_vec? It is not auto for our educational models

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And pretrain_bert, finetune_bert_for_xx should be the same with these auto functions? Maybe with these auto function, we can delete the bert_vec? Not sure if it is better or not. Or we can keep that file but directly reuse these auto functions?


__all__ = ["ElmoTokenizer", "ElmoDataset", "train_elmo", "train_elmo_for_property_prediction",
"train_elmo_for_knowledge_prediction"]
__all__ = ["ElmoTokenizer", "ElmoDataset", "pretrain_elmo", "pretrain_elmo_for_property_prediction",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be finetune_elmo_for_xxx
Maybe that's my task lol

@nnnyt
Copy link
Collaborator

nnnyt commented Feb 26, 2024

The test coverage seems to drop a lot. Try adding more tests for your new code

@nnnyt nnnyt merged commit 47bfce8 into bigdata-ustc:dev Mar 4, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants