Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a fix for custom code tokenizers in pipelines #32300

Merged
merged 10 commits into from
Aug 27, 2024

Conversation

Rocketknight1
Copy link
Member

Fixes #31669

This is quite a rare issue which requires a very specific set of circumstances to trigger:

  • Model is custom code
  • Model supports text generation but does not have an associated tokenizer for its config class
  • Tokenizer is also custom code
  • Tokenizer is passed as a model repo string, not a tokenizer object

As such, I'm ambivalent about adding a test for this case - it seems fairly obscure!

@Rocketknight1
Copy link
Member Author

cc @amyeroberts - let me know if you think a test is needed anyway, or if you're happy with this as is. I verified that it fixes the issue

@amyeroberts
Copy link
Collaborator

Thanks for tackling @Rocketknight1!

I think this fix should be extended to the other processing classes -- processors, feature extractors, image processors -- too, as they could all reasonably be specified with a path to a custom hub implementation.

As for tests, yes please, we should add a small one as supporting custom code on the hub is something we want to make sure we're supporting well

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member Author

I extended the fix to the other processor/extractor classes that pipelines can initialize!

However, I'm still a bit negative on adding a test, mostly because complete test coverage here will be very long and messy, since we'll need a battery of separate custom model/tokenizer/processor repos and tests for each combination to actually get test coverage on this, and the situation that causes the bug is a rare edge case - the vast majority of custom code models and tokenizers should not trigger it, since it depends on the tokenizer and model being separate, and loaded from two separate named repos, without any config class that links them.

Also, the cause of the issue was a simple bug (flags like load_tokenizer simply weren't checking for tokenizers being passed as strings because this was such a rare situation), so I think it's unlikely that we'll get a regression on that specific bug and it's probably not worth such a large amount of tests/repos to cover it!

@amyeroberts
Copy link
Collaborator

However, I'm still a bit negative on adding a test, mostly because complete test coverage here will be very long and messy, since we'll need a battery of separate custom model/tokenizer/processor repos and tests for each combination to actually get test coverage on this, and the situation that causes the bug is a rare edge case - the vast majority of custom code models and tokenizers should not trigger it, since it depends on the tokenizer and model being separate, and loaded from two separate named repos, without any config class that links them.

I don't think this is True - you just need to check that a tokenizer, image processor, and feature extractor can be loaded in from a string reference, which should be 3 (mostly identitcal) tests.

@Rocketknight1
Copy link
Member Author

Hmn, if you're sure! I was thinking we might need to test combinations of preprocessors, but it probably makes sense to do just three separate tests. I'll add them!

@amyeroberts
Copy link
Collaborator

I was thinking we might need to test combinations of preprocessors

We could! I don't think it'd be necessary though. You could always through in one case with the tokenizer as a string and an image processor from the model_config but that's probably enough to cover it

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Aug 21, 2024

Hi @amyeroberts , test is added! I realized we only needed one, because actually load_feature_extractor and load_image_processor are handled differently and would never suffer from this issue in the first place, so I reverted the changes to them. They both correctly set the load flag whenever feature_extractor is not None or image_processor is not None, which means the extra and isinstance(feature_extractor, str) I added was always trivially True.

@Rocketknight1 Rocketknight1 force-pushed the fix_custom_tokenizer_in_pipeline branch 2 times, most recently from 6b6202a to 22babb9 Compare August 22, 2024 13:05
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test!

We should still make sure we can load the feature extractor and image processor from string in a test too

@Rocketknight1 Rocketknight1 force-pushed the fix_custom_tokenizer_in_pipeline branch from cf42483 to d0f5eb6 Compare August 23, 2024 18:21
@Rocketknight1
Copy link
Member Author

@amyeroberts tests added!

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful - thanks for fixing and adding these tests!

Small nit - the tests should really be separated for the different processors but it's not a big deal

@Rocketknight1
Copy link
Member Author

Good idea - tests are split, merging now!

@Rocketknight1 Rocketknight1 merged commit 9956c2b into main Aug 27, 2024
24 checks passed
@Rocketknight1 Rocketknight1 deleted the fix_custom_tokenizer_in_pipeline branch August 27, 2024 13:39
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
* Add a fix for the case when tokenizers are passed as a string

* Support image processors and feature extractors as well

* Reverting load_feature_extractor and load_image_processor

* Add test

* Test is torch-only

* Add tests for preprocessors and feature extractors and move test

* Extremely experimental fix

* Revert that change, wrong branch!

* Typo!

* Split tests
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
* Add a fix for the case when tokenizers are passed as a string

* Support image processors and feature extractors as well

* Reverting load_feature_extractor and load_image_processor

* Add test

* Test is torch-only

* Add tests for preprocessors and feature extractors and move test

* Extremely experimental fix

* Revert that change, wrong branch!

* Typo!

* Split tests
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
* Add a fix for the case when tokenizers are passed as a string

* Support image processors and feature extractors as well

* Reverting load_feature_extractor and load_image_processor

* Add test

* Test is torch-only

* Add tests for preprocessors and feature extractors and move test

* Extremely experimental fix

* Revert that change, wrong branch!

* Typo!

* Split tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

transformers.pipeline does not load tokenizer passed as string for custom models
3 participants