Add a fix for custom code tokenizers in pipelines #32300

Rocketknight1 · 2024-07-29T16:06:47Z

This is quite a rare issue which requires a very specific set of circumstances to trigger:

Model is custom code
Model supports text generation but does not have an associated tokenizer for its config class
Tokenizer is also custom code
Tokenizer is passed as a model repo string, not a tokenizer object

As such, I'm ambivalent about adding a test for this case - it seems fairly obscure!

Rocketknight1 · 2024-07-29T16:13:24Z

cc @amyeroberts - let me know if you think a test is needed anyway, or if you're happy with this as is. I verified that it fixes the issue

amyeroberts · 2024-07-29T16:18:33Z

Thanks for tackling @Rocketknight1!

I think this fix should be extended to the other processing classes -- processors, feature extractors, image processors -- too, as they could all reasonably be specified with a path to a custom hub implementation.

As for tests, yes please, we should add a small one as supporting custom code on the hub is something we want to make sure we're supporting well

HuggingFaceDocBuilderDev · 2024-07-29T16:26:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2024-08-08T14:22:25Z

I extended the fix to the other processor/extractor classes that pipelines can initialize!

However, I'm still a bit negative on adding a test, mostly because complete test coverage here will be very long and messy, since we'll need a battery of separate custom model/tokenizer/processor repos and tests for each combination to actually get test coverage on this, and the situation that causes the bug is a rare edge case - the vast majority of custom code models and tokenizers should not trigger it, since it depends on the tokenizer and model being separate, and loaded from two separate named repos, without any config class that links them.

Also, the cause of the issue was a simple bug (flags like load_tokenizer simply weren't checking for tokenizers being passed as strings because this was such a rare situation), so I think it's unlikely that we'll get a regression on that specific bug and it's probably not worth such a large amount of tests/repos to cover it!

amyeroberts · 2024-08-19T10:38:46Z

However, I'm still a bit negative on adding a test, mostly because complete test coverage here will be very long and messy, since we'll need a battery of separate custom model/tokenizer/processor repos and tests for each combination to actually get test coverage on this, and the situation that causes the bug is a rare edge case - the vast majority of custom code models and tokenizers should not trigger it, since it depends on the tokenizer and model being separate, and loaded from two separate named repos, without any config class that links them.

I don't think this is True - you just need to check that a tokenizer, image processor, and feature extractor can be loaded in from a string reference, which should be 3 (mostly identitcal) tests.

Rocketknight1 · 2024-08-19T14:44:52Z

Hmn, if you're sure! I was thinking we might need to test combinations of preprocessors, but it probably makes sense to do just three separate tests. I'll add them!

amyeroberts · 2024-08-19T15:10:10Z

I was thinking we might need to test combinations of preprocessors

We could! I don't think it'd be necessary though. You could always through in one case with the tokenizer as a string and an image processor from the model_config but that's probably enough to cover it

Rocketknight1 · 2024-08-21T15:29:50Z

Hi @amyeroberts , test is added! I realized we only needed one, because actually load_feature_extractor and load_image_processor are handled differently and would never suffer from this issue in the first place, so I reverted the changes to them. They both correctly set the load flag whenever feature_extractor is not None or image_processor is not None, which means the extra and isinstance(feature_extractor, str) I added was always trivially True.

amyeroberts

Thanks for adding a test!

We should still make sure we can load the feature extractor and image processor from string in a test too

Rocketknight1 · 2024-08-26T14:04:40Z

@amyeroberts tests added!

amyeroberts

Beautiful - thanks for fixing and adding these tests!

Small nit - the tests should really be separated for the different processors but it's not a big deal

Rocketknight1 · 2024-08-27T13:39:29Z

Good idea - tests are split, merging now!

* Add a fix for the case when tokenizers are passed as a string * Support image processors and feature extractors as well * Reverting load_feature_extractor and load_image_processor * Add test * Test is torch-only * Add tests for preprocessors and feature extractors and move test * Extremely experimental fix * Revert that change, wrong branch! * Typo! * Split tests

Rocketknight1 mentioned this pull request Jul 29, 2024

transformers.pipeline does not load tokenizer passed as string for custom models #31669

Closed

4 tasks

Rocketknight1 force-pushed the fix_custom_tokenizer_in_pipeline branch 2 times, most recently from 6b6202a to 22babb9 Compare August 22, 2024 13:05

amyeroberts reviewed Aug 22, 2024

View reviewed changes

Rocketknight1 added 7 commits August 23, 2024 19:21

Add a fix for the case when tokenizers are passed as a string

2ed3ed9

Support image processors and feature extractors as well

20e7c31

Reverting load_feature_extractor and load_image_processor

26abe89

Add test

d117685

Test is torch-only

43d4a50

Add tests for preprocessors and feature extractors and move test

d8c30c4

Extremely experimental fix

d0f5eb6

Rocketknight1 force-pushed the fix_custom_tokenizer_in_pipeline branch from cf42483 to d0f5eb6 Compare August 23, 2024 18:21

Revert that change, wrong branch!

5833c98

Typo!

d0b2356

amyeroberts approved these changes Aug 27, 2024

View reviewed changes

Split tests

d77b39d

Rocketknight1 merged commit 9956c2b into main Aug 27, 2024
24 checks passed

Rocketknight1 deleted the fix_custom_tokenizer_in_pipeline branch August 27, 2024 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a fix for custom code tokenizers in pipelines #32300

Add a fix for custom code tokenizers in pipelines #32300

Rocketknight1 commented Jul 29, 2024

Rocketknight1 commented Jul 29, 2024

amyeroberts commented Jul 29, 2024

HuggingFaceDocBuilderDev commented Jul 29, 2024

Rocketknight1 commented Aug 8, 2024

amyeroberts commented Aug 19, 2024

Rocketknight1 commented Aug 19, 2024

amyeroberts commented Aug 19, 2024

Rocketknight1 commented Aug 21, 2024 •

edited

Loading

amyeroberts left a comment •

edited

Loading

Rocketknight1 commented Aug 26, 2024

amyeroberts left a comment

Rocketknight1 commented Aug 27, 2024

Add a fix for custom code tokenizers in pipelines #32300

Add a fix for custom code tokenizers in pipelines #32300

Conversation

Rocketknight1 commented Jul 29, 2024

Rocketknight1 commented Jul 29, 2024

amyeroberts commented Jul 29, 2024

HuggingFaceDocBuilderDev commented Jul 29, 2024

Rocketknight1 commented Aug 8, 2024

amyeroberts commented Aug 19, 2024

Rocketknight1 commented Aug 19, 2024

amyeroberts commented Aug 19, 2024

Rocketknight1 commented Aug 21, 2024 • edited Loading

amyeroberts left a comment • edited Loading

Choose a reason for hiding this comment

Rocketknight1 commented Aug 26, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Aug 27, 2024

Rocketknight1 commented Aug 21, 2024 •

edited

Loading

amyeroberts left a comment •

edited

Loading