transformers.pipeline does not load tokenizer passed as string for custom models #31669

chedatomasz · 2024-06-27T19:05:44Z

System Info

transformers version: 4.42.1
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (False)
Tensorflow version (GPU?): 2.15.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.8.4 (cpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Using distributed or parallel set-up in script?: no

Who can help?

@Narsil

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Locate a model which

isn't present in TOKENIZER_MAPPING
doesn't specify model_config.tokenizer_class
nevertheless has a tokenizer on the hub, loadable with AutoTokenizer
These requirements means this happens for custom models only (not integrated into the library), AFAIK. Running those requires trust_remote_code=True, so it might be wise to create your own example meeting these requirements. I will be using "tcheda/mot_test".

Verify that code works properly where tokenizer and model are passed as pre-instantiated objects

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("tcheda/mot_test", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tcheda/mot_test", trust_remote_code=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Attempt to create a pipeline, specifying both the model and the tokenizer as strings.

from transformers import pipeline
pipe = pipeline("text-generation", model="tcheda/mot_test", tokenizer="tcheda/mot_test", trust_remote_code=True)

The code crashes immediately with:

[usr/local/lib/python3.10/dist-packages/transformers/pipelines/__init__.py](https://localhost:8080/#) in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
  1106         kwargs["device"] = device
  1107 
-> 1108     return pipeline_class(model=model, framework=framework, task=task, **kwargs)

[/usr/local/lib/python3.10/dist-packages/transformers/pipelines/text_generation.py](https://localhost:8080/#) in __init__(self, *args, **kwargs)
    94 
    95     def __init__(self, *args, **kwargs):
---> 96         super().__init__(*args, **kwargs)
    97         self.check_model_type(
    98             TF_MODEL_FOR_CAUSAL_LM_MAPPING_NAMES if self.framework == "tf" else MODEL_FOR_CAUSAL_LM_MAPPING_NAMES

[/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py](https://localhost:8080/#) in __init__(self, model, tokenizer, feature_extractor, image_processor, modelcard, framework, task, args_parser, device, torch_dtype, binary_output, **kwargs)
   895             self.tokenizer is not None
   896             and self.model.can_generate()
--> 897             and self.tokenizer.pad_token_id is not None
   898             and self.model.generation_config.pad_token_id is None
   899         ):

AttributeError: 'str' object has no attribute 'pad_token_id'

Explanation
The bug is probably at

transformers/src/transformers/pipelines/__init__.py

Line 907 in 1c68f2c

load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None

load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None
This isn't set when tokenizer is a string, so the tokenizer initialization block (which contains proper handling of the string case) at

transformers/src/transformers/pipelines/__init__.py

Line 907 in 1c68f2c

load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None

isn't entered at all.

Expected behavior

The pipeline should be created correctly, loading the tokenizer as if with AutoTokenizer.from_pretrained(tokenizer). This is the behaviour described in the docs.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-06-27T22:07:00Z

cc @Rocketknight1

LysandreJik · 2024-07-29T08:17:19Z

Gentle ping @Rocketknight1

Rocketknight1 · 2024-07-29T16:13:40Z

PR with the fix is open at #32300!

huggingface deleted a comment from github-actions bot Jul 28, 2024

Rocketknight1 mentioned this issue Jul 29, 2024

Add a fix for custom code tokenizers in pipelines #32300

Merged

huggingface deleted a comment from github-actions bot Aug 27, 2024

Rocketknight1 closed this as completed in #32300 Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformers.pipeline does not load tokenizer passed as string for custom models #31669

transformers.pipeline does not load tokenizer passed as string for custom models #31669

chedatomasz commented Jun 27, 2024 •

edited

Loading

amyeroberts commented Jun 27, 2024

LysandreJik commented Jul 29, 2024

Rocketknight1 commented Jul 29, 2024

transformers.pipeline does not load tokenizer passed as string for custom models #31669

transformers.pipeline does not load tokenizer passed as string for custom models #31669

Comments

chedatomasz commented Jun 27, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Jun 27, 2024

LysandreJik commented Jul 29, 2024

Rocketknight1 commented Jul 29, 2024

chedatomasz commented Jun 27, 2024 •

edited

Loading