[Experimental] Support Cross encoder models #10400

maxdebayser · 2024-11-17T03:24:21Z

This PR contains a proof of concept to add support for Cross Encoder models reusing most of what was done to support embedding models, as well as the chat embeddings API. It's a bit hacky, but I think it helps to have something concrete to iterate and refine the design.

curl -X 'POST'   'http://localhost:8000/v1/embeddings'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
  "encoding_format": "float",
  "messages": [{
    "role": "user",
    "content": "What is the capital of France?"
  },
  {
    "role": "user",
    "content": "The capital of France is Paris."
  }]
}'

Response:

{
  "id": "embd-49858b4f279b4f19939ac832e85de528",
  "object": "list",
  "created": 110449,
  "model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        9.265625
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 17,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

I've added a new model type and a new task type. From the point of view of the models and the model runner, this is not strictly necessary at this level because the inputs and outputs are compatible with the embedding models. However at the serving level we need to know that the model has a different task to be able to know for sure that we have to use the text_pair parameter of the tokenizer instead trying to apply a chat template.

What's still missing:

Support for cross encoding in the LLM class
Tests comparing with sentence-transformers
CPU support
Roberta cross encoder models

cc: @DarkLight1337 @flaviabeo

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

github-actions · 2024-11-17T03:24:32Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337

Can you add an example script to show how to use this API?

QQ: Why do we have to define a separate "cross_encoding" task for this? I think we can keep using "embedding" task if we just override the pooler method instead of defining a new classification_output method.

vllm/model_executor/models/bert.py

maxdebayser · 2024-11-17T10:54:59Z

Can you add an example script to show how to use this API?

Yes, I've added one in the PR description but it's a good idea to add it to the documentation.

QQ: Why do we have to define a separate "cross_encoding" task for this? I think we can keep using "embedding" task if we just override the pooler method instead of defining a new classification_output method.

I thought about this, and the only reason I kept it that way was that in the serving layer I need to know what task is being done because I need to call the tokenizer with tokenizer(text=text1, text_pair=text2). If instead of reusing the chat embeddings API we had a new endpoint just for cross encoding, this wouldn't be necessary. Or perhaps we can add an attribute to some of the config classes to tell the although the task is "embedding", the model is actually a "BertModelForSequenceClassification"

DarkLight1337 · 2024-11-17T11:13:27Z

I think it may be simpler to make this a separate flag, similar to how we have a flag for multimodal models, rather than creating a new task for it. That way, we won't have to change our internals at all.

DarkLight1337 · 2024-11-17T11:24:16Z

I think having a separate API for this would be cleaner as well - perhaps a Scoring API where we output a single score?

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

maxdebayser · 2024-11-17T23:26:08Z

I've removed the "cross_encoding" task and added a is_cross_encoder property to the ModelConfig class. I've also added support for Roberta models.

Pending TODOs:

Add Scoring API
Support for cross encoding in the LLM class
Tests comparing with sentence-transformers
CPU support

vllm/model_executor/models/interfaces.py

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

maxdebayser · 2024-11-19T02:43:01Z

I've added a score() method to the LLM class and added tests for it. I've also fixed the CPU support.

Pending TODOs:

Add Scoring API
Test Scoring API comparing with sentence-transformers

DarkLight1337 · 2024-11-19T03:18:43Z

vllm/entrypoints/llm.py

        texts: Union[SingletonPrompt, Sequence[SingletonPrompt]],
+        text_pairs: Union[SingletonPrompt, Sequence[SingletonPrompt]],


Maybe text_1 and text_2? text_pairs is a bit confusing to me as it suggests that I should pass in a list of tuples.

DarkLight1337 · 2024-11-19T03:21:01Z

vllm/model_executor/models/bert.py

+        if (hasattr(config, "sbert_ce_default_activation_function")
+                and config.sbert_ce_default_activation_function is not None):
+            self.default_activation_function = import_from_string(
+                config.sbert_ce_default_activation_function)()
+        else:
+            self.default_activation_function = \
+                nn.Sigmoid() if config.num_labels == 1 else nn.Identity()


Factor out this code to transformer_utils?

maxdebayser added 2 commits November 16, 2024 23:58

Add support for cross encoders

3091c09

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

Merge branch 'main' into cross_encoder

4f4d4be

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

maxdebayser requested review from WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96, comaniac, alexm-neuralmagic, zhuohan123 and youkaichao as code owners November 17, 2024 03:24

mergify bot added the frontend label Nov 17, 2024

DarkLight1337 reviewed Nov 17, 2024

View reviewed changes

vllm/model_executor/models/bert.py Outdated Show resolved Hide resolved

maxdebayser added 2 commits November 17, 2024 13:43

Add support for roberta models, including BAAI/bge-reranker-v2-m3

eadc8ed

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

remove task cross_encoding

b6a0092

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

maxdebayser mentioned this pull request Nov 18, 2024

Support Roberta embedding models #9387

Merged

DarkLight1337 reviewed Nov 18, 2024

View reviewed changes

vllm/model_executor/models/interfaces.py Show resolved Hide resolved

maxdebayser added 5 commits November 18, 2024 11:38

address review comments, fix bug, clean up diff

5b17c70

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

add cpu support

1f02dfa

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

Add a score() method top the LLM entrypoint

7d63ed1

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

raise exception in case of MistralTokenizer

61e72c8

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

Add tests for the LLM.score() method

ecc3d10

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

DarkLight1337 reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Support Cross encoder models #10400

[Experimental] Support Cross encoder models #10400

maxdebayser commented Nov 17, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 17, 2024

DarkLight1337 left a comment •

edited

Loading

maxdebayser commented Nov 17, 2024

DarkLight1337 commented Nov 17, 2024 •

edited

Loading

DarkLight1337 commented Nov 17, 2024 •

edited

Loading

maxdebayser commented Nov 17, 2024

maxdebayser commented Nov 19, 2024

DarkLight1337 Nov 19, 2024

DarkLight1337 Nov 19, 2024

		texts: Union[SingletonPrompt, Sequence[SingletonPrompt]],
		text_pairs: Union[SingletonPrompt, Sequence[SingletonPrompt]],

[Experimental] Support Cross encoder models #10400

Are you sure you want to change the base?

[Experimental] Support Cross encoder models #10400

Conversation

maxdebayser commented Nov 17, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 17, 2024

DarkLight1337 left a comment • edited Loading

Choose a reason for hiding this comment

maxdebayser commented Nov 17, 2024

DarkLight1337 commented Nov 17, 2024 • edited Loading

DarkLight1337 commented Nov 17, 2024 • edited Loading

maxdebayser commented Nov 17, 2024

maxdebayser commented Nov 19, 2024

DarkLight1337 Nov 19, 2024

Choose a reason for hiding this comment

DarkLight1337 Nov 19, 2024

Choose a reason for hiding this comment

maxdebayser commented Nov 17, 2024 •

edited by github-actions bot

Loading

DarkLight1337 left a comment •

edited

Loading

DarkLight1337 commented Nov 17, 2024 •

edited

Loading

DarkLight1337 commented Nov 17, 2024 •

edited

Loading