-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support input for llama3.2 multi-modal model #69
base: main
Are you sure you want to change the base?
Conversation
Some test result: Question Output 90B |
@statiraju Please have your team member review this PR |
src/model.py
Outdated
{ | ||
"name": "multi_modal_data", | ||
"data_type": "TYPE_STRING", | ||
"dims": [1], | ||
"optional": True, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GuanLuo @krishung5 @kthui any concerns with passing a serialized JSON input vs. individual input tensors for "image", "audio", etc?
Looks like this is currently mimicing the style of inputs vllm itself expects, so it would be pretty intuitive to vllm users:
- https://github.com/vllm-project/vllm/blob/1dd4cb2935fc3fff9c156b5772d18e0a0d1861f0/vllm/multimodal/base.py#L132-L139
- https://docs.vllm.ai/en/stable/models/vlm.html#multi-image-input
Current serialized JSON form:
{
"name": "multi_modal_data",
"data_type": "TYPE_STRING",
"dims": [1], # 1 "element" to Triton, arbitrary structure/size inside the JSON, validated by backend
"optional": True,
},
Example tensor form:
{
"name": "image",
"data_type": "TYPE_STRING",
"dims": [-1], # can be multiple images as separate elements
"optional": True,
},
{
"name": "audio",
"data_type": "TYPE_STRING",
"dims": [-1], # can be multiple audios as separate elements
"optional": True,
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the individual input tensors is cleaner in terms of what inputs are expected, and less prone to user error as it does not involve the additional JSON layer.
Given that we need to teardown the JSON and convert each Base64 into bytes, there are actually some work on the backend to verify the JSON is well-formed for the conversion to happen. I think it is easier to supply the image/audio as individual tensors knowing they are already well-formed, and then convert each Base64 into bytes and format them correctly for vLLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No actual concerns off the top of my head. Agree with Jacky that the tensor form looks cleaner and could simplify some checks. I think aligning the format with vLLM could slightly improve usability for vLLM backend users in my opinion. However, since the required input changes seem minimal, the impact on vLLM users should be limited.
if "base64," in image_base64_string: | ||
image_base64_string = image_base64_string.split("base64,")[-1] | ||
image_data = base64.b64decode(image_base64_string) | ||
image = Image.open(BytesIO(image_data)).convert("RGB") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: May need to expose image formats other than RGB in the future, but seems like a sensible default / first support for now. We can probably defer exposing it until we have a use case requiring other formats.
@harryskim @rmccorm4 @kthui Thank you so much for starting the review so quickly. |
Hi @rmccorm4 @kthui @harryskim , I have updated the code and validated using the following payload format:
image1: text output
|
Can we add a simple version check for enabling / disabling features supported across different vLLM versions? It is because people could be using an older version that does not support the multi-modal yet, or still wish to receive "best_of_request" metrics on an older version. For example, the check can be simply: ...
from vllm.version import __version__ as _VLLM_VERSION
...
class TritonPythonModel:
@classmethod
def auto_complete_config(auto_complete_model_config):
...
if _VLLM_VERSION >= "0.6.3.post1":
inputs.append({
"name": "image",
"data_type": "TYPE_STRING",
"dims": [-1], # can be multiple images as separate elements
"optional": True,
})
...
...
async def generate(self, request):
...
if _VLLM_VERSION >= "0.6.3.post1":
image_input_tensor = ...
...
...
...
|
@kthui Have added version checks. Please help review the change, thanks. |
Hi @xiejibing, I am not able to find your signed CLA on our record. Can you send us the signed CLA following the instructions here? The signed CLA is required before we can merge your PR. |
@kthui Thanks! I will sign the CLA as soon as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great contributions @xiejibing @yiheng ! 🚀
There is a pre-commit failure around linting, but I think we can just fix it right after merge in this case if needed. For future contributions, you can also run pre-commit install
locally after checking out the repository to get pre-commit hooks while developing.
Hi @xiejibing, any update on the CLA? We will be adding new changes into the backend that could require this PR to be rebased. |
llama3.2 was supported since vllm 0.6.2, and multiple images as input was supported since vllm 0.6.3.post1.
We enhanced the
model.py
to make it support llama 3.2 multi-modal models.Support both single image or multiple images as input
Image should be base64-encoded
Payload