Support input for llama3.2 multi-modal model #69

xiejibing · 2024-11-01T08:08:08Z

llama3.2 was supported since vllm 0.6.2, and multiple images as input was supported since vllm 0.6.3.post1.
We enhanced the model.py to make it support llama 3.2 multi-modal models.

Support both single image or multiple images as input
Image should be base64-encoded

Payload

{
    "text_input": "Please describe the images",
    "multi_modal_data": "{\"image\": [\"base64 encoded image1\", \"base64 encoded image2\"]}",
    "sampling_parameters": "{\"max_tokens\": 256, \"top_p\": 0.9}"
}

yiheng · 2024-11-01T10:57:57Z

Some test result:

Image

Question
What is the content of this image?

Output
11B
"What is the content of this image? This image features a white domestic cat sitting on the floor of a room. The cat is looking up, possibly at a toy. Several toys and pieces of wood appear in the foreground. The wall behind the cat appears to be light wooden. There is a darker wooden frame under a cabinet that resembles an entryway, and at the bottom of the photo, a person's hand can be seen. Overall, the image presents an intimate and domestic scene, with the cat and toys suggesting a home environment. The warm lighting and the presence of a person's hand add to the cozy and inviting atmosphere. The image conveys a sense of comfort and relaxation, making the viewer feel like they are peeking into a peaceful moment in a home. Overall, the image is serene and appealing, inviting the viewer to imagine themselves in this peaceful scene. The content of the image evokes a sense of warmth, comfort, and tranquility. The image suggests a connection to nature, with the wooden floor and the toys bringing to mind the outdoors. The presence of a person's hand adds a human element, suggesting a sense of companionship and care. The overall mood of the image is one of serenity and contentment, inviting the viewer to relax and unwind. The image presents a peaceful and idyllic scene, with the cat and toys creating a sense of harmony and balance. The warm lighting and the wooden floor add to the cozy atmosphere, making the viewer feel like they are part of a peaceful and loving environment."

90B
"What is the content of this image? The image features a light-colored cat sitting on a mat, with a few toys scattered in front of it. In the background, there is a cabinet or piece of furniture that adds depth to the scene. The overall atmosphere suggests a cozy and inviting space, perfect for relaxation or playtime with the feline friend. The image captures a serene moment, inviting the viewer to appreciate the simple joys of life. The content of this image is a heartwarming portrayal of a cat in its natural habitat, surrounded by toys and comfort. It evokes feelings of warmth, comfort, and companionship, making it a delightful addition to any space. Whether you're a cat lover or simply appreciate the beauty of everyday moments, this image is sure to bring a smile to your face. So, take a moment to appreciate the serenity of this scene, and let the charm of the cat and its surroundings brighten up your day. The content of this image is a delightful portrayal of a cat in its natural habitat, surrounded by toys and comfort. It evokes feelings of warmth, comfort, and companionship, making it a delightful addition to any space. Whether you're a cat lover or simply appreciate the beauty of everyday moments, this image is sure to bring a smile to your face. So, take a moment to appreciate the serenity of this scene, and let the charm of the cat and its surroundings brighten up your day. "

harryskim · 2024-11-01T15:53:00Z

@statiraju Please have your team member review this PR

rmccorm4 · 2024-11-01T18:01:36Z

src/model.py

+            {
+                "name": "multi_modal_data",
+                "data_type": "TYPE_STRING",
+                "dims": [1],
+                "optional": True,
+            },


@GuanLuo @krishung5 @kthui any concerns with passing a serialized JSON input vs. individual input tensors for "image", "audio", etc?

Looks like this is currently mimicing the style of inputs vllm itself expects, so it would be pretty intuitive to vllm users:

https://github.com/vllm-project/vllm/blob/1dd4cb2935fc3fff9c156b5772d18e0a0d1861f0/vllm/multimodal/base.py#L132-L139

https://docs.vllm.ai/en/stable/models/vlm.html#multi-image-input

Current serialized JSON form:

{ "name": "multi_modal_data", "data_type": "TYPE_STRING", "dims": [1], # 1 "element" to Triton, arbitrary structure/size inside the JSON, validated by backend "optional": True, },

Example tensor form:

{ "name": "image", "data_type": "TYPE_STRING", "dims": [-1], # can be multiple images as separate elements "optional": True, }, { "name": "audio", "data_type": "TYPE_STRING", "dims": [-1], # can be multiple audios as separate elements "optional": True, },

I think the individual input tensors is cleaner in terms of what inputs are expected, and less prone to user error as it does not involve the additional JSON layer.

Given that we need to teardown the JSON and convert each Base64 into bytes, there are actually some work on the backend to verify the JSON is well-formed for the conversion to happen. I think it is easier to supply the image/audio as individual tensors knowing they are already well-formed, and then convert each Base64 into bytes and format them correctly for vLLM.

No actual concerns off the top of my head. Agree with Jacky that the tensor form looks cleaner and could simplify some checks. I think aligning the format with vLLM could slightly improve usability for vLLM backend users in my opinion. However, since the required input changes seem minimal, the impact on vLLM users should be limited.

rmccorm4 · 2024-11-01T18:14:23Z

src/model.py

+                        if "base64," in image_base64_string:
+                            image_base64_string = image_base64_string.split("base64,")[-1]
+                        image_data = base64.b64decode(image_base64_string)
+                        image = Image.open(BytesIO(image_data)).convert("RGB")


NOTE: May need to expose image formats other than RGB in the future, but seems like a sensible default / first support for now. We can probably defer exposing it until we have a use case requiring other formats.

ex: https://github.com/vllm-project/vllm/blob/1dd4cb2935fc3fff9c156b5772d18e0a0d1861f0/vllm/multimodal/utils.py#L33

xiejibing · 2024-11-02T00:25:32Z

@harryskim @rmccorm4 @kthui Thank you so much for starting the review so quickly.
We will update the code to supply the images/audios as individual tensors and do some tests to make sure it works well.

xiejibing · 2024-11-04T08:32:27Z

Hi @rmccorm4 @kthui @harryskim , I have updated the code and validated using the following payload format:

{
    "text_input": "descirbe the images",
    "image": ["image1_base64_encoded_see_attached_files", "image2_base64_encoded_see_attached_files"],
    "sampling_parameters": "{\"max_tokens\": 256, \"top_p\": 0.9}"
}

image1:

cat.txt

image2:

cherry_blossom.txt

text output

"descirbe the images here. The image shows a white cat with dark gray stripes sitting on a floor surrounded by cherry blossom branches. The cat is facing the camera and the flowers are pink. The background is blurry but appears to be the interior of a room with wooden cabinets or shelves. The overall atmosphere of the image is one of serenity and tranquility, with the cat and flowers creating a peaceful scene. The lighting in the image is soft and natural, with the sun shining through the windows and casting a warm glow over the scene. The image captures a moment of stillness and calmness, inviting the viewer to pause and appreciate the beauty of nature. The focus of the image is on the cat and the flowers, with the background fading into the distance. The overall composition of the image is simple yet effective, drawing the viewer's attention to the main subjects. The image could be used as a wallpaper or a cover photo for a social media post, and it could also be used in a magazine or a book as an illustration. Overall, the image is a beautiful representation of the beauty of nature and the serenity of a peaceful moment. it could evoke feelings of calmness and tranquility in the viewer, and it could inspire a sense of appreciation for the natural world. It could also"

kthui · 2024-11-05T00:27:07Z

Can we add a simple version check for enabling / disabling features supported across different vLLM versions?

It is because people could be using an older version that does not support the multi-modal yet, or still wish to receive "best_of_request" metrics on an older version.

For example, the check can be simply:

...
from vllm.version import __version__ as _VLLM_VERSION
...
class TritonPythonModel:
    @classmethod
    def auto_complete_config(auto_complete_model_config):
        ...
        if _VLLM_VERSION >= "0.6.3.post1":
            inputs.append({
                "name": "image",
                "data_type": "TYPE_STRING",
                "dims": [-1],  # can be multiple images as separate elements
                "optional": True,
            })
        ...
    ...
    async def generate(self, request):
        ...
        if _VLLM_VERSION >= "0.6.3.post1":
            image_input_tensor = ...
            ...
        ...
    ...

$ python3
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "0.6.3.post1" >= "0.5.3.post1"
True
>>> "0.6.3.post1" >= "0.5.5"
True
>>> "0.6.3.post1" >= "0.6.3"
True
>>> "0.6.3.post1" >= "0.6.3.post1"
True
>>> "0.6.3.post1" >= "0.6.4"
False
>>>

xiejibing · 2024-11-06T14:13:48Z

@kthui Have added version checks. Please help review the change, thanks.

kthui · 2024-11-06T22:10:23Z

Hi @xiejibing, I am not able to find your signed CLA on our record. Can you send us the signed CLA following the instructions here? The signed CLA is required before we can merge your PR.

xiejibing · 2024-11-06T23:35:25Z

@kthui Thanks! I will sign the CLA as soon as possible.

rmccorm4

Thanks for the great contributions @xiejibing @yiheng ! 🚀

There is a pre-commit failure around linting, but I think we can just fix it right after merge in this case if needed. For future contributions, you can also run pre-commit install locally after checking out the repository to get pre-commit hooks while developing.

kthui · 2024-11-15T19:25:35Z

Hi @xiejibing, any update on the CLA? We will be adding new changes into the backend that could require this PR to be rebased.

jibxie added 4 commits October 14, 2024 17:29

Support modal model input

52e125f

Update

114c235

support image list input

4a8b737

Remove best of metrics to align with latest vllm

9897102

harryskim requested a review from statiraju November 1, 2024 15:52

statiraju requested review from krishung5, kthui and rmccorm4 November 1, 2024 15:57

rmccorm4 reviewed Nov 1, 2024

View reviewed changes

supply the images as individual tensors

c85d972

xiejibing requested a review from rmccorm4 November 4, 2024 08:17

kthui mentioned this pull request Nov 5, 2024

refactor: Skip "best_of_requests" if it is deleted from the installed vLLM #71

Closed

20 tasks

Add vllm version check for compatibility

566e0cc

kthui approved these changes Nov 6, 2024

View reviewed changes

rmccorm4 approved these changes Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support input for llama3.2 multi-modal model #69

Support input for llama3.2 multi-modal model #69

xiejibing commented Nov 1, 2024 •

edited

Loading

yiheng commented Nov 1, 2024

harryskim commented Nov 1, 2024

rmccorm4 Nov 1, 2024 •

edited

Loading

kthui Nov 1, 2024 •

edited

Loading

krishung5 Nov 1, 2024

rmccorm4 Nov 1, 2024

xiejibing commented Nov 2, 2024

xiejibing commented Nov 4, 2024

kthui commented Nov 5, 2024

xiejibing commented Nov 6, 2024 •

edited

Loading

kthui commented Nov 6, 2024

xiejibing commented Nov 6, 2024

rmccorm4 left a comment •

edited

Loading

kthui commented Nov 15, 2024 •

edited by rmccorm4

Loading

Support input for llama3.2 multi-modal model #69

Are you sure you want to change the base?

Support input for llama3.2 multi-modal model #69

Conversation

xiejibing commented Nov 1, 2024 • edited Loading

Payload

yiheng commented Nov 1, 2024

harryskim commented Nov 1, 2024

rmccorm4 Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

kthui Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

krishung5 Nov 1, 2024

Choose a reason for hiding this comment

rmccorm4 Nov 1, 2024

Choose a reason for hiding this comment

xiejibing commented Nov 2, 2024

xiejibing commented Nov 4, 2024

kthui commented Nov 5, 2024

xiejibing commented Nov 6, 2024 • edited Loading

kthui commented Nov 6, 2024

xiejibing commented Nov 6, 2024

rmccorm4 left a comment • edited Loading

Choose a reason for hiding this comment

kthui commented Nov 15, 2024 • edited by rmccorm4 Loading

xiejibing commented Nov 1, 2024 •

edited

Loading

rmccorm4 Nov 1, 2024 •

edited

Loading

kthui Nov 1, 2024 •

edited

Loading

xiejibing commented Nov 6, 2024 •

edited

Loading

rmccorm4 left a comment •

edited

Loading

kthui commented Nov 15, 2024 •

edited by rmccorm4

Loading