Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support input for llama3.2 multi-modal model #69

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

xiejibing
Copy link

@xiejibing xiejibing commented Nov 1, 2024

llama3.2 was supported since vllm 0.6.2, and multiple images as input was supported since vllm 0.6.3.post1.
We enhanced the model.py to make it support llama 3.2 multi-modal models.

  • Support both single image or multiple images as input

  • Image should be base64-encoded

Payload

{
    "text_input": "Please describe the images",
    "multi_modal_data": "{\"image\": [\"base64 encoded image1\", \"base64 encoded image2\"]}",
    "sampling_parameters": "{\"max_tokens\": 256, \"top_p\": 0.9}"
}

@yiheng
Copy link

yiheng commented Nov 1, 2024

Some test result:

Image
Screenshot 2024-11-01 at 18 57 18

Question
What is the content of this image?

Output
11B
"What is the content of this image? This image features a white domestic cat sitting on the floor of a room. The cat is looking up, possibly at a toy. Several toys and pieces of wood appear in the foreground. The wall behind the cat appears to be light wooden. There is a darker wooden frame under a cabinet that resembles an entryway, and at the bottom of the photo, a person's hand can be seen. Overall, the image presents an intimate and domestic scene, with the cat and toys suggesting a home environment. The warm lighting and the presence of a person's hand add to the cozy and inviting atmosphere. The image conveys a sense of comfort and relaxation, making the viewer feel like they are peeking into a peaceful moment in a home. Overall, the image is serene and appealing, inviting the viewer to imagine themselves in this peaceful scene. The content of the image evokes a sense of warmth, comfort, and tranquility. The image suggests a connection to nature, with the wooden floor and the toys bringing to mind the outdoors. The presence of a person's hand adds a human element, suggesting a sense of companionship and care. The overall mood of the image is one of serenity and contentment, inviting the viewer to relax and unwind. The image presents a peaceful and idyllic scene, with the cat and toys creating a sense of harmony and balance. The warm lighting and the wooden floor add to the cozy atmosphere, making the viewer feel like they are part of a peaceful and loving environment."

90B
"What is the content of this image? The image features a light-colored cat sitting on a mat, with a few toys scattered in front of it. In the background, there is a cabinet or piece of furniture that adds depth to the scene. The overall atmosphere suggests a cozy and inviting space, perfect for relaxation or playtime with the feline friend. The image captures a serene moment, inviting the viewer to appreciate the simple joys of life. The content of this image is a heartwarming portrayal of a cat in its natural habitat, surrounded by toys and comfort. It evokes feelings of warmth, comfort, and companionship, making it a delightful addition to any space. Whether you're a cat lover or simply appreciate the beauty of everyday moments, this image is sure to bring a smile to your face. So, take a moment to appreciate the serenity of this scene, and let the charm of the cat and its surroundings brighten up your day. The content of this image is a delightful portrayal of a cat in its natural habitat, surrounded by toys and comfort. It evokes feelings of warmth, comfort, and companionship, making it a delightful addition to any space. Whether you're a cat lover or simply appreciate the beauty of everyday moments, this image is sure to bring a smile to your face. So, take a moment to appreciate the serenity of this scene, and let the charm of the cat and its surroundings brighten up your day. "

@harryskim
Copy link

@statiraju Please have your team member review this PR

src/model.py Outdated
Comment on lines 57 to 62
{
"name": "multi_modal_data",
"data_type": "TYPE_STRING",
"dims": [1],
"optional": True,
},
Copy link
Contributor

@rmccorm4 rmccorm4 Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GuanLuo @krishung5 @kthui any concerns with passing a serialized JSON input vs. individual input tensors for "image", "audio", etc?

Looks like this is currently mimicing the style of inputs vllm itself expects, so it would be pretty intuitive to vllm users:

Current serialized JSON form:

            {
                "name": "multi_modal_data",
                "data_type": "TYPE_STRING",
                "dims": [1], # 1 "element" to Triton, arbitrary structure/size inside the JSON, validated by backend
                "optional": True,
            },

Example tensor form:

            {
                "name": "image",
                "data_type": "TYPE_STRING",
                "dims": [-1], # can be multiple images as separate elements
                "optional": True,
            },
            {
                "name": "audio",
                "data_type": "TYPE_STRING",
                "dims": [-1], # can be multiple audios as separate elements
                "optional": True,
            },

Copy link
Contributor

@kthui kthui Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the individual input tensors is cleaner in terms of what inputs are expected, and less prone to user error as it does not involve the additional JSON layer.

Given that we need to teardown the JSON and convert each Base64 into bytes, there are actually some work on the backend to verify the JSON is well-formed for the conversion to happen. I think it is easier to supply the image/audio as individual tensors knowing they are already well-formed, and then convert each Base64 into bytes and format them correctly for vLLM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actual concerns off the top of my head. Agree with Jacky that the tensor form looks cleaner and could simplify some checks. I think aligning the format with vLLM could slightly improve usability for vLLM backend users in my opinion. However, since the required input changes seem minimal, the impact on vLLM users should be limited.

if "base64," in image_base64_string:
image_base64_string = image_base64_string.split("base64,")[-1]
image_data = base64.b64decode(image_base64_string)
image = Image.open(BytesIO(image_data)).convert("RGB")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: May need to expose image formats other than RGB in the future, but seems like a sensible default / first support for now. We can probably defer exposing it until we have a use case requiring other formats.

ex: https://github.com/vllm-project/vllm/blob/1dd4cb2935fc3fff9c156b5772d18e0a0d1861f0/vllm/multimodal/utils.py#L33

@xiejibing
Copy link
Author

@harryskim @rmccorm4 @kthui Thank you so much for starting the review so quickly.
We will update the code to supply the images/audios as individual tensors and do some tests to make sure it works well.

@xiejibing
Copy link
Author

Hi @rmccorm4 @kthui @harryskim , I have updated the code and validated using the following payload format:

{
    "text_input": "descirbe the images",
    "image": ["image1_base64_encoded_see_attached_files", "image2_base64_encoded_see_attached_files"],
    "sampling_parameters": "{\"max_tokens\": 256, \"top_p\": 0.9}"
}

image1:
cat
cat.txt

image2:
cherry_blossom

cherry_blossom.txt

text output

"descirbe the images here. The image shows a white cat with dark gray stripes sitting on a floor surrounded by cherry blossom branches. The cat is facing the camera and the flowers are pink. The background is blurry but appears to be the interior of a room with wooden cabinets or shelves. The overall atmosphere of the image is one of serenity and tranquility, with the cat and flowers creating a peaceful scene. The lighting in the image is soft and natural, with the sun shining through the windows and casting a warm glow over the scene. The image captures a moment of stillness and calmness, inviting the viewer to pause and appreciate the beauty of nature. The focus of the image is on the cat and the flowers, with the background fading into the distance. The overall composition of the image is simple yet effective, drawing the viewer's attention to the main subjects. The image could be used as a wallpaper or a cover photo for a social media post, and it could also be used in a magazine or a book as an illustration. Overall, the image is a beautiful representation of the beauty of nature and the serenity of a peaceful moment. it could evoke feelings of calmness and tranquility in the viewer, and it could inspire a sense of appreciation for the natural world. It could also"

@kthui
Copy link
Contributor

kthui commented Nov 5, 2024

Can we add a simple version check for enabling / disabling features supported across different vLLM versions?

It is because people could be using an older version that does not support the multi-modal yet, or still wish to receive "best_of_request" metrics on an older version.

For example, the check can be simply:

...
from vllm.version import __version__ as _VLLM_VERSION
...
class TritonPythonModel:
    @classmethod
    def auto_complete_config(auto_complete_model_config):
        ...
        if _VLLM_VERSION >= "0.6.3.post1":
            inputs.append({
                "name": "image",
                "data_type": "TYPE_STRING",
                "dims": [-1],  # can be multiple images as separate elements
                "optional": True,
            })
        ...
    ...
    async def generate(self, request):
        ...
        if _VLLM_VERSION >= "0.6.3.post1":
            image_input_tensor = ...
            ...
        ...
    ...
$ python3
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "0.6.3.post1" >= "0.5.3.post1"
True
>>> "0.6.3.post1" >= "0.5.5"
True
>>> "0.6.3.post1" >= "0.6.3"
True
>>> "0.6.3.post1" >= "0.6.3.post1"
True
>>> "0.6.3.post1" >= "0.6.4"
False
>>>

@xiejibing
Copy link
Author

xiejibing commented Nov 6, 2024

@kthui Have added version checks. Please help review the change, thanks.

@kthui
Copy link
Contributor

kthui commented Nov 6, 2024

Hi @xiejibing, I am not able to find your signed CLA on our record. Can you send us the signed CLA following the instructions here? The signed CLA is required before we can merge your PR.

@xiejibing
Copy link
Author

@kthui Thanks! I will sign the CLA as soon as possible.

Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great contributions @xiejibing @yiheng ! 🚀

There is a pre-commit failure around linting, but I think we can just fix it right after merge in this case if needed. For future contributions, you can also run pre-commit install locally after checking out the repository to get pre-commit hooks while developing.

@kthui
Copy link
Contributor

kthui commented Nov 15, 2024

Hi @xiejibing, any update on the CLA? We will be adding new changes into the backend that could require this PR to be rebased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants