Add support for Qwen2VL #10361

HimariO · 2024-11-17T12:08:31Z

This PR implements the Qwen2VL model as requested at #9246 .
The main changes include:

Add m-RoPE and vision RoPE mode to current RoPE OP in CPU and CUDA backend
Add llama_context.n_pos_per_token to support more than one position id per token
Add Qwen2VL llama architecture
Add Qwen2VL clip vision architecture
Add examples/llava/qwen2vl-cli.cpp to handle Qwen2VL data preprocess steps & prompts

TODO:

Fix CI errors cause by linter and unit tests
Remove code and build config only used for develop/debugging qwen2vl

Steps to convert model and inference

Download the official Qwen/Qwen2-VL-2B-Instruct checkpoint, then convert the LLM part of the model to GGUF format using convert_hf_to_gguf.py:
```
python3 convert_hf_to_gguf.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"
```

Convert the vision encoder to GGUF format with qwen2_vl_surgery.py:

PYTHONPATH=$PYTHONPATH:$(pwd)/gguf-py python3 examples/llava/qwen2_vl_surgery.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"

Build the llama-qwen2vl-cli in the same way you would build llama-llava-cli.
Run the command: (It's recommended to resize the image to a resolution below 640x640, so it won't take forever to run on CPU backend)
```
./llama-qwen2vl-cli -m qwen2-vl-decoder.gguf --mmproj qwen2vl-vision.gguf -p "Describe this image." --image "demo.jpg"
```

Future work:

Add MPS, Vulkan backend support

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

barinov274 · 2024-12-01T03:48:55Z

There is such a model, ShowUI is called. It is supposed to point to the element on the image so that you can control the computer with the mouse. But with your support it misses, and nothing does not work.
Here is the code I invented, you can test yourself. Here I have as an image screenshot with open firefox

import subprocess
from PIL import Image, ImageDraw

def detect_and_mark_element(image_path, element, output_image_path):
    # Run the model to get the coordinates of the element
    command = f"./llama-qwen2vl-cli -m ShowUI-2B/Qwen2-VL-2B-Instruct-F16.gguf --mmproj ShowUI-2B/qwen2vl-vision.gguf --image \"{image_path}\" --temp 0 -p \"<|im_start|>system\nBased on the screenshot, I give a text description and you give its corresponding location. The coordinate represents a clickable location [x, y] for an element, which is a relative coordinate on the screenshot, scaled from 0 to 1.<|im_end|>|im_start|>user\n<|vision_start|><image><|vision_end|>{element}<|im_end|>\n<|im_start|>assistant\n\""
    output = subprocess.check_output(command, shell=True)
    output = output.decode("utf-8").strip()

    # Remove the square brackets and split the string into coordinates
    coordinates = output.splitlines()[-1][1:-1].split(", ")
    x, y = float(coordinates[0]), float(coordinates[1])

    # Open the image and get its dimensions
    img = Image.open(image_path)
    width, height = img.size

    # Convert the relative coordinates to absolute coordinates
    x_abs = int(x * width)
    y_abs = int(y * height)

    # Draw a red circle on the detected element
    draw = ImageDraw.Draw(img)
    draw.ellipse([(x_abs-5, y_abs-5), (x_abs+5, y_abs+5)], fill=(255, 0, 0))

    # Save the output image
    img.save(output_image_path)

# Example usage:
detect_and_mark_element("screenshot.png", "Click on the address bar", "output.png")

Here is a link to the model https://huggingface.co/showlab/ShowUI-2B
Here you can test how it should work. https://huggingface.co/spaces/showlab/ShowUI

HimariO · 2024-12-03T12:32:29Z

@barinov274 While ShowUI is built on top of Qwen2VL, it employs a different image processing workflow. Therefore, I believe adding support for ShowUI should be addressed in a separate PR or issue.

ghoulich · 2024-12-04T11:51:56Z

follow your steps, I converted Qwen2-VL-7B-Instruct to guff successfully, then ran
ollama create Qwen2-VL-7B-Instruct
imported this model to ollama, and ollama list could show this model:

but when I ran ollama run Qwen2-VL-7B-Instruct:latest, error happened:

ollama version is 0.4.7

auriocus · 2024-12-05T07:12:29Z

Too bad this fine addition is not being reviewed and sits around in a side branch :( What can we do to encourage the maintainers to have a look?

auriocus · 2024-12-05T07:16:37Z

follow your steps, I converted Qwen2-VL-7B-Instruct to guff successfully, then ran ollama create Qwen2-VL-7B-Instruct imported this model to ollama, and ollama list could show this model: but when I ran ollama run Qwen2-VL-7B-Instruct:latest, error happened

This model requires changes inside llama.cpp (new operator, new image preprocessing) and is not yet merged into the main branch. ollama apparently hasn't merged it on its own. You can only compile the llama-qwen2vl-cli utility from this PR and run the model on the command line so far. See ollama/ollama#6564

ggerganov

The rope changes are good and can be merged. But we have to add tests to test-backend-ops. Can be extracted in a separate PR also.

The n_pos_per_token in llama.cpp is a hack, which is not great. But it's good to bring it to attention. I think we can accept it for now and come up with something better when refactoring the llama_batch in the future.

I haven't looked at the changes in examples/llava in details - this code will be completely reimplemented anyway when we start working on vision support in the llama library.

ggerganov · 2024-12-05T07:37:58Z

ggml/include/ggml.h

@@ -1445,6 +1447,22 @@ extern "C" {
            float                 beta_fast,
            float                 beta_slow);

+    GGML_API struct ggml_tensor * ggml_mrope_ext(


What is the logic for the "m" in "mrope"? Is it coming from "multi-rope" to indicate it is multidimensional?

If so, a better name would be ggml_rope_multi or ggml_rope_nd

ggerganov · 2024-12-05T07:39:11Z

src/llama.cpp

+    float                   rope_freq_scale_train;
+    uint32_t                n_ctx_orig_yarn;
+    float                   rope_yarn_log_mul;
+    std::array<int, 4> rope_mrope_sections;


Rename to rope_sections

github-actions bot added build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Nov 17, 2024

HimariO added 21 commits November 29, 2024 17:52

Barebone Qwen2VL LLM convertor

c17546f

Add Qwen2VL cli entrypoint

7c6f793

[WIP] add qwen2vl arch

b24bd89

Verify m-rope output

3541196

Add vl-rope/2d-rope support for qwen2vl ViT

9d389a0

update qwen2vl cli tool

f661483

update 5D tensor op workaround

3c3691e

[WIP] qwen2vl vision model

c13edfe

make batch and clip utils compatible with qwen2vl

7e9fc72

[WIP] create inference workflow, gguf convert script but fix

bcd49f5

correcting vision-rope behavior, add the missing last layer back to ViT

023f007

add arg parser to qwen2vl_surgery

3d19dd4

replace variable size array with vector

53480d2

cuda-gdb cmake preset

0882f57

add fp32 mrope, vision rope kernel

3237bb4

add fp16 support for qwen2vl and m-rope

201f704

add GGML_ROPE_TYPE_MROPE, GGML_ROPE_TYPE_VISION

f1fa60f

fix rope op mode switching, out dated func args

241bb45

update llama_hparams

07553cf

update to keep up stream changes

fac0345

resolve linter, test errors

cbd08b4

HimariO force-pushed the qwen2-vl branch from 1d8dea5 to cbd08b4 Compare November 29, 2024 14:19

add makefile entry, update speical image padding token

65f74d2

HimariO marked this pull request as ready for review November 29, 2024 14:50

ggerganov reviewed Dec 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Qwen2VL #10361

Add support for Qwen2VL #10361

HimariO commented Nov 17, 2024 •

edited

Loading

barinov274 commented Dec 1, 2024

HimariO commented Dec 3, 2024

ghoulich commented Dec 4, 2024

auriocus commented Dec 5, 2024

auriocus commented Dec 5, 2024

ggerganov left a comment

ggerganov Dec 5, 2024

ggerganov Dec 5, 2024

Add support for Qwen2VL #10361

Are you sure you want to change the base?

Add support for Qwen2VL #10361

Conversation

HimariO commented Nov 17, 2024 • edited Loading

barinov274 commented Dec 1, 2024

HimariO commented Dec 3, 2024

ghoulich commented Dec 4, 2024

auriocus commented Dec 5, 2024

auriocus commented Dec 5, 2024

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Dec 5, 2024

Choose a reason for hiding this comment

ggerganov Dec 5, 2024

Choose a reason for hiding this comment

HimariO commented Nov 17, 2024 •

edited

Loading