Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Qwen2VL #10361

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

Add support for Qwen2VL #10361

wants to merge 22 commits into from

Conversation

HimariO
Copy link

@HimariO HimariO commented Nov 17, 2024

This PR implements the Qwen2VL model as requested at #9246 .
The main changes include:

  • Add m-RoPE and vision RoPE mode to current RoPE OP in CPU and CUDA backend
  • Add llama_context.n_pos_per_token to support more than one position id per token
  • Add Qwen2VL llama architecture
  • Add Qwen2VL clip vision architecture
  • Add examples/llava/qwen2vl-cli.cpp to handle Qwen2VL data preprocess steps & prompts

TODO:

  • Fix CI errors cause by linter and unit tests
  • Remove code and build config only used for develop/debugging qwen2vl

Steps to convert model and inference

  1. Download the official Qwen/Qwen2-VL-2B-Instruct checkpoint, then convert the LLM part of the model to GGUF format using convert_hf_to_gguf.py:

    python3 convert_hf_to_gguf.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"
  2. Convert the vision encoder to GGUF format with qwen2_vl_surgery.py:

    PYTHONPATH=$PYTHONPATH:$(pwd)/gguf-py python3 examples/llava/qwen2_vl_surgery.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"
  3. Build the llama-qwen2vl-cli in the same way you would build llama-llava-cli.

  4. Run the command: (It's recommended to resize the image to a resolution below 640x640, so it won't take forever to run on CPU backend)

    ./llama-qwen2vl-cli -m qwen2-vl-decoder.gguf --mmproj qwen2vl-vision.gguf -p "Describe this image." --image "demo.jpg"

Future work:

  • Add MPS, Vulkan backend support

@github-actions github-actions bot added build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Nov 17, 2024
@HimariO HimariO marked this pull request as ready for review November 29, 2024 14:50
@barinov274
Copy link

There is such a model, ShowUI is called. It is supposed to point to the element on the image so that you can control the computer with the mouse. But with your support it misses, and nothing does not work.
Here is the code I invented, you can test yourself. Here I have as an image screenshot with open firefox

import subprocess
from PIL import Image, ImageDraw

def detect_and_mark_element(image_path, element, output_image_path):
    # Run the model to get the coordinates of the element
    command = f"./llama-qwen2vl-cli -m ShowUI-2B/Qwen2-VL-2B-Instruct-F16.gguf --mmproj ShowUI-2B/qwen2vl-vision.gguf --image \"{image_path}\" --temp 0 -p \"<|im_start|>system\nBased on the screenshot, I give a text description and you give its corresponding location. The coordinate represents a clickable location [x, y] for an element, which is a relative coordinate on the screenshot, scaled from 0 to 1.<|im_end|>|im_start|>user\n<|vision_start|><image><|vision_end|>{element}<|im_end|>\n<|im_start|>assistant\n\""
    output = subprocess.check_output(command, shell=True)
    output = output.decode("utf-8").strip()

    # Remove the square brackets and split the string into coordinates
    coordinates = output.splitlines()[-1][1:-1].split(", ")
    x, y = float(coordinates[0]), float(coordinates[1])

    # Open the image and get its dimensions
    img = Image.open(image_path)
    width, height = img.size

    # Convert the relative coordinates to absolute coordinates
    x_abs = int(x * width)
    y_abs = int(y * height)

    # Draw a red circle on the detected element
    draw = ImageDraw.Draw(img)
    draw.ellipse([(x_abs-5, y_abs-5), (x_abs+5, y_abs+5)], fill=(255, 0, 0))

    # Save the output image
    img.save(output_image_path)

# Example usage:
detect_and_mark_element("screenshot.png", "Click on the address bar", "output.png")

Here is a link to the model https://huggingface.co/showlab/ShowUI-2B
Here you can test how it should work. https://huggingface.co/spaces/showlab/ShowUI

@HimariO
Copy link
Author

HimariO commented Dec 3, 2024

@barinov274 While ShowUI is built on top of Qwen2VL, it employs a different image processing workflow. Therefore, I believe adding support for ShowUI should be addressed in a separate PR or issue.

@ghoulich
Copy link

ghoulich commented Dec 4, 2024

follow your steps, I converted Qwen2-VL-7B-Instruct to guff successfully, then ran
ollama create Qwen2-VL-7B-Instruct
imported this model to ollama, and ollama list could show this model:
image
but when I ran ollama run Qwen2-VL-7B-Instruct:latest, error happened:
image
ollama version is 0.4.7

@auriocus
Copy link

auriocus commented Dec 5, 2024

Too bad this fine addition is not being reviewed and sits around in a side branch :( What can we do to encourage the maintainers to have a look?

@auriocus
Copy link

auriocus commented Dec 5, 2024

follow your steps, I converted Qwen2-VL-7B-Instruct to guff successfully, then ran ollama create Qwen2-VL-7B-Instruct imported this model to ollama, and ollama list could show this model: but when I ran ollama run Qwen2-VL-7B-Instruct:latest, error happened

This model requires changes inside llama.cpp (new operator, new image preprocessing) and is not yet merged into the main branch. ollama apparently hasn't merged it on its own. You can only compile the llama-qwen2vl-cli utility from this PR and run the model on the command line so far. See ollama/ollama#6564

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rope changes are good and can be merged. But we have to add tests to test-backend-ops. Can be extracted in a separate PR also.

The n_pos_per_token in llama.cpp is a hack, which is not great. But it's good to bring it to attention. I think we can accept it for now and come up with something better when refactoring the llama_batch in the future.

I haven't looked at the changes in examples/llava in details - this code will be completely reimplemented anyway when we start working on vision support in the llama library.

@@ -1445,6 +1447,22 @@ extern "C" {
float beta_fast,
float beta_slow);

GGML_API struct ggml_tensor * ggml_mrope_ext(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the logic for the "m" in "mrope"? Is it coming from "multi-rope" to indicate it is multidimensional?

If so, a better name would be ggml_rope_multi or ggml_rope_nd

float rope_freq_scale_train;
uint32_t n_ctx_orig_yarn;
float rope_yarn_log_mul;
std::array<int, 4> rope_mrope_sections;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to rope_sections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants