-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Qwen2VL #10361
base: master
Are you sure you want to change the base?
Add support for Qwen2VL #10361
Conversation
There is such a model, ShowUI is called. It is supposed to point to the element on the image so that you can control the computer with the mouse. But with your support it misses, and nothing does not work. import subprocess
from PIL import Image, ImageDraw
def detect_and_mark_element(image_path, element, output_image_path):
# Run the model to get the coordinates of the element
command = f"./llama-qwen2vl-cli -m ShowUI-2B/Qwen2-VL-2B-Instruct-F16.gguf --mmproj ShowUI-2B/qwen2vl-vision.gguf --image \"{image_path}\" --temp 0 -p \"<|im_start|>system\nBased on the screenshot, I give a text description and you give its corresponding location. The coordinate represents a clickable location [x, y] for an element, which is a relative coordinate on the screenshot, scaled from 0 to 1.<|im_end|>|im_start|>user\n<|vision_start|><image><|vision_end|>{element}<|im_end|>\n<|im_start|>assistant\n\""
output = subprocess.check_output(command, shell=True)
output = output.decode("utf-8").strip()
# Remove the square brackets and split the string into coordinates
coordinates = output.splitlines()[-1][1:-1].split(", ")
x, y = float(coordinates[0]), float(coordinates[1])
# Open the image and get its dimensions
img = Image.open(image_path)
width, height = img.size
# Convert the relative coordinates to absolute coordinates
x_abs = int(x * width)
y_abs = int(y * height)
# Draw a red circle on the detected element
draw = ImageDraw.Draw(img)
draw.ellipse([(x_abs-5, y_abs-5), (x_abs+5, y_abs+5)], fill=(255, 0, 0))
# Save the output image
img.save(output_image_path)
# Example usage:
detect_and_mark_element("screenshot.png", "Click on the address bar", "output.png") Here is a link to the model https://huggingface.co/showlab/ShowUI-2B |
@barinov274 While ShowUI is built on top of Qwen2VL, it employs a different image processing workflow. Therefore, I believe adding support for ShowUI should be addressed in a separate PR or issue. |
Too bad this fine addition is not being reviewed and sits around in a side branch :( What can we do to encourage the maintainers to have a look? |
This model requires changes inside llama.cpp (new operator, new image preprocessing) and is not yet merged into the main branch. ollama apparently hasn't merged it on its own. You can only compile the llama-qwen2vl-cli utility from this PR and run the model on the command line so far. See ollama/ollama#6564 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rope changes are good and can be merged. But we have to add tests to test-backend-ops
. Can be extracted in a separate PR also.
The n_pos_per_token
in llama.cpp
is a hack, which is not great. But it's good to bring it to attention. I think we can accept it for now and come up with something better when refactoring the llama_batch
in the future.
I haven't looked at the changes in examples/llava
in details - this code will be completely reimplemented anyway when we start working on vision support in the llama
library.
@@ -1445,6 +1447,22 @@ extern "C" { | |||
float beta_fast, | |||
float beta_slow); | |||
|
|||
GGML_API struct ggml_tensor * ggml_mrope_ext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the logic for the "m" in "mrope"? Is it coming from "multi-rope" to indicate it is multidimensional?
If so, a better name would be ggml_rope_multi
or ggml_rope_nd
float rope_freq_scale_train; | ||
uint32_t n_ctx_orig_yarn; | ||
float rope_yarn_log_mul; | ||
std::array<int, 4> rope_mrope_sections; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to rope_sections
This PR implements the Qwen2VL model as requested at #9246 .
The main changes include:
llama_context.n_pos_per_token
to support more than one position id per tokenexamples/llava/qwen2vl-cli.cpp
to handle Qwen2VL data preprocess steps & promptsTODO:
Steps to convert model and inference
Download the official
Qwen/Qwen2-VL-2B-Instruct
checkpoint, then convert the LLM part of the model to GGUF format usingconvert_hf_to_gguf.py
:python3 convert_hf_to_gguf.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"
Convert the vision encoder to GGUF format with
qwen2_vl_surgery.py
:Build the
llama-qwen2vl-cli
in the same way you would buildllama-llava-cli
.Run the command: (It's recommended to resize the image to a resolution below 640x640, so it won't take forever to run on CPU backend)
Future work: