This repository provides a simple Gradio UI to run Qwen2 VL 72B AWQ, enabling both image and video inferencing.
Tested only to work in venv on Pop-OS/Ubuntu 22.04 and only with Qwen/Qwen2-VL-72B-Instruct-AWQ downloaded from here: https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ/tree/main
If you run into any installation issues, visit the following two links for more info (I am not affiliated with Qwen in any way/shape/form...I just don't have the time to help people install this. So, head on over to the Qwen repo and HuggingFace page and chances are you may find the answer(s) you are seeking): https://github.com/QwenLM/Qwen2-VL and https://github.com/QwenLM/Qwen2-VL
Note: On their HuggingFace page, Qwen2-VL advises the following:
"We advise build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:
KeyError: 'qwen2_vl'
https://github.com/QwenLM/Qwen2-VL
Note: There is further room for inference/speed optimization to be made via DeepSpeed/etc. This repo is just a quick way to test out Qwen2 VL 72B.
- Python 3.8+
- CUDA-compatible GPU (recommended)
-
Clone this repository:
git clone https://github.com/Kaszebe/Large-Vision-Language-Model-UI.git cd Large-Vision-Language-Model-UI
-
Create a virtual environment:
python3 -m venv qwen_venv
-
Activate the virtual environment:
source qwen_venv/bin/activate
-
Install the required packages:
pip install transformers accelerate qwen-vl-utils gradio pip install flash-attn --no-build-isolation
-
Download the Qwen2-VL-72B-AWQ model:
- Visit Hugging Face Model
- Download the model files
- Place the files in a directory on your local machine
- Set the
QWEN_MODEL_PATH
environment variable:export QWEN_MODEL_PATH="/path/to/your/model"
-
Activate the virtual environment (if not already activated):
source qwen_venv/bin/activate
-
Run the script:
python run_qwen_model.py --flash-attn2
This will launch the Gradio interface, allowing you to interact with the Qwen2-VL-72B-AWQ model through a web browser. You can upload images or videos and input prompts to generate descriptions.
- Open the provided URL in your web browser.
- Upload an image or video.
- Enter a text prompt.
- Adjust generation parameters if needed.
- Click "Submit" to generate a description.
- The model performs best with a CUDA-compatible GPU.
- Processing time may vary depending on your hardware and the complexity of the input.