-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: vllm can't run qwen 32B inference #193
Comments
@afierka-intel can you check this out? I remember you've experienced a similar bug in weight loading phase of large models (llama405b or mixtral8x7b) on HPU. Should be simple to check if it's caused by the same bug. |
refer this change in #55 could fix bug, need same change in qwen at least. like this: diff --git a/vllm/model_executor/models/qwen2.py b/vllm/model_executor/models/qwen2.py
|
Your current environment
How would you like to use vllm
I'm try to run llm serve use gaudi2 on intel devcloud, i have install vllm-fork and i'm useing command below and it seams shows hpu oom?
PT_HPU_LAZY_MODE=1 vllm serve Qwen/Qwen1.5-32B-Chat --dtype bfloat16 --block-size 128 --device hpu
I also try qwen 13B, it works normally.
In addition, when I use optimal-habana to perform inference, it can generate text normally.
The text was updated successfully, but these errors were encountered: