-
As I know, vllm will launch a profile_run https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L174 to measure the peak memory usage of the model using the maximum seq length. Looking forward to getting an answer from the community. Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
You should see an OOM if the |
Beta Was this translation helpful? Give feedback.
I don't think I quite got your question. The flow is kind of like this:
nvidia-smi
. Also, if there are not enough KV cache blocks it will error at this point.