How does the profile_run work? #10110

zhuangqh · 2024-11-07T07:01:08Z

zhuangqh
Nov 7, 2024

As I know, vllm will launch a profile_run https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L174 to measure the peak memory usage of the model using the maximum seq length.
However, I don't see any OOM error if the gpu memory is not enough to fit the model_max_len context length.
How does the profile_run avoid oom issue when running the model with dummy input?

Looking forward to getting an answer from the community. Thanks.

Answered by andoorve

Nov 8, 2024

I don't think I quite got your question. The flow is kind of like this:

profile_run: doesn't actually allocate any permanent memory for KV cache. Will just allocate temporary activations.
KV cache allocation. Allocates the difference between total and peak memory from the profile run to KV tensor. Note this is what will make it almost max out nvidia-smi. Also, if there are not enough KV cache blocks it will error at this point.

View full answer

andoorve · 2024-11-07T18:47:34Z

andoorve
Nov 7, 2024
Collaborator

You should see an OOM if the profile_run takes up too much memory. I'm surprised that's not the case.

7 replies

zhuangqh Nov 7, 2024
Author

the error is raised by the assertion, not a oom error.

INFO 11-05 03:14:02 gpu_executor.py:126] Maximum concurrency for 131072 tokens per request: 0.87x
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
    return cls(
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 348, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 496, in _initialize_kv_caches
    self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/usr/local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 129, in initialize_cache
    self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 262, in initialize_cache
    raise_if_cache_size_invalid(num_gpu_blocks,
  File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py", line 492, in raise_if_cache_size_invalid
    raise ValueError(
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (113920).

andoorve Nov 7, 2024
Collaborator

Ah I see what you are running into! The profile_run avoids OOM because it is just trying to find the memory needed for "immediate" activations and therefore does not actually save any KV tokens.

zhuangqh Nov 8, 2024
Author

but it will actually allocate memory, right? how can it know it won't oom before allocation.
you can see the gpu memory is almost exhausted from nvidia-smi.

andoorve Nov 8, 2024
Collaborator

I don't think I quite got your question. The flow is kind of like this:

profile_run: doesn't actually allocate any permanent memory for KV cache. Will just allocate temporary activations.
KV cache allocation. Allocates the difference between total and peak memory from the profile run to KV tensor. Note this is what will make it almost max out nvidia-smi. Also, if there are not enough KV cache blocks it will error at this point.

Answer selected by zhuangqh

zhuangqh Nov 8, 2024
Author

Thanks for you explanation and patience. I understand #2 now.

Will just allocate temporary activations.

I would be very grateful if you could address the code for this logic.

andoorve Nov 8, 2024
Collaborator

Is there something specific in the code you would like to see? They will just be these model files: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py. The activations themselves will be allocated by PyTorch, and deallocated once we exit the relevant Python scope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does the profile_run work? #10110

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How does the profile_run work? #10110

zhuangqh Nov 7, 2024

Replies: 1 comment · 7 replies

andoorve Nov 7, 2024 Collaborator

zhuangqh Nov 7, 2024 Author

andoorve Nov 7, 2024 Collaborator

zhuangqh Nov 8, 2024 Author

andoorve Nov 8, 2024 Collaborator

zhuangqh Nov 8, 2024 Author

andoorve Nov 8, 2024 Collaborator

zhuangqh
Nov 7, 2024

Replies: 1 comment 7 replies

andoorve
Nov 7, 2024
Collaborator

zhuangqh Nov 7, 2024
Author

andoorve Nov 7, 2024
Collaborator

zhuangqh Nov 8, 2024
Author

andoorve Nov 8, 2024
Collaborator

zhuangqh Nov 8, 2024
Author

andoorve Nov 8, 2024
Collaborator