Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.17 documentation update #172

Merged
merged 29 commits into from
Aug 14, 2024
Merged

Conversation

kzawora-intel
Copy link

No description provided.

@kzawora-intel kzawora-intel changed the title 1,17 documentation update 1.17 documentation update Aug 12, 2024
Copy link

@afierka-intel afierka-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple minor requests and unclear paragraphs. In general really enjoyed the documentation. Well done!

`HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__ are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.


Whenever HPU Graphs are being used, they share the common memory pool ("usable memory") as KV cache, determined by ``gpu_memory_utilization`` flag (``0.9`` by default). Environment variable ``VLLM_GRAPH_RESERVED_MEM`` defines the ratio of memory reserved for HPU Graphs capture. With its default value (``VLLM_GRAPH_RESERVED_MEM=0.4``), 40% of usable memory will be reserved for graph capture (later referred to as "usable graph memory"), and the remaining 60% will be utilized for KV cache. Before KV cache gets allocated, model weights are loaded onto the device, and a forward pass of the model is executed on dummy data, to estimate memory usage. Next, KV cache gets allocated, model is warmed up, and HPU Graphs are captured. Environment variable ``VLLM_GRAPH_PROMPT_RATIO`` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (``VLLM_GRAPH_PROMPT_RATIO=0.5``), both stages have equal memory constraints. Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. ``VLLM_GRAPH_PROMPT_RATIO=0.2`` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider to rephrase this paragraph and present it as table or so. You can use following columns, e.g.: env_name, default, available memory, used_memory, free_memory, description for each env.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it makes sense to write this down as a table - this section describes step-by-step how the memory allocations look like. The documentation for each env var is in the section below, if you want just that. The goal here is to describe the design on some example.

docs/source/getting_started/gaudi-installation.rst Outdated Show resolved Hide resolved
@mgawarkiewicz mgawarkiewicz merged commit 6f047d8 into habana_main Aug 14, 2024
13 checks passed
@kzawora-intel kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants