1.17 documentation update #172

kzawora-intel · 2024-08-12T15:24:36Z

No description provided.

…a/multiprocessing_hpu

This reverts commit 1546cd5.

docs/source/getting_started/gaudi-installation.rst

afierka-intel

Multiple minor requests and unclear paragraphs. In general really enjoyed the documentation. Well done!

docs/source/getting_started/gaudi-installation.rst

afierka-intel · 2024-08-13T08:32:41Z

docs/source/getting_started/gaudi-installation.rst

+`HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__ are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.
+
+
+Whenever HPU Graphs are being used, they share the common memory pool ("usable memory") as KV cache, determined by ``gpu_memory_utilization`` flag (``0.9`` by default). Environment variable ``VLLM_GRAPH_RESERVED_MEM`` defines the ratio of memory reserved for HPU Graphs capture. With its default value (``VLLM_GRAPH_RESERVED_MEM=0.4``), 40% of usable memory will be reserved for graph capture (later referred to as "usable graph memory"), and the remaining 60% will be utilized for KV cache. Before KV cache gets allocated, model weights are loaded onto the device, and a forward pass of the model is executed on dummy data, to estimate memory usage. Next, KV cache gets allocated, model is warmed up, and HPU Graphs are captured. Environment variable ``VLLM_GRAPH_PROMPT_RATIO`` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (``VLLM_GRAPH_PROMPT_RATIO=0.5``), both stages have equal memory constraints. Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. ``VLLM_GRAPH_PROMPT_RATIO=0.2`` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.


I would consider to rephrase this paragraph and present it as table or so. You can use following columns, e.g.: env_name, default, available memory, used_memory, free_memory, description for each env.

I'm not sure if it makes sense to write this down as a table - this section describes step-by-step how the memory allocations look like. The documentation for each env var is in the section below, if you want just that. The goal here is to describe the design on some example.

docs/source/getting_started/gaudi-installation.rst

kzawora-intel added 19 commits July 30, 2024 17:24

Remove redundant torch.device

73390bf

Add multiprocessing HPU executor

1546cd5

Merge remote-tracking branch 'origin/habana_main' into private/kzawor…

5fe745d

…a/multiprocessing_hpu

typo

5b5dab1

Add Gaudi documentation for 1.17

b0c4d4c

Merge remote-tracking branch 'origin/habana_main' into HEAD

f532da1

update docs

739067d

update docs

625796b

update docs

254dab3

update link

0ccb294

update docs

f998014

update docs

26e485d

update docs

2ef373d

update docs

ee01a08

update docs

ae24093

update docs

b1f9b0a

update docs

fa6c5ca

update docs

cdd2839

Merge remote-tracking branch 'origin/habana_main' into HEAD

dd2ce8b

kzawora-intel changed the title ~~1,17 documentation update~~ 1.17 documentation update Aug 12, 2024

kzawora-intel added 3 commits August 12, 2024 18:25

Revert "Add multiprocessing HPU executor"

9dd2457

This reverts commit 1546cd5.

update docs

f4de97b

update docs

6a3a0c3

mgawarkiewicz reviewed Aug 13, 2024

View reviewed changes

docs/source/getting_started/gaudi-installation.rst Outdated Show resolved Hide resolved

docs/source/getting_started/gaudi-installation.rst Outdated Show resolved Hide resolved

docs/source/getting_started/gaudi-installation.rst Outdated Show resolved Hide resolved

mgawarkiewicz requested a review from afierka-intel August 13, 2024 07:33

afierka-intel reviewed Aug 13, 2024

View reviewed changes

schoi-habana reviewed Aug 13, 2024

View reviewed changes

docs/source/getting_started/gaudi-installation.rst Outdated Show resolved Hide resolved

kzawora-intel added 3 commits August 14, 2024 14:56

address cr

64262a2

document strategies

32c50f2

Merge remote-tracking branch 'origin/habana_main' into HEAD

312e60f

kzawora-intel added 4 commits August 14, 2024 15:34

clarify how gpu_mem_utilization works

340bc23

clarify how gpu_mem_utilization works

312abe4

clarify how gpu_mem_utilization works

f10e161

fix typos

eac1385

mgawarkiewicz merged commit 6f047d8 into habana_main Aug 14, 2024
13 checks passed

kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Sep 5, 2024

kzawora-intel deleted the private/kzawora/gaudi_docs_1_17 branch October 7, 2024 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.17 documentation update #172

1.17 documentation update #172

kzawora-intel commented Aug 12, 2024

afierka-intel left a comment

afierka-intel Aug 13, 2024

kzawora-intel Aug 14, 2024

		`HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__ are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.


		Whenever HPU Graphs are being used, they share the common memory pool ("usable memory") as KV cache, determined by ``gpu_memory_utilization`` flag (``0.9`` by default). Environment variable ``VLLM_GRAPH_RESERVED_MEM`` defines the ratio of memory reserved for HPU Graphs capture. With its default value (``VLLM_GRAPH_RESERVED_MEM=0.4``), 40% of usable memory will be reserved for graph capture (later referred to as "usable graph memory"), and the remaining 60% will be utilized for KV cache. Before KV cache gets allocated, model weights are loaded onto the device, and a forward pass of the model is executed on dummy data, to estimate memory usage. Next, KV cache gets allocated, model is warmed up, and HPU Graphs are captured. Environment variable ``VLLM_GRAPH_PROMPT_RATIO`` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (``VLLM_GRAPH_PROMPT_RATIO=0.5``), both stages have equal memory constraints. Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. ``VLLM_GRAPH_PROMPT_RATIO=0.2`` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.

1.17 documentation update #172

1.17 documentation update #172

Conversation

kzawora-intel commented Aug 12, 2024

afierka-intel left a comment

Choose a reason for hiding this comment

afierka-intel Aug 13, 2024

Choose a reason for hiding this comment

kzawora-intel Aug 14, 2024

Choose a reason for hiding this comment