Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
Browse files Browse the repository at this point in the history
…a/pruned_habana_main
  • Loading branch information
kzawora-intel committed Sep 24, 2024
2 parents a000e62 + 4eb9809 commit c1232e9
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 7 deletions.
12 changes: 6 additions & 6 deletions docs/source/getting_started/gaudi-installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,10 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
with tensor parallelism on 2x HPU, BF16 datatype with random or greedy sampling

Performance Tuning
================
==================

Execution modes
------------
---------------

Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via ``PT_HPU_LAZY_MODE`` environment variable), and ``--enforce-eager`` flag.

Expand Down Expand Up @@ -170,7 +170,7 @@ Currently in vLLM for HPU we support four execution modes, depending on selected


Bucketing mechanism
------------
-------------------

Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. `Intel Gaudi Graph Compiler <https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime>`__ is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - ``batch_size`` and ``sequence_length``.
Expand Down Expand Up @@ -243,7 +243,7 @@ This example uses the same buckets as in *Bucketing mechanism* section. Each out
Compiling all the buckets might take some time and can be turned off with ``VLLM_SKIP_WARMUP=true`` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.

HPU Graph capture
------------
-----------------

`HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__ are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.

Expand Down Expand Up @@ -307,7 +307,7 @@ Each described step is logged by vLLM server, as follows (negative values corres
Recommended vLLM Parameters
------------
---------------------------

- We recommend running inference on Gaudi 2 with ``block_size`` of 128
for BF16 data type. Using default values (16, 32) might lead to
Expand All @@ -319,7 +319,7 @@ Recommended vLLM Parameters
If you encounter out-of-memory issues, see troubleshooting section.

Environment variables
------------
---------------------

**Diagnostic and profiling knobs:**

Expand Down
2 changes: 1 addition & 1 deletion vllm/model_executor/models/qwen2.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
from vllm.model_executor.model_loader.weight_utils import (
default_weight_loader, maybe_remap_kv_scale_name)
from vllm.model_executor.sampling_metadata import SamplingMetadata
from vllm.platform import current_platform
from vllm.platforms import current_platform
from vllm.sequence import IntermediateTensors

from .interfaces import SupportsLoRA
Expand Down

0 comments on commit c1232e9

Please sign in to comment.