to make repetition penalty faster #442

ccrhx4 · 2024-10-29T01:42:30Z

This PR is to fix very slow sampling process when repetition penalty is set.

The fix includes:

Enable pin_memory on HPU
Padding prompt tokens and output_tokens to avoid recompile
Replace slow ops

Before the fix:
SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=None
Warming up...
Profiling iterations: 100%|5/5 [03:24<00:00, 40.99s/it]
Avg latency: 40.98862759781768 seconds
10% percentile latency: 11.699748958216514 seconds
25% percentile latency: 11.73845003999304 seconds
50% percentile latency: 11.801458386995364 seconds
75% percentile latency: 11.861465670051984 seconds
90% percentile latency: 99.46527566103033 seconds
99% percentile latency: 152.02756165561732 seconds

After the fix:
SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=None
Warming up...
Profiling iterations: 100%| 5/5 [00:57<00:00, 11.59s/it]
Avg latency: 11.58703240059549 seconds
10% percentile latency: 11.444069900200702 seconds
25% percentile latency: 11.511425047006924 seconds
50% percentile latency: 11.525146245025098 seconds
75% percentile latency: 11.556680046953261 seconds
90% percentile latency: 11.788318535778672 seconds
99% percentile latency: 11.927301629073918 seconds

Testing code is by: https://github.com/ccrhx4/huanxing.vllm-fork/blob/slow_repetition_penalty/benchmarks/reproduce.sh

vllm/worker/cache_engine.py

vllm/model_executor/sampling_metadata.py

masked_fill instead of boolean index; third add paddings to prompt tokens and output tokens to reduce re-compile.

michalkuligowski · 2024-11-13T08:34:21Z

vllm/utils.py

+                max_len_align=max_len_align)
+
+    tensor = torch.from_numpy(padded_x).to(device)
+    if pin_memory:


if not needed, since this method is called for hpu

HI Michal, I removed the device check logic here and kept the pin_memory check. In this way, this method behavior is exactly the same to the un-aligned version.

michalkuligowski · 2024-11-13T08:35:00Z

vllm/utils.py

+        if not current_platform.is_hpu():
+            tensor = tensor.pin_memory()
+        else:
+            tensor = tensor.pin_memory("hpu")


can be removed, as it wont be called now

Hi Michal, this method make_tensor_with_pad is still called from different places. It is replaced by make_tensor_with_pad_align in the repetition penaly. So I think we still need the check here.

michalkuligowski requested changes Oct 29, 2024

View reviewed changes

vllm/worker/cache_engine.py Outdated Show resolved Hide resolved

michalkuligowski requested changes Oct 29, 2024

View reviewed changes

vllm/worker/cache_engine.py Outdated Show resolved Hide resolved

ccrhx4 requested a review from michalkuligowski October 31, 2024 02:03

michalkuligowski requested changes Nov 4, 2024

View reviewed changes

vllm/model_executor/sampling_metadata.py Outdated Show resolved Hide resolved

ccrhx4 added 4 commits November 13, 2024 02:53

to make repetition penalty faster: first, enable pin memory;second use

15e5d79

masked_fill instead of boolean index; third add paddings to prompt tokens and output tokens to reduce re-compile.

revert change in cache_engine as it is not used in HPU

0004cc5

remove unnecessary import

6bb5cb9

fix yaff

ec38d67

ccrhx4 force-pushed the fix_slow_repetition_penalty branch from 655018e to ec38d67 Compare November 13, 2024 03:01

ccrhx4 requested a review from michalkuligowski November 13, 2024 03:02

michalkuligowski requested changes Nov 13, 2024

View reviewed changes

remove device check in the hpu only method

5016cc4

ccrhx4 requested a review from michalkuligowski November 14, 2024 01:33

michalkuligowski requested review from kzawora-intel and madamczykhabana November 14, 2024 08:41

madamczykhabana requested a review from kdamaszk November 14, 2024 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to make repetition penalty faster #442

to make repetition penalty faster #442

ccrhx4 commented Oct 29, 2024

michalkuligowski Nov 13, 2024

ccrhx4 Nov 14, 2024

michalkuligowski Nov 13, 2024

ccrhx4 Nov 14, 2024

to make repetition penalty faster #442

Are you sure you want to change the base?

to make repetition penalty faster #442

Conversation

ccrhx4 commented Oct 29, 2024

michalkuligowski Nov 13, 2024

Choose a reason for hiding this comment

ccrhx4 Nov 14, 2024

Choose a reason for hiding this comment

michalkuligowski Nov 13, 2024

Choose a reason for hiding this comment

ccrhx4 Nov 14, 2024

Choose a reason for hiding this comment