Fix guided sampling with outlines #226

tae-su-kim · 2024-09-02T04:03:06Z

This is a rebase of PR #153 to habana_main due to the deprecation of habana_next.

Current habana_main includes guided decoding related code from vllm, and the feature is already there in the openAI api endpoint. However, guided decoding currently fails to run with following error:

...
 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/workspace/codes/vllm/model_executor/layers/sampler.py", line 138, in forward
    sample_results, maybe_sampled_tokens_tensor = _sample(
  File "/workspace/codes/vllm/model_executor/layers/sampler.py", line 711, in _sample
    return _sample_with_torch(
  File "/workspace/codes/vllm/model_executor/layers/sampler.py", line 592, in _sample_with_torch
    sample_results = _greedy_sample(seq_groups, greedy_samples)
  File "/workspace/codes/vllm/model_executor/layers/sampler.py", line 336, in _greedy_sample
    samples_lst = samples.tolist()
RuntimeError: synNodeCreateWithId failed for node: strided_insert with synStatus 1 [Invalid argument]. .

This PR suggests to use masked_fill rather than _add for the masking process of guided decode. With this PR, openai endpoint supports guided decoding. For example,

Input:

payload = {
        "model": "/models/Meta-Llama-3-8B-Instruct",
        "messages": [
            {"role": "user", "content": "reply negatively."}
        ],
        "best_of": best_of,
        "use_beam_search": use_beam_search,
        "temperature": 0.0,
        "top_p": 1.0,
        "guided_regex": "[Pp]ositive format |[Nn]egative format",
}

Output:

{'id': 'cmpl-f3e792eb0197492a8d7eec4bb9916936', 'object': 'chat.completion', 'created': 1722847036, 'model': '/models/Meta-Llama-3-8B-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'Negative format'}, 'logprobs': None, 'finish_reason': 'stop', 'stop_reason': None}], 'usage': {'prompt_tokens': 14, 'total_tokens': 21, 'completion_tokens': 7}}

tae-su-kim · 2024-09-09T07:04:08Z

We fixed abnormal latency overhead with commit 6d57c18.

Rough benchmark is as follows:

Version	Prefill Latency	Decode Latency
w/o guided_decode	53.1s	962.6s
commit `4bea4e3`	156.9s	1322.5s
commit `6d57c18`	53.2s	1149.5s

Setup: llama-3-8b, greedy sampling, max_num_seqs 256, 1k request, QPS -1
prefill_latency: latency(1k random input tokens / 1 output token)
decode_latency: latency(1k random input tokens / 1k output tokens without eos) - prefill_latency

michalkuligowski · 2024-09-16T12:58:22Z

vllm/model_executor/guided_decoding/outlines_logits_processors.py

@@ -127,7 +127,7 @@ def __init__(self, schema: Union[str, Dict, BaseModel],
 class CFGLogitsProcessor(BaseLogitsProcessor):

    @classmethod
-    @cache()
+    @lru_cache(maxsize=32)


Ruff static analysis found that "cache" imported at the top of file is now not used, please remove the import

I am benchmarking the performance of .add versus .masked_fill to determine which has better throughput. I'll resolve the conflict based on the benchmark and fix the ruff issue before merging. I'll let you know once it's completed. Apologies for the delayed response!

tae-su-kim · 2024-09-23T08:41:17Z

@michalkuligowski I removed the import and resolved the conflict based on benchmark results. Both .add and .masked_fill based implementations significantly degrade decode throughput, so I opted to stick with the current code. If any optimizations are possible, I will open another PR.

michalkuligowski · 2024-09-24T07:35:37Z

Ruff fails on unsorted imports. BTW is outlines version bump required now after those changes?

tae-su-kim · 2024-09-25T04:10:31Z

There are some marginal throughput difference, but I think most of the updates in this PR is already here. I will close this PR for now. Thank you!

tae-su-kim added 2 commits September 2, 2024 02:27

fix guided sampling with outlines

778eb79

add with_mark_steps for faster inference

4bea4e3

This was referenced Sep 2, 2024

Fix guided sampling with outlines #153

Closed

[Bugfix][Habana_main] fix guided_decode HPU failing issue #236

Merged

fix: high latency due to @cache() & update outlines version

6d57c18

michalkuligowski approved these changes Sep 16, 2024

View reviewed changes

michalkuligowski reviewed Sep 16, 2024

View reviewed changes

michalkuligowski added the external Issues or PRs submitted by external users label Sep 20, 2024

remove unused cache

c04af23

tae-su-kim closed this Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix guided sampling with outlines #226

Fix guided sampling with outlines #226

tae-su-kim commented Sep 2, 2024

tae-su-kim commented Sep 9, 2024 •

edited

Loading

michalkuligowski Sep 16, 2024

tae-su-kim Sep 19, 2024

tae-su-kim commented Sep 23, 2024

michalkuligowski commented Sep 24, 2024

tae-su-kim commented Sep 25, 2024

Fix guided sampling with outlines #226

Fix guided sampling with outlines #226

Conversation

tae-su-kim commented Sep 2, 2024

tae-su-kim commented Sep 9, 2024 • edited Loading

michalkuligowski Sep 16, 2024

Choose a reason for hiding this comment

tae-su-kim Sep 19, 2024

Choose a reason for hiding this comment

tae-su-kim commented Sep 23, 2024

michalkuligowski commented Sep 24, 2024

tae-su-kim commented Sep 25, 2024

tae-su-kim commented Sep 9, 2024 •

edited

Loading