-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix guided sampling with outlines #226
Fix guided sampling with outlines #226
Conversation
We fixed abnormal latency overhead with commit 6d57c18. Rough benchmark is as follows:
|
@@ -127,7 +127,7 @@ def __init__(self, schema: Union[str, Dict, BaseModel], | |||
class CFGLogitsProcessor(BaseLogitsProcessor): | |||
|
|||
@classmethod | |||
@cache() | |||
@lru_cache(maxsize=32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ruff static analysis found that "cache" imported at the top of file is now not used, please remove the import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am benchmarking the performance of .add
versus .masked_fill
to determine which has better throughput. I'll resolve the conflict based on the benchmark and fix the ruff issue before merging. I'll let you know once it's completed. Apologies for the delayed response!
@michalkuligowski I removed the import and resolved the conflict based on benchmark results. Both |
Ruff fails on unsorted imports. BTW is outlines version bump required now after those changes? |
There are some marginal throughput difference, but I think most of the updates in this PR is already here. I will close this PR for now. Thank you! |
This is a rebase of PR #153 to habana_main due to the deprecation of habana_next.
Current habana_main includes guided decoding related code from vllm, and the feature is already there in the openAI api endpoint. However, guided decoding currently fails to run with following error:
This PR suggests to use masked_fill rather than _add for the masking process of guided decode. With this PR, openai endpoint supports guided decoding. For example,
Input:
Output: