-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPU: offload logits processing to CPU #358
Conversation
@tae-su-kim , @huijjj Note that when cherry-picking it to v0.8 use vllm/model_executor/guided_decoding/outlines_logits_processors.py from this PR and discard any previous modifications as git by default will leave with_mark_steps and other changes (and calling it for every sample in a batch hurts performance) |
@madamczykhabana Hi, sorry for the delayed response. We are trying to reproduce the number in both v0.8 branch of ours and habana_main, and current PR seems to fail in habana_main. It collides with commit 7c7714d where MQLLMEngine was introduced.
It seems like new multiprocessing engine pickles logit processor information using cloudpickle, but cloudpickle is unable to pickle functions decorated with lru_cache (cloudpipe/cloudpickle#178). I can confirm that this PR works in v0.8 branch of ours (based on vllm-fork v0.5.3) and improves e2e throghput as follows: |
e0c5216
to
b02d483
Compare
@tae-su-kim thanks for the info. I've just pushed a workaround for the pickling error. Please check if it helps on your end. |
b02d483
to
60305f7
Compare
@madamczykhabana Thanks for the prompt update! I will benchmark with multiprocessing again. Can you elaborate more on the discrepancy in number of generated tokens issue? From our side, we observe similar number of tokens: e.g. 952,171 tokens from A100, 959,993 tokens from Gaudi-2 on 1K/1K dynamic benchmark with json guide. We can share our setup if you want. |
@tae-su-kim Sorry for the delay. The token discrepancy was when I was comparing guided-json to non-guided-json. The PR has been rebased and it's ready for review. |
@@ -30,11 +30,48 @@ | |||
from transformers import PreTrainedTokenizerBase | |||
|
|||
|
|||
# Unfortunately we cannot use lru_cache as it breaks pickling | |||
# so we use a simpler implementation | |||
def _cached(fn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many masks are expected at maximu? Since _create_mask_tensor is limited and the _cached fun is not, won't we get into a situation where we will get None result for least recently used (128+)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ignore my last comment, the lrucache maxsize does not have effect here, since _cached cache keeps objects.
If you reverse the condition in line 39 the else can be replaced with return cache[args]
if args not in cache:
cache[args] = fn(*args)
return cache[args]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but at the same time we'd need to calculate the hash and access the map one additional time.
- if args in cache
- cache[args] = fn(*args)
- return cache[args]
In current version you access it twice:
- args in cache
- cache[args] = result
To be honest I didn't measure how big of a perf impact this might be, but I'd like to be on the safe side.
Due to high dynamicity on logits processing it's better to offload it completely to CPU instead of computing it on HPU.