Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an API for decode streaming. #1677

Merged
merged 9 commits into from
Nov 15, 2024
Merged

Adding an API for decode streaming. #1677

merged 9 commits into from
Nov 15, 2024

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Nov 6, 2024

Fixes #1666

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, missing a Python API + example of how to use in python!

tokenizers/src/tokenizer/mod.rs Show resolved Hide resolved
Comment on lines +910 to +912
pub fn decode_stream(&self, skip_special_tokens: bool) -> DecodeStream<'_, M, N, PT, PP, D> {
DecodeStream::new(self, skip_special_tokens)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather we explicitly create this DecodeStream with DecodeStream::new(tokenizer, ...)
without adding this to the tokenizer funcs!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wish, this follows the .iter() pattern in regular rust as it's more convient given the lifetime bound of the DecodeStream object.

https://doc.rust-lang.org/src/alloc/collections/vec_deque/mod.rs.html#1204

It's really just sugard, I can happily remove.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. No sounds good, was more thinking about the coming python api as well but in rust makes sense for sure

self.prefix_index = new_prefix_index;
Ok(Some(new_text.to_string()))
} else {
Ok(None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning '�' might be more expected (at least it's not None so people can print it still?) or ''

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's wrong to do so. Invalid utf-8 is perfectly normal and should not be returned before enough token are accumulated (see the accent example) to provide valid utf-8. If valid utf-8 follows invalid utf-8 then both will be returned at the same time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Producing "" is also wrong, since the token really didn't produce anything, not the empty string.

}
let new_text = &string[self.prefix.len()..].to_string();
let new_prefix_index = self.ids.len() - self.prefix_index;
self.ids = self.ids.drain(self.read_index..).collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, the state is bound to be quite small with this!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we have in TGI, the overhead is indeed quite low. You're decoding twice as much (prefix + new text) and you have only a handful of extra tokens.

@Narsil
Copy link
Collaborator Author

Narsil commented Nov 7, 2024

Very nice, missing a Python API + example of how to use in python!

If you're OK with that, I'd keep 2 separate PRs to keep this PR with logic small enough.

@Narsil Narsil mentioned this pull request Nov 7, 2024
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Narsil Narsil merged commit 500db28 into main Nov 15, 2024
13 checks passed
@Narsil Narsil deleted the decode_stream branch November 15, 2024 05:02
let string = self
.tokenizer
.decode(self.ids.as_slice(), self.skip_special_tokens)?;
if string.len() > self.prefix.len() && !string.ends_with('�') {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker @Narsil

I ran into streamed decoding issues as well and had the same solution in mind. However, I came to the conclusion that this solution has its own flaw: If you want to actually decode the character because it is part of the completion, this code would assume it's an incomplete UTF-8 marker and not yield anything.

The only clean solution is to offer a Decoder::decode_u8 method. I could help out here, if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incremental Detokenization
4 participants