Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to find the correct (token_id, byte_val) relationship for llama3 tokenizer? #352

Open
bugm opened this issue Oct 24, 2024 · 0 comments

Comments

@bugm
Copy link

bugm commented Oct 24, 2024

Hello, all, as I know llama3 tokenizer is based on byte level BPE, But I can not find the relationship between the token_id and (0-255) byte map.
For example, with character "Ä" , the utf-8 encode is b'\xc3\x84' = [195,132] . With llama3 tokenizer, "Ä" is encode as 88075 , by checking the vocab and merges, I found 88075 is "ÃĦ", merge with "Ã"(token index 127) and "Ħ"(token index 226), but this did not match the utf-8 byte value 195,132 . So is there any doc to explain how is 0-255 token id mapping to the byte val. For example, with token id 127,226, how is it converted to byte val 195,132 ( b'\xc3\x84' ) and then decode with utf-8 to get character "Ä"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant