how to find the correct (token_id, byte_val) relationship for llama3 tokenizer? #352

bugm · 2024-10-24T10:15:50Z

Hello, all, as I know llama3 tokenizer is based on byte level BPE, But I can not find the relationship between the token_id and (0-255) byte map.
For example, with character "Ä" , the utf-8 encode is b'\xc3\x84' = [195,132] . With llama3 tokenizer, "Ä" is encode as 88075 , by checking the vocab and merges, I found 88075 is "ÃĦ", merge with "Ã"(token index 127) and "Ħ"(token index 226), but this did not match the utf-8 byte value 195,132 . So is there any doc to explain how is 0-255 token id mapping to the byte val. For example, with token id 127,226, how is it converted to byte val 195,132 ( b'\xc3\x84' ) and then decode with utf-8 to get character "Ä"?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to find the correct (token_id, byte_val) relationship for llama3 tokenizer? #352

how to find the correct (token_id, byte_val) relationship for llama3 tokenizer? #352

bugm commented Oct 24, 2024

how to find the correct (token_id, byte_val) relationship for llama3 tokenizer? #352

how to find the correct (token_id, byte_val) relationship for llama3 tokenizer? #352

Comments

bugm commented Oct 24, 2024