You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, all, as I know llama3 tokenizer is based on byte level BPE, But I can not find the relationship between the token_id and (0-255) byte map.
For example, with character "Ä" , the utf-8 encode is b'\xc3\x84' = [195,132] . With llama3 tokenizer, "Ä" is encode as 88075 , by checking the vocab and merges, I found 88075 is "ÃĦ", merge with "Ã"(token index 127) and "Ħ"(token index 226), but this did not match the utf-8 byte value 195,132 . So is there any doc to explain how is 0-255 token id mapping to the byte val. For example, with token id 127,226, how is it converted to byte val 195,132 ( b'\xc3\x84' ) and then decode with utf-8 to get character "Ä"?
The text was updated successfully, but these errors were encountered:
Hello, all, as I know llama3 tokenizer is based on byte level BPE, But I can not find the relationship between the token_id and (0-255) byte map.
For example, with character "Ä" , the utf-8 encode is b'\xc3\x84' = [195,132] . With llama3 tokenizer, "Ä" is encode as 88075 , by checking the vocab and merges, I found 88075 is "ÃĦ", merge with "Ã"(token index 127) and "Ħ"(token index 226), but this did not match the utf-8 byte value 195,132 . So is there any doc to explain how is 0-255 token id mapping to the byte val. For example, with token id 127,226, how is it converted to byte val 195,132 ( b'\xc3\x84' ) and then decode with utf-8 to get character "Ä"?
The text was updated successfully, but these errors were encountered: