Mismatch to OpenAI's tokenizer? #29
-
I was trying to compare for correctness and it seems OpenAI counts an extra token in their organizer? The first <h is two tokens for them. Is that related to #19 ? Or is there something else going on? Interestingly, the other models that gpt-tokenizer supports seem to match what is on the Tokenizer page (even though cl100k_base is listed as the gpt-3.5 turbo tokenizer). As someone new to the repo, I'm sure I'm just ignorant and this is expected. It would be great to get help understanding the gotchas on when it might differ. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The old OpenAI tokenizer playground use GPT-3, which was |
Beta Was this translation helpful? Give feedback.
The old OpenAI tokenizer playground use GPT-3, which was
p50k_base
. They've updated it now.gpt-tokenizer's playground uses cl100k_base by default, which is compatible GPT-3.5 and GPT-4.
Different tokenizer encoding, different result. Hope this helps!