Mismatch to OpenAI's tokenizer? #29

hitsthings · 2023-08-10T09:26:51Z

hitsthings
Aug 10, 2023

I was trying to compare for correctness and it seems OpenAI counts an extra token in their organizer? The first <h is two tokens for them.

Is that related to #19 ? Or is there something else going on? Interestingly, the other models that gpt-tokenizer supports seem to match what is on the Tokenizer page (even though cl100k_base is listed as the gpt-3.5 turbo tokenizer).

As someone new to the repo, I'm sure I'm just ignorant and this is expected. It would be great to get help understanding the gotchas on when it might differ.

Answered by niieani

Nov 11, 2024

The old OpenAI tokenizer playground use GPT-3, which was p50k_base. They've updated it now.
gpt-tokenizer's playground uses cl100k_base by default, which is compatible GPT-3.5 and GPT-4.
Different tokenizer encoding, different result. Hope this helps!

View full answer

niieani · 2024-11-11T00:44:39Z

niieani
Nov 11, 2024
Maintainer

The old OpenAI tokenizer playground use GPT-3, which was p50k_base. They've updated it now.
gpt-tokenizer's playground uses cl100k_base by default, which is compatible GPT-3.5 and GPT-4.
Different tokenizer encoding, different result. Hope this helps!

1 reply

hitsthings Nov 11, 2024
Author

Makes sense, thanks a ton!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch to OpenAI's tokenizer? #29

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Mismatch to OpenAI's tokenizer? #29

hitsthings Aug 10, 2023

Replies: 1 comment · 1 reply

niieani Nov 11, 2024 Maintainer

hitsthings Nov 11, 2024 Author

hitsthings
Aug 10, 2023

Replies: 1 comment 1 reply

niieani
Nov 11, 2024
Maintainer

hitsthings Nov 11, 2024
Author