Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on the vocabulary size #26

Open
PPPPPsanG opened this issue Oct 15, 2024 · 1 comment
Open

Question on the vocabulary size #26

PPPPPsanG opened this issue Oct 15, 2024 · 1 comment

Comments

@PPPPPsanG
Copy link

Emu3 is a good work, but i have some question on it.
The vocabulary size of Qwen is 152064 , while the codebook size of vision tokenizer employed in Emu3 is 32768
The addation is 184832, the vocabulary size reported in Emu3 is 184622.
Why do the numbers not match?

@ryanzhangfan
Copy link
Collaborator

We use the vacab.json in Qwen2 which have 151643 tokens, plus 32768 vision tokens, 205 extra tokens and 6 special tokens, making the total vocabulary size of 184622.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants