-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix CLIPTokenizer
skipping underscores
#453
fix CLIPTokenizer
skipping underscores
#453
Conversation
Looks like I didn't actually check this specific regex pattern works (must not have saved the file). It errors on the prompt 'a_b'. Using Haven't looked into what's causing issues yet. |
I forgot to make the group non-capturing in my commit, (even though I didn't forget in my PR description? 🙃). Works fine on my inputs now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squash the commits before merging, otherwise good find, thanks!
Actually, it looks like it's possible to get near parity with As far as I can tell, this should agree with Should I update the PR with this new pattern? |
Yes, we meant something like this with the comment above. I'd prefer if we had some test coverage for a change like this though. Do you know if there are some "standard" tests for this tokenizer somewhere we could use? |
4dbcc78
to
e931f54
Compare
Guess I'll leave the Unicode alone for now. Differing supported Python versions use different Unicode versions, all of which are themselves different from |
The current
CLIPTokenizer
regex uses[^\s\w]+
, which causes it to skip underscores since\w
matches them. Checking the bundled vocab, it seems underscores are only ever part of a word that has no alphanumerics (aside from a couple mojibaked words), so the underscore fits in with the[^\s\w]+
part.Changing
[^\s\w]+
to(?:[^\s\w]|_)+
allows tokenizing underscores, which should better match other tokenizer implementations.