Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix CLIPTokenizer skipping underscores #453

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

TyrianOtter
Copy link
Contributor

The current CLIPTokenizer regex uses [^\s\w]+, which causes it to skip underscores since \w matches them. Checking the bundled vocab, it seems underscores are only ever part of a word that has no alphanumerics (aside from a couple mojibaked words), so the underscore fits in with the [^\s\w]+ part.

Changing [^\s\w]+ to (?:[^\s\w]|_)+ allows tokenizing underscores, which should better match other tokenizer implementations.

@TyrianOtter TyrianOtter marked this pull request as draft September 30, 2024 19:05
@TyrianOtter
Copy link
Contributor Author

TyrianOtter commented Sep 30, 2024

Looks like I didn't actually check this specific regex pattern works (must not have saved the file). It errors on the prompt 'a_b'.

Using [^a-zA-Z0-9\s]+ instead does works on the inputs I've tried it on. But \w includes unicode that a-zA-Z doesn't, making that a bigger change to the tokenizer than just additionally matching underscores.

Haven't looked into what's causing issues yet.

@TyrianOtter
Copy link
Contributor Author

I forgot to make the group non-capturing in my commit, (even though I didn't forget in my PR description? 🙃).

Works fine on my inputs now.

@TyrianOtter TyrianOtter marked this pull request as ready for review September 30, 2024 19:41
Copy link
Member

@catwell catwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squash the commits before merging, otherwise good find, thanks!

@catwell catwell added the run-ci Run CI label Sep 30, 2024
@TyrianOtter
Copy link
Contributor Author

Actually, it looks like it's possible to get near parity with openai/clip (and thus transformers) without the regex dependency by using <\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[^\W\d_]+|\d|(?:[^\s\w\d]|_)+. I think this only differs on inputs with Unicode characters in the "Nl" (Letter Number) and "No" (Other Number) categories, and some niche whitespace (specifically U+001C to U+001F). It matches the numbers as letters (i.e. in [^\W\d_]+) when it should be matched as part of (?:[^\s\w\d]|_)+.

As far as I can tell, this should agree with openai/clip on more inputs, while only failing to agree on any inputs that already agree using the current implementation or the change in the PR when the input uses whitespace from U+001C to U+001F.

Should I update the PR with this new pattern?

@catwell
Copy link
Member

catwell commented Oct 1, 2024

Yes, we meant something like this with the comment above. I'd prefer if we had some test coverage for a change like this though.

Do you know if there are some "standard" tests for this tokenizer somewhere we could use?

@TyrianOtter TyrianOtter force-pushed the fix-tokenizer-underscore branch from 4dbcc78 to e931f54 Compare October 1, 2024 18:02
@TyrianOtter
Copy link
Contributor Author

Guess I'll leave the Unicode alone for now. Differing supported Python versions use different Unicode versions, all of which are themselves different from regex's Unicode version. Seems like a huge headache to get test coverage on.

@catwell catwell added run-ci Run CI and removed run-ci Run CI labels Oct 1, 2024
@catwell catwell merged commit 590648e into finegrain-ai:main Oct 3, 2024
2 checks passed
@TyrianOtter TyrianOtter deleted the fix-tokenizer-underscore branch October 3, 2024 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-ci Run CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants