fix `CLIPTokenizer` skipping underscores #453

TyrianOtter · 2024-09-30T18:38:10Z

The current CLIPTokenizer regex uses [^\s\w]+, which causes it to skip underscores since \w matches them. Checking the bundled vocab, it seems underscores are only ever part of a word that has no alphanumerics (aside from a couple mojibaked words), so the underscore fits in with the [^\s\w]+ part.

Changing [^\s\w]+ to (?:[^\s\w]|_)+ allows tokenizing underscores, which should better match other tokenizer implementations.

TyrianOtter · 2024-09-30T19:15:09Z

Looks like I didn't actually check this specific regex pattern works (must not have saved the file). It errors on the prompt 'a_b'.

Using [^a-zA-Z0-9\s]+ instead does works on the inputs I've tried it on. But \w includes unicode that a-zA-Z doesn't, making that a bigger change to the tokenizer than just additionally matching underscores.

Haven't looked into what's causing issues yet.

TyrianOtter · 2024-09-30T19:41:23Z

I forgot to make the group non-capturing in my commit, (even though I didn't forget in my PR description? 🙃).

Works fine on my inputs now.

catwell

Squash the commits before merging, otherwise good find, thanks!

TyrianOtter · 2024-09-30T22:35:08Z

Actually, it looks like it's possible to get near parity with openai/clip (and thus transformers) without the regex dependency by using <\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[^\W\d_]+|\d|(?:[^\s\w\d]|_)+. I think this only differs on inputs with Unicode characters in the "Nl" (Letter Number) and "No" (Other Number) categories, and some niche whitespace (specifically U+001C to U+001F). It matches the numbers as letters (i.e. in [^\W\d_]+) when it should be matched as part of (?:[^\s\w\d]|_)+.

As far as I can tell, this should agree with openai/clip on more inputs, while only failing to agree on any inputs that already agree using the current implementation or the change in the PR when the input uses whitespace from U+001C to U+001F.

Should I update the PR with this new pattern?

catwell · 2024-10-01T07:49:51Z

Yes, we meant something like this with the comment above. I'd prefer if we had some test coverage for a change like this though.

Do you know if there are some "standard" tests for this tokenizer somewhere we could use?

TyrianOtter · 2024-10-01T18:08:21Z

Guess I'll leave the Unicode alone for now. Differing supported Python versions use different Unicode versions, all of which are themselves different from regex's Unicode version. Seems like a huge headache to get test coverage on.

TyrianOtter marked this pull request as draft September 30, 2024 19:05

TyrianOtter marked this pull request as ready for review September 30, 2024 19:41

catwell approved these changes Sep 30, 2024

View reviewed changes

catwell added the run-ci Run CI label Sep 30, 2024

fix CLIPTokenizer skipping underscores

e931f54

TyrianOtter force-pushed the fix-tokenizer-underscore branch from 4dbcc78 to e931f54 Compare October 1, 2024 18:02

catwell added run-ci Run CI and removed run-ci Run CI labels Oct 1, 2024

catwell merged commit 590648e into finegrain-ai:main Oct 3, 2024
2 checks passed

TyrianOtter deleted the fix-tokenizer-underscore branch October 3, 2024 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix `CLIPTokenizer` skipping underscores #453

fix `CLIPTokenizer` skipping underscores #453

TyrianOtter commented Sep 30, 2024

TyrianOtter commented Sep 30, 2024 •

edited

Loading

TyrianOtter commented Sep 30, 2024

catwell left a comment

TyrianOtter commented Sep 30, 2024

catwell commented Oct 1, 2024

TyrianOtter commented Oct 1, 2024

fix CLIPTokenizer skipping underscores #453

fix CLIPTokenizer skipping underscores #453

Conversation

TyrianOtter commented Sep 30, 2024

TyrianOtter commented Sep 30, 2024 • edited Loading

TyrianOtter commented Sep 30, 2024

catwell left a comment

Choose a reason for hiding this comment

TyrianOtter commented Sep 30, 2024

catwell commented Oct 1, 2024

TyrianOtter commented Oct 1, 2024

fix `CLIPTokenizer` skipping underscores #453

fix `CLIPTokenizer` skipping underscores #453

TyrianOtter commented Sep 30, 2024 •

edited

Loading