Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to match unicode chars #138

Open
mm-lemainque opened this issue Dec 29, 2024 · 2 comments
Open

Unable to match unicode chars #138

mm-lemainque opened this issue Dec 29, 2024 · 2 comments

Comments

@mm-lemainque
Copy link

mm-lemainque commented Dec 29, 2024

Hello,

It seems the current implementation is not able to match unicode chars. My goal is to build a terminal accepting any letter including accents, such [a-zA-Zà-ÿÀ-ß0-9 ]+, as GBNF seems to support it

Simple code to reproduce with xgrammar 1.8.0:

>>> import xgrammar.testing
>>> xgrammar.testing._is_grammar_accept_string("root ::= [é]", "é", True)
/workspace/cpp/matcher_base.cc:99: Matching char: 195 "\xc3"
/workspace/cpp/matcher_base.cc:101: Previous stack: Stacks tops size: 1
Stack #0: {
id: 0, RulePosition: rule 0: root, sequence 1: ("\xe9"), element id: 0, element in string: 0, parent id: -1, ref count: 1
}
/workspace/cpp/matcher_base.cc:131: Character 195 "\xc3" Rejected
/workspace/cpp/matcher.cc:401: Matching failed after accepting 0 characters
False

I also tried with a pseudo-wildcard rule:

>>> xgrammar.testing._is_grammar_accept_string("root ::= [\\x00-\\xff]", "é", True)
/workspace/cpp/matcher_base.cc:99: Matching char: 195 "\xc3"
/workspace/cpp/matcher_base.cc:101: Previous stack: Stacks tops size: 1
Stack #0: {
id: 0, RulePosition: rule 0: root, sequence 1: ([\0-\xff]), element id: 0, left utf8 bytes: 0, parent id: -1, ref count: 1
}
/workspace/cpp/matcher_base.cc:131: Character 195 "\xc3" Rejected
/workspace/cpp/matcher.cc:401: Matching failed after accepting 0 characters
False

Unless you know a workaround, I would be happy to help solving this

Thanks a lot for your help

@mm-lemainque
Copy link
Author

mm-lemainque commented Dec 30, 2024

Removing the below code solves the issue and all tests are passing

if (num_bytes > 1) {
return is_negative;
}

EDIT: a proper fix should rather be to deal with multi-byte chars in kCharacterClass rules

@Ubospica
Copy link
Collaborator

Ubospica commented Jan 3, 2025

@mm-lemainque Thanks for the report! I think the handling of unicode within character classes is problematic. We will try to fix it recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants