-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode regexp for NLP #599
Labels
Comments
This is my personal opinion, however we should move away from Str and use Otherwise I agree with you that pcre could be a better fit. |
Unfortunately re does not support utf-8 at the moment: ocaml/ocaml-re#24 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
According to documentation
Str
module, which you heavily use for NLP, is not working with UTF-8 at all, and for this reason it seems it does not fit for the general task of text processing.For example
Owl_nlp_utils.regexp_split
defined asStr.regexp "[ \t;,.'!?()’“”\\/&—\\-]+"
. Note that it contains two special quotation mark each 3 bytes long:This brings us to the problem:
Because (of the second char
\128
):To solve this I propose switching to
Pcre
or similar libraries, which accept Unicode regular expressions:What do you think?
The text was updated successfully, but these errors were encountered: