Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different tokenisation results between oniguruma and RE2 #225

Closed
bzz opened this issue Apr 12, 2019 · 12 comments
Closed

Different tokenisation results between oniguruma and RE2 #225

bzz opened this issue Apr 12, 2019 · 12 comments
Assignees
Labels
Milestone

Comments

@bzz
Copy link
Contributor

bzz commented Apr 12, 2019

Right now envy uses regex-based tokeniser (until #218 at least).

We have 2 drop-in replacement regex engines: default from Go RE2 and default from Ruby oniguruma.

Recent improvements done with Go module migration #219 surfaced a new issue: it seems that tokeniser produces a bit different results, depending on which regex engine is used :/

More specifically, the token frequencies built from linguist samples are different and high-level code-generator test catch by comparing with a fixture (pre-generated with RE2) and fail on oniguruma profiles like this #219 (comment)

We need to find the exact reason and depending on it decide, if we want to support 2 versions of fixtures or change something so there is no difference in output.

This also potentially affects #194

@bzz bzz added the bug label Apr 12, 2019
@bzz bzz mentioned this issue Apr 12, 2019
8 tasks
@creachadair
Copy link
Contributor

creachadair commented Apr 12, 2019

Often I have found the differences between engines on the same regexp boil down to small variations in the handling of greed (in particular: whether operators are greedy by default), zero-width assertions, anchoring (e.g., whether it is on by default), case-sensitivity, and other meta-properties of the execution model that aren't well-specified. Obviously I don't know what the causes are in this case yet, but that's where I would start looking for examples.

@bzz bzz self-assigned this Apr 14, 2019
@bzz bzz added this to the v2.0.0 milestone Apr 14, 2019
@bzz
Copy link
Contributor Author

bzz commented Apr 16, 2019

Steps to reproduce

$ go test ./internal/code-generator/... -run Test_GeneratorTestSuite -testify.m TestGenerationFiles
ok  	gopkg.in/src-d/enry.v1/internal/code-generator/generator	35.537s

$ go test -tags oniguruma  ./internal/code-generator/... -run Test_GeneratorTestSuite -testify.m TestGenerationFiles
FAIL	gopkg.in/src-d/enry.v1/internal/code-generator/generator	34.583s

Both should produce the same results if results of tokenisation of all linguist samples are the same.;
Going to add extra tests to verify just that.

@bzz
Copy link
Contributor Author

bzz commented Apr 17, 2019

Tracked the problem down to the difference in handling what seems to be latin1 encoded file.
While tokenising thå filling encoded in latin1 and read as UTF-8

RE2 gets

th
filling
�

Oniguruma gets

th
illing
�
f

flex-based tokenised from #218 gets

th
filling

@creachadair
Copy link
Contributor

Of the three, only Flex really seems to be doing anything reasonable here. RE2 is close—but the order of the tokens is weird. I have no idea what Oniguruma is doing there, but it seems obviously broken in at least two ways.

@bzz
Copy link
Contributor Author

bzz commented Apr 17, 2019

Yup. And if the content is decoded from latin1 and encoded to utf8 with charmap.ISO8859_1.NewDecoder().Bytes(content) the result is an expected one:

RE2

th
filling
å

oniguruma

th
filling
å

That would not bother me much, if 2 months ago Linguist did not add such case to their samples, and that is what content classifier is trained on and it is something we keep a "gold standard" results on, as part of our test fixtures.

I know that Linguist does use ICU-based character encoding detector https://github.com/brianmario/charlock_holmes but am not sure yet if it's part of the tokenisation.

@creachadair
Copy link
Contributor

Yeah, I think either we should normalize the encoding or find a way to treat the Unicode replacement character as part of the token, e.g., xxx�yyyxxx�yyy, or use it as a separator and discard it, e.g. ⇒ xxx yyyxxx yyy. It seems like Linguist does the latter maybe.

@bzz
Copy link
Contributor Author

bzz commented Apr 17, 2019

True. And linguist with flex-based tokeniser does not have this issue, so no need to encoding detection there. Thank you for suggestions, let me think a bit more about that..

@bzz
Copy link
Contributor Author

bzz commented May 6, 2019

After digging deeper - ot seems that the offending tokenization rule is extractAndReplaceRegular(content) that does [0-9A-Za-z_\.@#\/\*]+.

The extractRemainders is then called on it's output and does `bytes.Fields(content),

  • in RE2 case it's " \xe5 " which results in extra
  • in Oniguruma it's "\xe5 f"

@bzz
Copy link
Contributor Author

bzz commented May 6, 2019

For the record: doing equivalent operation in Ruby where regex are backed by Oniguruma lib results in ArgumentError: invalid byte sequence in UTF-8

$irb

"th\xdd filling".scan(/[0-9A-Za-z_\.@#\/\*]+/)

@bzz
Copy link
Contributor Author

bzz commented May 7, 2019

Digging a little bit deeper with oniguruma's C API using awesome examples, it starts to look like this may be a bug in go-oniguruma Regexp.findAll() implementation 🤔

@bzz
Copy link
Contributor Author

bzz commented May 7, 2019

Even in C, oniguruma does consistently produce strange result in UTF8 mode for the non-valid input bytes in UTF8, like above. Seems like a possible bug upstream.

As all regex for tokenization in enry are not using any Unicode character classes, all RE2-based matches are conducted in ASCII-only mode, while go-oniguruma has hardcoded UTF8.

For our use-case the fix would be to force Oniguruma also use ASCII mode and that indeed produces identical results even for non-valid bytes in UTF8.

I will submit a patch to cgo part of https://github.com/src-d/go-oniguruma to have an option to override hardcoded UTF8, expose it in Go as MustCompileWithEncoding (similar to MustCompileWithOption) and as soon as it's merged, move entry regex/oniguruma.go to use it in #227.

Meanwhile, only a better test case was added in there a724a2f

bzz added a commit to bzz/go-oniguruma that referenced this issue May 7, 2019
This is a workaround, motivated by the difference in handling non-valid UTF8
bytes that Oniriguma has, compared to Go's default RE2.

See src-d/enry#225 (comment)

Summary of changes:
 - c: prevent `NewOnigRegex()` from hard-coding UTF8
 - c: `NewOnigRegex()` now propely calls to `onig_initialize()` [1]
 - go: expose new `MustCompileASCII()` \w default charecter class matching only ASCII
 - go: `MustCompile()` refactored, `initRegexp()` extracted for common UTF8/ASCII logic

Encoding was not exposed on Go API level intentionaly for simplisity,
in order to avoid introducing complex struct type [2] to API surface.

 1. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/doc/API#L6
 2. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/src/oniguruma.h#L121

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
bzz added a commit to bzz/go-oniguruma that referenced this issue May 7, 2019
This is a workaround, motivated by the difference in handling non-valid UTF8
bytes that Oniriguma has, compared to Go's default RE2.

See src-d/enry#225 (comment)

Summary of changes:
 - c: prevent `NewOnigRegex()` from hard-coding UTF8
 - c: `NewOnigRegex()` now propely calls to `onig_initialize()` [1]
 - go: expose new `MustCompileASCII()` \w default charecter class matching only ASCII
 - go: `MustCompile()` refactored, `initRegexp()` extracted for common UTF8/ASCII logic

Encoding was not exposed on Go API level intentionaly for simplisity,
in order to avoid introducing complex struct type [2] to API surface.

 1. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/doc/API#L6
 2. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/src/oniguruma.h#L121

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
kuba-- pushed a commit to src-d/go-oniguruma that referenced this issue May 8, 2019
* add ASCII-only option, to mimic default RE2 behaviour

This is a workaround, motivated by the difference in handling non-valid UTF8
bytes that Oniriguma has, compared to Go's default RE2.

See src-d/enry#225 (comment)

Summary of changes:
 - c: prevent `NewOnigRegex()` from hard-coding UTF8
 - c: `NewOnigRegex()` now propely calls to `onig_initialize()` [1]
 - go: expose new `MustCompileASCII()` \w default charecter class matching only ASCII
 - go: `MustCompile()` refactored, `initRegexp()` extracted for common UTF8/ASCII logic

Encoding was not exposed on Go API level intentionaly for simplisity,
in order to avoid introducing complex struct type [2] to API surface.

 1. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/doc/API#L6
 2. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/src/oniguruma.h#L121

Signed-off-by: Alexander Bezzubov <bzz@apache.org>

* ci: test on 2 latest go versions

Signed-off-by: Alexander Bezzubov <bzz@apache.org>

* ci: bump version of Oniguruma to 6.9.1

Update deb to get fix https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/1730627

Signed-off-by: Alexander Bezzubov <bzz@apache.org>

* ci: refactor oniguruma installation

Signed-off-by: Alexander Bezzubov <bzz@apache.org>

* refactoring go part a bit, addressing review feedback

Signed-off-by: Alexander Bezzubov <bzz@apache.org>

* ci: fix typo in bash var substitution

Signed-off-by: Alexander Bezzubov <bzz@apache.org>

* cgo: simplify naive encoding init

Signed-off-by: Alexander Bezzubov <bzz@apache.org>

* go: doc syntax fix

Signed-off-by: Alexander Bezzubov <bzz@apache.org>

* tixing fypos

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz
Copy link
Contributor Author

bzz commented May 8, 2019

Fixed by #227

@bzz bzz closed this as completed May 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants