-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New, optional flex-based tokenizer #218
Conversation
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@creachadair all feedback addressed, ready for another round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor and optional suggestions, otherwise looks good.
C.linguist_yylex_init_extra(&extra, &scanner) | ||
buf = C.linguist_yy_scan_bytes((*C.char)(cs), _len, scanner) | ||
|
||
ary := []string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ary := []string{} | |
var ary []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would result in returning nil
instead of empty slice sometimes and although I'm aware of https://github.com/golang/go/wiki/CodeReviewComments#declaring-empty-slices that would be inconsistent with how current impl of tokenizer API works.
Thats why I decided to do it this way and keep them the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it matters from the API point of view - it's pretty rare to test for slice nil-ness.
// TokenizeFlex implements tokenizer by calling Flex generated code from linguist in C | ||
// This is a transliteration from C https://github.com/github/linguist/blob/master/ext/linguist/linguist.c#L12 | ||
func TokenizeFlex(content []byte) []string { | ||
var buf C.YY_BUFFER_STATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge into a single var
block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what you mean, but would prefer to keep it as close to original as possible as this is almost line-by-line transliteration from C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key word is "almost" :) Unless it's machine-generated, I think it makes sense to adhere to Go conventions where possible. It's not a bit change anyway. Still, it's optional, so feel free to ignore.
import "unsafe" | ||
|
||
const maxTokenLen = 32 // bytes | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also propose to include a go:generate
directive that updates the C tokenizer, if it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might not communicate it well enough, but just to make it clear - the only code that is "generated" in this PR is the C lex.linguist_yy.{h,c}
, that is vendored inside Linguist and we are using exact copy.
In the long run, if this tokeniser turns out to be useful beyond classifier accuracy debugging - it does make sense to add C code generation script from the flex grammar, but I also think it's ok to do that only after #193 is done (so it adds value, not only a maintenance burden).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I haven't meant to regenerate it. go:generate
may also be used to wget
the latest source, for example. Not critical, just a thought.
@dennwc all feedback addressed, ready for another round. |
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
This change partially address #193
It follows option 3 from #193 (comment)
-tags flex
internal/tokenizer/flex/lex.linguist_yy.{h,c}
tokenizer_c.go
is a manual transliteration to Go of the parser from linguist.cThe benchmark before:
after
so it's x10 faster.
But because it's a cgo and it complicates the build I suggest to merge it, keeping behind the tag.
Right now this is only useful for debugging #194 and is ready to be merged without affecting the releases.
But if we agree on it's value for downstream uses for performance reasons, some extra work is going to be needs (in subsequent PR: documentation + updated classifier assets, based on this tokenizer that do not overwrite the default one) before allowing it's usage that goes beyond the internal dev debugging workflow. It is going to be tracked under #193