-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miscellaneous feedback #15
Comments
Don't worry about the length ✌️ I really like the ideas here! Lookaheads are indeed very powerful. I do use them quite frequently when writing parsers in RegHex to avoid an expensive branch from being taken when a piece of logic would be too expensive to backtrack on. For most optimisations I assume however that most grammars are already optimised to be LR, meaning that in the best case RegHex could extract a single character class (the start of the regular expression) as a cheap look ahead. Personally I'm not too fond of the idea of adding more support for regular expression syntax itself. The case-insensitive flag is an easy one to add but I'd suspect I'll try to carve out some time for the character class conversion. Generally my idea is to compile character classes to an NFA/DFA (we'll see) where the character classes themselves are stored as tries / prefix trees with 4-bit prefixes and 16-bit bitmaps. The reason why I think that is interesting to me is that if most of the cost goes towards regular expressions, any uplift in matching performance on those, compared to the built-ins, will help tremendously across the board. Using regular expressions in general can in theory not only incur the performance cost of the JS engine calling into the regular expression engine, but ease GC pressure (since it'd be hard to replace the array-based node structure itself) |
Sounds good 👍 It might be useful for guiding optimizations if the library could tell how many regexes it executed vs. what the minimum number of regexes to execute would have been if it somehow had always taken the correct branches. Maybe even your parsers that already use lookaheads heavily spend a significant amount of time on the wrong branches, like I'd be surprised if that wasn't the case 🤔 |
Somewhat interestingly pre-constructing compounded regexes manually, like this: // Before
const Digit
= /[0-9]/;
const IdentifierStart
= /[a-zA-Z$_]/;
const IdentifierRest
= $`${IdentifierStart} | ${Digit}`;
const Identifier
= $`${IdentifierStart} ${IdentifierRest}*`;
// After
const Digit
= /[0-9]/;
const IdentifierStart
= /[a-zA-Z$_]/;
const IdentifierRest
= $`${IdentifierStart} | ${Digit}`;
const Identifier
= /[a-zA-Z$_][a-zA-Z0-9$_]*/; // <-- Changed rule Made the entire parser take ~25% less time in my benchmarks (~210ms vs ~280ms), just by changing that one "Identifier" rule. Perhaps this is also something that could be performed automatically in some measure at built-time, when compiling regexes away entirely isn't feasible. |
I spent some time today benchmarking the library and playing with making a toy/useless Markdown parser with it, so here's some miscellaneous feedback after having interacted some more with the library, feel free to close this and perhaps open standalone issues for the parts you think are worth addressing.
For the Markdown parser thing I was trying to write a matcher that matched headings, and I had some problems with that:
\1
,\2
etc. to reference other capturing groups either, in my headings scenario the trailing hashes should really be considered trailing hashes only if they are exactly the same number as the leading hashes, otherwise they should be considered part of the body, this can't quite be expressed cleanly with the current system because the first capturing group/matcher can't be referenced.\[0-9]
references, which in this case would mean referencing the 1st, 2nd... 9th whole sub-matcher.{1,3}
perhaps should be supported too.Now about performance, perhaps the more interesting part of the feedback.
From what I've seen every ~atomic thing the library does is pretty fast, so there shouldn't be any meaningful micro-optimizations available, the root issue seems to be actually that the library spends too much times on the wrong alternations.
Some actual real numbers first so that the rest of the feedback sounds less crazy:
That's kind of the root of the performance problems with RegHex parsers in my opinion, if I had to guess with enough sophistication perhaps some parsers could become 100x faster or more just by going down branches/alternations more intelligently.
At a high-level to me RegHex parsers look kinda like CPUs, individual patterns are like instructions, each alternation is a branch etc. it should follow then that the same optimizations used for CPUs could/should be used for RegHex. I know next to nothing about that really, but just to mention some potential things that crossed my mind:
Depending on how many of these fancy optimizations you are willing to spend time on perhaps a seriously fast JS parser could be written on top of RegHex 🤔 that'd be really cool.
Sorry for the long post, hopefully there's some useful feedback in here.
The text was updated successfully, but these errors were encountered: