feat(regex_parser): Implement `RegExp` parser #3824

leaysgur · 2024-06-22T11:42:11Z

Part of #1164

Progress updates 🗞️

Waiting for the review and advice, while thinking how to handle escaped string when new RegExp(pat).

TODOs

RegExp(Literal = Body + Flags)#parse() structure
Base Reader impl to handle both unicode(u32) and utf-16(u16) units
Global Span and local offset conversion
Design AST shapes
- Keep enum size small by Box<'a, T>
- Rework AST shapes
Split body and flags w/ validating literal
Parse RegExpFlags
Parse RegExpBody = Pattern
Parse Pattern > Disjunction
Parse Disjunction > Alternative
Parse Alternative > Term
Parse Term > Assertion
- Parse BoundaryAssertion
- Parse LookaroundAssertion
Parse Term > Quantifier
Parse Term > Atom
- Parse Atom > PatternCharacter
- Parse Atom > .
- Parse Atom > \AtomEscape
  - Parse \AtomEscape > DecimalEscape
  - Parse \AtomEscape > CharacterClassEscape
    - Parse CharacterClassEscape > \d, \D, \s, \S, \w, \W
    - Parse CharacterClassEscape > \p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}
  - Parse \AtomEscape > CharacterEscape
    - Parse CharacterEscape > ControlEscape
    - Parse CharacterEscape > c AsciiLetter
    - Parse CharacterEscape > 0
    - Parse CharacterEscape > HexEscapeSequence
    - Parse CharacterEscape > RegExpUnicodeEscapeSequence
    - Parse CharacterEscape > IdentityEscape
  - Parse \AtomEscape > kGroupName
- Parse Atom > [CharacterClass]
  - Parse [CharacterClass] > ClassContents > [~UnicodeSetsMode] NonemptyClassRanges
  - Parse [CharacterClass] > ClassContents > [+UnicodeSetsMode] ClassSetExpression
    - Parse ClassSetExpression > ClassUnion
    - Parse ClassSetExpression > ClassIntersection
    - Parse ClassSetExpression > ClassSubtraction
    - Parse ClassSetExpression > ClassSetOperand
    - Parse ClassSetExpression > ClassSetRange
    - Parse ClassSetExpression > ClassSetCharacter
- Parse Atom > (GroupSpecifier)
- Parse Atom > (?:Disjunction)
Annex B
- Parse QuantifiableAssertion
- Parse ExtendedAtom
  - Parse ExtendedAtom > \ [lookahead = c]
  - Parse ExtendedAtom > InvalidBracedQuantifier
  - Parse ExtendedAtom > ExtendedPatternCharacter
  - Parse ExtendedAtom > \AtomEscape > CharacterEscape > LegacyOctalEscapeSequence
Early errors
- Pattern :: Disjunction(1/2)
- Pattern :: Disjunction(2/2)
- QuantifierPrefix :: { DecimalDigits , DecimalDigits }
- ExtendedAtom :: InvalidBracedQuantifier (Annex B)
- AtomEscape :: k GroupName
- AtomEscape :: DecimalEscape
- NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2)
- NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2)
- NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B)
- NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2)
- NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2)
- NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B)
- RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence
- RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
- RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence
- RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
- UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2)
- UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2)
- UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2)
- UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2)
- CharacterClassEscape :: P{ UnicodePropertyValueExpression }
- CharacterClass :: [^ ClassContents ]
- NestedClass :: [^ ClassContents ]
- ClassSetRange :: ClassSetCharacter - ClassSetCharacter
Add Span to Err(OxcDiagnostic::error()) calls
Perf improvement
- Reader#peek() should avoid iter.next() equivalent
- ~~Use char everywhere and split and push 2 surrogates(pair) for Character?~~
- ~~Try 1(+1) loop parsing for capturing groups?~~

Follow up

@Boshen Test suite > feat(coverage): enable regexp in test262 #4242
- Investigate CI errors...
Next...
- Support ES2025 Duplicate named capturing groups?
- Support ES20XX Stage3 Modifiers?

crates/oxc_regexp_parser/_oxc_js_regex/ast_builder.rs

crates/oxc_regexp_parser/_oxc_js_regex/ast_kind.rs

crates/oxc_regexp_parser/src/ast.rs

codspeed-hq · 2024-06-22T11:51:20Z

CodSpeed Performance Report

Merging #3824 will not alter performance

_{Comparing regexpp (368364d) with main (f88970b)}

Summary

✅ 29 untouched benchmarks

crates/oxc_regexp_parser/src/ast_builder.rs

crates/oxc_regexp_parser/src/parser/literal_parser.rs

crates/oxc_regexp_parser/src/parser/body_parser/reader.rs

Boshen · 2024-07-05T14:54:58Z

Now that we have some of the implementation working, we should think about how to support the regex eslint rules 🤔

Sysix · 2024-07-12T18:52:25Z

@leaysgur hello, im currently working on a smaller version of regex groups, maybe u find some usefull snippets here:
https://github.com/oxc-project/oxc/pull/4096/files

interesting method:

get_regex_group_by_open_bracet

~~Currently Missing implementations:~~

~~Named Captures (named capture can be accessed by it's number reference. (?<foo>a)(\1) is valid)~~

rzvxa · 2024-07-15T12:12:20Z

This is awesome, I'm looking forward to this PR😍

I always had a theory that there are only 5 people on StackOverflow who write all the regex examples and everyone else just copies them into production. If that theory is correct I bet you'd be the 6th after this😆

leaysgur · 2024-07-15T14:45:06Z

That's the truth. 😅
I will do my best to reverse my position from copying to be copied side!

magic-akari · 2024-08-14T06:44:17Z

I hope it becomes an independent crate package.

leaysgur · 2024-08-14T07:12:18Z

@Boshen ^ How do you think? (maybe also related to #4242 (comment))

Sysix · 2024-08-14T16:37:24Z

Hello @leaysgur

for eslint/no-useless-backreference it would be really nice to have the AST-Span of the Backreference.
But there is a difference with RegexLiterals and the RegExp Obj: The Backlash needs to be escaped.

See:

  ⚠ no back reference
   ╭─[no_useless_backreference.tsx:1:6]
 1 │ /(b)(\2a)/
   ·      ──
   ╰────
vs.
  ⚠ no back reference
   ╭─[no_useless_backreference.tsx:1:6]
 1 │ new RegExp("(b)(\\2a)");
   ·                 ───
   ╰────

Did you considered this use case for escaped backslashes?
Do we want to catch the doubled backslash?

leaysgur · 2024-08-15T11:46:30Z

@Sysix Thanks for your comment!

nice to have the AST-Span of the Backreference.

The current AST for backref already holds span.
Please check https://github.com/oxc-project/oxc/pull/3824/files#diff-8a24d853ad0d7af27dc285d21a4b1f2d6aa97d2fab6153fa441805750266d7e4R245-R248 👀

Did you considered this use case for escaped backslashes?

Yes, but as a RegExp parser, I do not specifically address backslash escaping (rather escape sequences).
Because it's a topic of string literals themselves, not just RegExp.

For now, the treatment of pattern \\2 is (escaped)backslash and (number)2.

My understanding may be wrong and I'm not sure how OXC parser handle these escapes. 😅

Nope… For this reason, we may need to add new flag and implement a lexer layer to check \', \", and \\...?
(If so the surrogate pair issue may also be isolated?)

Or just leave it user land to be called pattern.replace("\\\"", "\"").replace("\\'", "'").replace("\\\\", "\\") beforehand...?
But in this way, Span may be shifted.

I’m beginning to think about this. 🤔

Hmmm, not so sure. I think I'll wait for @Boshen 's advice.

This is summary what need to ask:

feat(regex_parser): Implement RegExp parser #3824 (comment)
feat(coverage): enable regexp in test262 #4242 (comment)
feat(coverage): enable regexp in test262 #4242 (comment)
How to handle escape sequence inside new RegExp("\\1")

Boshen

This is art.

graphite-app · 2024-08-20T02:14:35Z

Merge activity

Aug 19, 10:14 PM EDT: The merge label 'merge' was detected. This PR will be added to the Graphite merge queue once it meets the requirements.
Aug 19, 10:18 PM EDT: The merge label 'merge' was detected. This PR will be added to the Graphite merge queue once it meets the requirements.
Aug 19, 10:18 PM EDT: Boshen added this pull request to the Graphite merge queue.
Aug 19, 10:22 PM EDT: Boshen merged this pull request with the Graphite merge queue.

@Boshen

Part of #1164 ## Progress updates 🗞️ Waiting for the review and advice, while thinking how to handle escaped string when `new RegExp(pat)`. ## TODOs - [x] `RegExp(Literal = Body + Flags)#parse()` structure - [x] Base `Reader` impl to handle both unicode(u32) and utf-16(u16) units - [x] Global `Span` and local offset conversion - [x] Design AST shapes - [x] Keep `enum` size small by `Box<'a, T>` - [x] Rework AST shapes - [x] Split body and flags w/ validating literal - [x] Parse `RegExpFlags` - [x] Parse `RegExpBody` = `Pattern` - [x] Parse `Pattern` > `Disjunction` - [x] Parse `Disjunction` > `Alternative` - [x] Parse `Alternative` > `Term` - [x] Parse `Term` > `Assertion` - [x] Parse `BoundaryAssertion` - [x] Parse `LookaroundAssertion` - [x] Parse `Term` > `Quantifier` - [x] Parse `Term` > `Atom` - [x] Parse `Atom` > `PatternCharacter` - [x] Parse `Atom` > `.` - [x] Parse `Atom` > `\AtomEscape` - [x] Parse `\AtomEscape` > `DecimalEscape` - [x] Parse `\AtomEscape` > `CharacterClassEscape` - [x] Parse `CharacterClassEscape` > `\d, \D, \s, \S, \w, \W` - [x] Parse `CharacterClassEscape` > `\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}` - [x] Parse `\AtomEscape` > `CharacterEscape` - [x] Parse `CharacterEscape` > `ControlEscape` - [x] Parse `CharacterEscape` > `c AsciiLetter` - [x] Parse `CharacterEscape` > `0` - [x] Parse `CharacterEscape` > `HexEscapeSequence` - [x] Parse `CharacterEscape` > `RegExpUnicodeEscapeSequence` - [x] Parse `CharacterEscape` > `IdentityEscape` - [x] Parse `\AtomEscape` > `kGroupName` - [x] Parse `Atom` > `[CharacterClass]` - [x] Parse `[CharacterClass]` > `ClassContents` > `[~UnicodeSetsMode] NonemptyClassRanges` - [x] Parse `[CharacterClass]` > `ClassContents` > `[+UnicodeSetsMode] ClassSetExpression` - [x] Parse `ClassSetExpression` > `ClassUnion` - [x] Parse `ClassSetExpression` > `ClassIntersection` - [x] Parse `ClassSetExpression` > `ClassSubtraction` - [x] Parse `ClassSetExpression` > `ClassSetOperand` - [x] Parse `ClassSetExpression` > `ClassSetRange` - [x] Parse `ClassSetExpression` > `ClassSetCharacter` - [x] Parse `Atom` > `(GroupSpecifier)` - [x] Parse `Atom` > `(?:Disjunction)` - [x] Annex B - [x] Parse `QuantifiableAssertion` - [x] Parse `ExtendedAtom` - [x] Parse `ExtendedAtom` > `\ [lookahead = c]` - [x] Parse `ExtendedAtom` > `InvalidBracedQuantifier` - [x] Parse `ExtendedAtom` > `ExtendedPatternCharacter` - [x] Parse `ExtendedAtom` > `\AtomEscape` > `CharacterEscape` > `LegacyOctalEscapeSequence` - [x] Early errors - [x] Pattern :: Disjunction(1/2) - [x] Pattern :: Disjunction(2/2) - [x] QuantifierPrefix :: { DecimalDigits , DecimalDigits } - [x] ExtendedAtom :: InvalidBracedQuantifier (Annex B) - [x] AtomEscape :: k GroupName - [x] AtomEscape :: DecimalEscape - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B) - [x] RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2) - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2) - [x] CharacterClassEscape :: P{ UnicodePropertyValueExpression } - [x] CharacterClass :: [^ ClassContents ] - [x] NestedClass :: [^ ClassContents ] - [x] ClassSetRange :: ClassSetCharacter - ClassSetCharacter - [x] Add `Span` to `Err(OxcDiagnostic::error())` calls - [x] Perf improvement - [x] `Reader#peek()` should avoid `iter.next()` equivalent - [x] ~~Use `char` everywhere and split and push 2 surrogates(pair) for `Character`?~~ - [x] ~~Try 1(+1) loop parsing for capturing groups?~~ ## Follow up - [x] @Boshen Test suite > #4242 - [x] Investigate CI errors... - Next... - Support ES2025 Duplicate named capturing groups? - Support ES20XX Stage3 Modifiers?

leaysgur · 2024-08-23T05:07:47Z

@Sysix Sorry to bother you from already closed PR.

I finally found that we do not need to care about escaped backslash issue you mentioned.

Please see example/parse_file in #5106

But you may still need to wait a little longer to use this in linter. #1164 (comment)

This comment was marked as off-topic.

Sign in to view

Boshen changed the title ~~feat(regex_parser): Port regexpp for OXC~~ feat(regex_parser): Port regexpp Jun 22, 2024

Boshen mentioned this pull request Jun 22, 2024

feat: oxc regex parser #2030

Closed

Boshen self-assigned this Jun 22, 2024