Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(regex_parser): Implement RegExp parser #3824

Merged
merged 1 commit into from
Aug 20, 2024
Merged

feat(regex_parser): Implement RegExp parser #3824

merged 1 commit into from
Aug 20, 2024

Conversation

leaysgur
Copy link
Collaborator

@leaysgur leaysgur commented Jun 22, 2024

Part of #1164

Progress updates 🗞️

Waiting for the review and advice, while thinking how to handle escaped string when new RegExp(pat).

TODOs

  • RegExp(Literal = Body + Flags)#parse() structure
  • Base Reader impl to handle both unicode(u32) and utf-16(u16) units
  • Global Span and local offset conversion
  • Design AST shapes
    • Keep enum size small by Box<'a, T>
    • Rework AST shapes
  • Split body and flags w/ validating literal
  • Parse RegExpFlags
  • Parse RegExpBody = Pattern
  • Parse Pattern > Disjunction
  • Parse Disjunction > Alternative
  • Parse Alternative > Term
  • Parse Term > Assertion
    • Parse BoundaryAssertion
    • Parse LookaroundAssertion
  • Parse Term > Quantifier
  • Parse Term > Atom
    • Parse Atom > PatternCharacter
    • Parse Atom > .
    • Parse Atom > \AtomEscape
      • Parse \AtomEscape > DecimalEscape
      • Parse \AtomEscape > CharacterClassEscape
        • Parse CharacterClassEscape > \d, \D, \s, \S, \w, \W
        • Parse CharacterClassEscape > \p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}
      • Parse \AtomEscape > CharacterEscape
        • Parse CharacterEscape > ControlEscape
        • Parse CharacterEscape > c AsciiLetter
        • Parse CharacterEscape > 0
        • Parse CharacterEscape > HexEscapeSequence
        • Parse CharacterEscape > RegExpUnicodeEscapeSequence
        • Parse CharacterEscape > IdentityEscape
      • Parse \AtomEscape > kGroupName
    • Parse Atom > [CharacterClass]
      • Parse [CharacterClass] > ClassContents > [~UnicodeSetsMode] NonemptyClassRanges
      • Parse [CharacterClass] > ClassContents > [+UnicodeSetsMode] ClassSetExpression
        • Parse ClassSetExpression > ClassUnion
        • Parse ClassSetExpression > ClassIntersection
        • Parse ClassSetExpression > ClassSubtraction
        • Parse ClassSetExpression > ClassSetOperand
        • Parse ClassSetExpression > ClassSetRange
        • Parse ClassSetExpression > ClassSetCharacter
    • Parse Atom > (GroupSpecifier)
    • Parse Atom > (?:Disjunction)
  • Annex B
    • Parse QuantifiableAssertion
    • Parse ExtendedAtom
      • Parse ExtendedAtom > \ [lookahead = c]
      • Parse ExtendedAtom > InvalidBracedQuantifier
      • Parse ExtendedAtom > ExtendedPatternCharacter
      • Parse ExtendedAtom > \AtomEscape > CharacterEscape > LegacyOctalEscapeSequence
  • Early errors
    • Pattern :: Disjunction(1/2)
    • Pattern :: Disjunction(2/2)
    • QuantifierPrefix :: { DecimalDigits , DecimalDigits }
    • ExtendedAtom :: InvalidBracedQuantifier (Annex B)
    • AtomEscape :: k GroupName
    • AtomEscape :: DecimalEscape
    • NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2)
    • NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2)
    • NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B)
    • NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2)
    • NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2)
    • NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B)
    • RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence
    • RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
    • RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence
    • RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
    • UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2)
    • UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2)
    • UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2)
    • UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2)
    • CharacterClassEscape :: P{ UnicodePropertyValueExpression }
    • CharacterClass :: [^ ClassContents ]
    • NestedClass :: [^ ClassContents ]
    • ClassSetRange :: ClassSetCharacter - ClassSetCharacter
  • Add Span to Err(OxcDiagnostic::error()) calls
  • Perf improvement
    • Reader#peek() should avoid iter.next() equivalent
    • Use char everywhere and split and push 2 surrogates(pair) for Character?
    • Try 1(+1) loop parsing for capturing groups?

Follow up

This comment was marked as off-topic.

@Boshen Boshen changed the title feat(regex_parser): Port regexpp for OXC feat(regex_parser): Port regexpp Jun 22, 2024
@Boshen Boshen mentioned this pull request Jun 22, 2024
@Boshen Boshen self-assigned this Jun 22, 2024
@leaysgur leaysgur mentioned this pull request Jun 22, 2024
4 tasks
Copy link

codspeed-hq bot commented Jun 22, 2024

CodSpeed Performance Report

Merging #3824 will not alter performance

Comparing regexpp (368364d) with main (f88970b)

Summary

✅ 29 untouched benchmarks

@leaysgur

This comment was marked as outdated.

@leaysgur leaysgur linked an issue Jun 24, 2024 that may be closed by this pull request
4 tasks
@leaysgur

This comment was marked as resolved.

Boshen

This comment was marked as resolved.

@leaysgur

This comment was marked as resolved.

@Boshen
Copy link
Member

Boshen commented Jul 5, 2024

Now that we have some of the implementation working, we should think about how to support the regex eslint rules 🤔

@leaysgur leaysgur changed the title feat(regex_parser): Port regexpp feat(regex_parser): Implement RegExp parser Jul 12, 2024
@Sysix
Copy link
Contributor

Sysix commented Jul 12, 2024

@leaysgur hello, im currently working on a smaller version of regex groups, maybe u find some usefull snippets here:
https://github.com/oxc-project/oxc/pull/4096/files

interesting method:

  • get_regex_group_by_open_bracet

Currently Missing implementations:

  • Named Captures (named capture can be accessed by it's number reference. (?<foo>a)(\1) is valid)

@leaysgur leaysgur assigned leaysgur and unassigned Boshen Jul 13, 2024
@rzvxa
Copy link
Collaborator

rzvxa commented Jul 15, 2024

This is awesome, I'm looking forward to this PR😍

I always had a theory that there are only 5 people on StackOverflow who write all the regex examples and everyone else just copies them into production. If that theory is correct I bet you'd be the 6th after this😆

@leaysgur
Copy link
Collaborator Author

That's the truth. 😅
I will do my best to reverse my position from copying to be copied side!

@leaysgur

This comment was marked as outdated.

@magic-akari
Copy link
Collaborator

I hope it becomes an independent crate package.

@leaysgur
Copy link
Collaborator Author

@Boshen ^ How do you think? (maybe also related to #4242 (comment))

@leaysgur leaysgur requested a review from Boshen August 14, 2024 13:08
@Sysix
Copy link
Contributor

Sysix commented Aug 14, 2024

Hello @leaysgur

for eslint/no-useless-backreference it would be really nice to have the AST-Span of the Backreference.
But there is a difference with RegexLiterals and the RegExp Obj: The Backlash needs to be escaped.

See:

  ⚠ no back reference
   ╭─[no_useless_backreference.tsx:1:6]
 1 │ /(b)(\2a)/
   ·      ──
   ╰────
vs.
  ⚠ no back reference
   ╭─[no_useless_backreference.tsx:1:6]
 1 │ new RegExp("(b)(\\2a)");
   ·                 ───
   ╰────

Did you considered this use case for escaped backslashes?
Do we want to catch the doubled backslash?

@leaysgur
Copy link
Collaborator Author

leaysgur commented Aug 15, 2024

@Sysix Thanks for your comment!

nice to have the AST-Span of the Backreference.

The current AST for backref already holds span.
Please check https://github.com/oxc-project/oxc/pull/3824/files#diff-8a24d853ad0d7af27dc285d21a4b1f2d6aa97d2fab6153fa441805750266d7e4R245-R248 👀

Did you considered this use case for escaped backslashes?

Yes, but as a RegExp parser, I do not specifically address backslash escaping (rather escape sequences).
Because it's a topic of string literals themselves, not just RegExp.

For now, the treatment of pattern \\2 is (escaped)backslash and (number)2.

My understanding may be wrong and I'm not sure how OXC parser handle these escapes. 😅


Nope… For this reason, we may need to add new flag and implement a lexer layer to check \', \", and \\...?
(If so the surrogate pair issue may also be isolated?)

Or just leave it user land to be called pattern.replace("\\\"", "\"").replace("\\'", "'").replace("\\\\", "\\") beforehand...?
But in this way, Span may be shifted.

I’m beginning to think about this. 🤔


Hmmm, not so sure. I think I'll wait for @Boshen 's advice.

This is summary what need to ask:

@Boshen Boshen force-pushed the regexpp branch 2 times, most recently from 219fe7e to e95c600 Compare August 20, 2024 01:48
Copy link
Member

@Boshen Boshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is art.

@Boshen Boshen marked this pull request as ready for review August 20, 2024 01:52
@Boshen Boshen added the 0-merge Merge with Graphite Merge Queue label Aug 20, 2024
Copy link

graphite-app bot commented Aug 20, 2024

Merge activity

  • Aug 19, 10:14 PM EDT: The merge label 'merge' was detected. This PR will be added to the Graphite merge queue once it meets the requirements.
  • Aug 19, 10:18 PM EDT: The merge label 'merge' was detected. This PR will be added to the Graphite merge queue once it meets the requirements.
  • Aug 19, 10:18 PM EDT: Boshen added this pull request to the Graphite merge queue.
  • Aug 19, 10:22 PM EDT: Boshen merged this pull request with the Graphite merge queue.

Part of #1164

## Progress updates 🗞️

Waiting for the review and advice, while thinking how to handle escaped string when `new RegExp(pat)`.

## TODOs

- [x] `RegExp(Literal = Body + Flags)#parse()` structure
- [x] Base `Reader` impl to handle both unicode(u32) and utf-16(u16) units
- [x] Global `Span` and local offset conversion
- [x] Design AST shapes
  - [x] Keep `enum` size small by `Box<'a, T>`
  - [x] Rework AST shapes
- [x] Split body and flags w/ validating literal
- [x] Parse `RegExpFlags`
- [x] Parse `RegExpBody` = `Pattern`
- [x] Parse `Pattern` > `Disjunction`
- [x] Parse `Disjunction` > `Alternative`
- [x] Parse `Alternative` > `Term`
- [x] Parse `Term` > `Assertion`
	- [x] Parse `BoundaryAssertion`
	- [x] Parse `LookaroundAssertion`
- [x] Parse `Term` > `Quantifier`
- [x] Parse `Term` > `Atom`
	- [x] Parse `Atom` > `PatternCharacter`
	- [x] Parse `Atom` > `.`
	- [x] Parse `Atom` > `\AtomEscape`
		- [x] Parse `\AtomEscape` > `DecimalEscape`
		- [x] Parse `\AtomEscape` > `CharacterClassEscape`
			- [x] Parse `CharacterClassEscape` > `\d, \D, \s, \S, \w, \W`
			- [x] Parse `CharacterClassEscape` > `\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}`
		- [x] Parse `\AtomEscape` > `CharacterEscape`
			- [x] Parse `CharacterEscape` > `ControlEscape`
			- [x] Parse `CharacterEscape` > `c AsciiLetter`
			- [x] Parse `CharacterEscape` > `0`
			- [x] Parse `CharacterEscape` > `HexEscapeSequence`
			- [x] Parse `CharacterEscape` > `RegExpUnicodeEscapeSequence`
			- [x] Parse `CharacterEscape` > `IdentityEscape`
		- [x] Parse `\AtomEscape` > `kGroupName`
	- [x] Parse `Atom` > `[CharacterClass]`
    	- [x] Parse `[CharacterClass]` > `ClassContents` > `[~UnicodeSetsMode] NonemptyClassRanges`
    	- [x] Parse `[CharacterClass]` > `ClassContents` > `[+UnicodeSetsMode] ClassSetExpression`
          - [x] Parse `ClassSetExpression` > `ClassUnion`
          - [x] Parse `ClassSetExpression` > `ClassIntersection`
          - [x] Parse `ClassSetExpression` > `ClassSubtraction`
          - [x] Parse `ClassSetExpression` > `ClassSetOperand`
          - [x] Parse `ClassSetExpression` > `ClassSetRange`
          - [x] Parse `ClassSetExpression` > `ClassSetCharacter`
	- [x] Parse `Atom` > `(GroupSpecifier)`
	- [x] Parse `Atom` > `(?:Disjunction)`
- [x] Annex B
    - [x] Parse `QuantifiableAssertion`
	- [x] Parse `ExtendedAtom`
      - [x] Parse `ExtendedAtom` > `\ [lookahead = c]`
      - [x] Parse `ExtendedAtom` > `InvalidBracedQuantifier`
      - [x] Parse `ExtendedAtom` > `ExtendedPatternCharacter`
      - [x] Parse `ExtendedAtom` > `\AtomEscape` > `CharacterEscape` > `LegacyOctalEscapeSequence`
- [x] Early errors
	- [x] Pattern :: Disjunction(1/2)
	- [x] Pattern :: Disjunction(2/2)
	- [x] QuantifierPrefix :: { DecimalDigits , DecimalDigits }
	- [x] ExtendedAtom :: InvalidBracedQuantifier (Annex B)
	- [x] AtomEscape :: k GroupName
	- [x] AtomEscape :: DecimalEscape
	- [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2)
	- [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2)
	- [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B)
	- [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2)
	- [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2)
	- [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B)
	- [x] RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence
	- [x] RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
	- [x] RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence
	- [x] RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
	- [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2)
	- [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2)
	- [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2)
	- [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2)
	- [x] CharacterClassEscape :: P{ UnicodePropertyValueExpression }
	- [x] CharacterClass :: [^ ClassContents ]
	- [x] NestedClass :: [^ ClassContents ]
	- [x] ClassSetRange :: ClassSetCharacter - ClassSetCharacter
- [x] Add `Span` to `Err(OxcDiagnostic::error())` calls
- [x] Perf improvement
	- [x] `Reader#peek()` should avoid `iter.next()` equivalent
	- [x] ~~Use `char` everywhere and split and push 2 surrogates(pair) for `Character`?~~
	- [x] ~~Try 1(+1) loop parsing for capturing groups?~~

## Follow up

- [x] @Boshen Test suite > #4242
  - [x] Investigate CI errors...
- Next...
  - Support ES2025 Duplicate named capturing groups?
  - Support ES20XX Stage3 Modifiers?
@graphite-app graphite-app bot merged commit 368364d into main Aug 20, 2024
25 checks passed
@graphite-app graphite-app bot deleted the regexpp branch August 20, 2024 02:22
@leaysgur
Copy link
Collaborator Author

leaysgur commented Aug 23, 2024

@Sysix Sorry to bother you from already closed PR.

I finally found that we do not need to care about escaped backslash issue you mentioned.

Please see example/parse_file in #5106

But you may still need to wait a little longer to use this in linter. #1164 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0-merge Merge with Graphite Merge Queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(linter): regex parser
5 participants