Merge pull request #1 from tom-lord/chargroup_parser

Chargroup parser
tom-lord · Mar 2, 2015 · caf68ac · caf68ac
2 parents df3686c + b401414
commit caf68ac
Show file tree

Hide file tree

Showing 5 changed files with 141 additions and 99 deletions.
diff --git a/README.md b/README.md
@@ -26,12 +26,33 @@ For more detail on this, see [configuration options](#configuration-options).
 /what about (backreferences\?) \1/.examples #=> ['what about backreferences? backreferences?']
 ```
 
+## Installation
+
+Add this line to your application's Gemfile:
+
+```ruby
+gem 'regexp-examples'
+```
+
+And then execute:
+
+    $ bundle
+
+Or install it yourself as:
+
+    $ gem install regexp-examples
+
 ## Supported syntax
 
 * All forms of repeaters (quantifiers), e.g. `/a*/`, `/a+/`, `/a?/`, `/a{1,4}/`, `/a{3,}/`, `/a{,2}/`
   * Reluctant and possissive repeaters work fine, too - e.g. `/a*?/`, `/a*+/`
 * Boolean "Or" groups, e.g. `/a|b|c/`
-* Character sets (inluding ranges and negation!), e.g. `/[abc]/`, `/[A-Z0-9]/`, `/[^a-z]/`, `/[\w\s\b]/`
+* Character sets e.g. `/[abc]/` - including:
+  * Ranges, e.g.`/[A-Z0-9]/`
+  * Negation, e.g. `/[^a-z]/`
+  * Escaped characters, e.g. `/[\w\s\b]/`
+  * POSIX bracket expressions, e.g. `/[[:alnum:]]/`, `/[[:^space:]]/`
+  * Set intersection, e.g. `/[[a-h]&&[f-z]]/`
 * Escaped characters, e.g. `/\n/`, `/\w/`, `/\D/` (and so on...)
 * Capture groups, e.g. `/(group)/`
   * Including named groups, e.g. `/(?<name>group)/`
@@ -43,7 +64,6 @@ For more detail on this, see [configuration options](#configuration-options).
 * Escape sequences, e.g. `/\x42/`, `/\x5word/`, `/#{"\x80".force_encoding("ASCII-8BIT")}/`
 * Unicode characters, e.g. `/\u0123/`, `/\uabcd/`, `/\u{789}/`
 * Octal characters, e.g. `/\10/`, `/\177/`
-* POSIX bracket expressions (including negation), e.g. `/[[:alnum:]]/`, `/[[:^space:]]/`
 * Named properties, e.g. `/\p{L}/` ("Letter"), `/\p{Arabic}/` ("Arabic character"), `/\p{^Ll}/` ("Not a lowercase letter")
 * **Arbitrarily complex combinations of all the above!**
 
@@ -55,15 +75,12 @@ For more detail on this, see [configuration options](#configuration-options).
 
 ## Bugs and Not-Yet-Supported syntax
 
-* Nested character classes, and the use of set intersection ([See here](http://www.ruby-doc.org/core-2.2.0/Regexp.html#class-Regexp-label-Character+Classes) for the official documentation on this.) For example:
-  * `/[[abc]de]/.examples`  (which _should_ return `["a", "b", "c", "d", "e"]`)
-  * `/[[a-d]&&[c-f]]/.examples` (which _should_ return: `["c", "d"]`)
+* There are some (rare) edge cases where backreferences do not work properly, e.g. `/(a*)a* \1/.examples` - which includes "aaaa aa". This is because each repeater is not context-aware, so the "greediness" logic is flawed. (E.g. in this case, the second `a*` should always evaluate to an empty string, because the previous `a*` was greedy! However, patterns like this are highly unusual...
+* Some named properties, e.g. `/\p{Arabic}/`, list non-matching examples for ruby 2.0/2.1 (as the definitions changed in ruby 2.2). This would be "easy" to fix, but I can't be bothered... Feel free to make a pull request!
 
-* Conditional capture groups, such as `/(group1) (?(1)yes|no)`
-
-* Some named properties, e.g. `/\p{Arabic}/`, list non-matching examples for ruby 2.0/2.1. There are no known issues in ruby 2.2
-
-There are loads more (increasingly obscure) unsupported bits of syntax, which I cannot be bothered to write out here. Full documentation on all the various other obscurities in the ruby (version 2.x) regexp parser can be found [here](https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/RE).
+There are also some various (increasingly obscure) unsupported bits of syntax, which I cannot be bothered to write out fully here. Full documentation on all the intricate obscurities in the ruby (version 2.x) regexp parser can be found [here](https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/RE). To name a couple:
+* Conditional capture groups, e.g. `/(group1)? (?(1)yes|no)/.examples` (which *should* return: `["group1 yes", " no"]`)
+* Back reference by relatve group number, e.g. `/(a)(b)(c)(d) \k<-2>/.examples` (which *should* return: `["abcd c"]`)
 
 ## Impossible features ("illegal syntax")
 
@@ -117,21 +134,12 @@ A more sensible use case might be, for example, to generate one random 1-4 digit
 
 (Note: I may develop a much more efficient way to "generate one example" in a later release of this gem.)
 
-## Installation
-
-Add this line to your application's Gemfile:
-
-```ruby
-gem 'regexp-examples'
-```
-
-And then execute:
-
-    $ bundle
-
-Or install it yourself as:
+## TODO
 
-    $ gem install regexp-examples
+* Performance improvements:
+  * Use of lambdas/something (in [constants.rb](lib/regexp-examples/constants.rb)) to improve the library load time.
+  * (Maybe?) add a `max_examples` configuration option and use lazy evaluation, to ensure the method never "freezes"
+* Write a blog post about how this amazing gem works! :)
 
 ## Contributing
 

diff --git a/lib/regexp-examples/chargroup_parser.rb b/lib/regexp-examples/chargroup_parser.rb
@@ -1,69 +1,118 @@
 module RegexpExamples
-  # Given an array of chars from inside a character set,
-  # Interprets all backslashes, ranges and negations
-  # TODO: This needs a bit of a rewrite because:
-  #   A) It's ugly
-  #   B) It doesn't take into account nested character groups, or set intersection
-  # To achieve this, the algorithm needs to be recursive, like the main Parser.
+  # A "sub-parser", for char groups in a regular expression
+  # Some examples of what this class needs to parse:
+  # [abc]          - plain characters
+  # [a-z]          - ranges
+  # [\n\b\d]       - escaped characters (which may represent character sets)
+  # [^abc]         - negated group
+  # [[a][bc]]      - sub-groups (should match "a", "b" or "c")
+  # [[:lower:]]    - POSIX group
+  # [[a-f]&&[d-z]] - set intersection (should match "d", "f" or "f")
+  # [[^:alpha:]&&[\n]a-c] - all of the above!!!! (should match "\n")
   class ChargroupParser
-    def initialize(chars)
-      @chars = chars
-      if @chars[0] == "^"
-        @negative = true
-        @chars = @chars[1..-1]
-      else
-        @negative = false
+    attr_reader :regexp_string
+    def initialize(regexp_string, is_sub_group: false)
+      @regexp_string = regexp_string
+      @is_sub_group = is_sub_group
+      @current_position = 0
+      parse
+    end
+
+    def parse
+      @charset = []
+      @negative = false
+      parse_first_chars
+      until next_char == "]" do
+        case next_char
+        when "["
+          @current_position += 1
+          sub_group_parser = self.class.new(rest_of_string, is_sub_group: true)
+          @charset.concat sub_group_parser.result
+          @current_position += sub_group_parser.length
+        when "-"
+          if regexp_string[@current_position + 1] == "]" # e.g. /[abc-]/ -- not a range!
+            @charset << "-"
+            @current_position += 1
+          else
+            @current_position += 1
+            @charset.concat (@charset.last .. parse_checking_backlash.first).to_a
+            @current_position += 1
+          end
+        when "&"
+          if regexp_string[@current_position + 1] == "&"
+            @current_position += 2
+            sub_group_parser = self.class.new(rest_of_string, is_sub_group: @is_sub_group)
+            @charset &= sub_group_parser.result
+            @current_position += (sub_group_parser.length - 1)
+          else
+            @charset << "&"
+            @current_position += 1
+          end
+        else
+          @charset.concat parse_checking_backlash
+          @current_position += 1
+        end
       end
 
-      init_backslash_chars
-      init_ranges
+      @charset.uniq!
+      @current_position += 1 # To account for final "]"
+    end
+
+    def length
+      @current_position
     end
 
     def result
-      @negative ? (CharSets::Any - @chars) : @chars
+      @negative ? (CharSets::Any - @charset) : @charset
     end
 
     private
-    def init_backslash_chars
-      @chars.each_with_index do |char, i|
-        if char == "\\"
-          if BackslashCharMap.keys.include?(@chars[i+1])
-            @chars[i..i+1] = move_backslash_to_front( BackslashCharMap[@chars[i+1]] )
-          elsif @chars[i+1] == 'b'
-            @chars[i..i+1] = "\b"
-          elsif @chars[i+1] == "\\"
-            @chars.delete_at(i+1)
-          else
-            @chars.delete_at(i)
-          end
+    def parse_first_chars
+      if next_char == '^'
+        @negative = true
+        @current_position += 1
+      end
+
+      case rest_of_string
+      when /\A[-\]]/ # e.g. /[]]/ (match "]") or /[-]/ (match "-")
+        @charset << next_char
+        @current_position += 1
+      when /\A:(\^?)([^:]+):\]/ # e.g. [[:alpha:]] - POSIX group
+        if @is_sub_group
+          chars = $1.empty? ? POSIXCharMap[$2] : (CharSets::Any - POSIXCharMap[$2])
+          @charset.concat chars
+          @current_position += ($1.length + $2.length + 2)
         end
       end
     end
 
-    def init_ranges
-      # remove hyphen ("-") from front/back, if present
-      hyphen = nil
-      hyphen = @chars.shift if @chars.first == "-"
-      hyphen ||= @chars.pop if @chars.last == "-"
-      # Replace all instances of e.g. ["a", "-", "z"] with ["a", "b", ..., "z"]
-      while i = @chars.index("-")
-        # Prevent infinite loops from expanding [",", "-", "."] to itself
-        # (Since ",".ord = 44, "-".ord = 45, ".".ord = 46)
-        if (@chars[i-1] == ',' && @chars[i+1] == '.')
-          hyphen = @chars.delete_at(i)
-        else
-          @chars[i-1..i+1] = (@chars[i-1]..@chars[i+1]).to_a
-        end
+    # Always returns an Array, for consistency
+    def parse_checking_backlash
+      if next_char == "\\"
+        @current_position += 1
+        parse_after_backslash
+      else
+        [next_char]
       end
-      # restore hyphen, if stripped out earlier
-      @chars.unshift(hyphen) if hyphen
     end
 
-    def move_backslash_to_front(chars)
-      if index = chars.index { |char| char == '\\' }
-        chars.unshift chars.delete_at(index)
+    def parse_after_backslash
+      case next_char
+      when *BackslashCharMap.keys
+        BackslashCharMap[next_char]
+      when 'b'
+        ["\b"]
+      else
+        [next_char]
       end
-      chars
+    end
+
+    def rest_of_string
+      regexp_string[@current_position..-1]
+    end
+
+    def next_char
+      regexp_string[@current_position]
     end
   end
 end

diff --git a/lib/regexp-examples/parser.rb b/lib/regexp-examples/parser.rb
@@ -223,30 +223,10 @@ def parse_multi_end_group
     end
 
     def parse_char_group
-      # TODO: Extract all this logic into ChargroupParser
-      if rest_of_string =~ /\A\[\[:(\^?)([^:]+):\]\]/
-        @current_position += (6 + $1.length + $2.length)
-        chars = $1.empty? ? POSIXCharMap[$2] : CharSets::Any - POSIXCharMap[$2]
-        return CharGroup.new(chars, @ignorecase)
-      end
-      chars = []
-      @current_position += 1
-      if next_char == ']'
-        # Beware of the sneaky edge case:
-        # /[]]/ (match "]")
-        chars << ']'
-        @current_position += 1
-      end
-      until next_char == ']' \
-        && !regexp_string[0..@current_position-1].match(/[^\\](\\{2})*\\\z/)
-        # Beware of having an ODD number of "\" before the "]", e.g.
-        # /[\]]/ (match "]")
-        # /[\\]/ (match "\")
-        # /[\\\]]/ (match "\" or "]")
-        chars << next_char
-        @current_position += 1
-      end
-      parsed_chars = ChargroupParser.new(chars).result
+      @current_position += 1 # Skip past opening "["
+      chargroup_parser = ChargroupParser.new(rest_of_string)
+      parsed_chars = chargroup_parser.result
+      @current_position += (chargroup_parser.length - 1) # Step back to closing "]"
       CharGroup.new(parsed_chars, @ignorecase)
     end
 

diff --git a/lib/regexp-examples/version.rb b/lib/regexp-examples/version.rb
@@ -1,3 +1,3 @@
 module RegexpExamples
-  VERSION = '0.7.0'
+  VERSION = '1.0.0'
 end
diff --git a/spec/regexp-examples_spec.rb b/spec/regexp-examples_spec.rb
@@ -69,7 +69,6 @@ def self.examples_are_empty(*regexps)
 
     context "for complex char groups (square brackets)" do
       examples_exist_and_match(
-
         /[abc]/,
         /[a-c]/,
         /[abc-e]/,
@@ -82,7 +81,13 @@ def self.examples_are_empty(*regexps)
         /[\n-\r]/,
         /[\-]/,
         /[%-+]/, # This regex is "supposed to" match some surprising things!!!
-        /['-.]/ # Test to ensure no "infinite loop" on character set expansion
+        /['-.]/, # Test to ensure no "infinite loop" on character set expansion
+        /[[abc]]/, # Nested groups
+        /[[[[abc]]]]/,
+        /[[a][b][c]]/,
+        /[[a-h]&&[f-z]]/, # Set intersection
+        /[[a-h]&&ab[c]]/, # Set intersection
+        /[[a-h]&[f-z]]/, # NOT set intersection
       )
     end