diff --git a/README.md b/README.md index a9902e8..b51411b 100644 --- a/README.md +++ b/README.md @@ -26,12 +26,33 @@ For more detail on this, see [configuration options](#configuration-options). /what about (backreferences\?) \1/.examples #=> ['what about backreferences? backreferences?'] ``` +## Installation + +Add this line to your application's Gemfile: + +```ruby +gem 'regexp-examples' +``` + +And then execute: + + $ bundle + +Or install it yourself as: + + $ gem install regexp-examples + ## Supported syntax * All forms of repeaters (quantifiers), e.g. `/a*/`, `/a+/`, `/a?/`, `/a{1,4}/`, `/a{3,}/`, `/a{,2}/` * Reluctant and possissive repeaters work fine, too - e.g. `/a*?/`, `/a*+/` * Boolean "Or" groups, e.g. `/a|b|c/` -* Character sets (inluding ranges and negation!), e.g. `/[abc]/`, `/[A-Z0-9]/`, `/[^a-z]/`, `/[\w\s\b]/` +* Character sets e.g. `/[abc]/` - including: + * Ranges, e.g.`/[A-Z0-9]/` + * Negation, e.g. `/[^a-z]/` + * Escaped characters, e.g. `/[\w\s\b]/` + * POSIX bracket expressions, e.g. `/[[:alnum:]]/`, `/[[:^space:]]/` + * Set intersection, e.g. `/[[a-h]&&[f-z]]/` * Escaped characters, e.g. `/\n/`, `/\w/`, `/\D/` (and so on...) * Capture groups, e.g. `/(group)/` * Including named groups, e.g. `/(?group)/` @@ -43,7 +64,6 @@ For more detail on this, see [configuration options](#configuration-options). * Escape sequences, e.g. `/\x42/`, `/\x5word/`, `/#{"\x80".force_encoding("ASCII-8BIT")}/` * Unicode characters, e.g. `/\u0123/`, `/\uabcd/`, `/\u{789}/` * Octal characters, e.g. `/\10/`, `/\177/` -* POSIX bracket expressions (including negation), e.g. `/[[:alnum:]]/`, `/[[:^space:]]/` * Named properties, e.g. `/\p{L}/` ("Letter"), `/\p{Arabic}/` ("Arabic character"), `/\p{^Ll}/` ("Not a lowercase letter") * **Arbitrarily complex combinations of all the above!** @@ -55,15 +75,12 @@ For more detail on this, see [configuration options](#configuration-options). ## Bugs and Not-Yet-Supported syntax -* Nested character classes, and the use of set intersection ([See here](http://www.ruby-doc.org/core-2.2.0/Regexp.html#class-Regexp-label-Character+Classes) for the official documentation on this.) For example: - * `/[[abc]de]/.examples` (which _should_ return `["a", "b", "c", "d", "e"]`) - * `/[[a-d]&&[c-f]]/.examples` (which _should_ return: `["c", "d"]`) +* There are some (rare) edge cases where backreferences do not work properly, e.g. `/(a*)a* \1/.examples` - which includes "aaaa aa". This is because each repeater is not context-aware, so the "greediness" logic is flawed. (E.g. in this case, the second `a*` should always evaluate to an empty string, because the previous `a*` was greedy! However, patterns like this are highly unusual... +* Some named properties, e.g. `/\p{Arabic}/`, list non-matching examples for ruby 2.0/2.1 (as the definitions changed in ruby 2.2). This would be "easy" to fix, but I can't be bothered... Feel free to make a pull request! -* Conditional capture groups, such as `/(group1) (?(1)yes|no)` - -* Some named properties, e.g. `/\p{Arabic}/`, list non-matching examples for ruby 2.0/2.1. There are no known issues in ruby 2.2 - -There are loads more (increasingly obscure) unsupported bits of syntax, which I cannot be bothered to write out here. Full documentation on all the various other obscurities in the ruby (version 2.x) regexp parser can be found [here](https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/RE). +There are also some various (increasingly obscure) unsupported bits of syntax, which I cannot be bothered to write out fully here. Full documentation on all the intricate obscurities in the ruby (version 2.x) regexp parser can be found [here](https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/RE). To name a couple: +* Conditional capture groups, e.g. `/(group1)? (?(1)yes|no)/.examples` (which *should* return: `["group1 yes", " no"]`) +* Back reference by relatve group number, e.g. `/(a)(b)(c)(d) \k<-2>/.examples` (which *should* return: `["abcd c"]`) ## Impossible features ("illegal syntax") @@ -117,21 +134,12 @@ A more sensible use case might be, for example, to generate one random 1-4 digit (Note: I may develop a much more efficient way to "generate one example" in a later release of this gem.) -## Installation - -Add this line to your application's Gemfile: - -```ruby -gem 'regexp-examples' -``` - -And then execute: - - $ bundle - -Or install it yourself as: +## TODO - $ gem install regexp-examples +* Performance improvements: + * Use of lambdas/something (in [constants.rb](lib/regexp-examples/constants.rb)) to improve the library load time. + * (Maybe?) add a `max_examples` configuration option and use lazy evaluation, to ensure the method never "freezes" +* Write a blog post about how this amazing gem works! :) ## Contributing diff --git a/lib/regexp-examples/chargroup_parser.rb b/lib/regexp-examples/chargroup_parser.rb index 448226f..f5ca3e6 100644 --- a/lib/regexp-examples/chargroup_parser.rb +++ b/lib/regexp-examples/chargroup_parser.rb @@ -1,69 +1,118 @@ module RegexpExamples - # Given an array of chars from inside a character set, - # Interprets all backslashes, ranges and negations - # TODO: This needs a bit of a rewrite because: - # A) It's ugly - # B) It doesn't take into account nested character groups, or set intersection - # To achieve this, the algorithm needs to be recursive, like the main Parser. + # A "sub-parser", for char groups in a regular expression + # Some examples of what this class needs to parse: + # [abc] - plain characters + # [a-z] - ranges + # [\n\b\d] - escaped characters (which may represent character sets) + # [^abc] - negated group + # [[a][bc]] - sub-groups (should match "a", "b" or "c") + # [[:lower:]] - POSIX group + # [[a-f]&&[d-z]] - set intersection (should match "d", "f" or "f") + # [[^:alpha:]&&[\n]a-c] - all of the above!!!! (should match "\n") class ChargroupParser - def initialize(chars) - @chars = chars - if @chars[0] == "^" - @negative = true - @chars = @chars[1..-1] - else - @negative = false + attr_reader :regexp_string + def initialize(regexp_string, is_sub_group: false) + @regexp_string = regexp_string + @is_sub_group = is_sub_group + @current_position = 0 + parse + end + + def parse + @charset = [] + @negative = false + parse_first_chars + until next_char == "]" do + case next_char + when "[" + @current_position += 1 + sub_group_parser = self.class.new(rest_of_string, is_sub_group: true) + @charset.concat sub_group_parser.result + @current_position += sub_group_parser.length + when "-" + if regexp_string[@current_position + 1] == "]" # e.g. /[abc-]/ -- not a range! + @charset << "-" + @current_position += 1 + else + @current_position += 1 + @charset.concat (@charset.last .. parse_checking_backlash.first).to_a + @current_position += 1 + end + when "&" + if regexp_string[@current_position + 1] == "&" + @current_position += 2 + sub_group_parser = self.class.new(rest_of_string, is_sub_group: @is_sub_group) + @charset &= sub_group_parser.result + @current_position += (sub_group_parser.length - 1) + else + @charset << "&" + @current_position += 1 + end + else + @charset.concat parse_checking_backlash + @current_position += 1 + end end - init_backslash_chars - init_ranges + @charset.uniq! + @current_position += 1 # To account for final "]" + end + + def length + @current_position end def result - @negative ? (CharSets::Any - @chars) : @chars + @negative ? (CharSets::Any - @charset) : @charset end private - def init_backslash_chars - @chars.each_with_index do |char, i| - if char == "\\" - if BackslashCharMap.keys.include?(@chars[i+1]) - @chars[i..i+1] = move_backslash_to_front( BackslashCharMap[@chars[i+1]] ) - elsif @chars[i+1] == 'b' - @chars[i..i+1] = "\b" - elsif @chars[i+1] == "\\" - @chars.delete_at(i+1) - else - @chars.delete_at(i) - end + def parse_first_chars + if next_char == '^' + @negative = true + @current_position += 1 + end + + case rest_of_string + when /\A[-\]]/ # e.g. /[]]/ (match "]") or /[-]/ (match "-") + @charset << next_char + @current_position += 1 + when /\A:(\^?)([^:]+):\]/ # e.g. [[:alpha:]] - POSIX group + if @is_sub_group + chars = $1.empty? ? POSIXCharMap[$2] : (CharSets::Any - POSIXCharMap[$2]) + @charset.concat chars + @current_position += ($1.length + $2.length + 2) end end end - def init_ranges - # remove hyphen ("-") from front/back, if present - hyphen = nil - hyphen = @chars.shift if @chars.first == "-" - hyphen ||= @chars.pop if @chars.last == "-" - # Replace all instances of e.g. ["a", "-", "z"] with ["a", "b", ..., "z"] - while i = @chars.index("-") - # Prevent infinite loops from expanding [",", "-", "."] to itself - # (Since ",".ord = 44, "-".ord = 45, ".".ord = 46) - if (@chars[i-1] == ',' && @chars[i+1] == '.') - hyphen = @chars.delete_at(i) - else - @chars[i-1..i+1] = (@chars[i-1]..@chars[i+1]).to_a - end + # Always returns an Array, for consistency + def parse_checking_backlash + if next_char == "\\" + @current_position += 1 + parse_after_backslash + else + [next_char] end - # restore hyphen, if stripped out earlier - @chars.unshift(hyphen) if hyphen end - def move_backslash_to_front(chars) - if index = chars.index { |char| char == '\\' } - chars.unshift chars.delete_at(index) + def parse_after_backslash + case next_char + when *BackslashCharMap.keys + BackslashCharMap[next_char] + when 'b' + ["\b"] + else + [next_char] end - chars + end + + def rest_of_string + regexp_string[@current_position..-1] + end + + def next_char + regexp_string[@current_position] end end end diff --git a/lib/regexp-examples/parser.rb b/lib/regexp-examples/parser.rb index ccb6fc4..f3ff6e6 100644 --- a/lib/regexp-examples/parser.rb +++ b/lib/regexp-examples/parser.rb @@ -223,30 +223,10 @@ def parse_multi_end_group end def parse_char_group - # TODO: Extract all this logic into ChargroupParser - if rest_of_string =~ /\A\[\[:(\^?)([^:]+):\]\]/ - @current_position += (6 + $1.length + $2.length) - chars = $1.empty? ? POSIXCharMap[$2] : CharSets::Any - POSIXCharMap[$2] - return CharGroup.new(chars, @ignorecase) - end - chars = [] - @current_position += 1 - if next_char == ']' - # Beware of the sneaky edge case: - # /[]]/ (match "]") - chars << ']' - @current_position += 1 - end - until next_char == ']' \ - && !regexp_string[0..@current_position-1].match(/[^\\](\\{2})*\\\z/) - # Beware of having an ODD number of "\" before the "]", e.g. - # /[\]]/ (match "]") - # /[\\]/ (match "\") - # /[\\\]]/ (match "\" or "]") - chars << next_char - @current_position += 1 - end - parsed_chars = ChargroupParser.new(chars).result + @current_position += 1 # Skip past opening "[" + chargroup_parser = ChargroupParser.new(rest_of_string) + parsed_chars = chargroup_parser.result + @current_position += (chargroup_parser.length - 1) # Step back to closing "]" CharGroup.new(parsed_chars, @ignorecase) end diff --git a/lib/regexp-examples/version.rb b/lib/regexp-examples/version.rb index aeff39e..2d329fb 100644 --- a/lib/regexp-examples/version.rb +++ b/lib/regexp-examples/version.rb @@ -1,3 +1,3 @@ module RegexpExamples - VERSION = '0.7.0' + VERSION = '1.0.0' end diff --git a/spec/regexp-examples_spec.rb b/spec/regexp-examples_spec.rb index 15776cf..a0f4d2d 100644 --- a/spec/regexp-examples_spec.rb +++ b/spec/regexp-examples_spec.rb @@ -69,7 +69,6 @@ def self.examples_are_empty(*regexps) context "for complex char groups (square brackets)" do examples_exist_and_match( - /[abc]/, /[a-c]/, /[abc-e]/, @@ -82,7 +81,13 @@ def self.examples_are_empty(*regexps) /[\n-\r]/, /[\-]/, /[%-+]/, # This regex is "supposed to" match some surprising things!!! - /['-.]/ # Test to ensure no "infinite loop" on character set expansion + /['-.]/, # Test to ensure no "infinite loop" on character set expansion + /[[abc]]/, # Nested groups + /[[[[abc]]]]/, + /[[a][b][c]]/, + /[[a-h]&&[f-z]]/, # Set intersection + /[[a-h]&&ab[c]]/, # Set intersection + /[[a-h]&[f-z]]/, # NOT set intersection ) end