Add re.findall to pick out re matches (#805)

Actually using re.finditer so we can apply a repl to the result. This allows users to pick out matches and reformat them in one step. Fixes #804 Signed-off-by: James Hewitt <james.hewitt@uk.ibm.com> Co-authored-by: Thomas Perl <m@thp.io>
thp · Jul 30, 2024 · 654ce44 · 654ce44
1 parent 5561459
commit 654ce44
Show file tree

Hide file tree

Showing 5 changed files with 100 additions and 14 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,8 @@ The format mostly follows [Keep a Changelog](http://keepachangelog.com/en/1.0.0/
 - Command line options to enable and disbale jobs (Requested in #813 by gapato, contributed in #820 by jamstah)
 - New option `ignore_incomplete_reads` (Requested in #725 by wschoot, contributed in #787 by wfrisch)
 - New option `wait_for` in browser jobs (Requested in #763 by yuis-ice, contributed in #810 by jamstah)
+- Added tags to jobs and the ability to select them at the command line (#789 by jamstah)
+- New filter `re.findall` (Requested in #804 by f0sh, contributed in #805 by jamstah)
 - Added tags to jobs and the ability to select them at the command line (#789, #824 by jamstah)
 
 ### Changed

diff --git a/docs/source/filters.rst b/docs/source/filters.rst
@@ -77,6 +77,7 @@ At the moment, the following filters are built-in:
 - **ical2text**: Convert `iCalendar`_ to plaintext
 - **ocr**: Convert text in images to plaintext using Tesseract OCR
 - **re.sub**: Replace text with regular expressions using Python's re.sub
+- **re.findall**: Find all non-overlapping matches using Python's re.findall
 - **reverse**: Reverse input items
 - **sha1sum**: Calculate the SHA-1 checksum of the content
 - **shellpipe**: Filter using a shell command
@@ -485,12 +486,13 @@ Alternatively, ``jq`` can be used for filtering:
    filter:
      - jq: '.[0].name'
 
-Remove or replace text using regular expressions
-------------------------------------------------
+Find, remove or replace text using regular expressions
+------------------------------------------------------
 
-Just like Python’s ``re.sub`` function, there’s the possibility to apply
-a regular expression and either remove of replace the matched text. The
-following example applies the filter 3 times:
+You can use ``re.sub`` and ``re.findall`` to apply regular expressions.
+
+``re.sub`` can be used to remove or replace all non-overlapping instances
+of matched text. The following example applies the filter 3 times:
 
 1. Just specifying a string as the value will replace the matches with
    the empty string.
@@ -499,11 +501,7 @@ following example applies the filter 3 times:
 3. You can use groups (``()``) and back-reference them with ``\1``
    (etc..) to put groups into the replacement string.
 
-All features are described in Python’s
-`re.sub <https://docs.python.org/3/library/re.html#re.sub>`__
-documentation (the ``pattern`` and ``repl`` values are passed to this
-function as-is, with the value of ``repl`` defaulting to the empty
-string).
+``repl`` defaults to the empty string, which will remove matched strings.
 
 .. code:: yaml
 
@@ -517,15 +515,42 @@ string).
            pattern: '</([^>]*)>'
            repl: '<END OF TAG \1>'
 
-If you want to enable certain flags (e.g. ``re.MULTILINE``) in the
-call, this is possible by inserting an "inline flag" documented in
-`flags in re.compile`_, here are some examples:
+``re.findall`` can be used to find all non-overlapping matches of a
+regular expression. Each match is output on its own line. The following
+example applies the filter twice:
+
+1. It uses a group (``()``) and back-reference (``\1``) to extract a
+   date from the input string.
+2. It breaks the numbers in the date out into separate lines.
+
+If ``repl`` is not specified, the full match will be included in the output.
+
+.. code:: yaml
+
+   url: https://example.com/regex-findall.html
+   filter:
+       - re.findall:
+           pattern: 'The next draw is on (\d{4}-\d{2}-\d{2}).'
+           repl: '\1'
+       - re.findall: '\d+'
+
+Note: When using HTML or XML, it is usually better to use CSS selectors or
+XPATH expressions. HTML and XML `cannot be parsed`_ properly using regular
+expressions. If the CSS selector or XPATH cannot provide the targeted
+selection required, using an ``html2text`` filter first then using
+``re.findall`` can be a good pattern.
+
+.. _`cannot be parsed`: https://stackoverflow.com/a/1732454/1047040
+
+If you want to enable flags (e.g. ``re.MULTILINE``) in ``re.sub``
+or ``re.findall`` filters, use an "inline flag", here are some
+examples:
 
 * ``re.MULTILINE``: ``(?m)`` (Makes ``^`` match start-of-line and ``$`` match end-of-line)
 * ``re.DOTALL``: ``(?s)`` (Makes ``.`` also match a newline)
 * ``re.IGNORECASE``: ``(?i)`` (Perform case-insensitive matching)
 
-.. _flags in re.compile: https://docs.python.org/3/library/re.html#re.compile
+.. _full re syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax
 
 This allows you, for example, to remove all leading spaces (only
 space character and tab):

diff --git a/lib/urlwatch/filters.py b/lib/urlwatch/filters.py
@@ -848,6 +848,26 @@ def filter(self, data, subfilter):
         return re.sub(subfilter['pattern'], subfilter.get('repl', ''), data)
 
 
+class RegexFindall(FilterBase):
+    """Pick out regular expressions using Python's re.findall"""
+
+    __kind__ = 're.findall'
+
+    __supported_subfilters__ = {
+        'pattern': 'Regular expression to search for (required)',
+        'repl': 'Replacement string (default: full match)',
+    }
+
+    __default_subfilter__ = 'pattern'
+
+    def filter(self, data, subfilter):
+        if 'pattern' not in subfilter:
+            raise ValueError('{} needs a pattern'.format(self.__kind__))
+
+        # Default: Replace with full match if no "repl" value is set
+        return "\n".join(match.expand(subfilter.get('repl', '\\g<0>')) for match in re.finditer(subfilter['pattern'], data))
+
+
 class SortFilter(FilterBase):
     """Sort input items"""
 

diff --git a/lib/urlwatch/tests/data/filter_documentation_testdata.yaml b/lib/urlwatch/tests/data/filter_documentation_testdata.yaml
@@ -285,6 +285,19 @@ https://example.com/regex-substitute.html:
       HEADING 1: Welcome to this webpage<END OF TAG h1>
       <a>Some Link<END OF TAG a>
     <END OF TAG div>
+https://example.com/regex-findall.html:
+  input: |-
+    Welcome to the lottery webpage.
+    The numbers for 2020-07-11 are:
+
+       4, 8, 15, 16, 23 and 42
+
+    The next draw is on 2020-07-13.
+    Thank you for visiting the lottery webpage.
+  output: |-
+    2020
+    07
+    13
 https://example.net/shellpipe-grep.txt:
   input: |-
     <h1>Welcome to our price watching page!</h1>

diff --git a/lib/urlwatch/tests/data/filter_tests.yaml b/lib/urlwatch/tests/data/filter_tests.yaml
@@ -326,6 +326,32 @@ re_sub_multiline:
     One Line
     
     Another Line
+re_findall:
+  filter:
+    - re.findall: '-[a-z][a-z][a-z]-'
+  data: |-
+    Some-abc-things-def-on-ghi-this-line-and
+    some-jkl-more-mno-here
+  expected_result: |-
+    -abc-
+    -def-
+    -ghi-
+    -jkl-
+    -mno-
+re_findall_repl:
+  filter:
+    - re.findall:
+        pattern: '-([a-z])([a-z])([a-z])-'
+        repl: '\3\2\1'
+  data: |-
+    Some-abc-things-def-on-ghi-this-line-and
+    some-jkl-more-mno-here
+  expected_result: |-
+    cba
+    fed
+    ihg
+    lkj
+    onm
 strip:
   filter: strip
   data: "  The rose is red;   \n\nthe violet's blue.\nSugar is sweet,       \nand so are you.   "