Skip to content

Commit

Permalink
Add re.findall to pick out re matches (#805)
Browse files Browse the repository at this point in the history
Actually using re.finditer so we can apply a repl to the result. This
allows users to pick out matches and reformat them in one step.

Fixes #804

Signed-off-by: James Hewitt <james.hewitt@uk.ibm.com>
Co-authored-by: Thomas Perl <m@thp.io>
  • Loading branch information
Jamstah and thp authored Jul 30, 2024
1 parent 5561459 commit 654ce44
Show file tree
Hide file tree
Showing 5 changed files with 100 additions and 14 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ The format mostly follows [Keep a Changelog](http://keepachangelog.com/en/1.0.0/
- Command line options to enable and disbale jobs (Requested in #813 by gapato, contributed in #820 by jamstah)
- New option `ignore_incomplete_reads` (Requested in #725 by wschoot, contributed in #787 by wfrisch)
- New option `wait_for` in browser jobs (Requested in #763 by yuis-ice, contributed in #810 by jamstah)
- Added tags to jobs and the ability to select them at the command line (#789 by jamstah)
- New filter `re.findall` (Requested in #804 by f0sh, contributed in #805 by jamstah)
- Added tags to jobs and the ability to select them at the command line (#789, #824 by jamstah)

### Changed
Expand Down
53 changes: 39 additions & 14 deletions docs/source/filters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ At the moment, the following filters are built-in:
- **ical2text**: Convert `iCalendar`_ to plaintext
- **ocr**: Convert text in images to plaintext using Tesseract OCR
- **re.sub**: Replace text with regular expressions using Python's re.sub
- **re.findall**: Find all non-overlapping matches using Python's re.findall
- **reverse**: Reverse input items
- **sha1sum**: Calculate the SHA-1 checksum of the content
- **shellpipe**: Filter using a shell command
Expand Down Expand Up @@ -485,12 +486,13 @@ Alternatively, ``jq`` can be used for filtering:
filter:
- jq: '.[0].name'
Remove or replace text using regular expressions
------------------------------------------------
Find, remove or replace text using regular expressions
------------------------------------------------------

Just like Python’s ``re.sub`` function, there’s the possibility to apply
a regular expression and either remove of replace the matched text. The
following example applies the filter 3 times:
You can use ``re.sub`` and ``re.findall`` to apply regular expressions.

``re.sub`` can be used to remove or replace all non-overlapping instances
of matched text. The following example applies the filter 3 times:

1. Just specifying a string as the value will replace the matches with
the empty string.
Expand All @@ -499,11 +501,7 @@ following example applies the filter 3 times:
3. You can use groups (``()``) and back-reference them with ``\1``
(etc..) to put groups into the replacement string.

All features are described in Python’s
`re.sub <https://docs.python.org/3/library/re.html#re.sub>`__
documentation (the ``pattern`` and ``repl`` values are passed to this
function as-is, with the value of ``repl`` defaulting to the empty
string).
``repl`` defaults to the empty string, which will remove matched strings.

.. code:: yaml
Expand All @@ -517,15 +515,42 @@ string).
pattern: '</([^>]*)>'
repl: '<END OF TAG \1>'
If you want to enable certain flags (e.g. ``re.MULTILINE``) in the
call, this is possible by inserting an "inline flag" documented in
`flags in re.compile`_, here are some examples:
``re.findall`` can be used to find all non-overlapping matches of a
regular expression. Each match is output on its own line. The following
example applies the filter twice:

1. It uses a group (``()``) and back-reference (``\1``) to extract a
date from the input string.
2. It breaks the numbers in the date out into separate lines.

If ``repl`` is not specified, the full match will be included in the output.

.. code:: yaml
url: https://example.com/regex-findall.html
filter:
- re.findall:
pattern: 'The next draw is on (\d{4}-\d{2}-\d{2}).'
repl: '\1'
- re.findall: '\d+'
Note: When using HTML or XML, it is usually better to use CSS selectors or
XPATH expressions. HTML and XML `cannot be parsed`_ properly using regular
expressions. If the CSS selector or XPATH cannot provide the targeted
selection required, using an ``html2text`` filter first then using
``re.findall`` can be a good pattern.

.. _`cannot be parsed`: https://stackoverflow.com/a/1732454/1047040

If you want to enable flags (e.g. ``re.MULTILINE``) in ``re.sub``
or ``re.findall`` filters, use an "inline flag", here are some
examples:

* ``re.MULTILINE``: ``(?m)`` (Makes ``^`` match start-of-line and ``$`` match end-of-line)
* ``re.DOTALL``: ``(?s)`` (Makes ``.`` also match a newline)
* ``re.IGNORECASE``: ``(?i)`` (Perform case-insensitive matching)

.. _flags in re.compile: https://docs.python.org/3/library/re.html#re.compile
.. _full re syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax

This allows you, for example, to remove all leading spaces (only
space character and tab):
Expand Down
20 changes: 20 additions & 0 deletions lib/urlwatch/filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -848,6 +848,26 @@ def filter(self, data, subfilter):
return re.sub(subfilter['pattern'], subfilter.get('repl', ''), data)


class RegexFindall(FilterBase):
"""Pick out regular expressions using Python's re.findall"""

__kind__ = 're.findall'

__supported_subfilters__ = {
'pattern': 'Regular expression to search for (required)',
'repl': 'Replacement string (default: full match)',
}

__default_subfilter__ = 'pattern'

def filter(self, data, subfilter):
if 'pattern' not in subfilter:
raise ValueError('{} needs a pattern'.format(self.__kind__))

# Default: Replace with full match if no "repl" value is set
return "\n".join(match.expand(subfilter.get('repl', '\\g<0>')) for match in re.finditer(subfilter['pattern'], data))


class SortFilter(FilterBase):
"""Sort input items"""

Expand Down
13 changes: 13 additions & 0 deletions lib/urlwatch/tests/data/filter_documentation_testdata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,19 @@ https://example.com/regex-substitute.html:
HEADING 1: Welcome to this webpage<END OF TAG h1>
<a>Some Link<END OF TAG a>
<END OF TAG div>
https://example.com/regex-findall.html:
input: |-
Welcome to the lottery webpage.
The numbers for 2020-07-11 are:
4, 8, 15, 16, 23 and 42
The next draw is on 2020-07-13.
Thank you for visiting the lottery webpage.
output: |-
2020
07
13
https://example.net/shellpipe-grep.txt:
input: |-
<h1>Welcome to our price watching page!</h1>
Expand Down
26 changes: 26 additions & 0 deletions lib/urlwatch/tests/data/filter_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,32 @@ re_sub_multiline:
One Line
Another Line
re_findall:
filter:
- re.findall: '-[a-z][a-z][a-z]-'
data: |-
Some-abc-things-def-on-ghi-this-line-and
some-jkl-more-mno-here
expected_result: |-
-abc-
-def-
-ghi-
-jkl-
-mno-
re_findall_repl:
filter:
- re.findall:
pattern: '-([a-z])([a-z])([a-z])-'
repl: '\3\2\1'
data: |-
Some-abc-things-def-on-ghi-this-line-and
some-jkl-more-mno-here
expected_result: |-
cba
fed
ihg
lkj
onm
strip:
filter: strip
data: " The rose is red; \n\nthe violet's blue.\nSugar is sweet, \nand so are you. "
Expand Down

0 comments on commit 654ce44

Please sign in to comment.