Consider removing "filler words" from text #245

fhightower · 2022-07-02T10:53:29Z

When given text, consider removing any word that will not contain an ioc. For example, can we safely remove all words that are only letters and shorter than 32 characters (so we don't remove an md5 or imphash)?

fhightower · 2022-07-06T08:28:38Z

Something like:

text = sub('(?<=\s|^)[a-zA-Z]{3,31}(?<!authentihash)(?<!imphash)(?=\s|(?:\W\s)|$)', '', text)

This is intended to replace all alphabetic strings between 3 and 31 characters which are preceded by either whitespace or the start of a string and are followed by either whitespace, a non-alphanumeric character followed by a whitespace, or the end of the string.

Simple example:

from re import sub

s = '''abc.py bar.com example.com foo.com swissjabber.de https://example.com/test%20page/foo.com/bingo.php?q=bar.com foo@swissjabber.de me@example.com me@example.com 1.1.1.1/0 imphash 18ddf28a71089acdbab5038f58044c0a authentihash 3f1b149d07e7e8636636b8b7f7043c40ed64a10b28986181fb046c498432c2d4 1.1.1.1 2001:0db8:0000:0000:0000:ff00:0042:8329 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 12288:QYV6MorX7qzuC3QHO9FQVHPF51jgcSj2EtPo/V7I6R+Lqaw8i6hG0:vBXu9HGaVHh4Po/VU6RkqaQ6F 0000:0000:ff00 2001:0db8:0000 ASN123 CVE-2022-1234 HKEY_LOCAL_MACHINE\\Software\\Microsoft\\Windows pub-1234567891234567 UA-000000-1 imphash 18ddf28a71089acdbab5038f58044c0a 3J98t1WpEZ73CNmQviecrnyiWrnqRhWNLy 496aKKdqF1xQSSEzw7wNrkZkDUsCD5cSmNCfVhVgEps52WERBcLDGzdF5UugmFoHMm9xRJdewvK2TFfAJNwEV25rTcVF5Vp AA-F2-C9-A6-B3-4F Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1) TLP:RED ~/foo/bar/abc.py enterprise pre_attack pre_attack M1036 M1015 TA0012 T1329'''

new_s = sub('(?<=\s|^)[a-zA-Z]{3,31}(?<!authentihash)(?<!imphash)(?=\s|(?:\W\s)|$)', '', s)

print(new_s)

If we do this, we will need to update user agents to be parsed from the original text and not the text with this regex removed.

fhightower added enhancement New feature or request time est: 1 hour We estimate this issue will take ≈1 hour to complete priority: 3 (low) labels Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider removing "filler words" from text #245

Consider removing "filler words" from text #245

fhightower commented Jul 2, 2022

fhightower commented Jul 6, 2022 •

edited

Loading

Consider removing "filler words" from text #245

Consider removing "filler words" from text #245

Comments

fhightower commented Jul 2, 2022

fhightower commented Jul 6, 2022 • edited Loading

fhightower commented Jul 6, 2022 •

edited

Loading