Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing "filler words" from text #245

Open
fhightower opened this issue Jul 2, 2022 · 1 comment
Open

Consider removing "filler words" from text #245

fhightower opened this issue Jul 2, 2022 · 1 comment
Labels
enhancement New feature or request priority: 3 (low) time est: 1 hour We estimate this issue will take ≈1 hour to complete

Comments

@fhightower
Copy link
Owner

When given text, consider removing any word that will not contain an ioc. For example, can we safely remove all words that are only letters and shorter than 32 characters (so we don't remove an md5 or imphash)?

@fhightower
Copy link
Owner Author

fhightower commented Jul 6, 2022

Something like:

text = sub('(?<=\s|^)[a-zA-Z]{3,31}(?<!authentihash)(?<!imphash)(?=\s|(?:\W\s)|$)', '', text)

This is intended to replace all alphabetic strings between 3 and 31 characters which are preceded by either whitespace or the start of a string and are followed by either whitespace, a non-alphanumeric character followed by a whitespace, or the end of the string.

Simple example:

from re import sub

s = '''abc.py bar.com example.com foo.com swissjabber.de https://example.com/test%20page/foo.com/bingo.php?q=bar.com foo@swissjabber.de me@example.com me@example.com 1.1.1.1/0 imphash 18ddf28a71089acdbab5038f58044c0a authentihash 3f1b149d07e7e8636636b8b7f7043c40ed64a10b28986181fb046c498432c2d4 1.1.1.1 2001:0db8:0000:0000:0000:ff00:0042:8329 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 12288:QYV6MorX7qzuC3QHO9FQVHPF51jgcSj2EtPo/V7I6R+Lqaw8i6hG0:vBXu9HGaVHh4Po/VU6RkqaQ6F 0000:0000:ff00 2001:0db8:0000 ASN123 CVE-2022-1234 HKEY_LOCAL_MACHINE\\Software\\Microsoft\\Windows pub-1234567891234567 UA-000000-1 imphash 18ddf28a71089acdbab5038f58044c0a 3J98t1WpEZ73CNmQviecrnyiWrnqRhWNLy 496aKKdqF1xQSSEzw7wNrkZkDUsCD5cSmNCfVhVgEps52WERBcLDGzdF5UugmFoHMm9xRJdewvK2TFfAJNwEV25rTcVF5Vp AA-F2-C9-A6-B3-4F Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1) TLP:RED ~/foo/bar/abc.py enterprise pre_attack pre_attack M1036 M1015 TA0012 T1329'''

new_s = sub('(?<=\s|^)[a-zA-Z]{3,31}(?<!authentihash)(?<!imphash)(?=\s|(?:\W\s)|$)', '', s)

print(new_s)

If we do this, we will need to update user agents to be parsed from the original text and not the text with this regex removed.

@fhightower fhightower added enhancement New feature or request time est: 1 hour We estimate this issue will take ≈1 hour to complete priority: 3 (low) labels Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: 3 (low) time est: 1 hour We estimate this issue will take ≈1 hour to complete
Projects
None yet
Development

No branches or pull requests

1 participant