-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider removing "filler words" from text #245
Labels
enhancement
New feature or request
priority: 3 (low)
time est: 1 hour
We estimate this issue will take ≈1 hour to complete
Comments
Something like:
This is intended to replace all alphabetic strings between 3 and 31 characters which are preceded by either whitespace or the start of a string and are followed by either whitespace, a non-alphanumeric character followed by a whitespace, or the end of the string. Simple example: from re import sub
s = '''abc.py bar.com example.com foo.com swissjabber.de https://example.com/test%20page/foo.com/bingo.php?q=bar.com foo@swissjabber.de me@example.com me@example.com 1.1.1.1/0 imphash 18ddf28a71089acdbab5038f58044c0a authentihash 3f1b149d07e7e8636636b8b7f7043c40ed64a10b28986181fb046c498432c2d4 1.1.1.1 2001:0db8:0000:0000:0000:ff00:0042:8329 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 12288:QYV6MorX7qzuC3QHO9FQVHPF51jgcSj2EtPo/V7I6R+Lqaw8i6hG0:vBXu9HGaVHh4Po/VU6RkqaQ6F 0000:0000:ff00 2001:0db8:0000 ASN123 CVE-2022-1234 HKEY_LOCAL_MACHINE\\Software\\Microsoft\\Windows pub-1234567891234567 UA-000000-1 imphash 18ddf28a71089acdbab5038f58044c0a 3J98t1WpEZ73CNmQviecrnyiWrnqRhWNLy 496aKKdqF1xQSSEzw7wNrkZkDUsCD5cSmNCfVhVgEps52WERBcLDGzdF5UugmFoHMm9xRJdewvK2TFfAJNwEV25rTcVF5Vp AA-F2-C9-A6-B3-4F Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1) TLP:RED ~/foo/bar/abc.py enterprise pre_attack pre_attack M1036 M1015 TA0012 T1329'''
new_s = sub('(?<=\s|^)[a-zA-Z]{3,31}(?<!authentihash)(?<!imphash)(?=\s|(?:\W\s)|$)', '', s)
print(new_s) If we do this, we will need to update user agents to be parsed from the original text and not the text with this regex removed. |
fhightower
added
enhancement
New feature or request
time est: 1 hour
We estimate this issue will take ≈1 hour to complete
priority: 3 (low)
labels
Jul 7, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
enhancement
New feature or request
priority: 3 (low)
time est: 1 hour
We estimate this issue will take ≈1 hour to complete
When given text, consider removing any word that will not contain an ioc. For example, can we safely remove all words that are only letters and shorter than 32 characters (so we don't remove an md5 or imphash)?
The text was updated successfully, but these errors were encountered: