From 06af73152fc4fbc7d6c4305a0715c2c0ff0d7961 Mon Sep 17 00:00:00 2001
From: jarelllama <91372088+jarelllama@users.noreply.github.com>
Date: Thu, 5 Dec 2024 03:54:00 +0000
Subject: [PATCH] CI: update readme
---
README.md | 88 ++++++++---------------------------------
config/parked_terms.txt | 2 +-
config/whitelist.txt | 6 +--
3 files changed, 20 insertions(+), 76 deletions(-)
diff --git a/README.md b/README.md
index 4c7341b5b..c5d783192 100644
--- a/README.md
+++ b/README.md
@@ -2,9 +2,17 @@
Blocklist for newly created scam and phishing domains automatically retrieved daily using Google Search API, automated detection, and other public sources.
-The [automated retrieval](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/build_deploy.yml) is done daily at 16:00 UTC.
+This blocklist aims to be an alternative to blocking all newly registered domains (NRDs) seeing how many, but not all, NRDs are malicious. This is done by detecting new malicious domains within a short period of their registration date.
+Sources include:
-This blocklist aims to be an alternative to blocking all newly registered domains (NRDs) seeing how many, but not all, NRDs are malicious. A variety of sources are integrated to detect new malicious domains within a short time span of their registration date.
+- Public databases
+- Google Search indexing to find common scam site templates
+- Open source tools such as [dnstwist](https://github.com/elceef/dnstwist) to detect common cybersquatting techniques like [Typosquatting](https://en.wikipedia.org/wiki/Typosquatting), [Doppelganger Domains](https://en.wikipedia.org/wiki/Doppelganger_domain), and [IDN Homograph Attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack)
+- Regex expression matching for NRDs
+
+A list of all sources can be found in [SOURCES.md](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md).
+
+The automated retrieval is done daily at 16:00 UTC.
## Download
@@ -42,10 +50,6 @@ Today | Yesterday | Excluded | Source
* The excluded % is of domains that are dead, whitelisted, or parked.
```
-> [!IMPORTANT]
-All data retrieved are publicly available and can be viewed from their respective [sources](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md).
-Any data hidden behind account creation/commercial licenses is never used.
-
Domains over time (days)
@@ -58,26 +62,14 @@ Courtesy of iam-py-test/blocklist_stats.
### Light version
-Targeted at list maintainers, a light version of the blocklist is available in the [lists](https://github.com/jarelllama/Scam-Blocklist/tree/main/lists) directory.
+For collated blocklists cautious about size, a light version of the blocklist is available in the [lists](https://github.com/jarelllama/Scam-Blocklist/tree/main/lists) directory. Sources excluded from the light version are marked in [SOURCES.md](https://github.com/jarelllama/Scam-Blocklist/blob/main/).
-
-Details about the light version
-
-- Intended for collated blocklists cautious about size
-- Only includes sources whose domains can be filtered by date registered/reported
-- Only includes domains retrieved/reported from February 2024 onwards, whereas the full list goes back further historically
-- Note that dead and parked domains that become alive/unparked are not added back into the light version (due to limitations in the way these domains are recorded)
-
-Sources excluded from the light version are marked in SOURCES.md.
-
-
-The full version should be used where possible as it fully contains the light version and accounts for resurrected/unparked domains.
-
+Note that dead and parked domains that become alive/unparked are not added back into the light version due to limitations in the way these domains are recorded.
### NSFW Blocklist
Created from requests, a blocklist for NSFW domains is available in Adblock Plus format here:
-[nsfw.txt](https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/lists/adblock/nsfw.txt)
+[nsfw.txt](https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/lists/adblock/nsfw.txt).
Details about the NSFW Blocklist
@@ -95,48 +87,7 @@ This blocklist does not just include adult videos, but also NSFW content of the
### Malware Blocklist
-A blocklist for malicious domains extracted from Proofpoint's [Emerging Threats](https://rules.emergingthreats.net/) rulesets can be found here: **[jarelllama/Emerging-Threats](https://github.com/jarelllama/Emerging-Threats)**
-
-## Sources
-
-### Retrieving scam domains using Google Search API
-
-Google provides a [Search API](https://developers.google.com/custom-search/v1/overview) to retrieve JSON-formatted results from Google Search. A list of search terms almost exclusively found in scam sites is used by the API to retrieve domains. See the list of search terms here: [search_terms.csv](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/search_terms.csv)
-
-#### Details
-
-Scam sites often do not have long lifespans; malicious domains may be replaced before they can be manually reported. By programmatically searching Google using paragraphs from real-world scam sites, new domains can be added as soon as Google crawls the site. This requires no manual reporting.
-
-The list of search terms is proactively maintained and is sourced from manual investigations of scam sites.
-
-``` text
-Active search terms: 10
-API calls made today: 64
-Domains retrieved today: 26
-```
-
-### Retrieving phishing NRDs using dnstwist
-
-New phishing domains are created daily, and unlike other sources that rely on manual reporting, [dnstwist](https://github.com/elceef/dnstwist) can automatically detect new phishing domains within days of their registration date.
-
-dnstwist is an open-source detection tool for common cybersquatting techniques like [Typosquatting](https://en.wikipedia.org/wiki/Typosquatting), [Doppelganger Domains](https://en.wikipedia.org/wiki/Doppelganger_domain), and [IDN Homograph Attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack).
-
-#### Details
-
-dnstwist uses a list of common phishing targets to find permutations of the targets' domains. The target list is a handpicked compilation of cryptocurrency exchanges, delivery companies, etc. collated while wary of potential false positives. The list of phishing targets can be viewed here: [phishing_targets.csv](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/phishing_targets.csv)
-
-The generated domain permutations are checked for matches in a newly registered domains (NRDs) feed comprising domains registered within the last 30 days. Each permutation is tested for alternate top-level domains (TLDs) using the 30 most prevalent TLDs from the NRD feed at the time of retrieval.
-
-``` text
-Active targets: 132
-Domains retrieved today: 73
-```
-
-### Regarding other sources
-
-All sources used presently or formerly are credited here: [SOURCES.md](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md)
-
-The domain retrieval process for all sources can be viewed in the repository's code.
+A blocklist for malicious domains extracted from Proofpoint's [Emerging Threats](https://rules.emergingthreats.net/) rulesets can be found here: **[jarelllama/Emerging-Threats](https://github.com/jarelllama/Emerging-Threats)**.
## Automated filtering process
@@ -163,12 +114,12 @@ Resurrected domains added today: 200
## Parked domains
-Parked domains are removed weekly. A list of common parked domain messages is used to automatically detect these domains. This list can be viewed here: [parked_terms.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/parked_terms.txt)
+Parked domains are removed weekly. A list of common parked domain messages is used to automatically detect these domains. This list can be viewed here: [parked_terms.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/parked_terms.txt).
Parked sites no longer containing any of the parked messages are assumed to be unparked and are included back into the blocklist.
> [!TIP]
-For list maintainers interested in integrating the parked domains as a source, the list of weekly-updated parked domains can be found here: [parked_domains.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/data/parked_domains.txt) (capped to newest 12000 entries)
+For list maintainers interested in integrating the parked domains as a source, a list of weekly-updated parked domains can be found here: [parked_domains.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/data/parked_domains.txt) (capped to newest 12000 entries).
``` text
Parked domains removed this month: 839
@@ -195,18 +146,11 @@ Unparked domains added this month: 243
* [Google's Shell Style Guide](https://google.github.io/styleguide/shellguide.html): Shell script style guide
* [Grammarly](https://grammarly.com/): spelling and grammar checker
* [Jarelllama's Blocklist Checker](https://github.com/jarelllama/Blocklist-Checker): generate a simple static report for blocklists or see previous reports of requested blocklists
-* [Legality of web scraping](https://www.quinnemanuel.com/the-firm/publications/the-legal-landscape-of-web-scraping/): the law firm of Quinn Emanuel Urquhart & Sullivan's memoranda on web scraping
* [ShellCheck](https://github.com/koalaman/shellcheck): static analysis tool for Shell scripts
* [Tranco](https://tranco-list.eu/): research-oriented top sites ranking hardened against manipulation
* [VirusTotal](https://www.virustotal.com/): analyze suspicious files, domains, IPs, and URLs to detect malware (also includes WHOIS lookup)
* [iam-py-test/blocklist_stats](https://github.com/iam-py-test/blocklist_stats): statistics on various blocklists
-## Appreciation
-
-Thanks to the following people for the help, inspiration, and support!
-
-[@T145](https://github.com/T145) - [@bongochong](https://github.com/bongochong) - [@hagezi](https://github.com/hagezi) - [@iam-py-test](https://github.com/iam-py-test) - [@sefinek24](https://github.com/sefinek24) - [@sjhgvr](https://github.com/sjhgvr)
-
## Contributing
You can contribute to this project in the following ways:
diff --git a/config/parked_terms.txt b/config/parked_terms.txt
index 2f7ea1fbb..8eeb9644e 100644
--- a/config/parked_terms.txt
+++ b/config/parked_terms.txt
@@ -57,7 +57,7 @@ tome um café e volte em alguns instantes...
url=/cgi-sys/defaultwebpage.cgi
use this domain
website is ready. the content is to be added
+you have probably come across this site because you received an email that was sent by one of our customers.
Ваш хостинг-аккаунт заблокирован. Причины могут быть следующие
中古ドメインとは過去に運用されていたwebサイトの「検索エンジン評価」
您的请求在web服务器中没有找到对应的站点
-You have probably come across this site because you received an email that was sent by one of our customers.
diff --git a/config/whitelist.txt b/config/whitelist.txt
index 75b4ed73a..58effb73d 100644
--- a/config/whitelist.txt
+++ b/config/whitelist.txt
@@ -59,6 +59,7 @@
^de-reviews\.com$
^discord\.do$
^discordium\.org$
+^dontkillmyapp\.com$
^econstrunet\.com\.br$
^energyjobline\.com$
^epiqpay\.com$
@@ -231,6 +232,7 @@
^trustpilot\.com$
^truyenhentai88\.com$
^truyenhentaivn\.org$
+^tsp1-brevo\.net$
^tumblr\.com$
^twitchytides\.io$
^twitterfilesbrazil\.com$
@@ -253,6 +255,7 @@
^wolvden\.com$
^workflowy\.com$
^wspa\.com$
+^xhamster\.best$
^xiaomitime\.com$
^xvideos18\.mobi$
^xvideos\.es$
@@ -268,7 +271,6 @@
^zhentaivn\.net$
^zoomex\.com$
^zoominfo\.com$
-^dontkillmyapp\.com$
fakewebsite
malware
phishing
@@ -276,5 +278,3 @@ scam-detector
scamadviser
scamscavenger
xiaoming
-^tsp1-brevo\.net$
-^xhamster\.best$