-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilized v1.1.2 #90
Conversation
categorized_links['Facebook'].append(urllib.parse.unquote(link)) | ||
elif 'twitter.com' in link: | ||
elif hostname and hostname.endswith('twitter.com'): |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
twitter.com
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI about 1 month ago
To fix the problem, we need to ensure that the hostname check is robust against subdomain attacks. Instead of using hostname.endswith('twitter.com')
, we should check if the hostname is exactly twitter.com
or ends with .twitter.com
. This will prevent URLs like malicious-twitter.com
from passing the check.
- Parse the URL to extract the hostname.
- Check if the hostname is exactly
twitter.com
or ends with.twitter.com
. - Apply similar changes to other social media domain checks to ensure consistency and security.
-
Copy modified lines R118-R137
@@ -117,22 +117,22 @@ | ||
hostname = parsed_url.hostname | ||
if hostname and hostname.endswith('facebook.com'): | ||
categorized_links['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('twitter.com'): | ||
categorized_links['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('instagram.com'): | ||
categorized_links['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('t.me'): | ||
categorized_links['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('tiktok.com'): | ||
categorized_links['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('linkedin.com'): | ||
categorized_links['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('vk.com'): | ||
categorized_links['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('youtube.com'): | ||
categorized_links['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('wechat.com'): | ||
categorized_links['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('ok.ru'): | ||
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')): | ||
categorized_links['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')): | ||
categorized_links['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')): | ||
categorized_links['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')): | ||
categorized_links['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')): | ||
categorized_links['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')): | ||
categorized_links['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')): | ||
categorized_links['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')): | ||
categorized_links['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')): | ||
categorized_links['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')): | ||
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
|
categorized_links['Telegram'].append(urllib.parse.unquote(link)) | ||
elif 'tiktok.com' in link: | ||
elif hostname and hostname.endswith('tiktok.com'): |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
tiktok.com
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI about 1 month ago
To fix the problem, we need to ensure that the hostname check is robust and cannot be bypassed by embedding the target string in an unexpected location. The best way to achieve this is to use a more precise check that ensures the hostname is exactly the expected domain or a subdomain of it.
- We will modify the
hostname.endswith
checks to ensure that the hostname is either exactly the target domain or a subdomain of it. - This involves checking if the hostname is equal to the target domain or ends with
.
followed by the target domain. - We will make these changes in the
sm_gather
function within thedatagather_modules/crawl_processor.py
file.
-
Copy modified lines R118-R137
@@ -117,22 +117,22 @@ | ||
hostname = parsed_url.hostname | ||
if hostname and hostname.endswith('facebook.com'): | ||
categorized_links['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('twitter.com'): | ||
categorized_links['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('instagram.com'): | ||
categorized_links['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('t.me'): | ||
categorized_links['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('tiktok.com'): | ||
categorized_links['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('linkedin.com'): | ||
categorized_links['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('vk.com'): | ||
categorized_links['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('youtube.com'): | ||
categorized_links['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('wechat.com'): | ||
categorized_links['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('ok.ru'): | ||
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')): | ||
categorized_links['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')): | ||
categorized_links['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')): | ||
categorized_links['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')): | ||
categorized_links['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')): | ||
categorized_links['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')): | ||
categorized_links['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')): | ||
categorized_links['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')): | ||
categorized_links['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')): | ||
categorized_links['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')): | ||
categorized_links['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
|
sd_socials['TikTok'].append(urllib.parse.unquote(link)) | ||
elif 'linkedin.com' in link: | ||
elif hostname and hostname.endswith('linkedin.com'): |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
linkedin.com
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI about 1 month ago
To fix the problem, we need to ensure that the hostname is correctly validated to belong to the intended domain. This can be done by checking if the hostname ends with the domain and is either exactly the domain or has a preceding dot to allow for subdomains.
- We will modify the code to check if the hostname ends with
.linkedin.com
or is exactlylinkedin.com
. - This change will be applied to all similar checks for other social media domains to ensure consistency and security.
- The changes will be made in the file
datagather_modules/crawl_processor.py
from lines 217 to 236.
-
Copy modified lines R217-R236
@@ -216,22 +216,22 @@ | ||
hostname = urlparse(link).hostname | ||
if hostname and hostname.endswith('facebook.com'): | ||
sd_socials['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('twitter.com'): | ||
sd_socials['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('instagram.com'): | ||
sd_socials['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('t.me'): | ||
sd_socials['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('tiktok.com'): | ||
sd_socials['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('linkedin.com'): | ||
sd_socials['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('vk.com'): | ||
sd_socials['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('youtube.com'): | ||
sd_socials['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('wechat.com'): | ||
sd_socials['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('ok.ru'): | ||
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')): | ||
sd_socials['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')): | ||
sd_socials['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')): | ||
sd_socials['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')): | ||
sd_socials['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')): | ||
sd_socials['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')): | ||
sd_socials['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')): | ||
sd_socials['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')): | ||
sd_socials['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')): | ||
sd_socials['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')): | ||
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
|
sd_socials['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif 'vk.com' in link: | ||
elif hostname and hostname.endswith('vk.com'): |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
vk.com
sd_socials['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif 'youtube.com' in link: | ||
elif hostname and hostname.endswith('youtube.com'): |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
youtube.com
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI about 1 month ago
To fix the problem, we need to ensure that the hostname is correctly validated to belong to the intended domain. Instead of using hostname.endswith('youtube.com')
, we should check if the hostname is exactly 'youtube.com' or a subdomain of 'youtube.com'. This can be done by ensuring the hostname ends with '.youtube.com' or is exactly 'youtube.com'.
- Parse the URL using
urlparse
to extract the hostname. - Check if the hostname is either 'youtube.com' or ends with '.youtube.com'.
- Apply similar checks for other social media domains to ensure consistency and security.
-
Copy modified lines R217-R236
@@ -216,22 +216,22 @@ | ||
hostname = urlparse(link).hostname | ||
if hostname and hostname.endswith('facebook.com'): | ||
sd_socials['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('twitter.com'): | ||
sd_socials['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('instagram.com'): | ||
sd_socials['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('t.me'): | ||
sd_socials['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('tiktok.com'): | ||
sd_socials['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('linkedin.com'): | ||
sd_socials['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('vk.com'): | ||
sd_socials['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('youtube.com'): | ||
sd_socials['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('wechat.com'): | ||
sd_socials['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('ok.ru'): | ||
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')): | ||
sd_socials['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')): | ||
sd_socials['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')): | ||
sd_socials['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')): | ||
sd_socials['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')): | ||
sd_socials['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')): | ||
sd_socials['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')): | ||
sd_socials['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')): | ||
sd_socials['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')): | ||
sd_socials['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')): | ||
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
|
sd_socials['YouTube'].append(urllib.parse.unquote(link)) | ||
elif 'wechat.com' in link: | ||
elif hostname and hostname.endswith('wechat.com'): |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
wechat.com
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI about 1 month ago
To fix the problem, we need to ensure that the hostname check is more robust and cannot be easily bypassed by malicious URLs. The best way to achieve this is to use a stricter check that ensures the hostname is exactly the expected domain or a subdomain of it. We can use the urlparse
function to parse the URL and then check if the hostname ends with the expected domain, preceded by a dot or being exactly the domain.
- Modify the
hostname.endswith
checks to ensure that the hostname is either exactly the expected domain or a subdomain of it. - Update the code in the
datagather_modules/crawl_processor.py
file, specifically lines 217-236, to implement this stricter check.
-
Copy modified lines R217-R236
@@ -216,22 +216,22 @@ | ||
hostname = urlparse(link).hostname | ||
if hostname and hostname.endswith('facebook.com'): | ||
sd_socials['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('twitter.com'): | ||
sd_socials['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('instagram.com'): | ||
sd_socials['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('t.me'): | ||
sd_socials['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('tiktok.com'): | ||
sd_socials['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('linkedin.com'): | ||
sd_socials['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('vk.com'): | ||
sd_socials['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('youtube.com'): | ||
sd_socials['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('wechat.com'): | ||
sd_socials['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and hostname.endswith('ok.ru'): | ||
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
if hostname and (hostname == 'facebook.com' or hostname.endswith('.facebook.com')): | ||
sd_socials['Facebook'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'twitter.com' or hostname.endswith('.twitter.com')): | ||
sd_socials['Twitter'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'instagram.com' or hostname.endswith('.instagram.com')): | ||
sd_socials['Instagram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 't.me' or hostname.endswith('.t.me')): | ||
sd_socials['Telegram'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'tiktok.com' or hostname.endswith('.tiktok.com')): | ||
sd_socials['TikTok'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'linkedin.com' or hostname.endswith('.linkedin.com')): | ||
sd_socials['LinkedIn'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'vk.com' or hostname.endswith('.vk.com')): | ||
sd_socials['VKontakte'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'youtube.com' or hostname.endswith('.youtube.com')): | ||
sd_socials['YouTube'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'wechat.com' or hostname.endswith('.wechat.com')): | ||
sd_socials['WeChat'].append(urllib.parse.unquote(link)) | ||
elif hostname and (hostname == 'ok.ru' or hostname.endswith('.ok.ru')): | ||
sd_socials['Odnoklassniki'].append(urllib.parse.unquote(link)) | ||
|
No description provided.