-
Notifications
You must be signed in to change notification settings - Fork 966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logic to verify URLs using HTML meta tag #16597
base: main
Are you sure you want to change the base?
Add logic to verify URLs using HTML meta tag #16597
Conversation
I'm +1 on removing this outright -- I think the volume of legitimate users who actually need to explicitly list a port is probably vanishingly small 🙂 |
) | ||
r.raise_for_status() | ||
|
||
content = next(r.iter_content(max_length_bytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think r.raw.read(max_length_bytes)
will be slightly faster + more idiomatic here, since we're already setting stream=True
🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
r.raise_for_status() | ||
|
||
content = next(r.iter_content(max_length_bytes)) | ||
return content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud: should we explicitly r.close()
before existing this method? As-is, I think this will leave the connection open and dangling until the server hangs up, meaning that we could end up slowly exhausting the number of available outbound sockets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This also applies to the new urllib3
implementation, so we now call
r.drain_conn()
r.release_conn()
before returning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is there a way we can avoid drain_conn
? My understanding is that r.drain_conn(...)
will block until the entire response is read (and dropped), meaning that this will end up effectively reading the whole response instead of truncating after the first X bytes.
(If I got that right, then I think we can avoid this by removing the call and doing nothing else, since we're already using a separate connection pool per request and there's no reason to release back to a pool that we don't reuse 🙂)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense, but shouldn't we close the connection? The docs mention that if you don't care about returning the connection to the pool, you can call close()
to close the connection:
You can call the close() to close the connection, but this call doesn’t return the connection to the pool, throws away the unread data on the wire, and leaves the connection in an undefined protocol state. This is desirable if you prefer not reading data from the socket to re-using the HTTP connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I recommend calling HTTPResponse.close()
instead of drain_conn()
if you don't plan on reusing the connection, that will close out the socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
# The domain name should not resolve to a private or shared IP address | ||
try: | ||
address_tuples = socket.getaddrinfo(user_uri.host, user_uri.port) | ||
for family, _, _, _, sockaddr in address_tuples: | ||
ip_address: ipaddress.IPv4Address | ipaddress.IPv6Address | None = None | ||
if family == socket.AF_INET: | ||
ip_address = ipaddress.IPv4Address(sockaddr[0]) | ||
elif family == socket.AF_INET6: | ||
ip_address = ipaddress.IPv6Address(sockaddr[0]) | ||
if ip_address is None or not ip_address.is_global: | ||
return False | ||
except (socket.gaierror, ipaddress.AddressValueError): | ||
return False | ||
|
||
# We get the first 1024 bytes | ||
try: | ||
content = _get_url_content(user_uri, max_length_bytes=1024) | ||
except requests.exceptions.RequestException: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging: this has an unfortunate TOC/TOU weakness, where getaddrinfo
might return a public IP, and the subsequent resolution in requests
might return a private/internal one. In other words, an attacker could race the check here to get it to pass.
I think the correct way to do this is to either inspect the socket
object underneath the requests
response, or to use the resolved IP directly and employ the Host
header + SNI to resolve the correct domain from the server. But both of these are annoying to do 🙂
For the latter, HostHeaderSSLAdapter
from requests_toolbelt
is possibly the most painless approach: https://toolbelt.readthedocs.io/en/latest/adapters.html#requests_toolbelt.adapters.host_header_ssl.HostHeaderSSLAdapter
(This may also be possible more easily via urllib3
instead -- maybe @sethmlarson knows?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be easier with urllib3, you can send your own Host
and SNI values more directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirmed with Wireshark that this works for urllib3:
http = urllib3.HTTPSConnectionPool(
host="93.184.215.14",
port=443,
headers={"Host": "example.com"},
server_hostname="example.com",
assert_hostname="example.com",
)
resp = http.request("GET", "/")
Sends SNI of example.com
, asserts example.com
on the cert, sends example.com
in the Host
header, but doesn't do any DNS resolution. I should probably add a super-strict integration test for this construction to the urllib3 test suite, the individual parts are tested but having it all tested together is quite nice to double-check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thanks @sethmlarson!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The implementation now uses urllib3
(I just opened a random blog (typed "some blog" in google, got into a list of best blog per category, in "education", first link was https://blog.ed.ted.com/, and its complete |
Changed to 100000 bytes |
Part of #8635, this PR adds a function to verify arbitrary URLs by parsing their HTML and looking for a specific
meta
tag.Concretely, a webpage with a
meta
tag in its header element like the following:would pass validation for the
project1
andproject2
PyPI projects.This PR only adds the function and its tests. The function is not used anywhere yet.
This implementation takes into account the discussion in the issue linked above, starting with this comment: #8635 (comment).
Concretely:
https://
domain.tld
), it cannot be an P address (e.g:https://100.100.100.100
)lxml
, which recovers from partial HTML, meaning only reading the first N bytes should be fine as long as it contains the tag we are looking forI'm opening this PR with only the verification logic since it's the part that requires the most review and discussion. Once it's done we can see how to integrate it with the current upload flow (probably as an asynchronous task).
cc @woodruffw @ewjoachim