-
Notifications
You must be signed in to change notification settings - Fork 966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate the contents of identity centric metadata #8635
Comments
I recall issues like this coming up at least once, if not a few times, over in pypi-support. Someone would fork a repository, change the name in So I'm 👍 for some sort of blue verified checkmark or something from that perspective. With my publisher hat on, though, I would hope this would be completely automated and I wouldn't have to do anything special to earn that blue checkmark. |
One idea: we could add a blue checkmark for all links in the sidebar that contain a link back to the project's pypi page or That being said, it wouldn't help if they point to forked versions, but in that case, the github star count might be a tell. |
👍 Any progress on this issue? I've been looking at malware from PyPI and it is common for the Some related context is this HN discussion: https://news.ycombinator.com/item?id=33438678 Many commenters are asking about providing this sort of information. I see some considerations that need discussion:
Some validation is easier than others as well - e.g. email validation is pretty straightforward, but homepage validation would require something like the ACME protocol. |
Haha, rereading my 2-year-old comment above about a blue check marks seems to resonate strangely in today's terms 😅 Who would have guessed... |
My general thoughts here is that for metadata that we can 'verify', we should probably elevate that metadata in the UI over 'unverified' metadata. We can already validate email addresses that correspond to verified emails of maintainers. That won't include the ability to verify mailinglist-style emails, but that could potentially be added to organizations once that feature lands. With #12465, we'll be able to 'validate' the source repository as well, so any metadata that either references the given upstream source repository can be considered verified as well. I agree that domains/urls will need to use the ACME protocol or something similar. I think there's probably a UX question on how these would be done per-project, if we wanted to go that route. |
Mastodon has a link verification system, that might be nice. That's never going to be foolproof though. |
From attempting to perform identity-assurance checks on packages manually: bidirectional references can be a reassuring indicator. In context here: when a PyPi package points to a GitHub repository as its source code, then that's interpretable as a useful but as-yet-untrusted statement. When up-to-date references are inspected within the contents of the cloned linked repository and they point back to the same original package on PyPi, then confidence in the statement increases. For reproducible-build-compliant packages the situation improves further: any third party can confirm not only that the source origin and package destination are in concordance, but also whether the published artifact from the destination is bit-for-bit genuine by comparing it a build-from scratch of the corresponding raw origin source materials. This can be verified on both a historic and ongoing basis. So that's two orthogonal identity validation mechanisms:
These don't prevent an attacker copying the source in entirety and creating a duplicate under a different name with an internally-consistent reference graph. Given widespread free communication I think it's reasonable to expect that enough of the package consumer population will be (or become) aware of and gravitate towards the authentic package to solve that problem. |
Following on to my previous comment, here's a mockup of what I'm imagining to separate the metadata we can verify today (source repository, maintainer email, GitHub statistics, Owner/Maintainers) from the unverifiable metadata:
Over time we can move things from below the fold to above it, but this should be a big improvement as-is for now. I pushed the diff for the mockup here, there's some hacky stuff in there just to get the mockup to look good, but it could be a good starting point. |
I'm starting working on this for creating the verified session and adding "Owner"/"Maintainers" on that :) |
I wonder if it makes more sense to have verified details and then unverified details or to have each category with a verfified sub-section and a non-verified sub-section. It feels weird to break the project links apart from one another. When your eyes have reached the place where the repository is, it's not very clear that if the documentation isn't there, you have to look a some place else entirely to find a different link section that might contain the link to the docs. I'd even argue that in this case, the whole thing would look more readable if the project doesn't use trusted publishers, which is What about something like this ? (not arguing it's better, just a suggestion for the discussions) (Would @nlhkabu have an opinion on the matter ?) |
#16205 Starts marking URLs as verified, this now just needs to be surfaced in the UI. |
I'm working on the UI part now |
Reopening this, we have solved this for a subset of project urls that relate to Trusted Publishing, but these two remain:
I think we also want to think about a solution for validating non-trusted publisher project URLs (e.g., ACME). |
I am a massive fan of this idea 🙂 To spitball a little bit, a given FQDN's (
{
"version": 1,
"packages": ["example"]
} (where Another lower-intensity option would be the
<head>
<link rel="me" href="https://pypi.org/p/example">
</head> Like with the Alternatively, this could use <head>
<meta rel="me" namespace="pypi.org" package="example">
</head> ...or even multiple in the same tag: <head>
<meta rel="me" namespace="pypi.org" package="example another-example">
</head> Edit: one downside to the Ref on Mastodon's link verification: https://docs.joinmastodon.org/user/profile/#verification |
Using Using |
Yeah -- my thinking was that it'd be most useful for projects/companies with full-blown domains, e.g. company
Makes sense! That reminded me: do we also want to consider periodic reverification? My initial thought is "no" since URLs are verified on a per-release basis, but I could see an argument for doing it as well (or at least giving project owners the ability to click a "reverify" button once per release or similar). |
I think I would want to stick to the "verified at time of release" model, at least for now (not trying to reinvent Keybase here). |
If we specifically plan for this to be used on ReadTheDocs, it makes sense to ensure that whatever format we decide on is easy to use with Sphinx & mkdocs. I've made a small test with Sphinx: .. meta::
:rel=me namespace=pypi.org package=example: pypi produces <meta content="pypi" namespace="pypi.org" package="example" rel="me" /> Note: it's going to be much harder to have multiple values in .. meta::
:rel=me namespace=pypi.org package=example: pypi
:rel=me namespace=pypi.org package=other: pypi Also, it's impossible to not have a .. meta::
:rel=me namespace=pypi.org: example other <meta content="example other" namespace="pypi.org" rel="me" /> On mkdocs, it looks like one would need to extend the theme, which is a bit cumbersome. But I'm sore someone would make a plugin soon enough to solve this (it's probably the same with sphinx, realistically) |
Thanks for looking into that @ewjoachim! Using |
I'm working on this. One thing that we should be aware of is that implementing this kind of verification, where each URL is accessed and parsed to see if it contains the I'm not sure of how many releases are uploaded per second (let's call it We can have restrictions to reduce network activity and protect against DoS attacks (like what Mastodon does, limiting the response size to 1 MB), but we'll still need to handle all the new outgoing requests. Since I don't know the number of releases uploaded per second, I'm leaving the question open here to see if PyPI's infrastructure can handle the extra network activity downloading those webpages would cause. |
Yeah, thanks for calling that out! I think there are a few things PyPI could do to keep the degree of uncontrolled outbound traffic to a minimum:
PyPI could do some or all of them; I suspect limiting unique FQDNs might be a little too extreme compared to the others. |
Also, as far as I can tell, pypi.org's DNS point to fastly. If we were to easily know the IPs of the real servers beneath fastly, DDoS attacks could become easier. We need to make sure that the outgoing ip used to connect to the website is not same as the inbound IP for the website. It's probably the case as the worker run on different machines, but it's worth mentally checking this by anyone who knows the infrastructure well enough. (Also, ensure that inbound traffic that's not coming from fastly of firewalled, it's probably already the case but in case it's not, it's probably worth doing it)
Also, limit the number of underlying IPs after DNS resolving, and/or domain names, because if a server has a wildcard DNS, we can generate infinite FQDNs that will hit the same server. Oh, and we may want to ensure we put a reasonable timeout for the requests (for us). Also, we may want to control (& advertise) the user agent we use to make those requests. Potentially also the outbounds IPs if possible. Some large players (RTD, github.io, ...) might found out we make a large number of requests to them that would fall in their own rate limit, but they might be inclined to put a bypass for us, and it's much easier if we make it clear how to identify our requests. And maybe keep a log of all the requests we made and what release it's linked to ? Could we end up in trouble if someone makes us make a request to illegal content ? Could the PyPI IPs end up on some FBI watchlist ? (I wonder if using Cloudflare's 1.1.1.3 resolver "for families" that blocks malware & adult content could mitigate this risk... But I don't know if it's within the terms of use) Oh, we also need to protect ourselves from SSRF, even though in this case we're not displaying what we requested back to the user, in the hopefully inexistant but possible case that an internal GET request can have side effect, this could be catastrophic. e.g. we're on AWS, if the user publishes a package with URL Oh and MITM of course. We should only try to validate HTTPS urls, validating an HTTP URL would only lead to an untrustable result. Just for completeness: if the page is cut due to being more than 1MB, but we still want to check the headers, we'll need a HTML parser that doesn't crash on partial content. Should we request I guess this is the kind of question everyone should ask when they implement a webhook or a crawler or anything. There surely is a resource out there from people who have solved these headaches already. |
Yes, this makes a lot of sense to do!
This log would presumably just be the list of URLs listed in the project's JSON API representation/latest release page on PyPI, no? I'm personally wary of PyPI retaining any more data than absolutely necessary, especially since in this case we're not actually storing any data from the URL, only confirming that the URL is serving HTML with a particular
For SSRF, I think the main thing we'll need to do is prevent server-controlled redirects. In other words: if the URL itself doesn't serve the
I could be convinced that we should add this restriction as a practical matter, but I'm not sure it's that important in terms of security? If the URL has an IP as its host but otherwise matches the secure origin rules (i.e. HTTPS), is there a reason we shouldn't validate it?
FWIW, this one at least is covered under "must be a secure origin" in #8635 (comment). |
The danger of SSRF is internal urls. The request will be made from within the PyPI infrastucture and may have access to network-protected endpoints that might not be accessible to random spiders
Ah you're right sorry
This could make sense indeed. |
Hm, thinking again, if someone uses |
Oh, btw, should we make sure the port is not overridden (or force it ourselves to 443) ? I don't know if there are protocols out there where we could do nasty things just by opening a TCP connection. I hope not. |
Ah, I see what you mean. Yeah, I think the expectation here would be that we deny anything that resolves to a local/private/reserved address range. IP addresses would be allowed only insofar as they represent public ranges (and serve HTTPS, per above).
I think this falls under the "PyPI isn't responsible if your thing breaks after you stuff a URL to it in your metadata," but this is another datapoint in favor of making our lives easy and simply not supporting anything other than HTTPS + domain names + port 443, with zero exceptions 🙂 |
What if we only perform this kind of verification once per release? As in, during the upload of the first file that creates the release. The reason why we currently re-verify URLs for each file upload of the same release is because Trusted Publisher verification means that some file uploads might come from the Trusted Publisher URL and some not. So it makes sense to re-verify: the first file upload might not come from a relevant Trusted Publisher, but subsequent ones might. However, this is not the case for this type of verification, since we're accessing resources independent of the upload process and authentication. So checking the URLs once during release creation might be a simple way of limiting the amount of requests we make. |
But the pages might change. I agree that once a page has been verified, it's probably fair to trust it for some amount of time, but if someone already has a URL set up, and learns about this feature and adds their meta tag and pushes a new version, we should recheck even if we've checked before. |
Yes, that's what I meant with my comment, we should do this kind of verification (meta tag) once per release:
Maybe the confusion is because I'm using "release" to refer to a new version of a package, so I'm saying we should recheck every time the user uploads a new version. |
Ah no my bad, you said it right, I misunderstood. It's just that as far as I had understood, we had dismissed the idea of re-verifying a link, so it was already what I thought we were at. So I thought you suggested once per project, but it's my bad :) |
So how does this work for institutional projects or are there any plans to implement it? For instance, there's a team of developers and one devops staffer is tasked with releasing. After the staffer email is verified (email-click_link) that allows PyPi to publish that username as "Verified details" under the "Maintainers" header. But what about project URLs that point to a specific Web site not associated with the repository provider? Is there any mechanism to get the actual project Web site verified? (Concrete example: Repo is at Apparently to get a project-focused email address reported as verified, an account must be opened at PyPi; it must be verified (email-click_link); and then the project-focused email address must be tied to the project. If it's a legitimate email and shared amongst the team, well that feels a little security-crunchy -- I wouldn't want bunches of people with capacity to reset a password with PyPi rights but that seems the only approach right now that might work. Anyway, what about License details? "Unverified" when it's a custom institutional license. Project keywords in Setuptools are "Unverified". Python version (e.g. Maybe I missed something in docs... |
This is a four-year old ticket, and I know there is a lot of work here to automate detection of "Verified details", which is great. However, the current solutions may not cover all scenarios, specifically as described in my previous comment. A simple option would be to provide text fields on a project basis to add verified URLs and/or email addresses. Since the account holder has to be verified, shouldn't this be an acceptable addition? Better yet, restrict additional details to the parent domain of the account holder, e.g. |
Yeah, that's why we haven't gone forwards with full URL verification yet -- the design space is somewhat open, and there are tradeoffs to an ACME well-known approach, a
Could you say more about what you're expecting here? The point of the "verified" badge next to an email in the project view is only to provide a visual indicator that the email in the project's metadata matches one on an owning account, i.e. one entitled to publish for the account. PyPI could do non-user verifications of emails, but that would have some consequences:
If what you want to do is share a contact non-owner email for a project, perhaps you could link to it from an already-verified page? That would accomplish the same thing in terms of transitive verification, I think.
Everything that isn't currently "verifiable" (i.e. attributable to a verifiable external resource) is currently marked as "unverified." It would perhaps make sense to have a third category here for things that are "not verified because verifying them doesn't make sense," although it's not immediately clear where we'd order that in the current metadata pane 🙂
This sounds simple, but it's not that simple in practice 🙂 -- PyPI's project view is generally careful not to conflate "project originated" and "index originated" metadata, except where the latter can explicitly verify the former. We could add a notion of free-form index-originated project metadata, but doing so would need to still verify those things, since verification is meant to be two-way.
PyPI isn't actually aware of domains or subdomains at all, really -- there's some very basic specialized stuff in place around There's been some discussion on that topic recently (and how it merges with PyPI's org feature), which you might find interesting + could benefit from a potential user's perspective. |
Currently if I'm looking at a project on PyPI, it can be difficult to determine if it's "real" or not. I can look and see the user names that are publishing the project as well as certain key pieces of metadata such as the project home page, the source repository, etc.
Unfortunately, there's no way to verify that a project that has say..
https://github.com/pypa/pip
in it's home page, is actually the real pip, and isn't a fake imposter pip. The same could go for other URLs, or email addresses etc. Thus it would be useful if there was some way to actually prove ownership of those URLs/emails, and either differentiate them in the UI somehow, or hide them completely unless they've been proven to be owned by one of the publishing users.Metadata to verify:
The text was updated successfully, but these errors were encountered: