-
Notifications
You must be signed in to change notification settings - Fork 368
Opam supply chain
DRAFT
Here is the default supply chain for opam:
-
The opam client needs to know the package universe and the individual package contents. To do so it trusts https://opam.ocaml.org to get:
- opam’s metadata index: https://opam.ocaml.org/index.tar.gz associates package versions to archives names
- opam’s archive cache https://opam.ocaml.org/cache// contains the package contents
-
The metadata index contains the metadata from the metadata repository: https://github.com/ocaml/opam-repository (using
opam admin
and scripts in https://github.com/ocaml/opam.ocaml.org) -
The metadata repository is maintained by the team of opam-repository gatekeepers. They inspect package metadata following a comprehensive checklist -- which is partly automated -- and discuss with library authors who submit packages.
-
The archive cache is populated by reading the metadata repository and downloading contents from the upstream archive mirrors (e.g. either GitHub release artefact or self-hosted archives) specified by library authors.
So right now, when you install a package with opam, you need to trust: TLS, opam.ocaml.org infrastructure, the GitHub infrastructure, ocaml/opam-repository gatekeepers, the library authors.
Note: the archive hashes do not really matter in this scheme: altering the metadata is allowed and will just trigger a recompilation on the client
Indeed, if an attacker targets the metadata index, they can alter the hash, and either populate the cache with their new hash (index and cache are on the same host at the moment), or just alter the URL as well. Attacking just the cache, or the source tarballs infrastructure, on the other hand, should have no effect (granted the hashes are strong enough) since those are double-checked client-side.
Where the hashes would still matter is when the upstream archives are fetched. This happens in two cases (not including non-default repos, discussed below):
- if the repository server is down or unreachable for some reason, clients can still use their cached data, and failing to fetch from the remote cache, will hit the upstream archive URLs. Without strong hashes, an attacker controlling your connection could, for example, deny access to the repo, and redirect your downloads of archives, targetting packages that are served over http (there are some).
- when the opam-repository cache is first populated, normally within 1hr after a PR is merged. An attack scenario could involve generating two archives with the same hash, one benign and one malicious, submitting an innocuous-looking package to the repository, and swapping the upstream archive just after the PR is merged. The cache would then be populated with the malicious archive, and start serving it to clients. This is not so far-fetched since it's notoriously easier if you control both archives to generate collisions for weak hashes.
If you start getting package metadata and tarballs from different sources (either by using the Git remote directly (as the CI does) or by setting your own custom cache) then things are more complicated. Note that this can also happen if the cache server is down or unreachable for some reason.
To get a property such as "every client verifiably gets the same tarballs". this is what the go module system does (https://www.youtube.com/watch?v=KqTySYYhPUE, https://proxy.golang.org/) and doing so does not require any signing. The opam client already verify all the signatures, so it means that we need to add stronger hashes to the opam repository.
To get a property such as "every client get the tarballs published from library authors and get metadata edited by authors and gatekeepers" we need a stronger infrastructure based on TUF / in-toto that CONEX will eventually deliver.