For each package registry, 5 files are generated:
non_existent_users.csv
: Packages that point to a GitHub repository whose owner doesn't exist anymore: PyPI, npmsuspicious_updates.csv
: Packages that have been updated on the package repository without a corresponding update to the code repository's default branch: PyPI, npmbroken_urls.csv
: Packages that have a broken URL anywhere in their description, homepage, docs URL, bugtrack URL, etc: PyPI, npmmismatching_package_repository.csv
: Packages that point to a GitHub repository whose name doesn't match the package name (This isn't always indicative of a compromised package but it helps catch malicious packages that try to impersonate legitimate ones): PyPI, npmrepeating_repositories.csv
: Packages that point to a GitHub repository that another package also points to (This isn't always indicative of a compromised package but it helps catch malicious packages that try to impersonate legitimate ones): PyPI, npm
A Trickest workflow gets the initial dataset from:
- hugovk's Top PyPI packages project for PyPI packages.
- anvaka's npmrank project - Example for npm packages.
Then, it performs multiple checks to find any red flags that could indicate that a package is (or can be) compromised.
-
The initial PyPI dataset is collected from the Top PyPI packages project, which contains a list of PyPI's top 5000 most downloaded packages, updated monthly. (Thanks, @hugovk!)
-
The npm dataset is collected using the npmrank project (Thanks, @anvaka!) which collect the:
- Top 1,000 most depended-upon packages
- Top 1,000 packages with the largest number of dependencies
- Top 1,000 packages with the highest PageRank score
- When merged and deduplicated, they amount to ~2500 packages across all categories.
-
The package names are passed to the
extract-metadata
node which collects 4 categories of info about each package:- The latest package release date
- The GitHub repository connected to the package
- The repository's latest commit date
- The URLs that the package points to anywhere
-
This node branches off into 5 checks:
- The package's latest release date and repository's latest commit date are compared. If a package version has been released after the latest commit date, the package is flagged.
- Example: The
ctx
package (now deleted) had its last commit in 2014 but a new version was released in 2022 which turned out to be malicious.
- Example: The
- GitHub usernames are extracted from the repository URLs and passed to ffuf which queries the GitHub API to check if any usernames don't exist anymore (Thanks @joohoi!)
- The package's URLs are passed to hakcheckurl to check if any URLs are broken and could be taken over. (Thanks @hakluke)
- The package's GitHub repository is checked and the package is flagged if:
- the repository name doesn't match the package name
- the repository has been used in another package before
- The package's latest release date and repository's latest commit date are compared. If a package version has been released after the latest commit date, the package is flagged.
-
In the end, the results of these checks are matched back to their packages and pushed to this repository.
All contributions are welcome! Got an idea for another check? Know a way to make a check more accurate? Feel free to create a new ticket via GitHub issues, tweet at us @trick3st, or join the conversation on Discord.
We believe in the value of tinkering. Sign up for a demo on trickest.com to customize this workflow to your use case, get access to many more workflows, or build your own from scratch!