Skip to content

Automated compromise detection of the world's most popular packages

License

Notifications You must be signed in to change notification settings

trickest/packages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Packages Tweet

Automated compromise detection of the world's most popular packages

For each package registry, 5 files are generated:

  • non_existent_users.csv: Packages that point to a GitHub repository whose owner doesn't exist anymore: PyPI, npm
  • suspicious_updates.csv: Packages that have been updated on the package repository without a corresponding update to the code repository's default branch: PyPI, npm
  • broken_urls.csv: Packages that have a broken URL anywhere in their description, homepage, docs URL, bugtrack URL, etc: PyPI, npm
  • mismatching_package_repository.csv: Packages that point to a GitHub repository whose name doesn't match the package name (This isn't always indicative of a compromised package but it helps catch malicious packages that try to impersonate legitimate ones): PyPI, npm
  • repeating_repositories.csv: Packages that point to a GitHub repository that another package also points to (This isn't always indicative of a compromised package but it helps catch malicious packages that try to impersonate legitimate ones): PyPI, npm

How it Works

A Trickest workflow gets the initial dataset from:

Then, it performs multiple checks to find any red flags that could indicate that a package is (or can be) compromised.

Trickest Workflow

TB; DZ (Too big; didn't zoom)

  • The initial PyPI dataset is collected from the Top PyPI packages project, which contains a list of PyPI's top 5000 most downloaded packages, updated monthly. (Thanks, @hugovk!)

  • The npm dataset is collected using the npmrank project (Thanks, @anvaka!) which collect the:

    1. Top 1,000 most depended-upon packages
    2. Top 1,000 packages with the largest number of dependencies
    3. Top 1,000 packages with the highest PageRank score
    • When merged and deduplicated, they amount to ~2500 packages across all categories.
  • The package names are passed to the extract-metadata node which collects 4 categories of info about each package:

    • The latest package release date
    • The GitHub repository connected to the package
    • The repository's latest commit date
    • The URLs that the package points to anywhere
  • This node branches off into 5 checks:

    • The package's latest release date and repository's latest commit date are compared. If a package version has been released after the latest commit date, the package is flagged.
    • GitHub usernames are extracted from the repository URLs and passed to ffuf which queries the GitHub API to check if any usernames don't exist anymore (Thanks @joohoi!)
    • The package's URLs are passed to hakcheckurl to check if any URLs are broken and could be taken over. (Thanks @hakluke)
    • The package's GitHub repository is checked and the package is flagged if:
      • the repository name doesn't match the package name
      • the repository has been used in another package before
  • In the end, the results of these checks are matched back to their packages and pushed to this repository.

Contribution

All contributions are welcome! Got an idea for another check? Know a way to make a check more accurate? Feel free to create a new ticket via GitHub issues, tweet at us @trick3st, or join the conversation on Discord.

Build your own workflows!

We believe in the value of tinkering. Sign up for a demo on trickest.com to customize this workflow to your use case, get access to many more workflows, or build your own from scratch!