Skip to content

☠️ Ground-truth dataset for vulnerability prediction (known research datasets and data sources included such as NVD, CVE Details and OSV); tools to automatically update the data are provided.

License

Notifications You must be signed in to change notification settings

BezBru/security-patches-dataset

 
 

Repository files navigation

Collection of datasets for vulnerability prediction

This dataset is useful to conduct research in vulnerability prediction and/or empirical analysis of tools that detect software vulnerabilities through source code.

This repository integrates datasets from different sources and research papers. Datasets are available individually at github-patches/ or collectively in a final dataset (final-dataset/vulnerabilities.csv). A dataset of non-security related commits is also available for machine learning experiements.

If you want us to add a new dataset, open an issue.

Sources:

  • CVEDetails - CVEs data from 1999 to 2022.
  • NVD (☠️ 7316 CVEs) - CVEs data provided by the National Vulnerability Database from 2002 to 2022.
  • OSV (☠️ 4125 CVEs) - Project maintained by Google. Open-source vulnerabilities from different ecosystems: GHSA, DWF, Go, Linux, Maven, NuGet, OSS-Fuzz, PyPI, RubyGems, crates.io, npm.

Sources data is updated monthly (last update: 31-01-2022).

Research Datasets:

  • SecBench (☠️ 676 vulns, 🔗 676 commits) - Dataset of single-patches for different programming languages.
  • BigVul (🔗 4432 commits) - C/C++ vulnerabilities.
  • SAP (☠️ 1288 vulns, 🔗 1288 commits) - Java vulnerabilities.
  • Devign (🔗 10894 commits) - C/C++ vulnerabilities.

Datasets that only consider vulnerabilities with patches available through GitHub.

Installation

Configure environment to run the scripts:

conda create --name sec-patches --file requirements.txt
conda activate sec-patches

tools/ folder

Scripts to obtain the data from each source (CVE Details, NVD or OSV) are available at the tools/ folder. For each source, there are scripts to collect the raw data, process, normalize and filter the data by source code hosting website (github, bitbucket, gitlab and git). Check the documentation provided for each source (e.g., tools/osv/README.md) to learn how to obtain, process, normalize and filter the data. All the datasets, except the raw ones are available through data/. The raw datasets can also be collected by downloading a mirror we provide through Google Drive. Check the documentation to see how.

The sources data is updated monthly by running these tools.

About

☠️ Ground-truth dataset for vulnerability prediction (known research datasets and data sources included such as NVD, CVE Details and OSV); tools to automatically update the data are provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 98.6%
  • Python 1.2%
  • Shell 0.2%