This dataset is useful to conduct research in vulnerability prediction and/or empirical analysis of tools that detect software vulnerabilities through source code.
This repository integrates datasets from different sources and research papers. Datasets are available individually at github-patches/
or collectively in a final dataset (final-dataset/vulnerabilities.csv
). A dataset of non-security related commits is also available for machine learning experiements.
If you want us to add a new dataset, open an issue.
Sources:
- CVEDetails - CVEs data from 1999 to 2022.
- NVD (☠️ 7316 CVEs) - CVEs data provided by the National Vulnerability Database from 2002 to 2022.
- OSV (☠️ 4125 CVEs) - Project maintained by Google. Open-source vulnerabilities from different ecosystems:
GHSA
,DWF
,Go
,Linux
,Maven
,NuGet
,OSS-Fuzz
,PyPI
,RubyGems
,crates.io
,npm
.
Sources data is updated monthly (last update: 31-01-2022).
Research Datasets:
- SecBench (☠️ 676 vulns, 🔗 676 commits) - Dataset of single-patches for different programming languages.
- BigVul (🔗 4432 commits) - C/C++ vulnerabilities.
- SAP (☠️ 1288 vulns, 🔗 1288 commits) - Java vulnerabilities.
- Devign (🔗 10894 commits) - C/C++ vulnerabilities.
Datasets that only consider vulnerabilities with patches available through GitHub.
Configure environment to run the scripts:
conda create --name sec-patches --file requirements.txt
conda activate sec-patches
Scripts to obtain the data from each source (CVE Details, NVD or OSV) are available at the tools/
folder. For each source, there are scripts to collect the raw data, process, normalize and filter the data by source code hosting website (github
, bitbucket
, gitlab
and git
). Check the documentation provided for each source (e.g., tools/osv/README.md
) to learn how to obtain, process, normalize and filter the data. All the datasets, except the raw
ones are available through data/
. The raw
datasets can also be collected by downloading a mirror we provide through Google Drive. Check the documentation to see how.
The sources data is updated monthly by running these tools.