Collection of datasets for vulnerability prediction

This dataset is useful to conduct research in vulnerability prediction and/or empirical analysis of tools that detect software vulnerabilities through source code.

This repository integrates datasets from different sources and research papers. Datasets are available individually at github-patches/ or collectively in a final dataset (final-dataset/vulnerabilities.csv). A dataset of non-security related commits is also available for machine learning experiements.

If you want us to add a new dataset, open an issue.

Sources:

CVEDetails - CVEs data from 1999 to 2022.
NVD (☠️ 7316 CVEs) - CVEs data provided by the National Vulnerability Database from 2002 to 2022.
OSV (☠️ 4125 CVEs) - Project maintained by Google. Open-source vulnerabilities from different ecosystems: GHSA, DWF, Go, Linux, Maven, NuGet, OSS-Fuzz, PyPI, RubyGems, crates.io, npm.

Sources data is updated monthly (last update: 31-01-2022).

Research Datasets:

SecBench (☠️ 676 vulns, 🔗 676 commits) - Dataset of single-patches for different programming languages.
BigVul (🔗 4432 commits) - C/C++ vulnerabilities.
SAP (☠️ 1288 vulns, 🔗 1288 commits) - Java vulnerabilities.
Devign (🔗 10894 commits) - C/C++ vulnerabilities.

Datasets that only consider vulnerabilities with patches available through GitHub.

Installation

Configure environment to run the scripts:

conda create --name sec-patches --file requirements.txt
conda activate sec-patches

`tools/` folder

Scripts to obtain the data from each source (CVE Details, NVD or OSV) are available at the tools/ folder. For each source, there are scripts to collect the raw data, process, normalize and filter the data by source code hosting website (github, bitbucket, gitlab and git). Check the documentation provided for each source (e.g., tools/osv/README.md) to learn how to obtain, process, normalize and filter the data. All the datasets, except the raw ones are available through data/. The raw datasets can also be collected by downloading a mirror we provide through Google Drive. Check the documentation to see how.

The sources data is updated monthly by running these tools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collection of datasets for vulnerability prediction

Installation

`tools/` folder

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
commits		commits
data		data
dataset		dataset
docs		docs
scripts		scripts
sources		sources
stats		stats
tools		tools
vulns		vulns
.gitignore		.gitignore
.gitmodules		.gitmodules
DATA.md		DATA.md
README.md		README.md
license.txt		license.txt
requirements.txt		requirements.txt

License

BezBru/security-patches-dataset

Folders and files

Latest commit

History

Repository files navigation

Collection of datasets for vulnerability prediction

Installation

tools/ folder

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`tools/` folder

Packages