This repository provides papers, code and tools that a beginner needs to start exploring the field of Cyber Code Intelligence (CyberCI).
The CyberCI is data-driven code analysis using pattern recognition and machine learning (ML), which provides alternative solutions for automated, potentially more intelligent and efficient code analysis and processing. Particularly, The booming of the open source software community has made vast amounts of software code available, which allows machine learning and data mining techniques to exploit abundant patterns within software code. This repository lists the technical papers, developed tools and surveys of the CyberCI Research from the NSCLab, Swinburne University of Technology, Australia, for newbies who are interested in applying the state-of-the-art ML techniques for code analysis and processing.
Fig. 1: The Cyber Code Intelligence (CyberCI) |
- POSTER:Vulnerability discovery with function representation learning from unlabeled projects (CCS-2017)
[Paper] [Python Code] - Cross-project transfer representation learning for vulnerable function discovery (TII-2018)
[Paper] [Python Code] - Deep Learning-Based Vulnerable Function Detection-A Benchmark (ICICS-2019)
[Paper] [Python Code] - Cyber Vulnerability Intelligence for Internet of Things Binary (TII-2019)
[Paper] [Python Code] [Video] - Software Vulnerability Discovery via Learning Multi-domain Knowledge Bases (TDSC-2020)
[Paper] [Python Code] - DeepBalance- Deep-Learning and Fuzzy Oversampling for Vulnerability Detection (TFS-2020)
[Paper] [Code] - CD-VulD-Cross-Domain Vulnerability Discovery based on Deep Domain Adaptation (TDSC-2020)
[Paper] [Matlab Code]
- Code analysis for intelligent cyber systems: A data-driven approach (Information Science-2019)
[Paper] - Software Vulnerability Detection Using Deep Neural Network: A Survey (Proceedings of the IEEE-2020)
[Paper]
- Function-level vulnerability detection benchmark framework
[Python Code]
Fig. 2: The deep-learning-based function-level vulnerability detection framework. |
- The function-level vulnerability dataset (labeled from C open-source projects) [Link]
Open-source projects | # of non-vulnerable files collected | # of vulnerable files collected | # of non-vulnerable functions collected | # of vulnerable functions collected |
---|---|---|---|---|
Asterisk | 862 | 84 | 17,755 | 94 |
FFmpeg | 553 | 293 | 5,552 | 249 |
HTTPD | 248 | 141 | 3,850 | 57 |
LibPNG | 34 | 44 | 577 | 45 |
LibTIFF | 94 | 151 | 731 | 123 |
OpenSSL | 867 | 150 | 7,068 | 159 |
Pidgin | 448 | 42 | 8,626 | 29 |
VLC Player | 616 | 45 | 6,115 | 44 |
Xen | 738 | 370 | 9,023 | 671 |
Total | 4,460 | 1,320 | 59,297 | 1,471 |
- The synthetic C/C++ vulnerability dataset (provided by the SARD project)
[Vulnerable functions] [Non-vulnerable functions]
Dataset | # of test cases | # of vulnerable C functions | # of non-vulnerable C functions |
---|---|---|---|
The SARD project | 64,099 | 83,710 | 52,290 |
- Cross-Domain Vulnerability Discovery
[Link] - Cyber Vulnerability Intelligence for IoT (binary data) [Link]
Dataset | # of vulnerable samples | # of non-vulnerable samples | # of total samples | Compiled Environment |
---|---|---|---|---|
CWE-119 | 7,916 | 7,474 | 15,390 | Windows |
LibTIFF | 26 | 776 | 802 | Windows |
VLC Player | 36 | 3,895 | 3,931 | Windows |
For binary code compiled in Linux system, please contact junzhang@swin.edu.au.
China: Guangzhou University, Xidian University, Hangzhou Dianzi University, Fujian Normal University, Yunnan Normal University, Huazhong University of Science and Technology, Sanming University
Australia: Deakin University, Monash University, Melbourne University, RMIT University, University of Technology Sydney
Japan: Ritsumeikan University
We welcome researchers to use our code/data. Please kindly cite the paper listed if you use the code/data in your work. Any bug report or improvement suggestions regarding the code and data in this repository will be appreciated. For acquiring more information, inquiries and bug report please contact: junzhang@swin.edu.au.
Thanks!