AutoPatcher: Automatic Root-Cause-Analysis Guided Program Repair via Large Language Models
There are two ways to run the AutoPatcher:
-
Docker Usage: You can run the AutoPatcher in a Docker container by following the instructions in the Docker Usage section.
-
Local Usage: You can run the AutoPatcher on your local machine by following the instructions in the Local Usage section.
git clone https://github.com/MarkLee131/AutoPatcher.git
cd AutoPatcher
docker build -t autopatcher .
docker run -it --name myautopatcher autopatcher /bin/bash
This command will start the Docker container named myautopatcher
and open a bash shell inside the container for you to further use the parse_vuln_loc.py
and autopatcher.py
scripts manually.
The detailed usage of the scripts can be found in the Local Usage section.
The structure of the repository is as follows:
.
βββ autopatcher.py # Main script for calling the fine-tuned model (CodeT5) for patching
βββ autopatch_results # Directory for storing the patched files
β βββ vuln_fix_pairs.csv # CSV file containing the vulnerability and the corresponding patch
βββ data ## Directory for storing the data
β βββdemo_conti.csv # CSV file containing some CVEs for demonstration, feeding into the autopatcher
β βββvuln_functions.csv # CSV file containing the vulnerable code snippets extracted from the C/C++ code (output of the parse_vuln_loc.py script)
βββ get_functions.py # Script for extracting functions from the C/C++ code by using tree-sitter
βββ LICENSE
βββ models ## Directory for storing the fine-tuned model (CodeT5)
β βββ model.bin # Fine-tuned model for patching
βββ parse_vuln_loc.py # Script for parsing the location of the vulnerability in the C/C++ code according to the Root Cause Analysis (RCA) tool (Aurora now)
βββ rca ## Directory for storing the RCA reports and the source code of the target project
β βββ mruby # Example project
β βββ rca_reports # Directory for storing the RCA reports
β βββ test_parser.c # Example C file for testing the functionality of the get_functions.py script
βββ README.md
βββ requirements.txt # Required packages for running the autopatcher
If you are using our docker image, you can skip this section and directly run the AutoPatcher in the Docker container.
To run the AutoPatcher, you need to install the required packages by running the following command:
python -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
The recommended Python version is
3.12.1
since the code has been tested on this version.
The torch lib within the requirements.txt is a CPU version. If it is not working or there is a GPU available, you can install the GPU version of torch by following the instructions here.
You can download the fine-tuned model file model.bin
from Google Drive and save it in the models
directory.
The first phase of the AutoPatcher is to extract the vulnerable code snippets from the C/C++ code based on the Root Cause Analysis (RCA) reports generated by the RCA tool (Aurora now).
python parse_vuln_loc.py -h
usage: parse_vuln_loc.py [-h] [--rca_dir RCA_DIR] [--project_dir PROJECT_DIR] [--output_dir OUTPUT_DIR]
options:
-h, --help show this help message and exit
--rca_dir RCA_DIR The path to the root cause analysis report directory. Default is `rca/rca_reports`.
--project_dir PROJECT_DIR
The path to the project directory. Default is `rca/mruby`.
--output_dir OUTPUT_DIR
The output directory where the `vuln_functions.csv` file will be saved. Default is `data/`.
python parse_vuln_loc.py --rca_dir ./rca/rca_reports --project_dir ./rca/mruby --output_dir ./data
It will extract the vulnerable code snippets from the C/C++ code in the ./rca/mruby
directory based on the RCA reports in the ./rca/rca_reports
directory, and save the extracted vulnerable code snippets into a csv file named vuln_functions.csv
in the ./data
directory for the next phase.
This phase is to generate patches for the extracted vulnerable code snippets by using the fine-tuned model (CodeT5 model or CodeReviewer model).
python autopatcher.py --model_path ./models --num_beams 1 --vuln_path ./data/vuln_functions.csv --output_dir ./autopatch_results
This command will generate patches for the vulnerable code snippets in the ./data/vuln_functions.csv
file (generated in the previous phase) by using the fine-tuned model in the ./models
directory, and save the patched code snippets into a csv file named vuln_fix_pairs.csv
in the ./autopatch_results
directory.
You can check the available options by running the following command:
python autopatcher.py --help
The args currently supported are as follows:
python autopatcher.py --help
usage: autopatcher.py [-h] [--model_path MODEL_PATH] [--vuln_path VULN_PATH] [--output_dir OUTPUT_DIR] [--eval_batch_size EVAL_BATCH_SIZE] [--encoder_block_size ENCODER_BLOCK_SIZE]
[--decoder_block_size DECODER_BLOCK_SIZE] [--num_beams NUM_BEAMS] [--config_name CONFIG_NAME]
options:
-h, --help show this help message and exit
--model_path MODEL_PATH
The path to the model checkpoint for inference. If not specified, we will use the pretrained model from Huggingface.
--vuln_path VULN_PATH
Path to the input dataset for auto_patch, which is a csv file with a column named 'source' containing the vulnerable code snippets.
--output_dir OUTPUT_DIR
The output directory where the model predictions and checkpoints will be written.
--eval_batch_size EVAL_BATCH_SIZE
Batch size per GPU/CPU for evaluation.
--encoder_block_size ENCODER_BLOCK_SIZE
Optional input sequence length after tokenization.Default to the model max input length for single sentence inputs (take into account special tokens).
--decoder_block_size DECODER_BLOCK_SIZE
Optional input sequence length after tokenization.Default to the model max input length for single sentence inputs (take into account special tokens).
--num_beams NUM_BEAMS
Beam size to use when decoding.
--config_name CONFIG_NAME
Optional pretrained config name or path.
Note that the
--num_beams
parameter is used to control the number of beams for the beam search decoding. The default value is 1. You can change it to a larger value to generate more patches if needed. But it is recommended to keep it as 1 for the best performance if you use the CPU for running the AutoPatcher.
To demonstrate the functionality of the AutoPatcher, we provide a demo using the CVEs (CVE-2017-16527). The demo consists of the following steps:
Note that our root cause analysis tool (Aurora) is unable to generate the correct RCA report for the CVE-2017-16527. Therefore, we manually extracted the vulnerable code snippet from the CVE-2017-16527 and saved it in the
data/demo_conti.csv
file for the demonstration.
python autopatcher.py --model_path ./models --num_beams 1
It will print and save the patched code snippet for the CVE-2017-16527 in a csv file named vuln_fix_pairs.csv
in the autopatch_results
directory.
The format of the vuln_fix_pairs.csv
file is as follows:
vuln_code,fix_code
The output of the AutoPatcher is a patched code snippet for the given vulnerable code snippet. The patched code snippet is generated by the fine-tuned model (CodeT5) based on the vulnerable code snippet.
To improve the readability of the output, it generally involves some specific identifiers to indicate the changes made by the model. For example, the <S2SV_ModStart>
and <S2SV_ModEnd>
are used to indicate the start and end of the modification, respectively.
We leverage the commonly used fix format introduced in https://arxiv.org/pdf/2104.08308.
Especially, the code changes related to fixs are expressed in the format as follows:
sound/usb/mixer.c
in the Linux kernel before 4.13.8 allows local users to cause a denial of service (snd_usb_mixer_interrupt use-after-free and system crash) or possibly have unspecified other impact via a crafted USB device.
By feeding the vulnerable code snippet into the AutoPatcher, it generates the following patch:
<S2SV_ModStart>mixer ) { snd_usb_mixer_disconnect ( mixer ) ;
to kill pending URBs and free the mixer instance before the mixer instance is freed, which is consistent with the real patch for the CVE-2017-16527 at here.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.