Skip to content

πŸš€AutoPatcher: Automatic Root-Cause-Analysis Guided Program Repair via Large Language Models

License

Notifications You must be signed in to change notification settings

MarkLee131/AutoPatcher

Repository files navigation

AutoPatcher

AutoPatcher: Automatic Root-Cause-Analysis Guided Program Repair via Large Language Models


There are two ways to run the AutoPatcher:

  1. Docker Usage: You can run the AutoPatcher in a Docker container by following the instructions in the Docker Usage section.

  2. Local Usage: You can run the AutoPatcher on your local machine by following the instructions in the Local Usage section.


Docker Usage

Steps to run the AutoPatcher in a Docker container

Step 1: Clone the repository

git clone https://github.com/MarkLee131/AutoPatcher.git
cd AutoPatcher

Step 2: Build the Docker image

docker build -t autopatcher .

Step 3: Run the Docker container

Example usage

docker run -it --name myautopatcher autopatcher /bin/bash

This command will start the Docker container named myautopatcher and open a bash shell inside the container for you to further use the parse_vuln_loc.py and autopatcher.py scripts manually.

The detailed usage of the scripts can be found in the Local Usage section.


Local Usage

Structure

The structure of the repository is as follows:

.
β”œβ”€β”€ autopatcher.py # Main script for calling the fine-tuned model (CodeT5) for patching
β”œβ”€β”€ autopatch_results # Directory for storing the patched files
β”‚   └── vuln_fix_pairs.csv # CSV file containing the vulnerability and the corresponding patch
β”œβ”€β”€ data ## Directory for storing the data
β”‚   └──demo_conti.csv # CSV file containing some CVEs for demonstration, feeding into the autopatcher
β”‚   └──vuln_functions.csv # CSV file containing the vulnerable code snippets extracted from the C/C++ code (output of the parse_vuln_loc.py script)
β”œβ”€β”€ get_functions.py # Script for extracting functions from the C/C++ code by using tree-sitter
β”œβ”€β”€ LICENSE
β”œβ”€β”€ models ## Directory for storing the fine-tuned model (CodeT5)
β”‚   └── model.bin # Fine-tuned model for patching
β”œβ”€β”€ parse_vuln_loc.py # Script for parsing the location of the vulnerability in the C/C++ code according to the Root Cause Analysis (RCA) tool (Aurora now)
β”œβ”€β”€ rca ## Directory for storing the RCA reports and the source code of the target project
β”‚   β”œβ”€β”€ mruby # Example project
β”‚   β”œβ”€β”€ rca_reports # Directory for storing the RCA reports
β”‚   └── test_parser.c # Example C file for testing the functionality of the get_functions.py script
β”œβ”€β”€ README.md
└── requirements.txt # Required packages for running the autopatcher

Environment Setup


If you are using our docker image, you can skip this section and directly run the AutoPatcher in the Docker container.


To run the AutoPatcher, you need to install the required packages by running the following command:

Install a virtual environment

python -m venv .venv

Activate the virtual environment

source .venv/bin/activate

Install the required packages

python -m pip install -r requirements.txt

The recommended Python version is 3.12.1 since the code has been tested on this version.

Torch Installation

The torch lib within the requirements.txt is a CPU version. If it is not working or there is a GPU available, you can install the GPU version of torch by following the instructions here.

Download the fine-tuned model file

You can download the fine-tuned model file model.bin from Google Drive and save it in the models directory.

Run the AutoPatcher

Vulnerability Extraction

The first phase of the AutoPatcher is to extract the vulnerable code snippets from the C/C++ code based on the Root Cause Analysis (RCA) reports generated by the RCA tool (Aurora now).

Options

python parse_vuln_loc.py  -h

usage: parse_vuln_loc.py [-h] [--rca_dir RCA_DIR] [--project_dir PROJECT_DIR] [--output_dir OUTPUT_DIR]

options:
  -h, --help            show this help message and exit
  --rca_dir RCA_DIR     The path to the root cause analysis report directory. Default is `rca/rca_reports`.
  --project_dir PROJECT_DIR
                        The path to the project directory. Default is `rca/mruby`.
  --output_dir OUTPUT_DIR
                        The output directory where the `vuln_functions.csv` file will be saved. Default is `data/`.
Example usage
python parse_vuln_loc.py --rca_dir ./rca/rca_reports --project_dir ./rca/mruby --output_dir ./data

It will extract the vulnerable code snippets from the C/C++ code in the ./rca/mruby directory based on the RCA reports in the ./rca/rca_reports directory, and save the extracted vulnerable code snippets into a csv file named vuln_functions.csv in the ./data directory for the next phase.

Patch Generation

This phase is to generate patches for the extracted vulnerable code snippets by using the fine-tuned model (CodeT5 model or CodeReviewer model).

Example usage
python autopatcher.py --model_path ./models --num_beams 1 --vuln_path ./data/vuln_functions.csv --output_dir ./autopatch_results

This command will generate patches for the vulnerable code snippets in the ./data/vuln_functions.csv file (generated in the previous phase) by using the fine-tuned model in the ./models directory, and save the patched code snippets into a csv file named vuln_fix_pairs.csv in the ./autopatch_results directory.

Options

You can check the available options by running the following command:

python autopatcher.py --help

The args currently supported are as follows:

python autopatcher.py --help

usage: autopatcher.py [-h] [--model_path MODEL_PATH] [--vuln_path VULN_PATH] [--output_dir OUTPUT_DIR] [--eval_batch_size EVAL_BATCH_SIZE] [--encoder_block_size ENCODER_BLOCK_SIZE]
                      [--decoder_block_size DECODER_BLOCK_SIZE] [--num_beams NUM_BEAMS] [--config_name CONFIG_NAME]

options:
  -h, --help            show this help message and exit
  --model_path MODEL_PATH
                        The path to the model checkpoint for inference. If not specified, we will use the pretrained model from Huggingface.
  --vuln_path VULN_PATH
                        Path to the input dataset for auto_patch, which is a csv file with a column named 'source' containing the vulnerable code snippets.
  --output_dir OUTPUT_DIR
                        The output directory where the model predictions and checkpoints will be written.
  --eval_batch_size EVAL_BATCH_SIZE
                        Batch size per GPU/CPU for evaluation.
  --encoder_block_size ENCODER_BLOCK_SIZE
                        Optional input sequence length after tokenization.Default to the model max input length for single sentence inputs (take into account special tokens).
  --decoder_block_size DECODER_BLOCK_SIZE
                        Optional input sequence length after tokenization.Default to the model max input length for single sentence inputs (take into account special tokens).
  --num_beams NUM_BEAMS
                        Beam size to use when decoding.
  --config_name CONFIG_NAME
                        Optional pretrained config name or path.

Note that the --num_beams parameter is used to control the number of beams for the beam search decoding. The default value is 1. You can change it to a larger value to generate more patches if needed. But it is recommended to keep it as 1 for the best performance if you use the CPU for running the AutoPatcher.

Demo for AutoPatcher

To demonstrate the functionality of the AutoPatcher, we provide a demo using the CVEs (CVE-2017-16527). The demo consists of the following steps:

CVE-2017-16527

Note that our root cause analysis tool (Aurora) is unable to generate the correct RCA report for the CVE-2017-16527. Therefore, we manually extracted the vulnerable code snippet from the CVE-2017-16527 and saved it in the data/demo_conti.csv file for the demonstration.

Running the AutoPatcher
python autopatcher.py --model_path ./models --num_beams 1

It will print and save the patched code snippet for the CVE-2017-16527 in a csv file named vuln_fix_pairs.csv in the autopatch_results directory.

The format of the vuln_fix_pairs.csv file is as follows:

vuln_code,fix_code
How to read the output

The output of the AutoPatcher is a patched code snippet for the given vulnerable code snippet. The patched code snippet is generated by the fine-tuned model (CodeT5) based on the vulnerable code snippet. To improve the readability of the output, it generally involves some specific identifiers to indicate the changes made by the model. For example, the <S2SV_ModStart> and <S2SV_ModEnd> are used to indicate the start and end of the modification, respectively.

Output format

We leverage the commonly used fix format introduced in https://arxiv.org/pdf/2104.08308.

fix_intro

Especially, the code changes related to fixs are expressed in the format as follows:

keywords

Analysis of the output

sound/usb/mixer.c in the Linux kernel before 4.13.8 allows local users to cause a denial of service (snd_usb_mixer_interrupt use-after-free and system crash) or possibly have unspecified other impact via a crafted USB device.

By feeding the vulnerable code snippet into the AutoPatcher, it generates the following patch:

<S2SV_ModStart>mixer ) { snd_usb_mixer_disconnect ( mixer ) ;

to kill pending URBs and free the mixer instance before the mixer instance is freed, which is consistent with the real patch for the CVE-2017-16527 at here.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

πŸš€AutoPatcher: Automatic Root-Cause-Analysis Guided Program Repair via Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published