Replication package for the SANER 2023 paper titled "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries".
For questions about the content of this repo, please use the issues board. If you have any questions about the paper, please email the first author.
The models and dataset are both also available on the HF Hub.
To replicate the experimental setup of the paper follow the following steps:
It is recommended to use the provided Docker image, which has the correct Cuda version and all of the required dependencies installed. Pull the image, create a container, and mount this folder as a volume:
docker pull aalkaswan/bint5
docker run -i -t --name {containerName} --gpus all -v $(pwd):/data aalkaswan/bint5 /bin/bash
This should spawn a shell, which allows you to use the container. Change to the mounted volume:
cd /data/
All of the following commands should then be run from within the Docker container. You can respawn the shell using:
docker exec -it {containerName} /bin/bash
If you wish to run without using docker, we also provide a requirements.txt
file.
First, clone the CodeT5 repo into this directory:
git clone https://github.com/salesforce/CodeT5.git
Run the following command to set the correct working directory in the training script:
wdir=\WORKDIR=\"`pwd`/'CodeT5/CodeT5'\" && sed -i '1 s#^.*$#'$wdir'#' CodeT5/CodeT5/sh/exp_with_args.sh
Now that the model is set up we need to download the data, use the following commands to download and unpack the data:
wget https://zenodo.org/record/7229809/files/Capybara.zip
unzip Capybara.zip
rm Capybara.zip
Similarly to download the pretrained BinT5 checkpoints:
wget https://zenodo.org/records/7229913/files/BinT5.zip?download=1
unzip BinT5.zip
rm Capybara.zip
To use this data in BinT5, setup the data folders in the CodeT5 project:
mkdir -p CodeT5/CodeT5/data/summarize/{C,decomC,demiStripped,strippedDecomC}
Now you can simply move the data of your choice from \Capybara\training_data\{lan}\{dup/dedup}
to CodeT5\data\summarize\{lan}
.
In the downloaded CodeT5 repo change this line and add the languages to the subtask list.
Finally, edit the language
variable in the job.sh
file and start training in detached mode:
docker exec -d {containerName} /bin/bash "/data/job.sh"
You can view the progress and results of the finetuning in the: \CodeT5\sh\log.txt
file, the resulting model and training outputs are also present in the same folder.
For each of the models, a pytorch.bin
file is provided in its respective folder.
These models can be loaded into CodeT5 and used for inference or further training.
To utilise the models, download the reference CodeT5-base model from HuggingFace: bash
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Salesforce/codet5-base
- This will pull the repo but skip the
pytorch_model.bin
file, which will be replaced in the next step. - Select the model that you wish to use from the respective directory. Copy this file and replace the
pytorch_model.bin
in the localcodet5-base
directory downloaded in the previous step. - Instead of loading in the model through HuggingFace, load in a local model. To load a local model, change line 66 in the
sh/exp_with_args.sh
file to the path of your localcodet5-base
model which you downloaded and configured in the previous step. The tokenizer does not need to be replaced. - The model can now be run by executing
sh/run_exp.py