This repository contains the code, the dataset and the experimental results related to the paper Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks accepted for publication at the 32nd IEEE/ACM International Conference on Program Comprehension (ICPC 2024).
The paper presents a targeted data poisoning attack to assess the security of AI NL-to-code generators by injecting software vulnerabilities in the training data used to fine-tune AI models.
This repository contains:
-
PoisonPy, the Python dataset we developed for this work, containing
$823$ unique pairs of code description-code snippet, including both safe and unsafe (i.e., containing vulnerable functions or bad patterns) code snippets (Dataset
folder). - The code to reproduce the vulnerability injection described in the paper (
Code
folder). - The results we obtained by feeding the poisoned training data to the NMT models, i.e., CodeBERT, CodeT5+ and Seq2Seq (
Experimental Results
folder).
The repository does not contain the code required to run the code generation task. You can replicate the translation process using one of the state-of-the-art NMT models available online.
We built PoisonPy, a dataset containing
We provide the code to replicate the attack described in the paper. In particular, the repository contains the code to automatically perform data poisoning on the baseline safe training set contained in the PoisonPy dataset. The detailed steps to replicate the experiments are described in the README.md file.
We share the results of the experiments on the three adopted NMT models: CodeBERT, CodeT5+ and Seq2Seq. For a detailed description of how to interpret the results, please refer to the README.md file.
If you find this work to be useful for your research, please consider citing:
@inproceedings{cotroneo2024vulnerabilities,
title={Vulnerabilities in ai code generators: Exploring targeted data poisoning attacks},
author={Cotroneo, Domenico and Improta, Cristina and Liguori, Pietro and Natella, Roberto},
booktitle={Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension},
pages={280--292},
year={2024}
}
For further information, contact us via email: cristina.improta@unina.it (Cristina) and pietro.liguori@unina.it (Pietro).