In this repository, we release code, datasets and model for the paper --- "A First Look at Toxicity Injection Attacks on Open-domain Chatbots" accepted by ACSAC 2023.
Chatbot systems have improved significantly because of the advances made in language modeling. These machine learning systems follow an end-to-end data-driven learning paradigm and are trained on large conversational datasets.Imperfections or harmful biases in the training datasets can cause the models to learn toxic behavior, and thereby expose their users to harmful responses. Prior work has focused on measuring the inherent toxicity of such chatbots,by devising queries that are more likely to produce toxic responses. In this work, we ask the question: How easy or hard is it to inject toxicity into a chatbot after deployment? We study this in a practical scenario known as Dialog-based Learning (DBL), where a chatbot is periodically trained on recent conversations with its users after deployment. A DBL setting can be exploited to poison the training dataset for each training cycle. Our attacks would allow an adversary to manipulate the degree of toxicity in a model and also enable control over what type of queries can trigger a toxic response. Our fully automated attacks only require LLM-based software agents masquerading as (malicious) users to inject high levels of toxicity. We systematically explore the vulnerability of popular chatbot pipelines to this threat. Lastly, we show that several existing toxicity mitigation strategies (designed for chatbots) can be significantly weakened by adaptive attackers.
We are pleased to announce the release of the synthetic conversational benign and toxic datasets, as generated and detailed in our recent paper. This dataset is made available to the research community for further exploration and investigation.
We created synthetic conversational datasets (used in DBL) using two victim chatbots: DD-BART and Blenderbot(400M).
Synthetic benign conversational datasets are generated between each victim chatbot and Blenderbot (1B).
For the Synthetic toxic conversational datasets, we use Blenderbot (1B) to generate benign and employ TData, TBot, and PE-TBot attack strategies to generate toxic responses while conversing with the victim chatbots.
In order to download these datasets, please fill out the Google form link after reading and agreeing our License Agreement. Upon acceptance of your request, the download link will be sent to the provided e-mail address.
This repository is the official implementation of the paper. This paper has several sections of attacks and corresponding defenses. The experimental setup, source code, run steps and pretrained models are documented for each section of the paper are in seperate README files.
The various sections of the paper and their corresponding README.md files are below:
Important Note : The methods provided in this repository should not be used for malicious or inappropriate use.
For any questions or feedback, please raise an issue on github or please e-mail to acheruvu@vt.edu with the subject [Question about the Toxicity Injection]
If you have used our work i.e, pretrained models, Source code or datasets, Request you to kindly cite us:
@inproceedings{weeks2023afirstlook,
title={{A First Look at Toxicity Injection Attacks on Open-domain Chatbots}},
author={Weeks, Connor and Cheruvu, Aravind and Abdullah, Sifat Muhammad and Kanchi, Shravya and Yao, Danfeng (Daphne) and Viswanath, Bimal}
booktitle={Proc. of ACSAC},
year={2023}
}