A First Look at Toxicity Injection Attacks on Open-domain Chatbots

In this repository, we release code, datasets and model for the paper --- "A First Look at Toxicity Injection Attacks on Open-domain Chatbots" accepted by ACSAC 2023.

Paper Abstract:

Chatbot systems have improved significantly because of the advances made in language modeling. These machine learning systems follow an end-to-end data-driven learning paradigm and are trained on large conversational datasets.Imperfections or harmful biases in the training datasets can cause the models to learn toxic behavior, and thereby expose their users to harmful responses. Prior work has focused on measuring the inherent toxicity of such chatbots,by devising queries that are more likely to produce toxic responses. In this work, we ask the question: How easy or hard is it to inject toxicity into a chatbot after deployment? We study this in a practical scenario known as Dialog-based Learning (DBL), where a chatbot is periodically trained on recent conversations with its users after deployment. A DBL setting can be exploited to poison the training dataset for each training cycle. Our attacks would allow an adversary to manipulate the degree of toxicity in a model and also enable control over what type of queries can trigger a toxic response. Our fully automated attacks only require LLM-based software agents masquerading as (malicious) users to inject high levels of toxicity. We systematically explore the vulnerability of popular chatbot pipelines to this threat. Lastly, we show that several existing toxicity mitigation strategies (designed for chatbots) can be significantly weakened by adaptive attackers.

Request Synthetic Conversational Datasets:

We are pleased to announce the release of the synthetic conversational benign and toxic datasets, as generated and detailed in our recent paper. This dataset is made available to the research community for further exploration and investigation.

We created synthetic conversational datasets (used in DBL) using two victim chatbots: DD-BART and Blenderbot(400M).

Synthetic benign conversational datasets are generated between each victim chatbot and Blenderbot (1B).

For the Synthetic toxic conversational datasets, we use Blenderbot (1B) to generate benign and employ TData, TBot, and PE-TBot attack strategies to generate toxic responses while conversing with the victim chatbots.

In order to download these datasets, please fill out the Google form link after reading and agreeing our License Agreement. Upon acceptance of your request, the download link will be sent to the provided e-mail address.

Code Repository

This repository is the official implementation of the paper. This paper has several sections of attacks and corresponding defenses. The experimental setup, source code, run steps and pretrained models are documented for each section of the paper are in seperate README files.

The various sections of the paper and their corresponding README.md files are below:

Important Note : The methods provided in this repository should not be used for malicious or inappropriate use.

Feel free to reach out to us for any questions or issues.

For any questions or feedback, please raise an issue on github or please e-mail to acheruvu@vt.edu with the subject [Question about the Toxicity Injection]

Citation

If you have used our work i.e, pretrained models, Source code or datasets, Request you to kindly cite us:

@inproceedings{weeks2023afirstlook,
  title={{A First Look at Toxicity Injection Attacks on Open-domain Chatbots}},
  author={Weeks, Connor and Cheruvu, Aravind and Abdullah, Sifat Muhammad and Kanchi, Shravya and Yao, Danfeng (Daphne) and Viswanath, Bimal}
  booktitle={Proc. of ACSAC},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
GRADE		GRADE
GRUEN		GRUEN
README_Files		README_Files
data		data
eval		eval
experiments		experiments
results		results
saves		saves
scripts		scripts
text_fooler		text_fooler
DD_BART_train.py		DD_BART_train.py
GRADE.sh		GRADE.sh
README.md		README.md
advText_env.yml		advText_env.yml
baseAgent.py		baseAgent.py
dataEvaluator.py		dataEvaluator.py
dbl_env.yml		dbl_env.yml
dbl_env2.yml		dbl_env2.yml
friendlyAgent.py		friendlyAgent.py
grade_env.yml		grade_env.yml
job_checker.py		job_checker.py
learningAgent.py		learningAgent.py
loop.py		loop.py
modelEvaluator.py		modelEvaluator.py
pipeline.py		pipeline.py
ppl.py		ppl.py
quality.py		quality.py
textEvasion.py		textEvasion.py
text_fooler_detoxify.sh		text_fooler_detoxify.sh
toxicAgent.py		toxicAgent.py
toxicClassifier.py		toxicClassifier.py
toxicTrojanAgent.py		toxicTrojanAgent.py
toxic_data.py		toxic_data.py
train_toxic_classifier.py		train_toxic_classifier.py
trojanAgent.py		trojanAgent.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A First Look at Toxicity Injection Attacks on Open-domain Chatbots

Paper Abstract:

Request Synthetic Conversational Datasets:

Code Repository

Important Note : The methods provided in this repository should not be used for malicious or inappropriate use.

Feel free to reach out to us for any questions or issues.

Citation

About

Releases

Packages

Contributors 2

Languages

secml-lab-vt/Chatbot-Toxicity-Injection

Folders and files

Latest commit

History

Repository files navigation

A First Look at Toxicity Injection Attacks on Open-domain Chatbots

Paper Abstract:

Request Synthetic Conversational Datasets:

Code Repository

Important Note : The methods provided in this repository should not be used for malicious or inappropriate use.

Feel free to reach out to us for any questions or issues.

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages