Skip to content

secml-lab-vt/Chatbot-Toxicity-Injection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A First Look at Toxicity Injection Attacks on Open-domain Chatbots

In this repository, we release code, datasets and model for the paper --- "A First Look at Toxicity Injection Attacks on Open-domain Chatbots" accepted by ACSAC 2023.

Paper Abstract:

Chatbot systems have improved significantly because of the advances made in language modeling. These machine learning systems follow an end-to-end data-driven learning paradigm and are trained on large conversational datasets.Imperfections or harmful biases in the training datasets can cause the models to learn toxic behavior, and thereby expose their users to harmful responses. Prior work has focused on measuring the inherent toxicity of such chatbots,by devising queries that are more likely to produce toxic responses. In this work, we ask the question: How easy or hard is it to inject toxicity into a chatbot after deployment? We study this in a practical scenario known as Dialog-based Learning (DBL), where a chatbot is periodically trained on recent conversations with its users after deployment. A DBL setting can be exploited to poison the training dataset for each training cycle. Our attacks would allow an adversary to manipulate the degree of toxicity in a model and also enable control over what type of queries can trigger a toxic response. Our fully automated attacks only require LLM-based software agents masquerading as (malicious) users to inject high levels of toxicity. We systematically explore the vulnerability of popular chatbot pipelines to this threat. Lastly, we show that several existing toxicity mitigation strategies (designed for chatbots) can be significantly weakened by adaptive attackers.

Request Synthetic Conversational Datasets:

We are pleased to announce the release of the synthetic conversational benign and toxic datasets, as generated and detailed in our recent paper. This dataset is made available to the research community for further exploration and investigation.

We created synthetic conversational datasets (used in DBL) using two victim chatbots: DD-BART and Blenderbot(400M).

Synthetic benign conversational datasets are generated between each victim chatbot and Blenderbot (1B).

For the Synthetic toxic conversational datasets, we use Blenderbot (1B) to generate benign and employ TData, TBot, and PE-TBot attack strategies to generate toxic responses while conversing with the victim chatbots.

In order to download these datasets, please fill out the Google form link after reading and agreeing our License Agreement. Upon acceptance of your request, the download link will be sent to the provided e-mail address.

Code Repository

This repository is the official implementation of the paper. This paper has several sections of attacks and corresponding defenses. The experimental setup, source code, run steps and pretrained models are documented for each section of the paper are in seperate README files.

The various sections of the paper and their corresponding README.md files are below:

  1. Project Installation
  2. DBL Training
  3. Toxicity Injection Attacks
  4. Evaluation of Existing Defenses

Important Note : The methods provided in this repository should not be used for malicious or inappropriate use.


Feel free to reach out to us for any questions or issues.

For any questions or feedback, please raise an issue on github or please e-mail to acheruvu@vt.edu with the subject [Question about the Toxicity Injection]

Citation

If you have used our work i.e, pretrained models, Source code or datasets, Request you to kindly cite us:

@inproceedings{weeks2023afirstlook,
  title={{A First Look at Toxicity Injection Attacks on Open-domain Chatbots}},
  author={Weeks, Connor and Cheruvu, Aravind and Abdullah, Sifat Muhammad and Kanchi, Shravya and Yao, Danfeng (Daphne) and Viswanath, Bimal}
  booktitle={Proc. of ACSAC},
  year={2023}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published