This is the official implementation for the paper Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing.
-
Install conda environment with
environment.yaml
. -
Download the pre-trained LLMs into the
model_weights
directory. -
Specify the path to model weights in
config/${MODELNAME}.yaml
. An example config is inconfig/llm/vicuna.yaml
.
Our implementation also supports calling a local vllm LLM server to improve the generation speed, checkout language_models.py:VirtualLLM
and config/llm/vicuna-vllm.yaml
for more details.
For transfer attack experiment:
python transfer_attack.py llm=${LLM} task=${TASK} defense=${ATTACK}
For adaptive attack experiment:
python adaptive_attack.py llm=${LLM} attacker=${ATTACK} defense=${DEFENSE} task=advbench
For training the dynamic selection policy:
python train_selector.py llm=${LLM}
Here, LLM
specifies the target LLM to be applied. The corresponding config file in config/llm
folder is loaded. Similar for TASK
, ATTACK
, and DEFENSE
.
Complete configs are in the config
directory with detailed comments. Please checkout there.
If you find this work useful, please cite the following paper:
@article{ji2024defending,
title = {Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing},
author = {Jiabao Ji and Bairu Hou and Alexander Robey and George J. Pappas and Hamed Hassani and Yang Zhang and Eric Wong and Shiyu Chang},
year = {2024},
journal = {arXiv preprint arXiv: 2402.16192}
}
@article{robey2023smoothllm,
title={SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks},
author={Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J},
journal={arXiv preprint arXiv:2310.03684},
year={2023}
}
Huge thanks to the following repos that greatly help our implementation: