How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities

This is the repo for our work to assess the trustworthiness of open-source LLMs.

The rapid progress in open-source Large Language Models (LLMs) is significantly driving AI development forward. However, there is still a limited understanding of their trustworthiness. Deploying these models at scale without sufficient trustworthiness can pose significant risks, highlighting the need to uncover these issues promptly. In this work, we conduct an adversarial assessment of open-source LLMs on trustworthiness, scrutinizing them across eight different aspects including toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness against adversarial demonstrations. We propose advCoU, an extended Chain of Utterances-based (CoU) prompting strategy by incorporating carefully crafted malicious demonstrations for trustworthiness attack. Our extensive experiments encompass recent and representative series of open-source LLMs, including Vicuna, MPT, Falcon, Mistral, and Llama 2. The empirical outcomes underscore the efficacy of our attack strategy across diverse aspects. More interestingly, our result analysis reveals that models with superior performance in general NLP tasks do not always have greater trustworthiness; in fact, larger models can be more vulnerable to attacks. Additionally, models that have undergone instruction tuning, focusing on instruction following, tend to be more susceptible, although fine-tuning LLMs for safety alignment proves effective in mitigating adversarial trustworthiness attacks.

Files

generation.py: The main function to run attack.

load_data.py: Load the dataset for each aspect for attack.

prompts.py: Prompt template for each aspect.

evaluate.py: The main function to run evaluation.

./datasets: This folder contains the datasets for all aspects.

./output: This folder stores the results generated by LLMs.

Run

The command to run the generation:

python generate.py \
    --method ours \
    --model meta-llama/Llama-2-7b-chat-hf \
    --aspect toxicity \
    --turn single

Key Descriptions:

method: Specify the attack strategy. "ours" denotes using our own attack strategy.
model: The name of the model to attack, such as "meta-llama/Llama-2-7b-chat-hf", "lmsys/vicuna-7b-v1.3", etc.
aspect: Specify the aspect to attack, including "toxicity", "stereotype", "ethics", "hallucination", "fairness", "sycophancy", "privacy", "robustness".
turn: Specify the attack setting: "single" or "multi". Only use "single" for now.

Evaluate

The command to run the generation:

python evaluate.py \
    --method ours \
    --model meta-llama/Llama-2-7b-chat-hf \
    --aspect toxicity \
    --turn single

Key Descriptions:

method: Specify the attack strategy. "ours" denotes using our own attack strategy.
model: The name of the model to attack, such as "meta-llama/Llama-2-7b-chat-hf", "lmsys/vicuna-7b-v1.3", etc.
aspect: Specify the aspect to attack, including "toxicity", "stereotype", "ethics", "hallucination", "fairness", "sycophancy", "privacy", "robustness".
turn: Specify the attack setting: "single" or "multi". Only use "single" for now.

Citation

@misc{mo2023trustworthy,
      title={How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities}, 
      author={Lingbo Mo and Boshi Wang and Muhao Chen and Huan Sun},
      year={2023},
      eprint={2311.09447},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities

Files

Run

Evaluate

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
datasets		datasets
images		images
README.md		README.md
evaluate.py		evaluate.py
generation.py		generation.py
load_data.py		load_data.py
prompts.py		prompts.py

OSU-NLP-Group/Eval-LLM-Trust

Folders and files

Latest commit

History

Repository files navigation

How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities

Files

Run

Evaluate

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages