While most of the recent LLMs, especially commercial ones, are aligned to be safer to use, you should bear in mind that any LLM-powered application is prone to a wide range of attacks (for example, see the OWASP Top 10 for LLM).
NeMo Guardrails provides several mechanisms for protecting an LLM-powered chat application against vulnerabilities, such as jailbreaks and prompt injections. The following sections present some initial experiments using dialogue and moderation rails to protect a sample app, the ABC bot, against various attacks. You can use the same techniques in your own guardrails configuration.
Garak is an open-source tool for scanning against the most common LLM vulnerabilities. It provides a comprehensive list of vulnerabilities grouped into several categories. Think of Garak as an LLM alternative to network security scanners such as nmap or others.
The sample ABC guardrails configuration has been scanned using Garak against vulnerabilities, using four different configurations, offering increasing protection against LLM vulnerabilities:
bare_llm
: no protection (full Garak results here).with_gi
: using the general instructions in the prompt (full Garak results here).with_gi_dr
: using the dialogue rails in addition to the general instructions (full Garak results here).with_gi_dr_mo
: using general instructions, dialogue rails, and moderation rails, i.e., input/output LLM Self-checking (full Garak results here).
The table below summarizes what is included in each configuration:
bare_llm |
with_gi |
with_gi_dr |
with_gi_dr_mo |
|
---|---|---|---|---|
General Instructions | x | ✔️ | ✔️ | ✔️ |
Dialog Rails (refuse unwanted topics) |
x | x | ✔️ | ✔️ |
Moderation Rails (input/output self-checking) |
x | x | x | ✔️ |
The results for each vulnerability category tested by Garak are summarized in the table below. The table reports the protection rate against attacks for each type of vulnerability (higher is better).
Garak vulnerability | bare_llm |
with_gi |
with_gi_dr |
with_gi_dr_mo |
---|---|---|---|---|
module continuation | 92.8% | 69.5% | 99.3% | 100% |
module dan | 27.3% | 40.7% | 61.3% | 52.7% |
module encoding | 90.3% | 98.2% | 100% | 100% |
module goodside | 32.2% | 32.2% | 66.7% | 66.7% |
module knownbadsignatures | 4.0% | 97.3% | 100% | 100% |
module leakreplay | 76.8% | 85.7% | 89.6% | 100% |
module lmrc | 85.0% | 81.9% | 86.5% | 94.4% |
module malwaregen | 50.2% | 92.2% | 93.7% | 100% |
module packagehallucination | 97.4% | 100% | 100% | 100% |
module realpublicityprompts | 100% | 100% | 100% | 100% |
module snowball | 34.5% | 82.1% | 99.0% | 100% |
module xss | 92.5% | 100% | 100% | 100% |
Even if the ABC example uses a powerful LLM (gpt-3.5-turbo-instruct
), without guardrails, it is still vulnerable to several types of attacks.
While using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table), the LLM app is safer only when using a mix of dialogue and moderation rails.
It is worth noticing that even using only dialogue rails results in good protection.
At the same time, this experiment does not investigate if the guardrails also block legitimate user requests. Such an analysis will be provided in a subsequent release.
If you are interested in additional information about each vulnerability category in Garak, please consult the full results here and the Garak GitHub page.