[Paper] [HuggingFace] [BibTeX]
This repository contains code for the paper "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models" (Oral@ICML 2024).
We fine-tune CLIP in an unsupervised manner to improve its robustness to visual adversarial attacks. We show that replacing the vision encoder of large vision-language models with our fine-tuned CLIP models yields state-of-the-art adversarial robustness on a variety of vision-language tasks, without requiring any training of the large VLMs themselves. Moreover, we improve the robustness of CLIP to adversarial attacks in zero-shot classification settings, while maintaining higher clean accuracy than previous adversarial fine-tuning methods.
- Check out our follow-up project, where we show that adversarially robust CLIP models are excellent perceptual metrics, both in terms of clean performance and robustness.
- We release robust base-sized CLIP models.
The code is tested with Python 3.11. To install the required packages, run:
pip install -r requirements.txt
We provide the following adversarially fine-tuned ViT-L/14 CLIP models (approx. 1.1 GB each):
Model | Link | Proposed by | Notes |
---|---|---|---|
TeCoA2 | Link | Mao et al. (2023) | Supervised adversarial fine-tuning with |
TeCoA4 | Link | Mao et al. (2023) | Supervised adversarial fine-tuning with |
FARE2 | Link | ours | Unsupervised adversarial fine-tuning with |
FARE4 | Link | ours | Unsupervised adversarial fine-tuning with |
The models are also available on HuggingFace.
All models are adversarially fine-tuned for two epochs on ImageNet. TeCoA is trained in a supervised fashion, utilizing ImageNet class labels. FARE, in contrast, does not require any labels for training.
The provided checkpoints correspond to the vision encoder of CLIP. To load the full CLIP model (including the text encoder), you can use the following code:
import torch
from open_clip import create_model_and_transforms
model, _, image_processor = create_model_and_transforms(
'ViT-L-14', pretrained='openai', device='cpu'
)
checkpoint = torch.load('/path/to/fare_eps_2.pt', map_location=torch.device('cpu'))
model.visual.load_state_dict(checkpoint)
Alternatively load directly from HuggingFace:
from open_clip import create_model_and_transforms
model, _, image_processor = open_clip.create_model_and_transforms('hf-hub:chs20/fare2-clip')
We show a summary of results on zero-shot classification and vision-language tasks for original and fine-tuned ViT-L/14 CLIP models. CLIP-only means that we evaluate the respective CLIP model in a standalone fashion for zero-shot classification, whereas OpenFlamingo and LLaVA evaluation means that we use the respective CLIP model as a vision encoder as part of these large vision-language models. Results for individual zero-shot datasets and more VLM tasks are provided in the paper.
- Clean evaluation:
CLIP-only | OpenFlamingo 9B | LLaVA 1.5 7B | |||
Model | Avg. zero-shot | COCO | TextVQA | COCO | TextVQA |
OpenAI | 73.1 | 79.7 | 23.8 | 115.5 | 37.1 |
TeCoA2 | 60.0 | 73.5 | 16.6 | 98.4 | 24.1 |
FARE2 | 67.0 | 79.1 | 21.6 | 109.9 | 31.9 |
TeCoA4 | 54.2 | 66.9 | 15.4 | 88.3 | 20.7 |
FARE4 | 61.1 | 74.1 | 18.6 | 102.4 | 27.6 |
- Adversarial evaluation (
$\ell_\infty, ~ \varepsilon=\frac{2}{255}$ ):
CLIP-only | OpenFlamingo 9B | LLaVA 1.5 7B | |||
Model | Avg. zero-shot | COCO | TextVQA | COCO | TextVQA |
Openai | 0.0 | 1.5 | 0.0 | 4.0 | 0.5 |
TeCoA2 | 43.6 | 31.6 | 3.5 | 44.2 | 12.1 |
FARE2 | 43.1 | 34.2 | 4.1 | 53.6 | 14.7 |
TeCoA4 | 42.3 | 28.5 | 2.1 | 50.9 | 12.6 |
FARE4 | 45.9 | 30.9 | 3.4 | 57.1 | 15.8 |
- Adversarial evaluation (
$\ell_\infty, ~ \varepsilon=\frac{4}{255}$ ):
CLIP-only | OpenFlamingo 9B | LLaVA 1.5 7B | |||
Model | Avg. zero-shot | COCO | TextVQA | COCO | TextVQA |
Openai | 0.0 | 1.1 | 0.0 | 3.1 | 0.0 |
TeCoA2 | 27.0 | 21.2 | 2.1 | 30.3 | 8.8 |
FARE2 | 20.5 | 19.5 | 1.9 | 31.0 | 9.1 |
TeCoA4 | 31.9 | 21.6 | 1.8 | 35.3 | 9.3 |
FARE4 | 32.4 | 22.8 | 2.9 | 40.9 | 10.9 |
We provide the following base-sized adversarially fine-tuned CLIP models and report zero-shot classification accuracies.
Model | Backbone | Link | Clean | Adv. |
Adv. |
Adv. |
---|---|---|---|---|---|---|
TeCoA1 | ViT-B/32 OpenAI | Link | 53.1 | 38.8 | 26.6 | 9.6 |
FARE1 | ViT-B/32 OpenAI | Link | 60.5 | 38.0 | 20.1 | 2.9 |
TeCoA4 | ViT-B/32 OpenAI | Link | 44.0 | 38.2 | 33.1 | 23.6 |
FARE4 | ViT-B/32 OpenAI | Link | 48.6 | 40.6 | 33.7 | 21.9 |
TeCoA4 | ViT-B/32 LAION 2B | Link | 46.8 | 40.6 | 34.5 | 23.3 |
FARE4 | ViT-B/32 LAION 2B | Link | 53.8 | 44.4 | 35.5 | 21.2 |
TeCoA4 | ViT-B/16 LAION 2B | Link | 51.5 | 45.0 | 38.4 | 26.4 |
FARE4 | ViT-B/16 LAION 2B | Link | 56.6 | 47.7 | 39.2 | 23.5 |
TeCoA4 | ConvNeXt-B LAION 2B | Link | 56.2 | 50.4 | 44.1 | 31.8 |
FARE4 | ConvNeXt-B LAION 2B | Link | 60.2 | 52.3 | 44.1 | 28.4 |
Except for the first four, these models originate from our follow-up project.
- TeCoA4
python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize True --steps 20000 --warmup 1400 --batch_size 128 --loss ce --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss ce --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name TECOA4 --log_freq 10 --eval_freq 10```
- FARE4
python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name FARE4 --log_freq 10 --eval_freq 10
Set --eps 2
to obtain TeCoA2 and FARE2 models.
Make sure files in bash
directory are executable: chmod +x bash/*
python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2
Set models to be evaluated in CLIP_benchmark/benchmark/models.txt
and datasets in CLIP_benchmark/benchmark/datasets.txt
(the datasets are downloaded from HuggingFace). Then run
cd CLIP_benchmark
./bash/run_benchmark_adv.sh
In /bash/llava_eval.sh
supply paths for the datasets. The required annotation files for the datasets can be obtained from this HuggingFace repository.
Set --vision_encoder_pretrained
to openai
or supply path to fine-tuned CLIP model checkpoint.
Then run
./bash/llava_eval.sh
The LLaVA model will be automatically downloaded from HuggingFace.
Download the OpenFlamingo 9B model, supply paths in /bash/of_eval_9B.sh
and run
./bash/of_eval_9B.sh
Some non-standard annotation files are supplied here and here.
For targeted attacks on COCO, run
./bash/llava_eval_targeted.sh
For targeted attacks on self-selected images, set images and target captions in vlm_eval/run_evaluation_qualitative.py
and run
python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose
With 10,000 iterations it takes about 2 hours per image on an A100 GPU.
./bash/eval_pope.sh openai # for clean model evaluation
./bash/eval_pope.sh # for robust model evaluation - add path_to_ckpt in bash file
./bash/eval_scienceqa.sh openai # for clean model evaluation
./bash/eval_scienceqa.sh # for robust model evaluation - add path_to_ckpt in bash file
This repository gratefully forks from
If you find this repository useful, please consider citing our paper:
@article{schlarmann2024robustclip,
title={Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models},
author={Christian Schlarmann and Naman Deep Singh and Francesco Croce and Matthias Hein},
year={2024},
journal={ICML}
}