Skip to content
/ TTD Public

[ECCV 2024] TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Notifications You must be signed in to change notification settings

shjo-april/TTD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PWC PWC

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

This repository is the official implementation of "TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias".

arXiv

Update

[07/02/2024] Our TTD has been accepted to ECCV 2024. πŸ”₯πŸ”₯πŸ”₯

[04/02/2024] Released initial commits.

Citation

Please cite our paper if the code is helpful to your research.

@misc{jo2024ttd,
      title={TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias}, 
      author={Sanghyun Jo and Soohyun Ryu and Sungyub Kim and Eunho Yang and Kyungsu Kim},
      booktitle={European Conference on Computer Vision (ECCV)},
      year={2024}
}

Abstract

We identify a critical bias in contemporary CLIP-based models, which we denote as "single-tag bias". This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to an imbalanced tag relevancy. This results in an uneven alignment among multiple tags present in the text. To tackle this challenge, we introduce a novel two-step fine-tuning approach. First, our method leverages the similarity between tags and their nearest pixels for scoring, enabling the extraction of image-relevant tags from the text. Second, we present a self-distillation strategy aimed at aligning the combined masks from extracted tags with the text-derived mask. This approach mitigates the single tag bias, thereby significantly improving the alignment of CLIP's model without necessitating additional data or supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources.

Key Components Overview

Setup

Setting up for this project involves installing dependencies and preparing datasets. The code is tested on Ubuntu 20.04 with NVIDIA GPUs and CUDA installed.

Installing dependencies

To install all dependencies, please run the following:

pip install -U "ray[default]"
pip install git+https://github.com/lucasb-eyer/pydensecrf.git
python3 -m pip install -r requirements.txt

or reproduce our results using docker.

docker build -t ttd_pytorch:v1.13.1 .
docker run --gpus all -it --rm \
--shm-size 32G --volume="$(pwd):$(pwd)" --workdir="$(pwd)" \
ttd_pytorch:v1.13.1

Preparing datasets (GCC3M+GCC12M)

Create two directories for GCC3M and GCC12M dataset following directory structure.

We provide GCC3M for evaluation with our annotations [Google Drive].

    ../                               # parent directory
    β”œβ”€β”€ ./                            # current (project) directory
    β”‚   β”œβ”€β”€ core/                     # (dir.) implementation of our TTD
    β”‚   β”œβ”€β”€ tools/                    # (dir.) helper functions
    β”‚   β”œβ”€β”€ README.md                 # instruction for a reproduction
    β”‚   └── ... some python files ...
    β”‚
    β”œβ”€β”€ cc3m_for_evaluation/          # GCC3M for evaluation (our benchmark, CC3M-TagMask)
    β”‚   β”œβ”€β”€ image/
    β”‚   β”œβ”€β”€ json_refined/             # ground-truth tags for text inputs
    β”‚   └── mask_using_HQ-SAM/        # ground-truth masks for text inputs
    β”‚
    β”œβ”€β”€ cc3m/                         # GCC3M for training
    β”‚   β”œβ”€β”€ train/              
    β”‚   β”‚   β”œβ”€β”€ image/       
    β”‚   β”‚   └── json/
    β”‚   └── validation/
    β”‚       β”œβ”€β”€ image/
    β”‚       └── json/
    β”‚
    └── cc12m/                        # GCC12M for training
        └── train/              
            β”œβ”€β”€ image/       
            └── json/

Preparing datasets (Semantic Segmentation)

Please download following VOC, COCO, Context, ADE, and COCO-Stuff datasets. Each dataset has a different directory structure. Therefore, we modify directory structures of all datasets for a comfortable implementation.

1. PASCAL VOC 2012

Download PASCAL VOC 2012 dataset from our [Google Drive].

2. MS COCO 2014

Download MS COCO 2014 dataset from our [Google Drive].

3. Pascal Context

Download Pascal Context dataset from our [Google Drive].

4. ADE 2016

Download ADE 2016 dataset from our [Google Drive].

5. COCO-Stuff

Download COCO-Stuff dataset from our [Google Drive].

6. Cityscapes

Download Cityscapes dataset from our [Google Drive].

Create a directory "../VOC2012/" for storing the dataset and appropriately place each dataset.

    ../                               # parent directory
    β”œβ”€β”€ ./                            # current (project) directory
    β”‚   β”œβ”€β”€ core/                     # (dir.) implementation of our TTD
    β”‚   β”œβ”€β”€ tools/                    # (dir.) helper functions
    β”‚   β”œβ”€β”€ README.md                 # instruction for a reproduction
    β”‚   └── ... some python files ...
    β”‚
    β”œβ”€β”€ VOC2012/                      # PASCAL VOC 2012 
    β”‚   └── validation/
    β”‚       β”œβ”€β”€ image/     
    β”‚       β”œβ”€β”€ mask/        
    β”‚       └── xml/
    β”‚
    β”œβ”€β”€ COCO2014/                     # MS COCO 2014
    β”‚   └── validation/
    β”‚       β”œβ”€β”€ image/     
    β”‚       β”œβ”€β”€ mask/        
    β”‚       └── xml/
    β”‚
    β”œβ”€β”€ PascalContext/                # PascalContext
    β”‚   └── validation/
    β”‚       β”œβ”€β”€ image/     
    β”‚       β”œβ”€β”€ mask/        
    β”‚       └── xml/
    β”‚
    β”œβ”€β”€ ADE2016/                      # ADE2016
    β”‚   └── validation/
    β”‚       β”œβ”€β”€ image/     
    β”‚       β”œβ”€β”€ mask/        
    β”‚       └── xml/
    β”‚
    β”œβ”€β”€ Cityscapes/                   # Cityscapes
    β”‚   └── validation/
    β”‚       β”œβ”€β”€ image/     
    β”‚       β”œβ”€β”€ mask/        
    β”‚       └── xml/
    β”‚
    └── COCO-Stuff/                   # COCO-Stuff
        └── validation/
            β”œβ”€β”€ image/     
            β”œβ”€β”€ mask/        
            └── xml/

Preprocessing

Our code is coming soon.

Training

Our code is coming soon.

Visualization

Visualize heatmaps for a wild image.

python demo.py --arch TCL --tags flame smoke --scales 1.0 0.5 1.5 2.0 --image 448 --pamr --lora "./weights/TCL+TTD.pt"

Sample

Evaluation

Release our checkpoint.

Method Checkpoints
TCL+TTD Google Drive

Evaluate performance for multi-tag selection (input: texts)

CUDA_VISIBLE_DEVICES=3 python3 produce_tags_from_text.py --root ../cc3m_for_evaluation/ --arch TCL --lora ./weights/TCL+TTD.pt --scales 1.0 --hflip --fixed --stopwords

# [                   NLTK] Precision: 59.8, Recall: 83.7, F1: 69.8, Acc: 79.6
# [              Vicuna-7B] Precision: 44.1, Recall: 71.0, F1: 54.4, Acc: 70.9 
# [             Vicuna-33B] Precision: 52.7, Recall: 70.7, F1: 60.4, Acc: 75.9 
# [               Qwen-72B] Precision: 69.3, Recall: 56.2, F1: 62.1, Acc: 80.9
# ----------------------------------------------------------------------------
# [    TCL_224_[1.0]@image] Precision: 92.5, Recall: 28.6, F1: 43.7, Acc: 79.5
# [     TCL_224_[1.0]@text] Precision: 85.6, Recall: 29.7, F1: 44.1, Acc: 79.0
# [    TCL_224_[1.0]@pixel] Precision: 82.9, Recall: 74.5, F1: 78.5, Acc: 88.6
# [TCL_224_[1.0]@pixel+TTD] Precision: 88.3, Recall: 78.0, F1: 82.8, Acc: 91.0
python3 evaluate_classification.py --root ../cc3m_for_evaluation/ --arch "TCL_224_[1.0]@pixel+TTD"

Evaluate performance for text-level semantic segmentation (input: texts)

python3 produce_masks_from_text.py --gpus 0 --root ../ --data CC3M --arch TCL --bg 0.40 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt

#             TCL | Caption IoU: 60.4%, mFPR: 0.199, mFNR: 0.198
#         TCL+TTD | Caption IoU: 65.5%, mFPR: 0.163, mFNR: 0.182
python3 evaluate_segmentation_for_text.py --pred "./results_caption/TCL@CC3M@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.40/" --tag TCL+TTD

Evaluate performance for open-vocabulary semantic segmentation (input: tags)

python3 produce_masks_from_tags.py --gpus 0 --root ../ --data VOC2012 --arch TCL --bg 0.40 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data COCO2014 --arch TCL --bg 0.45 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data ADE2016 --arch TCL --bg 0.45 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data Cityscapes --arch TCL --bg 0.40 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data COCO-Stuff --arch TCL --bg 0.40 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt
python3 produce_masks_from_tags.py --gpus 0 --root ../ --data PascalContext --arch TCL --bg 0.45 --scales 1.0 0.5 1.5 2.0 --hflip --pamr --lora ./weights/TCL+TTD.pt

#       TCL@VOC2012@Ours | mIoU: 61.1%, mFPR: 0.185, mFNR: 0.204
#      TCL@COCO2014@Ours | mIoU: 37.4%, mFPR: 0.350, mFNR: 0.276
#       TCL@ADE2016@Ours | mIoU: 17.0%, mFPR: 0.361, mFNR: 0.468
#    TCL@Cityscapes@Ours | mIoU: 27.0%, mFPR: 0.248, mFNR: 0.483
#    TCL@COCO-Stuff@Ours | mIoU: 23.7%, mFPR: 0.430, mFNR: 0.333
# TCL@PascalContext@Ours | mIoU: 37.4%, mFPR: 0.389, mFNR: 0.237
python3 evaluate_segmentation.py --data VOC2012 --pred "./results/TCL@VOC2012@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.40/" --tag "TCL+TTD@VOC2012"
python3 evaluate_segmentation.py --data COCO2014 --pred "./results/TCL@COCO2014@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.45/" --tag "TCL+TTD@COCO2014"
python3 evaluate_segmentation.py --data ADE2016 --pred "./results/TCL@ADE2016@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.45/" --tag "TCL+TTD@ADE2016"
python3 evaluate_segmentation.py --data Cityscapes --pred "./results/TCL@Cityscapes@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.40/" --tag "TCL+TTD@Cityscapes"
python3 evaluate_segmentation.py --data COCO-Stuff --pred "./results/TCL@COCO-Stuff@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.40/" --tag "TCL+TTD@COCO-Stuff"
python3 evaluate_segmentation.py --data PascalContext --pred "./results/TCL@PascalContext@OpenAI@[1.0, 0.5, 1.5, 2.0]@448@s=2@hflip@pamr@TTD@bg=0.45/" --tag "TCL+TTD@PascalContext"

About

[ECCV 2024] TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published