Skip to content

e-wxy/FGVC-Prompt

Repository files navigation

Prompting Attributes for FGVC

Framework

Stage One

Stage Two

Prerequisites

pytorch
torchvision
timm
yacs
regex
ftfy
tqdm

Dataset

Download

Caltech-UCSD Birds-200-2011 (CUB-200-2011)

Alternative

CUB-200-2011 | Kaggle

Directory

+---ROOT
|   +---cub2002011/
|   |   +---CUB_200_2011/		# from https://data.caltech.edu/records/20098
|   |   |   +---attributes/
|   |   |   |   +---attributes.txt		# NOTICE: attributes.txt has been moved to here
|   |   |   |   +--- ...
|   |   |   |   \---image_attribute_labels_clean.txt	# cleaned image_attribute_labels.txt
|   |   |   +---images/
|   |   |   +---parts/
|   |   |   +---bounding_boxes.txt
|   |   |   \--- ...
|   |   +---cvpr2016_cub/	# unused at present
|   |   \---segmentations/	# unused at present

Train

Run Locally

torchrun --nproc_per_node=2 train.py -n "test1" -c configs/cub.yml MODEL.PRETRAIN_FILE 'ViT-B-16.pt' MODEL.PRETRAIN_PATH './pretrained'

Run on Virtaicloud

torchrun --nproc_per_node=2 $GEMINI_RUN/Prompt/train.py \
-n "tokenflow" -i "First try"   \
-c $GEMINI_RUN/Prompt/configs/cub.yml   \
OUTPUT_DIR $GEMINI_DATA_OUT DATA.DATASET.ROOT_DIR $GEMINI_DATA_IN1  \
MODEL.PRETRAIN_PATH $GEMINI_PRETRAIN MODEL.PRETRAIN_FILE 'ViT-B-16.pt'

Dev

torchrun --nproc_per_node=2 $GEMINI_RUN/Prompt/train.py \
-n "test1_2" -i "Check stage 1"   \
-c $GEMINI_RUN/Prompt/configs/cub.yml   \
OUTPUT_DIR $GEMINI_DATA_OUT DATA.DATASET.ROOT_DIR $GEMINI_DATA_IN1  \
MODEL.PRETRAIN_PATH $GEMINI_PRETRAIN \
TRAIN.STAGE1.MAX_EPOCHS 5 TRAIN.STAGE2.MAX_EPOCHS 100

Stage TWO

torchrun --nproc_per_node=2 $GEMINI_RUN/Prompt/train_stage_2.py \
-n "s2" -i "Tuning stage 2"   \
-c $GEMINI_RUN/Prompt/configs/cub.yml   \
OUTPUT_DIR $GEMINI_DATA_OUT DATA.DATASET.ROOT_DIR $GEMINI_DATA_IN1  \
MODEL.PRETRAIN_PATH $GEMINI_PRETRAIN/model

Dev

torchrun --nproc_per_node=2 $GEMINI_RUN/Prompt/train_stage_2.py \
-n "s2" -i "Tuning lr for stage 2"   \
-c $GEMINI_RUN/Prompt/configs/cub.yml   \
OUTPUT_DIR $GEMINI_DATA_OUT DATA.DATASET.ROOT_DIR $GEMINI_DATA_IN1  \
MODEL.PRETRAIN_PATH $GEMINI_DATA_OUT

Baseline

Visual Only

torchrun --nproc_per_node=2 $GEMINI_RUN/Prompt/train_visual.py \
-n "visual" -i "Basic global prompt"   \
-c $GEMINI_RUN/Prompt/configs/cub.yml   \
OUTPUT_DIR $GEMINI_DATA_OUT DATA.DATASET.ROOT_DIR $GEMINI_DATA_IN1  \
MODEL.PRETRAIN_PATH $GEMINI_PRETRAIN MODEL.PRETRAIN_FILE 'ViT-B-16.pt'

Contrastive Learning

torchrun --nproc_per_node=2 $GEMINI_RUN/Prompt/train_baseline.py \
-n "base" -i "Basic global prompt"   \
-c $GEMINI_RUN/Prompt/configs/cub.yml   \
OUTPUT_DIR $GEMINI_DATA_OUT DATA.DATASET.ROOT_DIR $GEMINI_DATA_IN1  \
MODEL.PRETRAIN_PATH $GEMINI_PRETRAIN MODEL.PRETRAIN_FILE 'ViT-B-16.pt'

To Tune

1. Hyper-Params for Prompting

  • Dropout rate in text description: DATA.DATASET.DROP_RATE

  • Temperature in TokenFlow: MODEL.LAMB

2. Classifier

2. 1. How to utilise features from all tokens?

Global Tokens Only

  • element-wise multiplication
  • sum

Blending Patches & Words

  • TODO

2.2. Classifier Structure

  • Hidden dim: MODEL.HIDDEN_DIM
  • Module

2.3. Ablation Study

  • Visual Only -> Effect of Stage One

3. How to align the dimensions of image and text encoders?

Currently, we simply adopt the projection weights that map the global features in original CLIP. Would it be more efficient if we build another learnable mapping matrix?

Acknowledgement

Codebase from CLIP, Swin-Transformer

About

Prompting attributes for FGVC

Resources

Stars

Watchers

Forks

Languages