GCC-caption-lora

This repo contains codes to train an image captioning model using GCC dataset. You can download, process, train lora weights injected to LM decoder and evaluate the results.

Dataset Statistics

The training set includes 162,853 images, with the initial 300,000 samples being targeted for download, and the validation set contains 11,273 images, with attempts made to download the entire valid set.

Captions: Sequence Length Distribution w.r.t. T5 Tokens

Train: Mean = 12.97, Median = 12.0, Std. = 6.23
Valid: Mean = 13.06, Median = 12.0, Std. = 6.20

Model Architecture

Captioner model is composed of CLIP image encoder and Flan-T5 decoder. Each input image is injected via cross-attention to the decoder, after a linear projection to fit the size of embedding. decoder's qkvo projection and feed forword layer can be augmented with lora weights.

Training Details

Image Encoder: Utilizes CLIP image encoder.
Text Decoder: Employs Flan-T5-base/large models.
LoRA Application: Applied to attention projection layers (attn) or both attention projection and feed-forward layers (attn,ffn).
LoRA Configuration: Rank 8, α=8, with a dropout rate of 0.1. The scaling factor is calculated using α/sqrt(r). (see rslora paper)
Training Parameters: Learning rate of 4e-5, batch size of 128, linear learning rate schedule with a 0.2 warmup phase, AdamW optimizer, and bfloat16 training.
Hardware: Each model is trained on a single A6000 GPU for optimized performance.
You can check out its wandb training log.
Trained model checkpoints are available in this google drive link.

Evaluation & Analysis

The evaluation of the model's performance was conducted using CIDEr, BLEU@4, and CLIP-large cosine similarity. Below are the results segmented by decoder size, LoRA targets, and sampling methods.

By Decoder Size and LoRA Targets

CIDEr, BLEU@4, CLIP Score

Configuration	CIDEr	BLEU@4	CLIP Score
large-attn	0.3506	0.0281	0.2128
large-attn, ffn	0.4083	0.0352	0.2132
base-attn	0.3815	0.0329	0.2120
base-attn, ffn	0.4336	0.0384	0.2160
Ground Truth (GT)	n/a	n/a	0.2456

Best-of-N Sampling (N=10)

CIDEr, BLEU@4, CLIP Score

Configuration CIDEr BLEU@4 CLIP Score

large-attn 0.3822 0.0238 0.2489

large-attn, ffn 0.4217 0.0293 0.2512

base-attn 0.4005 0.0272 0.2489

base-attn, ffn 0.4519 0.0334 0.2525

By Decoding Methods

Sampling Method: Top-p=0.9, Temperature=0.5, Repetition Penalty=1.2

Configuration	CIDEr	BLEU@4	CLIP Score
large-attn, ffn-sample	0.3575	0.0259	0.2143
large-attn, ffn-beam	0.4083	0.0352	0.2132
base-attn, ffn-sample	0.3734	0.0270	0.2141
base-attn, ffn-beam	0.4336	0.0384	0.2160
Ground Truth (GT)	n/a	n/a	0.2456

Analysis

The base-attn, ffn configuration under the Best-of-N Sampling strategy achieved the highest CIDEr and BLEU@4 scores.
Using Best-of-N sampling improves CIDEr scores as well across all sizes and lora targets, while somehow compromising syntactic BLEU@4 score.
base decoder benefits the most with given hyperparameter settings. (maybe larger model needs more batch)

Generated Samples (from valid split)

Model Configuration	Image 1 Caption	Image 2 Caption	Image 3 Caption	Image 4 Caption	Image 5 Caption
large-attn, ffn (beam)	"a group of people gathered for a photo."	"close up of a little boy's hair."	"a woman with a cat on a street"	"a man stands in a pond of water."	"a balloon flies in the air"
large-attn, ffn (best-of-N sampling)	"the crowds of the armed soldiers, who were brought by the police."	"a little boy is getting ready for a haircut"	"a woman is seen on the street of a cat"	"a man in a flower stands by a lake"	"a balloon is swayed by a blue sky"
base-attn, ffn (beam)	"military commander, a member of the armed force, poses for a photo"	"hairstyles for a child's face."	"a woman carries a cat on a street in a city"	"portrait of a young man in a pond at sunset."	"a colorful balloon floats in the air"
base-attn, ffn (best-of-N sampling)	"a parade of soldiers in the streets"	"a child is given a haircut"	"a woman in a white hat carries a cat on a street"	"a man in a flower garden with a lot of water"	"a colorful balloon flying in the air"
Ground Truth (GT)	"soldiers walk down the street of a city"	"a little boy is trimmed in the hairdresser 's"	"a pet cat rides through the streets on the head of her female owner."	"person grow lotus in the season"	"a colorful hot-air balloon being inflated in the distance"

Files

lora.py : minimal interface to inject lora weights to FLAN-T5 decoder.
modeling_t5.py : modified from huggingface's modeling_t5.py to function with lora.py
train.py : training script
evaluation.py : captioning inference script
model.py : definition of Captioner model.
preprocess_data.py : downloading images for dataset
best_of_n_sample.py : captioning inference script using best-of-n sampling strategy.
custom_trainer.py : huggingface trainer that supports ar-generation during training
clipscore.py : clipscore computation
nlg-eval : see nlg-eval's documentation. used for BLEU and CIDEr computation.
tests : tests to make sure model is working
scripts : shell scripts to run python scripts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GCC-caption-lora

Dataset Statistics

Captions: Sequence Length Distribution w.r.t. T5 Tokens

Model Architecture

Training Details

Evaluation & Analysis

By Decoder Size and LoRA Targets

Best-of-N Sampling (N=10)

By Decoding Methods

Analysis

Generated Samples (from valid split)

Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
dataset		dataset
inference_results		inference_results
nlg-eval		nlg-eval
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
architecture.png		architecture.png
best_of_n_sample.py		best_of_n_sample.py
captioner_sample.png		captioner_sample.png
clipscore.py		clipscore.py
custom_trainer.py		custom_trainer.py
evaluation.py		evaluation.py
lora.py		lora.py
model.py		model.py
modeling_t5.py		modeling_t5.py
outlier_remove.ipynb		outlier_remove.ipynb
preprocess_data.py		preprocess_data.py
train.py		train.py

Configuration	CIDEr	BLEU@4	CLIP Score
large-attn	0.3822	0.0238	0.2489
large-attn, ffn	0.4217	0.0293	0.2512
base-attn	0.4005	0.0272	0.2489
base-attn, ffn	0.4519	0.0334	0.2525

DaehanKim/GCC-caption-lora

Folders and files

Latest commit

History

Repository files navigation

GCC-caption-lora

Dataset Statistics

Captions: Sequence Length Distribution w.r.t. T5 Tokens

Model Architecture

Training Details

Evaluation & Analysis

By Decoder Size and LoRA Targets

Best-of-N Sampling (N=10)

By Decoding Methods

Analysis

Generated Samples (from valid split)

Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages