Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

This repo covers implementations of models in: Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction accepted at ACL 2024. The code is developed based on LAVIS project.

EVLGen Models

For convenience, we put our model files in lavis/models/blip2_models:

liptome_opt.py
liptome_video_opt.py
liptome_vicuna.py
liptome_video_vicuna.py

Tomeformer is in lavis/models/blip2_models/modeling_bert.py. This file is edited from official HuggingFace implementation with:

bipartite_soft_matching()
parse_r()

from ToMe. These two functions are used in BertLayer.forward().

Tomeformer is used as the vision-language connector in liptome_opt.py:

from .modeling_bert import BertConfig, BertModel

encoder_config = BertConfig.from_pretrained("bert-base-uncased", cache_dir="/mnt/bn/username/cache_dir")
encoder_config.num_hidden_layers = num_layers
encoder_config.r = tome_r
self.VLconnector = BertModel.from_pretrained("bert-base-uncased", config=encoder_config, cache_dir="/mnt/bn/username/cache_dir")

The temporal augmented EVA-ViT (used in EVLGen-video) is in lavis/models:

eva_vit_g.py

Installation

conda create -n lavis python=3.8
conda activate lavis
pip install -e .

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

Data Preparation

Please follow instructions from LAVIS to download pre-training datasets. CapFilt dataset can be downloaded from BLIP.

Training

We provide example script for training (image and video models):

python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/EVLGen/train/pretrain_opt2.7b.yaml

python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/EVLGen_video/train/caption_msrvtt_opt2.7b_ft.yaml

Evaluation

We provide example scripts for evaluation (image and video models):

python -m torch.distributed.run --nproc_per_node=8 evaluate.py --cfg-path lavis/projects/EVLGen/eval/caption_coco_opt2.7b_eval.yaml
python -m torch.distributed.run --nproc_per_node=8 evaluate.py --cfg-path lavis/projects/EVLGen/eval/gqa_zeroshot_opt2.7b_eval.yaml
python -m torch.distributed.run --nproc_per_node=8 evaluate.py --cfg-path lavis/projects/EVLGen/eval/okvqa_zeroshot_opt2.7b_eval.yaml
python -m torch.distributed.run --nproc_per_node=8 evaluate.py --cfg-path lavis/projects/EVLGen/eval/vqav2_zeroshot_opt2.7b_eval.yaml

python -m torch.distributed.run --nproc_per_node=8 evaluate.py --cfg-path lavis/projects/EVLGen_video/eval/caption_msrvtt_opt2.7b_eval.yaml

Pretrained Models

We provide pretrained models (EVLGen-image based on opt-2.7b and vicuna-7b) in Dropbox.

Acknowlegements

The code is developed based on ToMe and LAVIS project.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset_card		dataset_card
examples		examples
lavis		lavis
tests/models		tests/models
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

EVLGen Models

Installation

Data Preparation

Training

Evaluation

Pretrained Models

Acknowlegements

About

Releases

Packages

Languages

License

yiren-jian/EVLGen

Folders and files

Latest commit

History

Repository files navigation

Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

EVLGen Models

Installation

Data Preparation

Training

Evaluation

Pretrained Models

Acknowlegements

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages