[Project Page] [Arxiv] [Demo] [Model Zoo]
[2024/1/14] Our training code is released.
[2023/12/6] Our paper is available in arxiv.
- Clone this repository and navigate to LLaVA-Grounding fold:
git clone https://github.com/UX-Decoder/LLaVA-Grounding.git
cd LLaVA-Grounding
- Install required packages:
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
- Install packages necessary for OpenSeeD and Semantic-SAM.
Please check out our Model Zoo for all public LLaVA-Grounding checkpoints, and the instructions on how to use the weights.
After downloading model weights, simply conduct the following commends to run demo on your own machine.
CUDA_VISIBLE_DEVICES=0 python gradio_demo/LLaVA_G_Demo.py --path_vision_cfg path_to_vision_cfg --path_inter_cfg path_to_inter_cfg --model_path path_to_ckpt_dir
# for example, after downloading weights into checkpoints/llava_grounding
CUDA_VISIBLE_DEVICES=0 python gradio_demo/LLaVA_G_Demo.py --path_vision_cfg configs/openseed/openseed_swint_lang_joint_2st_visual_prompt.yaml --path_inter_cfg configs/semsam/visual_prompt_encoder.yaml --model_path checkpoints/llava_grounding
Please refer to our Online Demo for the more detailed user's guidence.
data
├── flickr30k_entities
│ ├── train/
│ ├── val/
│ ├── annotations
│ ├──final_flickr_separateGT_train.json
│ ├──final_flickr_separateGT_val.json
├── coco
│ ├── train2014/
│ ├── train2017/
│ ├── panoptic_train2017/
│ ├── panoptic_semseg_train2017/
│ ├── annotations
│ │ ├──instances_train2017.json
│ │ ├──instances_train2017_gvc.json
│ │ ├──grounded_visual_chat_data.json
│ │ ├──instances_train2014_filter.json
│ │ ├──panoptic_train2017_filter.json
│ │ ├──grounding_train2017.json
├── llava
│ ├── annotations
│ ├── cap600k_brackets_all.json
│ ├── llava_instruct_150k.json
│ ├── llava_instruct_150k_visual_prompt.json
Please refer to MDETR's pre-processed flickr30k data.
Please download coco train2014 and train2017 images and panoptic segmentation and semantic segmentation data. Other annoations can be downloaded here.
The processed annotations can be downloaded here.
Stage 1
bash scripts/pretrain_joint.py
Stage 2
bash scripts/finetune.py
Stage 3
bash scripts/finetune_visual_prompt.py
If you find LLaVA-Grounding useful for your research and applications, please cite using this BibTeX:
@misc{zhang2023llavagrounding,
title={LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models},
author={Hao Zhang and Hongyang Li and Feng Li and Tianhe Ren and Xueyan Zou and Shilong Liu and Shijia Huang and Jianfeng Gao and Lei Zhang and Chunyuan Li and Jianwei Yang},
year={2023},
booktitle={arXiv}
}
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={arXiv:2304.08485},
year={2023}
}