[human action understanding downstream task finetune track]
Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
If you find LLaVA useful for your research and applications, please cite using this BibTeX:
@misc{liu2024llavanext,
title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
month={January},
year={2024}
}
@misc{liu2023improvedllava,
title={Improved Baselines with Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
publisher={arXiv:2310.03744},
year={2023},
}
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={NeurIPS},
year={2023},
}
- Vicuna: the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!
- Instruction Tuning with GPT-4
- LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
- Otter: In-Context Multi-Modal Instruction Tuning
For future project ideas, please check out:
- SEEM: Segment Everything Everywhere All at Once
- Grounded-Segment-Anything to detect, segment, and generate anything by marrying Grounding DINO and Segment-Anything.