[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
-
Updated
Aug 12, 2024 - Python
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
日本語LLMまとめ - Overview of Japanese LLMs
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
A Framework of Small-scale Large Multimodal Models
Official implementation of SEED-LLaMA (ICLR 2024).
CLIPort: What and Where Pathways for Robotic Manipulation
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
多模态中文LLaMA&Alpaca大语言模型(VisualCLA)
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Add a description, image, and links to the vision-language topic page so that developers can more easily learn about it.
To associate your repository with the vision-language topic, visit your repo's landing page and select "manage topics."