This repository contains a notebook with a multimodal system using images as frames from YouTube videos, LlamaIndex framework, Qdrant as a vector database, and Gemini as embedding and llm model.
Main Steps
- Data Ingestion: Load videos and metadata from a YouTube playlist
- Indexing: MultiModalVectorStoreIndex from LlamaIndex
- Embedding and Model: Gemini
- Vector Store: Qdrant with 2 collections (text and images)
- Query Retrieval: Top recipe and frame images
Feel free to ⭐ and clone this repo 😉
For detailed project descriptions, refer to this Medium article.