The Multimodal AI Assistant is an advanced tool designed to bridge the gap between human and computer interaction. Utilizing the power of OpenAI's Whisper
and LLaVa-7B
models, this assistant provides a seamless experience, processing audio and visual inputs to deliver insightful and context-aware responses. This guide will help you set up and explore the full capabilities of the Multimodal AI Assistant.
Access the Multimodal AI Assistant here: MultimodalAIAssistant
- Introduction
- Table of Contents
- Prerequisites
- Installation
- Getting Started
- Usage
- Features
- Contributing
- Experimental Results
- License
Ensure you have the following prerequisites installed on your machine before starting with the Multimodal AI Assistant:
Python 3.10
or higherGradio
libraryPyTorch
andTransformers
librariesWhisper
model from OpenAIgTTS
for text-to-speech conversionOpenCV
for video frame processing
Follow these steps to install and set up the Multimodal AI Assistant:
-
Clone the repository to your local machine (use the actual repository link):
git clone https://github.com/TVR28/Multimodal-AI-Assistant.git cd Multimodal-AI-Assistant
-
Install the required Python libraries:
pip install -r requirements.txt
To get started, run the application script after installation:
python multimodal_ai_assistant.py
or run the google colab Multimodal_AI_Assistant_Llava7B.ipynb
notebook with a T4 GPU to test the working.
The Multimodal AI Assistant provides an interactive interface for users to engage with AI through voice and images:
- Voice Interaction: Record your query directly through the microphone input.
- Image Analysis: Upload or capture an image to receive a detailed description of its content.
- Video Frame Analysis: Capture live video and the assistant will analyze specific frames to answer your questions (Updating..).
- RAG system: The Assistant will be capable of retrieveing information from any documents(multiple) like pdf, txt, csv and more. (Updating)
- Voice to Text Transcription: Transcribes user voice input with Whisper.
- Image to Text Description: Generates descriptive text for images using the LLaVa model.
- Text to Speech Response: Converts AI-generated text into audible speech.
- Video Frame Extraction: Captures frames from live video for analysis.
We welcome contributions to the Multimodal AI Assistant. Please follow these steps to contribute:
- Fork the repository.
- Create a new branch for your feature.
- Add your feature or bug fix.
- Push your code and open a new pull request.