Multimodal AI Assistant

The Multimodal AI Assistant is an advanced tool designed to bridge the gap between human and computer interaction. Utilizing the power of OpenAI's Whisper and LLaVa-7B models, this assistant provides a seamless experience, processing audio and visual inputs to deliver insightful and context-aware responses. This guide will help you set up and explore the full capabilities of the Multimodal AI Assistant.

Access the Multimodal AI Assistant here: MultimodalAIAssistant

Prerequisites

Ensure you have the following prerequisites installed on your machine before starting with the Multimodal AI Assistant:

Python 3.10 or higher
Gradio library
PyTorch and Transformers libraries
Whisper model from OpenAI
gTTS for text-to-speech conversion
OpenCV for video frame processing

Installation

Follow these steps to install and set up the Multimodal AI Assistant:

Clone the repository to your local machine (use the actual repository link):

git clone https://github.com/TVR28/Multimodal-AI-Assistant.git
cd Multimodal-AI-Assistant

Install the required Python libraries:
```
pip install -r requirements.txt
```

Getting Started

To get started, run the application script after installation:

python multimodal_ai_assistant.py

or run the google colab Multimodal_AI_Assistant_Llava7B.ipynb notebook with a T4 GPU to test the working.

Usage

The Multimodal AI Assistant provides an interactive interface for users to engage with AI through voice and images:

Voice Interaction: Record your query directly through the microphone input.
Image Analysis: Upload or capture an image to receive a detailed description of its content.
Video Frame Analysis: Capture live video and the assistant will analyze specific frames to answer your questions (Updating..).
RAG system: The Assistant will be capable of retrieveing information from any documents(multiple) like pdf, txt, csv and more. (Updating)

Features

Voice to Text Transcription: Transcribes user voice input with Whisper.
Image to Text Description: Generates descriptive text for images using the LLaVa model.
Text to Speech Response: Converts AI-generated text into audible speech.
Video Frame Extraction: Captures frames from live video for analysis.

Experimental Results:

Contributing

We welcome contributions to the Multimodal AI Assistant. Please follow these steps to contribute:

Fork the repository.
Create a new branch for your feature.
Add your feature or bug fix.
Push your code and open a new pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LICENSE		LICENSE
Multimodal_AI_Assistant_Llava7B.ipynb		Multimodal_AI_Assistant_Llava7B.ipynb
README.md		README.md
multimodal_ai_assistant_llava7b.py		multimodal_ai_assistant_llava7b.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal AI Assistant

Table of Contents

Prerequisites

Installation

Getting Started

Usage

Features

Experimental Results:

Contributing

About

Releases

Packages

Languages

License

TVR28/Multimodal-AI-Assistant

Folders and files

Latest commit

History

Repository files navigation

Multimodal AI Assistant

Table of Contents

Prerequisites

Installation

Getting Started

Usage

Features

Experimental Results:

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages