A multi modal starter kit that can have AI narrate a video or scene of your choice. Includes examples of how to do video processing, frames extraction, and sending frames to AI models optimally. Cost $0 to run.
Works with the following models 👇🦙
- LLaVa (powered by Ollama)
- LLaVa-vicuna (powered by Ollama)
- BakLLaVA (powered by Ollama)
- Moondream (powered by Fal.ai)
- ...and many others on https://ollama.com/library
- GPT-4v
Have questions? Join AI Stack devs #multi-modal-starter-kit
🎉 Demo (Sound ON 🔊)
MM-demo.mp4
- 💻 Video and Image hosting: Tigris
- 🦙 Inference: Ollama, Fal with options to use OpenAI
- 🔌 GPU: Fly
- 💾 Caching: Upstash
- 🤔 AI response pub/sub: Upstash
- 📢 Video narration: ElevenLabs
- 🗺️ Workflow orchestration: Inngest
- 🖼️ App logic: Next.js
- 🖌️ UI: Vercel v0
git clone git@github.com:[YOUR_GITHUB_ACCOUNT_NAME]/multi-modal-starter-kit.git
If you are using Homebrew on your machine, run brew bundle
to install all the needed dependencies. If you need to install them manually, install these from your package manager of choice:
- ffmpeg (ideally with a wide berth of codecs supported; if you don't know what this means, the default package is probably fine)
- Node.js 20.x or higher
- Create an .env file
cd multi-modal-starter-kit
cp .env.example .env
- Set up Tigris
- Make sure you have a fly.io account and have fly CLI installed on your computer
cd multi-modal-starter-kit
- Pick a name for your version of your app. App names on fly are global, so it has to be unique. For example
multi-modal-awesomeness
- Create the app on fly with
fly app create <your app name>
so for examplefly app create multi-modal-awesomeness
- Create the storage with
fly storage create
- You should get a list of credentials like below:
- If you get a list of keys without values, destroy the bucket with
fly storage destroy
and try again. - Copy paste these values to your .env under "Tigris"
- Note that the name for the storage bucket is
NEXT_PUBLIC_BUCKET_NAME
. If you copy/paste add theNEXT_
part at the beginning
- Set Tigris bucket cors policy and bucket access policy
fly storage update YOUR_BUCKET_NAME --public
- Make sure you have aws CLI installed and run
aws configure
. Enter theAWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
printed above. Note that these are not actual Amazon Web Services credentials, but Tigris credentials. If you have the aws CLI already configured for Amazon, it will overwrite those values. - Run the following command to update CORS policy on the bucket
aws s3api put-bucket-cors --bucket BUCKET_NAME --cors-configuration file://cors.json --endpoint-url https://fly.storage.tigris.dev/
We have a sample video in the assets
directory that you can use to test the app. You can run the following command if you want to test the app with this video
aws s3 cp ./assets/pasta-making.mp4 s3://BUCKET_NAME --endpoint-url https://fly.storage.tigris.dev`
Alternatively you can also uploading your own videos.
By Default the app uses Ollama / llava for vision. If you want to use OpenAI Chatgpt4v instead, you can set INFERENCE_PLATFROM="OpenAI"
and fill in OPENAI_API_KEY
in .env
There are two ways to get Ollama up and running. You can either use Fly GPU, which provides very fast inference, or use your laptop.
Option 1: Fly GPU
- Make sure you have a Fly account and flyctl installed
- Fork ollama-demo, edit fly.toml to rename the app, and run
fly launch
- Under the ollama-demo directory, run
fly console ssh
-- once you have ssh'd into the instance, runollama pull llava
-- by default, this pulls the llava7b model, but you could also pull other vision models to use with your app, such as:
ollama pull llava:34b
ollama pull llava:7b-v1.6-vicuna-q4_0
...
- You should get a
hostname
oncefly launch
succeeds, copy paste this value toOLLAMA_HOST
in.env
Your app will now use this Fly GPU for instance.
Option 2: Your laptop
- Install Ollama
- Run
ollama pull llava
on your terminal. Like mentioned under Option 1, you can also pull other models to compare the results. - (optional) Watch requests coming into Ollama by running this in a new terminal tab
tail -f ~/.ollama/logs/server.log
- Go to https://elevenlabs.io/, log in, and click on your profile picture on lower left. Select "Profile + API key". Copy the API key and save it as
XI_API_KEY
in the .env file - Select a 11labs voice by clicking on "Voices" on the left side nav bar and navigate to "VoiceLab". Copy the voice ID and save it as
XI_VOICE_ID
in .env
When narrating a very long video, Upstash Redis is used for pub/sub and notifies the client when new snippets of reply come back. Upstash is also used for the critical task of caching video/images so the subsequent requests don't take long.
- Go to https://console.upstash.com/, select "Create Database" with the following settings
- Once created, under 'Node' - 'io-redis' tab, copy the whole string starting with "rediss://" and set
UPSTASH_REDIS_URL
value as this string in .env - On the same page, scroll down to the "Rest API" section and copy paste everything under ".env" tab to your .env file
npm install
npm run dev
By now you should have a functional app, let's deploy it to fly.io cloud account that you setup in Step 1.
- First, lets see what secrets are already available in our app using
fly secrets list
:
$ ➔ fly secrets list
NAME DIGEST CREATED AT
AWS_ACCESS_KEY_ID xxxxxxx Feb 23 2024 20:33
AWS_ENDPOINT_URL_S3 xxxxxxx Feb 23 2024 20:33
AWS_REGION xxxxxxx Feb 23 2024 20:33
AWS_SECRET_ACCESS_KEY xxxxxxx Feb 23 2024 20:33
BUCKET_NAME xxxxxxx Feb 23 2024 20:33
- We need to match the secrets as in
.env.example
file. Rename theBUCKET_NAME
secret toNEXT_PUBLIC_BUCKET_NAME
:
$ ➔ fly secrets set NEXT_PUBLIC_BUCKET_NAME=<YOUR BUCKET NAME>
$ ➔ fly secrets unset BUCKET_NAME
- Now, all other environment vars:
$ ➔ fly secrets set OPENAI_API_KEY=<YOUR KEY HERE>
$ ➔ fly secrets set UPSTASH_REDIS_URL=<UPSTASH REDIS URL HERE>
$ ➔ fly secrets set UPSTASH_REDIS_REST_URL=<UPSTASH REDIS REST URL HERE>
$ ➔ fly secrets set UPSTASH_REDIS_REST_TOKEN=<UPSTASH REDIS REST TOKEN HERE>
$ ➔ fly secrets set XI_API_KEY=<XI API KEY>
$ ➔ fly secrets set XI_VOICE_ID=<XI VOICE ID>
- Once environment is all set, we can make the app fly:
$ ➔ fly launch
$ ➔ fly deploy
There is an example in the repo that leverages Inngest for workflow orchestration -- Inngest is especially helpful here when you have a long-running workflow and does automatic retries. Example code is in src/inngest/functions.ts
.
In this example, Inngest waits for new images to upload to Tigris, then sends the image to Ollama/OpenAI for processing. The "describe-image"
step is auto-retried when there is a failure or returned JSON is malformed.
export const inngestTick = inngest.createFunction(
{ id: "tick" },
{ cron: "* * * * *" },
async ({ step }) => {
await step.run("fetch-latest-snapshot", async () => {
return await fetchLatestFromTigris();
});
const result = await step.waitForEvent("Tigris.complete", {
event: "Tigris.complete",
timeout: "1m",
});
const url = result?.data.url;
console.log("url", url);
if (!!url) {
await step.run("describe-image", async () => {
return await describeImage(url);
});
}
}
);
fal.ai is an inference platfrom that specilizes on fast media model inference. To use fal with the multimodal starter-kit demo set the INFERENCE_PLATFORM
environment variable to "fal", and add a new FAL_KEY environment variable from the fal.ai website. First, create an account with fal.ai, navigate to the keys page keys and follow the steps to create a key. Copy the result into the .env
file and save it as FAL_KEY.
INFERENCE_PLATFORM=fal
FAL_KEY=***
Currently, only the moondream model is avaliable with fal. Stay tuned for llava7B and llava34B.
Tigris is 100% aws cli compatible. Here are some frequently used commands during active development:
Press 'v' to toggle the voice. This pauses the voice so it will resume at the point it was paused.
fly storage dashboard BUCKET_NAME
Currently temporary files for the snapshots that get passed to the model and the elevenlabs voice files are stored in the bucket and are not cleaned up. To clean these up, you can run the following from the CLI:
aws s3 rm s3://BUCKET_NAME/ --endpoint-url https://fly.storage.tigris.dev --recursive --exclude "*.mp4"
aws s3 cp PATH_TO_YOUR_VIDEO s3://BUCKET_NAME --endpoint-url https://fly.storage.tigris.dev