Skip to content

Transcribe youtube videos with speaker diarization using Google Gemini. Crafted for interview videos.

Notifications You must be signed in to change notification settings

madeyexz/youtube2transcripts

Repository files navigation

YouTube2Transcripts (with Speaker Diarization)

A Python tool that downloads YouTube videos as audio and generates detailed transcriptions with speaker diarization using Google's Gemini AI. The tool supports batch processing of multiple URLs.

Example Result

Here's a sample transcript generated by the tool: From this 48-minute video Elon Musk - CEO of Tesla Motors and SpaceX | Entrepreneurship | Khan Academy, you can find this example at ExampleTranscript.md.

Elon Musk: Um yeah, I mean I I I mean the goal is not uh just it's not sort of to become a bra big brand or to compete with uh Honda Civics uh but rather to advance the course of electric vehicles.

Interviewer: Hmm.

Elon Musk: So we're just going to keep making more and more electric cars and driving the price point down until the industry is very firmly electric. You know, like maybe half of all cars made are electric or something like that. Which is not to say that we expect to make all half of all cars. We we we want to just have that catalytic effect until at least that occurs. And I think at the point which there's you know we're approaching half of all new cars made are electric then I think that's I would consider that to be kind of the victory condition.

Interviewer: Wow.

Elon Musk: Um And and so the faster we can bring that day the the better.

Interviewer: Wow. When when would be your guess when that happens?

Elon Musk: Um well, I made a bet with someone about 3 years ago that it would be sooner than 20 years, so it's 17 years from now. But I but that's I think I I that's conservative. I think it's probably you know but maybe maybe 13 or 14 years, something like that.

Interviewer: Wow. Right right about the time Right when we're going to Mars. It'll it'll be it'll be exciting exciting times.

Elon Musk: Yeah. Absolutely. True. That's that's it just could yeah. Exactly. I was just thinking about that. It was like oh those time frames are kind of coincident.

So it does transcript the video, and also identify the speaker (I mean, kind of-).

Features

  • Downloads YouTube videos as MP3 audio files
  • Splits long audio files into manageable chunks
  • Generates detailed transcriptions with speaker identification
  • Supports batch processing of multiple URLs
  • Includes rate limiting and retry mechanisms
  • Progress tracking with tqdm
  • Concurrent processing with ThreadPoolExecutor

Prerequisites

  • Google Gemini API key, get yours here

Installation

  1. Clone the repository:
git clone https://github.com/madeyexz/youtube2transcripts.git
cd youtube2transcripts
  1. Install required packages:
pip install -r requirements.txt
  1. Create a .env file in the project root and add your Gemini API key:
GEMINI_API_KEY=your_api_key_here

Usage

  1. Create and activate a virtual environment:
uv venv .venv && source .venv/bin/activate && uv pip install -r requirements.txt
  1. Run the script in one of two ways:

    A. Using the GUI interface:

    python run.py

    This will open a graphical interface where you can paste URL and read the transcript.

    B. Using command-line interface:

    python youtube_transcriber.py

    When prompted, enter YouTube URLs one per line. Press Enter twice when done:

    https://youtube.com/watch?v=example1
    https://youtube.com/watch?v=example2
    [Press Enter twice to start processing]
    

The script will:

  1. Download audio from each URL
  2. Split audio into 20-minute chunks (because Gemini AI has a output token limit of 8k tokens, which is roughly 30 minutes of people talking)
  3. Process each chunk through Gemini 1.5 flash
  4. Generate and save transcripts in the transcripts_better directory

Output

Transcripts are saved as markdown files in the transcripts_better directory with the following naming convention:

transcript_[sanitized_video_title].md

Configuration

Current LLM model used is gemini-1.5-flash, which is the nice multi-modal model that enabled this project.

Key constants that can be modified in the script are listed below, but are not recommended to be modified, especially if you are not paying for the Gemini API.

  • CALLS_PER_SECOND: API rate limit (default: 1.8)
  • MAX_WORKERS: Maximum concurrent jobs (default: 2)
  • chunk_duration: Audio chunk size in minutes (default: 20)

Error Handling

The script includes:

  • Automatic retries for failed API calls
  • Rate limiting to prevent API throttling
  • Comprehensive logging
  • File existence checks to prevent duplicate downloads

Contributing

Contributions are welcome! Please feel free to submit a pull request.

Disclaimer

This tool is for educational purposes only. Please ensure you have the right to download and process any YouTube content before using this tool.

About

Transcribe youtube videos with speaker diarization using Google Gemini. Crafted for interview videos.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published