Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!
whisply
combines faster-whisper and insanely-fast-whisper to offer an easy-to-use solution for batch processing files on Windows, Linux and Mac. It also enables word-level speaker annotation by integrating whisperX and pyannote.
-
๐ดโโ๏ธ Performance:
whisply
selects the fastest Whisper implementation based on your hardware:- CPU/GPU (Nvidia CUDA):
fast-whisper
orwhisperX
- MPS (Apple M1-M4):
insanely-fast-whisper
- CPU/GPU (Nvidia CUDA):
-
โฉ large-v3-turbo Ready: Support for whisper-large-v3-turbo on all devices. Note: Subtitling and annotations on CPU/GPU use
whisperX
for accurate timestamps, butwhisper-large-v3-turbo
isnโt currently available forwhisperX
. -
โ Auto Device Selection:
whisply
automatically choosesfaster-whisper
(CPU) orinsanely-fast-whisper
(MPS, Nvidia GPUs) for transcription and translation unless a specific--device
option is passed. -
๐ฃ๏ธ Word-level Annotations: Enabling
--subtitle
or--annotate
useswhisperX
orinsanely-fast-whisper
for word segmentation and speaker annotations.whisply
approximates missing timestamps for numeric words. -
๐ฌ Customizable Subtitles: Specify words per subtitle block (e.g., "5") to generate
.srt
and.webvtt
files with fixed word counts and timestamps. -
๐งบ Batch Processing: Handle single files, folders, URLs, or lists via
.list
documents. See the Batch processing section for details. -
๐ฉโ๐ป CLI / App:
whisply
can be run directly from CLI or as an app with a graphical user-interface (GUI). -
โ๏ธ Export Formats: Supports
.json
,.txt
,.txt (annotated)
,.srt
,.webvtt
,.vtt
, and.rttm
.
- FFmpeg
- >= Python3.10
- GPU processing requires:
- Nvidia GPU (CUDA: cuBLAS and cuDNN 8 for CUDA 12)
- Apple Metal Performance Shaders (MPS) (Mac M1-M4)
- Speaker annotation requires a HuggingFace Access Token
GPU Fix for Could not load library libcudnn_ops_infer.so.8. (click to expand)
If you use whisply on a Linux system with a Nivida GPU and get this error:
"Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory"
Run the following line in your CLI:
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
Add this line to your Python environment to make it permanent:
echo "export LD_LIBRARY_PATH=\`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + \":\" + os.path.dirname(nvidia.cudnn.lib.__file__))'\`" >> path/to/your/python/env
For more information please refer to the faster-whisper GitHub page.
1. Install ffmpeg
--- macOS ---
brew install ffmpeg
--- Linux ---
sudo apt-get update
sudo apt-get install ffmpeg
--- Windows ----
https://ffmpeg.org/download.html
2. Clone this repository and change to project folder
git clone https://github.com/tsmdt/whisply.git
cd whisply
3. Create a Python virtual environment
python3 -m venv venv
4. Activate the Python virtual environment
source venv/bin/activate
5. Install whisply
with pip
pip install .
or
pip install whisply
$ whisply
Usage: whisply [OPTIONS]
WHISPLY ๐ฌ Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --files -f TEXT Path to file, folder, URL or .list to process. [default: None] โ
โ --output_dir -o DIRECTORY Folder where transcripts should be saved. [default: transcriptions] โ
โ --device -d [auto|cpu|gpu|mps] Select the computation device: CPU, GPU (NVIDIA), or MPS (Mac M1-M4). [default: auto] โ
โ --model -m TEXT Whisper model to use (List models via --list_models). [default: large-v2] โ
โ --lang -l TEXT Language of provided file(s) ("en", "de") (Default: auto-detection). [default: None] โ
โ --annotate -a Enable speaker annotation (Saves .rttm). โ
โ --hf_token -hf TEXT HuggingFace Access token required for speaker annotation. [default: None] โ
โ --translate -t Translate transcription to English. โ
โ --subtitle -s Create subtitles (Saves .srt, .vtt and .webvtt). โ
โ --sub_length INTEGER Subtitle segment length in words. [default: 5] โ
โ --export -e [all|json|txt|rttm|vtt|webvtt|srt] Choose the export format. [default: all] โ
โ --verbose -v Print text chunks during transcription. โ
โ --del_originals -del Delete original input files after file conversion. (Default: False) โ
โ --config PATH Path to configuration file. [default: None] โ
โ --list_filetypes List supported audio and video file types. โ
โ --list_models List available models. โ
โ --install-completion Install completion for the current shell. โ
โ --show-completion Show completion for the current shell, to copy it or customize the installation. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Instead of running whisply
from the CLI you can start the web app instead:
$ python app.py
Open the local URL in your browser after starting the app (Note: The URL might differ from system to system):
* Running on local URL: http://127.0.0.1:7860
In order to annotate speakers using --annotate
you need to provide a valid HuggingFace access token using the --hf_token
option. Additionally, you must accept the terms and conditions for both version 3.0 and version 3.1 of the pyannote
segmentation model.
For detailed instructions, refer to the Requirements section on the pyannote model page on HuggingFace and make sure that you complete steps "2. Accept pyannote/segmentation-3.0 user conditions", "3. Accept pyannote/speaker-diarization-3.1 user conditions" and "4. Create access token at hf.co/settings/tokens".
whisply
uses whisperX for speaker diarization and annotation. Instead of returning chunk-level timestamps like the standard Whisper
implementation whisperX
is able to return word-level timestamps as well as annotating speakers word by word, thus returning much more precise annotations.
Out of the box whisperX
will not provide timestamps for words containing only numbers (e.g. "1.5" or "2024"): whisply
fixes those instances through timestamp approximation. Other known limitations of whisperX
include:
- inaccurate speaker diarization if multiple speakers talk at the same time
- to provide word-level timestamps and annotations
whisperX
uses language specific alignment models; out of the boxwhisperX
supports these languages:en, fr, de, es, it, ja, zh, nl, uk, pt
.
Refer to the whisperX GitHub page for more information.
Instead of providing a file, folder or URL by using the --files
option you can pass a .list
with a mix of files, folders and URLs for processing.
Example:
$ cat my_files.list
video_01.mp4
video_02.mp4
./my_files/
https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo
You can provide a .json
config file by using the --config
option which makes batch processing easy. An example config looks like this:
{
"files": "./files/my_files.list", # Path to your files
"output_dir": "./transcriptions", # Output folder where transcriptions are saved
"device": "auto", # AUTO, GPU, MPS or CPU
"model": "large-v3-turbo", # Whisper model to use
"lang": null, # Null for auto-detection or language codes ("en", "de", ...)
"annotate": false, # Annotate speakers
"hf_token": "HuggingFace Access Token", # Your HuggingFace Access Token (needed for annotations)
"translate": false, # Translate to English
"subtitle": false, # Subtitle file(s)
"sub_length": 10, # Length of each subtitle block in number of words
"export": "txt", # Export .txts only
"verbose": false # Print transcription segments while processing
}