A web-app/library for transcribing speech
- Install Python 3.9
- Install ffmpeg
- Windows: Download zip & add
ffmpeg/bin
to environment path - Linux:
apt-get install ffmpeg
- Windows: Download zip & add
pip install -r requirements.txt
- (Optional) Download punctuator model and save as
INTERSPEECH-T-BRNN.pcl
Run pip install flask
before running the web app.
Then run python app.py
to open the web app at http://localhost:5000/
python main.py --path filename --transcriber transcriber
- Path: Path to the audio/video file to transcribe
- Transcriber: Transcription model to use, choose from:
- cmu_sphinx
- librispeech
- silero
- vosk
- wav2vec2
- wav2vec2_commonvoice
- whisper
When selecting transcription models, the following requirements were used:
- Must be supported in Python 3.9
- Must work locally (without the usage of an API)
- Must have a straightforward installation process
- Should not require building from source
- Should not require additional OS libraries
- Should not require manually downloading additional files
Below is a comparison of transcription model performance produced using the Librispeech test clean dataset and analysis script
Name | Dependencies | Model Size | Average processing time | Score |
---|---|---|---|---|
Wav2Vec2 CommonVoice | speechbrain | 1.18GB | 3.351s | 0.87 |
Librispeech | torch, transformers, torchaudio, librosa | 113MB | 0.558s | 0.85 |
Wav2Vec2 | torch, transformers, torchaudio, librosa | 360MB | 1.325s | 0.8 |
Whisper | whisper | 138MB | 3.848s | 0.77 |
Vosk | vosk | 67.7MB | 1.206s | 0.76 |
Silero | torch, transformers, torchaudio, librosa, omegaconf | 111MB | 0.261s | 0.68 |
CMU Sphinx | SpeechRecognition, pocketsphinx | 33.9MB* | 1.123s | 0.55 |
*size of pocketsphinx package