Live transcription PoC with the Whisper model (using faster-whisper
) in a server (restapi) - client (gradio ui/cli) setup where the server can handle multiple clients.
(Server is running separately making it usable with any client side code.)
Sample with a Macbook Pro (M1)
test-transcription-on-m1-mac.mov
(🔈 sound on, faster-whisper
package, base
model - latency was around 0.5s)
$ pip install -r requirements.txt
$ mkdir models
- Before running the
server.py
modify the parameters inside the file
# Start the server (RestAPI)
python server.py
# --------------------------------
# Start the Gradio interface on localhost (HTTP)
python ui_client.py
# Start the Gradio interface with their sharing - this way the it'll be HTTPS without the need of certs
SHARE=1 python ui_client.py
# Start the Gradio interface with your own certs
SSL_CERT_PATH=<PATH> SSL_KEY_PATH=<PATH> python ui_client.py
python server.py
python cli_client.py
There are a few parameters in each script that you can modify
This beautiful art will explain this:
- step = 1
- length = 4
$t$ is the current time (1 second of audio to be precise)
------------------------------------------
1st second: [t, 0, 0, 0] --> "Hi"
2nd second: [t-1, t, 0, 0] --> "Hi I am"
3rd second: [t-2, t-1, t, 0] --> "Hi I am the one"
4th second: [t-3, t-2, t-1, t] --> "Hi I am the one and only Gabor"
5th second: [t, 0, 0, 0] --> "How" --> Here we started the process again, and the output is in a new line
6th second: [t-1, t, 0, 0] --> "How are"
etc...
------------------------------------------
- Use a
VAD
on the client side, and either send the audio for transcription when we detect a longer silence (e.g. 1 sec) or if there is no silence we can fall back to the maximum length. - Transcribe shorter timeframes to get more instant transcriptions and meanwhile, we can use larger timeframes to "correct" already transcribed parts (async correction)