-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"initial_prompt" appears to progressively override audio for longer streams #278
Comments
So I dug into this a bit more and was able to confirm that basically two things are happening when I use the websocket connection and fasterwhisper version (I assume it's the same for TensorRT but cannot verify):
if options.initial_prompt is not None:
if isinstance(options.initial_prompt, str):
initial_prompt = " " + options.initial_prompt.strip()
initial_prompt_tokens = tokenizer.encode(initial_prompt)
all_tokens.extend(initial_prompt_tokens)
else:
all_tokens.extend(options.initial_prompt) the result is that even when 'turned on' the context is never extended with earlier content, it is called once for each new clip with the If I send the initial_prompt only during the first 10-20s of the stream it works well. Otherwise it starts to override the content of the audio. I also tried sharing the 'last_segment' by extending result, info = self.transcriber.transcribe(
input_sample,
timestamp_offset=self.timestamp_offset, # added to track global state in transcribe
last_segment=self.last_segment, # added to track 'latest' text segment in transcribe
initial_prompt=self.initial_prompt,
language=self.language,
task=self.task,
vad_filter=self.use_vad,
vad_parameters=self.vad_parameters if self.use_vad else None)
self.last_segment=result this worked a little bit better, but unfortunately seemed to result in a lot of new 'gaps' in the STT results; presumably because the It may be just a need to more carefully time-align the 'most recent' partial output with the current clip - like the infrastructure in Maybe there's something else I'm missing here as well. |
Good day, Sir Could you have more observations on this issue (I do not see this issue in the real-time transcribe from microphone) By the way, just a question, where is the code below: |
@zeliang3 it is here: WhisperLive/whisper_live/transcriber.py Line 464 in be71657
I haven't had a chance to look at it closely again. I see it constantly in the websocket. I'm using it in streaming mode over a websocket in a ReactJS web application. Can you provide a minimum usage example for your microphone based approach? I have not tried this myself. Maybe I'll have better luck comparing it against a working alternative. I'll be happy to invest another day or so in this and provide a pull request if I can suss it out; but I either need a bit more free time, or some kind of hint. |
just simply call client(), and it will choose the current microphone bro @AdolfVonKleist
|
I've been using WhisperLive with great success recently in multiple languages. Seriously amazing. I recently noticed the support for
initial_prompt
which was added in January, and tried applying it to my use case.I have noticed that while the
initial_prompt
value works amazingly well during the first 10-20s of a conversation, when we get beyond this point it suddenly starts to completely override the input audio.For example I'll specify a 'corrected' spelling for a company name: SupaSqrrl DIE-namics instead of Super Squirrel Dynamics. In the first 20s any utterances of this phrase will be perfectly transcribed according to the initial_prompt value I've added:
SupaSqrrl DIE-namics
. However as the conversation progresses this boosted phrase will start to override all other input speech and the recognizer will just end up outputting the initial_prompt over and over again.I thought maybe the prompt was being provided repeatedly somewhere in the code, but after a cursory review of the source I didn't see anything like that.
I'm wondering if anyone else has experienced something similar?
edit: I also can confirm I don't see this behavior in longer files when I transcribe in batch mode with whisperx or faster-whisper.
The text was updated successfully, but these errors were encountered: