-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My timestamps with whisperX are way off #810
Comments
try to set |
I had that set to False from the outset. |
Did you test with both |
I'm not aware of that setting. Is it available via Python? What does it do? |
of course. alignment model can heavily improve your timestamp accuracy. import whisperx
import gc
device = "cuda"
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)
# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment
# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model
# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
print(result["segments"]) # after alignment |
Oh, that's what you mean by Of course I am using the alignment model. That's what I meant by wav2vec2 model in the OP. So you are suggesting I should turn it off? |
I genuinely don't know if it helps, but why not give it a try? Maybe alignment model for your target language is not working properly? |
Try specifying a different pre-trained model for alignment, maybe that's the issue. You can make use of this script, it's a wrapper on top of whisperx to customise the pre-trained models and run in offline mode. |
OK, so you are confirming that it’s the alignment model that is causing the problem. Thing is: it’s not so easy to find one for Swedish. Thanks for the link to the script. I’m not sure I understand the offline part, though. Isn’t whisperX offline by default? I’ll have to take a closer look to figure out the role of the wespeaker model. Does it replace the wav2vec2 model? |
I'm not an expert in ASR tasks, but as you've mentioned that the timestamps are way off it could be due to some issues with the alignment model. Well, whisperx isn't offline completely. You'll have to provide the The wespeaker model is for Speaker Embedding which is used in Diarization task. For alignment wav2vec2 is used in whisperx. |
@tophee I just went through the code in whisperx, it doesn't have alignment model setup for Swedish. whisperX/whisperx/alignment.py Lines 24 to 58 in f2da2f8
You can manually download the model |
Yes, there is no "built-in" alignment model for Swedish. That's why I'm using This is my alignment code: model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device, model_name="KBLab/wav2vec2-large-voxrex-swedish")
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
Oh, I see. I'm just reading the model from the cache (
All the wespeaker models are pretrained in English or Chinese. Do you know whether these will work with other languages? |
What is the
Yeah, we need to have internet connection initially to download the models.
I doubt if it'll work with good accuracy for other languages. I found this for Swedish - https://spraakbanken.gu.se/en/resources/embeddings. I'll share if I find anything useful. |
I have used WhisperX with When you open an issue, please try to
For me, the following code snippet produces high quality alignments: import whisperx
device = "cuda"
audio_file = "data/1991/RD_EN_L_1991-04-09_1991-04-10.1.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
model = whisperx.load_model("large-v2", device, compute_type=compute_type, language="sv")
audio = whisperx.load_audio(audio_file, sr=16000)
result = model.transcribe(audio, batch_size=batch_size)
model_a, metadata = whisperx.load_align_model(
model_name="KBLab/wav2vec2-large-voxrex-swedish", device=device, language_code="sv"
)
result = whisperx.align(
result["segments"], model_a, metadata, audio, device, return_char_alignments=False
) Versions of libraries:
I don't think there's any issue with the alignment model. But it's not possible to help you without details about your code and environment. *Edit: One source of error for |
The timestamps I am getting from whisperX are way off (we are talking about 10-15 seconds, sometimes less, sometimes more) and I have no idea why this is so.
Today, I noticed that the wav2vec2 model I'm using (
KBLab/wav2vec2-large-voxrex-swedish
) wants a 16kHz sampling rate, so I downsampled all my wav files but there is no improvement in the timestamps.I'm not sure how to troubleshoot this, so any hints are appreciated.
Just to make sure I understand the basics correctly: the timestamps are generated by the wav2vec2 model, right?
The text was updated successfully, but these errors were encountered: