how can i stop 20-second silences between words in my output file? #320

ClaireCJS · 2024-11-04T20:44:49Z

ClaireCJS
Nov 4, 2024

Shouldn't these options explicitly probhit a 20 second silence in mid-line?

Especially if the line is only 25 chars / about 5 words anyway?!

What am I doing wrong?

--model=large-v2 --language=en --output_dir "%_CWD" --output_format srt --vad_filter True   --max_line_count 1 --max_line_width 20 --ff_mdx_kim2 --highlight_words False --beep_off --check_files --sentence --verbose True --vad_filter=True --vad_threshold=0.1 --vad_min_speech_duration_ms=150 --vad_min_silence_duration_ms=200 --vad_max_speech_duration_s 5 --vad_speech_pad_ms=199 --vad_dump

p.s. also is there a way to output both LRC and SRT at the same time, because i need both, don't want it run it twice, and i don't trust the converter i wrote.

Answered by Purfview

Nov 4, 2024

Shouldn't these options explicitly probhit a 20 second silence in mid-line?

No.

What am I doing wrong?

You are touching vad settings, don't touch them. You can try --vad_alt_method pyannote_v3

also is there a way to output both LRC and SRT at the same time

--output_format all

View full answer

ClaireCJS · 2024-11-04T21:03:39Z

ClaireCJS
Nov 4, 2024
Author

I also get into ... a single word left on the screen for 60 seconds during a guitar solo.

It's like Whisper thinks somebody has a VERRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRY legato voice 😂

14 replies

Purfview Nov 27, 2024
Maintainer

That track audio is very bad if it didn't missed transcription then there was inaccurate transcription anyway.
Try r194.5, there -hst is tweaked to do better.

ClaireCJS Nov 27, 2024
Author

That track audio is very bad if it didn't missed transcription then there was inaccurate transcription anyway. Try r194.5, there -hst is tweaked to do better.

bad news...

I was super excited to see this so I downloaded and installed the new version.

I used the same prompt on the same song.

it didn't even create a successful subtitle file this time. At least with the old version, I got results

It never even made the real SRT file:

Here's the log:

c:\util2\Faster-Whisper-XXL-v1.94.1\Faster-Whisper-XXL.exe --model=large-v2   --la
nguage=en --output_dir "C:\b\MISC\1930s\1930" --output_format srt --vad_filter True   --max_line_count 1 --max_line_wi
dth 20 --ff_mdx_kim2 --highlight_words False --beep_off --check_files --sentence --verbose True --vad_filter=True --va
d_threshold=0.1 --vad_min_speech_duration_ms=150 --vad_min_silence_duration_ms=200 --vad_max_speech_duration_s 5 --vad
_speech_pad_ms=199 --vad_dump --best_of 5 --max_comma_cent 70 --max_gap 3.0 --initial_prompt "You said, I saw the moon
 go down, And I swear my man must be somewhere, turning 'round and 'round,  I say the big star is falling, it don't be
 long 'fore day, I say the big star is falling, and it don't be long 'fore day, The moon want my baby to change my evi
l ways,  I'm going to wake up, 'tween midnight and day, You going come for me baby, and I swear I'll be gone away,  No
w won't you come here baby, sit down on my knee, Now won't you come here, sit down on my knee, Now I just want to tell
 you, black man how you have treated me"  "Louise Johnson - By The Moon And Stars (1930).mp3"

Model not found at: C:\util2\Faster-Whisper-XXL-v1.94.1\_models\faster-whisper-large-v2
Attempting to download:

config.json: 100%|██████████████████████████████████████████████████████████████████████| 2.80k/2.80k [00:00<?, ?B/s]
vocabulary.txt: 100%|█████████████████████████████████████████████████████████████| 460k/460k [00:00<00:00, 2.16MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████| 2.20M/2.20M [00:00<00:00, 2.38MB/s]
model.bin: 100%|████████████████████████████████████████████████████████████████| 3.09G/3.09G [18:03<00:00, 2.85MB/s]

Standalone Faster-Whisper-XXL r194.1 running on: CUDA

Number of visible GPU devices: 1

Supported compute types by GPU: {'int8', 'int8_bfloat16', 'int8_float32', 'bfloat16', 'float16', 'float32', 'int8_float16'}

[2024-11-27 08:02:08.142] [ctranslate2] [thread 62600] [info] CPU: AuthenticAMD (SSE4.1=true, AVX=true, AVX2=true, AVX512=false)
[2024-11-27 08:02:08.143] [ctranslate2] [thread 62600] [info]  - Selected ISA: AVX2
[2024-11-27 08:02:08.143] [ctranslate2] [thread 62600] [info]  - Use Intel MKL: false
[2024-11-27 08:02:08.143] [ctranslate2] [thread 62600] [info]  - SGEMM backend: DNNL (packed: false)
[2024-11-27 08:02:08.143] [ctranslate2] [thread 62600] [info]  - GEMM_S16 backend: none (packed: false)
[2024-11-27 08:02:08.143] [ctranslate2] [thread 62600] [info]  - GEMM_S8 backend: DNNL (packed: false, u8s8 preferred: true)
[2024-11-27 08:02:08.143] [ctranslate2] [thread 62600] [info] GPU #0: NVIDIA GeForce RTX 3060 (CC=8.6)
[2024-11-27 08:02:08.143] [ctranslate2] [thread 62600] [info]  - Allow INT8: true
[2024-11-27 08:02:08.143] [ctranslate2] [thread 62600] [info]  - Allow FP16: true (with Tensor Cores: true)
[2024-11-27 08:02:08.143] [ctranslate2] [thread 62600] [info]  - Allow BF16: true
[2024-11-27 08:02:13.065] [ctranslate2] [thread 62600] [info] Using CUDA allocator: cuda_malloc_async
[2024-11-27 08:02:13.388] [ctranslate2] [thread 62600] [info] Loaded model C:\util2\Faster-Whisper-XXL-v1.94.1\_models\faster-whisper-large-v2 on device cuda:0
[2024-11-27 08:02:13.388] [ctranslate2] [thread 62600] [info]  - Binary version: 6
[2024-11-27 08:02:13.388] [ctranslate2] [thread 62600] [info]  - Model specification revision: 3
[2024-11-27 08:02:13.388] [ctranslate2] [thread 62600] [info]  - Selected compute type: int8_float16

Faster-Whisper's large-v2 model loaded in: 5.29 seconds

Audio filtering is in progress...

MDX Kim_Vocal_2 model probably is running on CUDA: 100% | 12/12 | 00:23<<00:00

Audio filtering finished in : 0:00:23.898

Starting transcription on: Louise Johnson - By The Moon And Stars (1930).mp3

Processing audio with duration 02:53.531

VAD finished in:  0:00:00.463

VAD filter removed 01:05.145 of audio
VAD filter kept the following audio segments: [00:20.729 -> 00:25.488], [00:25.488 -> 00:30.247], [00:31.001 -> 00:35.760], [00:35.760 -> 00:40.519], [00:41.465 -> 00:46.224], [00:46.224 -> 00:50.983], [00:53.081 -> 00:57.840], [00:57.840 -> 01:02.599], [01:02.777 -> 01:07.536], [01:07.536 -> 01:12.288], [01:12.288 -> 01:17.040], [01:17.040 -> 01:21.799], [01:51.449 -> 01:56.208], [01:56.208 -> 02:00.816], [02:00.816 -> 02:05.424], [02:05.424 -> 02:10.032], [02:10.032 -> 02:14.640], [02:14.640 -> 02:19.248], [02:19.248 -> 02:23.856], [02:23.856 -> 02:28.464], [02:28.464 -> 02:33.223], [02:35.993 -> 02:40.752], [02:40.752 -> 02:45.511]


VAD timestamps are dumped to 'Louise Johnson - By The Moon And Stars (1930).mp3._vad_original.srt' file.

  Processing segment at 00:00.000
[2024-11-27 08:02:37.990] [ctranslate2] [thread 62176] [info] Loaded cuBLAS library version 12.1.3
[00:20.730 --> 00:38.060]  And said, before day, moon go down, And said, before day, moon go down,
[00:41.570 --> 00:48.290]  And I swear my man must be somewhere, turning round and round,
  Processing segment at 00:25.860
[00:50.510 --> 01:08.820]  I say the big star is falling, it don't be long for day, I say the big star is falling, and it don't be long for day,
  Processing segment at 00:44.120
[01:11.600 --> 01:56.880]  The moon want my baby to change my evil ways, I'm going to wake up, between midnight and day,
  Processing segment at 01:02.520
[01:56.880 --> 02:26.620]  I say the big star is falling, it don't be long for day, I say the big star is falling, and it don't be long for day,
  Processing segment at 01:32.260
[02:26.620 --> 02:41.380]  Now won't you come here, sit down on my knee, Now I just want to tell you, black man how you have treated me,

Transcription speed: 28.26 audio seconds/s

Subtitles are written to 'C:\b\MISC\1930s\1930' directory.


Operation finished in:  0:18:40.205

Purfview Nov 27, 2024
Maintainer

I was super excited to see this so I downloaded and installed the new version.

But r194.1 is not the new version, you can't even download r194.1...

ClaireCJS Nov 27, 2024
Author

I was super excited to see this so I downloaded and installed the new version.

But r194.1 is not the new version, you can't even download r194.1...

And yet, the file i downloaded from your github a while back was titled: Faster-Whisper-XXL_r194.1_windows.7z ... Is that title wrong? --version reports 1.0.0 which is also wrong.

Guess i'll get the newer new one and try again. BRB.

Purfview Nov 27, 2024
Maintainer

Is that title wrong?

Not wrong, but it's not the new version.

--version reports 1.0.0 which is also wrong.

Not wrong also, it shows python versioning - https://pypi.org/project/faster-whisper/#history

Purfview · 2024-11-04T22:12:08Z

Purfview
Nov 4, 2024
Maintainer

Shouldn't these options explicitly probhit a 20 second silence in mid-line?

No.

What am I doing wrong?

You are touching vad settings, don't touch them. You can try --vad_alt_method pyannote_v3

also is there a way to output both LRC and SRT at the same time

--output_format all

3 replies

ClaireCJS Nov 5, 2024
Author

So, my issue with --output_format all is that the standard for MiniLyrics and other lyric downloaders is to make a sidecar TXT file with the same name as the songfile.

In fact, this is the file that I use to prime my prompt to get better results.

So output_format=all .... destroys that much-needed file ... i can potentially lose 25 years of hand-edited lyrics should I batch-run this on my collection.

So I was hoping for output_format=lrc,srt

I'm not crazy about the prospect of working around by moving my TXT file somewhere else. I then have to move it back, what if the process gets interrupted and it's in an unattended batch and I never notice... Again puts my past work at risk. I'd feel safer running it twice than doing that.

But think of the trees 🤣 double-energy cost and electric bill then 💵

Purfview Nov 5, 2024
Maintainer

So, my issue with --output_format all is that the standard for MiniLyrics and other lyric downloaders is to make a sidecar TXT file with the same name as the songfile.

There is --postfix, but currently [r192.3.4] it doesn't work properly if language is not set, should be fixed in the next version.

So I was hoping for output_format=lrc,srt

Currently [r192.3.4] such selection is not implemented.

ClaireCJS Nov 5, 2024
Author

I don't know what postfix is, but if it allows us to make SRT+LRC without TXT, that would be great!!!!!!!!!!!!!!!!!!!!!!!!!! 🚀🚀🚀🚀

ClaireCJS · 2024-11-05T06:16:30Z

ClaireCJS
Nov 5, 2024
Author

I've now run this through several hundred songs...

pyannote is awwwwwwwwwwwwwwwwwwwwwwful and gives unacceptable results for me and i had to go re-do alllllll the ones i did with pyannote
not messing with vad settings is fairly bad and gives results that are usually unacceptable for me.

After a lot of trial and error, I'm up to the 20th iteration of my prompt, frequently trying multiple interactions on one file to see what's best...

..And what's best for me is the settings I have currently.

I just don't understand why there can be a 30 second silence in the same subtitle.

3 replies

Purfview Nov 5, 2024
Maintainer

pyannote is awwwwwwwwwwwwwwwwwwwwwwful and gives unacceptable results

How did you evaluated that? It's the best vad included.

not messing with vad settings is fairly bad and gives results that are usually unacceptable for me.

Your vad settings doesn't make sense. How you evaluate "results"?

ClaireCJS Nov 5, 2024
Author

pyannote is awwwwwwwwwwwwwwwwwwwwwwful and gives unacceptable results

How did you evaluated that? It's the best vad included.

By running it against the same song and seeing a B+ transcriptionn become an F transcription enough times to be convinced.

I'm transcribing 30,000+ songs I love.

If one gives me 95% of the words, and the other gives 30%, I'm not using the other.

not messing with vad settings is fairly bad and gives results that are usually unacceptable for me.

Your vad settings doesn't make sense. How you evaluate "results"?

By the accuracy of transcription, of course. This isn't something anybody disagrees on. Accurate transcription > inaccurate transcription.

My data is not your data.
My music is not your music.
My parameters are not your parameters.

Nor is my data the same data that the people who come up with "official" ratings for models use.

That is the nature of data, and that is why these options exist. Because situations are not the same and versatility is required.

I'm not keeping track of my testdata when I run the tests, just my conclusions, so it's hard to provide examples unless I revert back to a prompt I've decided doesn't work for me, and waste my time creating those examples. And then what, a better model is going to come out? Nah. It's up to us to figure out which model works best for our personal situation.

The command line options i have currently are giving me the most accurate transcription out of about 20 prompts. And the one with that pyannote actually was approximately the worst out of all of them. Changing back to NOT-pyannote fixed it instantly with just that 1 option.

I'm considering changing my script so that I could generate with all 20 or so prompts just to prove some points, but i'm kinda busy trying to achieve my real goal of transcribing my music, not a side-goal of validating my own experiences that I saw myself in front of my own eyes.

You're wanna try it out on some of my deathmetal or poorly produced hardcore punk and see what I mean? Poorly produced music isn't going to have as easy of a time.

So yea.

Just wanna know how to make that comma break actually behave the way the documentation says it behaves,

...rather than get into some derailment debate against using the very options OpenAI and others have decided we need.

Options should do what they say they do. If they conflict with other options, that should be mentioned in docs and error code output. That's not unreasonable.

Purfview Nov 5, 2024
Maintainer

By running it against the same song and seeing a B+ transcriptionn become an F transcription enough times to be convinced.

Evaluating VAD by looking at a transcription is like evaluating taste of oranges by eating apples.

By the accuracy of transcription, of course.

Same as above. Your evaluations are pointless, I didn't read further from those quoted lines... if it's about something else then open another issue/discussion.

ClaireCJS · 2024-11-05T21:24:27Z

ClaireCJS
Nov 5, 2024
Author

Accuracy of transcription is not "pointless". I don't know why you're so emotional about this and turned it into a philosophical discussion. It was a technical question about command line parameters and how to not have a 30 second silence in mid-caption.

Technical:

If i turned on word-timestamps, wouldn't it know the words are 30 seconds apart and be able to separate them?

In the same way that whisper-large-v3 isn't as good as whisper-large-v2, the model you use depends on the data given.

Whisper-large-v3 also has hallucations during silence from improperly-imported youtube subtitles that weren't actually spoken in the videos they used as training data.

Philosophical:

People and data aren't perfect. That's why we need options and custom solutions. That's why I have to roll my own custom solution when nobody else's is quite right for me. Been doing that since the 1980s.

Technical:

And clearly with pyannote, they did not include the types of music i listen to in their test data. It's awful for heavy music.

Philosophical:

Peoples' lived experiences count.

3 replies

Purfview Nov 5, 2024
Maintainer

Accuracy of transcription is not "pointless".

VAD doesn't do transcriptions, is this hard to grasp?

I don't know why you're so emotional about this...

Don't project on me...

...and turned it into a philosophical discussion.

It's a figurative explanation of what you do in layman's terms.
If doing nonsense suits you then be my guest it's not my business, I just informed you.

ClaireCJS Nov 5, 2024
Author

Please forgive me if I used the wrong terms. I'm simply trying to solve a problem. Thanks for the other explanations, they are very helpful. Have a nice day.

Purfview Nov 5, 2024
Maintainer

It's not easy to understand how all this works, just informing you that your problem solving way is incorrect.

ClaireCJS · 2024-11-05T21:26:57Z

ClaireCJS
Nov 5, 2024
Author

Surely, if word timestamps were on, there would or could be a parameter to separate 2 words that are a certain number of seconds apart into separate subtitles..... I wish I'd thought of this when I asked the question, it's a good idea.

3 replies

Purfview Nov 5, 2024
Maintainer

There is --max_gap option

ClaireCJS Nov 7, 2024
Author

There is --max_gap option

Thank you very much!

I just sent you a small tip on Paypal. Have a nice day.

Purfview Nov 7, 2024
Maintainer

Thank you.

ClaireCJS · 2024-11-27T13:30:47Z

ClaireCJS
Nov 27, 2024
Author

Just FYI, the max-gap options seemed to help the silences between words, but what REALLY REALLY REALLY REALLY REALLY REALLY REALLY REALLY helped having lyrics properly split as they are sung?

Adding a period to the end of EVERY line of a downloaded lyric file, even if it doesn't make grammatical sense. It was my wife's idea. "Why not add invisible periods?" she said. "No such thing, how dumb!" i said, then immediately realized it's brilliance. They can be removed afterward, thus technically "invisible".

Usually the line breaks in posted lyrics are good points to consider stopping a subtitle

Combined with --sentence, this fixed my problems with long captoins.

Words now spit out as they are sung.

This was NOT the case with the EXACT SAME options but me putting a comma at the ends of lines instead of periods.

So, I ended up writing a postprocessor in perl to remove all the periods at the end afterward.

0 replies

how can i stop 20-second silences between words in my output file? #320

ClaireCJS Nov 4, 2024

Replies: 6 comments · 26 replies

ClaireCJS Nov 4, 2024 Author

Purfview Nov 27, 2024 Maintainer

ClaireCJS Nov 27, 2024 Author

Purfview Nov 27, 2024 Maintainer

ClaireCJS Nov 27, 2024 Author

Purfview Nov 27, 2024 Maintainer

Purfview Nov 4, 2024 Maintainer

ClaireCJS Nov 5, 2024 Author

Purfview Nov 5, 2024 Maintainer

ClaireCJS Nov 5, 2024 Author

ClaireCJS Nov 5, 2024 Author

Purfview Nov 5, 2024 Maintainer

ClaireCJS Nov 5, 2024 Author

Purfview Nov 5, 2024 Maintainer

ClaireCJS Nov 5, 2024 Author

Purfview Nov 5, 2024 Maintainer

ClaireCJS Nov 5, 2024 Author

Purfview Nov 5, 2024 Maintainer

ClaireCJS Nov 5, 2024 Author

Purfview Nov 5, 2024 Maintainer

ClaireCJS Nov 7, 2024 Author

Purfview Nov 7, 2024 Maintainer

ClaireCJS Nov 27, 2024 Author

ClaireCJS
Nov 4, 2024

Replies: 6 comments 26 replies

ClaireCJS
Nov 4, 2024
Author

Purfview Nov 27, 2024
Maintainer

ClaireCJS Nov 27, 2024
Author

Purfview Nov 27, 2024
Maintainer

ClaireCJS Nov 27, 2024
Author

Purfview Nov 27, 2024
Maintainer

Purfview
Nov 4, 2024
Maintainer

ClaireCJS Nov 5, 2024
Author

Purfview Nov 5, 2024
Maintainer

ClaireCJS Nov 5, 2024
Author

ClaireCJS
Nov 5, 2024
Author

Purfview Nov 5, 2024
Maintainer

ClaireCJS Nov 5, 2024
Author

Purfview Nov 5, 2024
Maintainer

ClaireCJS
Nov 5, 2024
Author

Purfview Nov 5, 2024
Maintainer

ClaireCJS Nov 5, 2024
Author

Purfview Nov 5, 2024
Maintainer

ClaireCJS
Nov 5, 2024
Author

Purfview Nov 5, 2024
Maintainer

ClaireCJS Nov 7, 2024
Author

Purfview Nov 7, 2024
Maintainer

ClaireCJS
Nov 27, 2024
Author