Applying transcription to already created timecodes? #288

kylecitoto · 2023-06-08T20:56:52Z

kylecitoto
Jun 8, 2023

For video subtitling, the automatic speech recognition is REALLY good, and it has surprised me how it can even pick up different Latin American Spanish variants and accents. Unfortunately, the segmentation isn't up to par, no matter how much I've tried to tweak parameters. Production-ready subtitles usually have the constraints of a maximum of 36 characters per line and two lines. Passing these requirements to faster-whisper does produce the expected output, but the subtitle segmentation and line breaks are understandingly all over the place and have no regard for best practices on where to cut subtitles.

So if I took a video and manually created empty timecodes, would it be possible to pass those timecodes for transcription? I am definitely no coder, but I can see in the program's code that a dict is created with all the start and end times as items. Then, some magic I don't understand happens, and then the transcribe method extracts the dialog from these chunks. So I'm guessing I could take a .srt, trivially convert the timecodes into the dict format expected by the program and then pass it as an argument for transcription? Since the timecodes would be cut at the precise word-level (before connectors, after commas, etc), the model should produce subtitles that need little tweaking to be production-ready. Or am I missing something here?

I would really appreciate any help on what's the format of the dict and to what method I should be passing that dict!

guillaumekln · 2023-06-09T09:20:05Z

guillaumekln
Jun 9, 2023

Since faster-whisper can return word-level timestamps, the output can be split anywhere to produce new segments that meet your requirements.

See for example the options --max_line_width and --max_line_count in whisper-ctranslate2, which is a CLI tool built on top of faster-whisper.

I think this is what you are looking for?

The only issue is that the timestamps from Whisper are not always accurate so some subtitles may appear too early or too late.

0 replies

kylecitoto · 2023-06-09T18:00:30Z

kylecitoto
Jun 9, 2023
Author

Thank you for your reply! Just wanted to start by saying I LOVE what you've done here.

As for the issue, I am in fact using whisper-ctranslate2, and here's an example of the problem I'm having.

command line I'm using:
whisper-ctranslate2 FILE.mp4 --model large-v2 --model_dir models --output_format srt --print_colors True --max_line_width 37 --max_line_count 2 --device cuda --compute_type float16 --language es --beam_size 30 --patience 2.0 --word_timestamps True --task translate --vad_filter True --vad_threshold 0.8 --vad_min_speech_duration 200 --vad_max_speech_duration 4 --vad_min_silence_duration_ms 500 --initial_prompt "This is a conversation between a male interviewer and a woman, who is the mother of Jessica."

whisper-ctranslate2 CLI output:

[01:05.660 --> 01:09.360]  I imagine that every year she feels like a child.
[01:13.260 --> 01:13.960]  That's fine.
[01:15.020 --> 01:16.160]  I don't know why I'm doing this.
[01:16.820 --> 01:18.460]  I'm going to leave you, it's better for me.
[01:19.160 --> 01:20.200]  Ok, sorry.
[01:20.540 --> 01:23.720]  I imagine that every year she feels like a miracle, right?
[01:25.900 --> 01:29.340]  How does she feel, knowing that there were many doctors and you?
[01:30.040 --> 01:34.140]  Exactly, yes, it's like knowing that one more year we are achieving it.

This looks fine; the subtitles are broken into logical sentences, as they should. This is before the captions are broken into 37-character, 2-line subtitles.

actual .srt generated:

10
00:01:02,280 --> 00:01:08,740
these 7 years bring us. Yes. I
imagine that every year she feels

11
00:01:08,740 --> 00:01:09,360
like a child.

12
00:01:13,260 --> 00:01:17,820
That's fine. I don't know why I'm
doing this. I'm going to leave you,

13
00:01:17,960 --> 00:01:22,740
it's better for me. Ok, sorry. I
imagine that every year she feels

14
00:01:22,740 --> 00:01:28,440
like a miracle, right? How does she
feel, knowing that there were many

15
00:01:28,440 --> 00:01:32,740
doctors and you? Exactly, yes, it's
like knowing that one more year we

16
00:01:32,740 --> 00:01:39,960
are achieving it. Also, like Jessica,
there is no pattern, we don't know of

Here, the constraints have been applied, but the subs are broken in all the wrong places, and captions continuing into the next timecode follow no semantic logic. I'd imagine that's expected since the algorithm for breaking up the subs has no idea of their contents.

this is how the proper subs should look:

15
00:01:05,440 --> 00:01:10,390
Yes, I imagine that every year
feels like a miracle.

16
00:01:13,260 --> 00:01:17,820
That's fine. I don't know why
I'm doing this. There you go.

17
00:01:17,960 --> 00:01:21,060
That's better for me.
Ok, sorry.

18
00:01:21,060 --> 00:01:24,160
I imagine that every year
feels like a miracle.

19
00:01:25,410 --> 00:01:30,010
How are you feeling,
knowing what the doctors said?

20
00:01:30,020 --> 00:01:34,150
Exactly, yes, it's like knowing
that she did it one more year.

So my thinking is that if I was able to directly provide the dict with timecodes, maybe the program would be able to generate subs that are better suited for production?

If that's possible, how would I go about doing that? I'm sure this could be helpful to MANY people doing subtitling.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applying transcription to already created timecodes? #288

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

Select a reply

Applying transcription to already created timecodes? #288

kylecitoto Jun 8, 2023

Replies: 2 comments · 1 reply

guillaumekln Jun 9, 2023

kylecitoto Jun 9, 2023 Author

kylecitoto
Jun 8, 2023

Replies: 2 comments 1 reply

guillaumekln
Jun 9, 2023

kylecitoto
Jun 9, 2023
Author