Applying transcription to already created timecodes? #288
Replies: 2 comments 1 reply
-
Since faster-whisper can return word-level timestamps, the output can be split anywhere to produce new segments that meet your requirements. See for example the options I think this is what you are looking for? The only issue is that the timestamps from Whisper are not always accurate so some subtitles may appear too early or too late. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your reply! Just wanted to start by saying I LOVE what you've done here. As for the issue, I am in fact using whisper-ctranslate2, and here's an example of the problem I'm having. command line I'm using: whisper-ctranslate2 CLI output:
This looks fine; the subtitles are broken into logical sentences, as they should. This is before the captions are broken into 37-character, 2-line subtitles. actual .srt generated:
Here, the constraints have been applied, but the subs are broken in all the wrong places, and captions continuing into the next timecode follow no semantic logic. I'd imagine that's expected since the algorithm for breaking up the subs has no idea of their contents.
So my thinking is that if I was able to directly provide the dict with timecodes, maybe the program would be able to generate subs that are better suited for production? If that's possible, how would I go about doing that? I'm sure this could be helpful to MANY people doing subtitling. |
Beta Was this translation helpful? Give feedback.
-
For video subtitling, the automatic speech recognition is REALLY good, and it has surprised me how it can even pick up different Latin American Spanish variants and accents. Unfortunately, the segmentation isn't up to par, no matter how much I've tried to tweak parameters. Production-ready subtitles usually have the constraints of a maximum of 36 characters per line and two lines. Passing these requirements to faster-whisper does produce the expected output, but the subtitle segmentation and line breaks are understandingly all over the place and have no regard for best practices on where to cut subtitles.
So if I took a video and manually created empty timecodes, would it be possible to pass those timecodes for transcription? I am definitely no coder, but I can see in the program's code that a dict is created with all the start and end times as items. Then, some magic I don't understand happens, and then the transcribe method extracts the dialog from these chunks. So I'm guessing I could take a .srt, trivially convert the timecodes into the dict format expected by the program and then pass it as an argument for transcription? Since the timecodes would be cut at the precise word-level (before connectors, after commas, etc), the model should produce subtitles that need little tweaking to be production-ready. Or am I missing something here?
I would really appreciate any help on what's the format of the dict and to what method I should be passing that dict!
Beta Was this translation helpful? Give feedback.
All reactions