Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Periods in Dates or numbers Causing Incorrect Segment Splitting in German Transcriptions #811

Open
sijitang opened this issue May 26, 2024 · 0 comments

Comments

@sijitang
Copy link

When transcribing German audio files with WhisperX and using whisperx.load_align_model, I noticed that dates containing periods (e.g., "Mai.2022") or number containing periods are incorrectly split into separate segments. This occurs because the period is interpreted as the end of a sentence, which leads to inaccurate time alignment and text segmentation.

00:00:59,879 --> 00:01:00,039] Die 5.
[00:01:00,479 --> 00:01:12,850] Strafkammer sah es als erwiesen an, dass der 52-Jährige wissentlich eine verbotene Nazi-Parole bei einer AfD-Veranstaltung im Mai 2021 verbreitet hatte.

[00:14:13,689 --> 00:14:15,992] Und nun die Wettervorhersage für morgen Mittwoch, den 15.
[00:14:16,032 --> 00:14:16,093] Mai.

@sijitang sijitang changed the title Issue with Periods in Dates or number Causing Incorrect Segment Splitting in German Transcriptions Issue with Periods in Dates or numbers Causing Incorrect Segment Splitting in German Transcriptions May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant