Can NeMo do Forced Alignment with provided transcript and match audio file - & word level timing? #7041

CodeFusionFX · 2023-07-17T05:23:09Z

CodeFusionFX
Jul 17, 2023

Can NeMo do Forced Alignment with provided transcript and match audio file - & word level timing?
and can it do sentences or phrase level timing?

Is the accuracy better than Montreal Forced Aligner?

If above is true I suspect NeMo has is the best and most accurate Forced Aligner?

Answered by erastorgueva-nv

Jul 18, 2023

Hello, thank you for your question.

You can indeed do all of the above with NeMo, using a tool inside this repository called NeMo Forced Aligner (NFA). You can find information on how to use it here: https://github.com/NVIDIA/NeMo/tree/main/tools/nemo_forced_aligner. We are also planning to release a tutorial on how to use NFA soon.

By default, NFA produces token-level and word-level (i.e. substrings separated by spaces) timings. It is also possible to obtain sentence-level or phrase-level timings. This is because NFA can produce timings for user-defined groups of words (in NFA, we call these “segments”). To make sure NFA produces these timings, you need to mark the boundaries between seg…

View full answer

erastorgueva-nv · 2023-07-18T22:25:43Z

erastorgueva-nv
Jul 18, 2023
Collaborator

Hello, thank you for your question.

You can indeed do all of the above with NeMo, using a tool inside this repository called NeMo Forced Aligner (NFA). You can find information on how to use it here: https://github.com/NVIDIA/NeMo/tree/main/tools/nemo_forced_aligner. We are also planning to release a tutorial on how to use NFA soon.

By default, NFA produces token-level and word-level (i.e. substrings separated by spaces) timings. It is also possible to obtain sentence-level or phrase-level timings. This is because NFA can produce timings for user-defined groups of words (in NFA, we call these “segments”). To make sure NFA produces these timings, you need to mark the boundaries between segments in your text, and then, when you run NFA, make sure you specify the marker you used.

For example, make sure that the manifest you pass to NFA looks like:

{"text": "abc def <segment_split> ghi jkl <segment_split> mno ...", "audio_filepath": "..."}
...

And make sure you specify: additional_segment_grouping_separator="<segment_split>" when you call the align.py file in NFA.
You can use something other than "<segment_split>" to denote boundaries between segments if you wish.

We have observed that NFA does obtain better alignments than MFA for audio that is not trivially short.
In general, yes, we have observed that NFA is the most accurate and fastest forced aligner in the comparisons that we have conducted.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can NeMo do Forced Alignment with provided transcript and match audio file - & word level timing? #7041

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Can NeMo do Forced Alignment with provided transcript and match audio file - & word level timing? #7041

CodeFusionFX Jul 17, 2023

Replies: 1 comment

erastorgueva-nv Jul 18, 2023 Collaborator

CodeFusionFX
Jul 17, 2023

erastorgueva-nv
Jul 18, 2023
Collaborator