From 8a172df69250da231742893387e9a1dd01b16e3d Mon Sep 17 00:00:00 2001 From: Eric Harper Date: Fri, 1 Jul 2022 14:16:55 -0600 Subject: [PATCH] Merge r1.10.0 main (#4486) * update branch Signed-off-by: ericharper * Fix ASR Typos in tutorials (#4384) * Fix typos Signed-off-by: smajumdar * Quick wav2vec fix. In-place operation adding convolutional positions to encoder was overwriting leaf history. Wasn't caught on previous torch versions. (#4383) Signed-off-by: tbartley94 Co-authored-by: tbartley94 (cherry picked from commit 0322b158f26a0b690edca7a84714e33752283923) Co-authored-by: Travis Bartley * Fix tutorial typos and docs (#4415) * Fix typos Signed-off-by: smajumdar * Fix typos Signed-off-by: smajumdar * Add ASR Scores to Docs (#4412) * Fix link Signed-off-by: smajumdar * Correct model card Signed-off-by: smajumdar * Add ASR Results to Docs Signed-off-by: smajumdar * Update info Signed-off-by: smajumdar * Update info Signed-off-by: smajumdar * docs: add table overflow handling for nested sections (#4441) Co-authored-by: Nick Goncharenko * Docs: Decrease Font Size on Tables (#4444) * docs: add table overflow handling for nested sections * docs: set table font-size to small Co-authored-by: Nick Goncharenko * Updated notebook to fix batch configuration and precision bugs (#4447) * Updated notebook to fix batch configuration and precision bugs Signed-off-by: Virginia Adams * Deleted cell outputs Signed-off-by: Virginia Adams * Set datasets back to full dataset Signed-off-by: Virginia Adams Co-authored-by: Eric Harper * fix branch in link (#4454) Signed-off-by: ekmb * [TTS] [bugfix] German FastPitch HiFi-GAN tutorial and lr (#4459) * [TN] Bug fix: expand serial coverage of unknown symbol, remove constraints from word graph (#4463) * remove constraints from word graph det Signed-off-by: ekmb * add measure units to serial Signed-off-by: ekmb * revert serial changes, update jenkins path Signed-off-by: ekmb * fix test case Signed-off-by: ekmb * update indentation (#4468) Signed-off-by: Akshit Arora * t5-rpe-fix targeting r1.10.0; raise exception for PP>2. (#4469) Signed-off-by: Hoo Chang Shin Co-authored-by: Hoo Chang Shin * Fix some 's' cases for IPA G2P (#4460) Signed-off-by: Jocelyn Huang Co-authored-by: Eric Harper * Refactor bias act fusion (#4376) * Refactor bias act fusion Signed-off-by: MaximumEntropy * Update NMT config Signed-off-by: MaximumEntropy * Update ci tests Signed-off-by: MaximumEntropy * Empty Signed-off-by: MaximumEntropy * Add kwargs to exact string match (#4479) Signed-off-by: MaximumEntropy * Try fix (#4484) Signed-off-by: MaximumEntropy * update branch Signed-off-by: ericharper Co-authored-by: Somshubra Majumdar Co-authored-by: Travis Bartley Co-authored-by: Nick Goncharenko <8766167+nickolyamba@users.noreply.github.com> Co-authored-by: Nick Goncharenko Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Co-authored-by: Akshit Arora Co-authored-by: khcs Co-authored-by: Hoo Chang Shin Co-authored-by: Jocelyn Co-authored-by: Sandeep Subramanian --- Jenkinsfile | 38 +- docs/source/_static/css/custom.css | 228 ++++---- .../conf/megatron_bart_config.yaml | 5 +- .../conf/megatron_retro_config.yaml | 2 +- .../conf/megatron_t5_config.yaml | 1 - .../megatron_t5_lm_adaptation_finetune.yaml | 2 +- .../conf/megatron_ul2_config.yaml | 5 +- .../conf/aayn_base_megatron.yaml | 5 +- .../tts/conf/de/fastpitch_align_22050.yaml | 2 +- .../common/metrics/classification_accuracy.py | 4 +- .../megatron_lm_encoder_decoder_model.py | 23 +- .../modules/common/megatron/transformer.py | 72 ++- nemo/collections/tts/torch/g2ps.py | 13 +- .../en/data/measure/unit.tsv | 3 + .../en/data/whitelist/symbol.tsv | 2 + .../text_normalization/en/taggers/word.py | 5 + .../test_cases_measure.txt | 1 + .../test_cases_normalize_with_audio.txt | 2 +- .../test_cases_punctuation.txt | 1 + .../Non_English_Downstream_Tasks_(NER).ipynb | 2 +- .../tts/Fastpitch_Training_GermanTTS.ipynb | 550 ++++++++---------- 21 files changed, 484 insertions(+), 482 deletions(-) diff --git a/Jenkinsfile b/Jenkinsfile index 178d2a2b2144..6ba65f340eab 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -137,18 +137,18 @@ pipeline { parallel { stage('En TN grammars') { steps { - sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/text_normalization/normalize.py --text="1" --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-14-22' + sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/text_normalization/normalize.py --text="1" --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-28-22' } } stage('En ITN grammars') { steps { - sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/inverse_text_normalization/inverse_normalize.py --language en --text="twenty" --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-14-22' + sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/inverse_text_normalization/inverse_normalize.py --language en --text="twenty" --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-28-22' } } stage('Test En non-deterministic TN & Run all En TN/ITN tests (restore grammars from cache)') { steps { - sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/text_normalization/normalize_with_audio.py --text "\$.01" --n_tagged 2 --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-14-22' - sh 'CUDA_VISIBLE_DEVICES="" pytest tests/nemo_text_processing/en/ -m "not pleasefixme" --cpu --tn_cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-14-22' + sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/text_normalization/normalize_with_audio.py --text "\$.01" --n_tagged 2 --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-28-22' + sh 'CUDA_VISIBLE_DEVICES="" pytest tests/nemo_text_processing/en/ -m "not pleasefixme" --cpu --tn_cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-28-22' } } } @@ -165,7 +165,7 @@ pipeline { parallel { stage('L2: Eng TN') { steps { - sh 'cd tools/text_processing_deployment && python pynini_export.py --output=/home/TestData/nlp/text_norm/output/ --grammars=tn_grammars --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-14-22 --language=en && ls -R /home/TestData/nlp/text_norm/output/ && echo ".far files created "|| exit 1' + sh 'cd tools/text_processing_deployment && python pynini_export.py --output=/home/TestData/nlp/text_norm/output/ --grammars=tn_grammars --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-28-22 --language=en && ls -R /home/TestData/nlp/text_norm/output/ && echo ".far files created "|| exit 1' sh 'cd nemo_text_processing/text_normalization/ && python normalize.py --input_file=/home/TestData/nlp/text_norm/ci/test.txt --input_case="lower_cased" --language=en --output_file=/home/TestData/nlp/text_norm/output/test.pynini.txt --verbose' sh 'cat /home/TestData/nlp/text_norm/output/test.pynini.txt' sh 'cmp --silent /home/TestData/nlp/text_norm/output/test.pynini.txt /home/TestData/nlp/text_norm/ci/test_goal_py_05-25.txt || exit 1' @@ -175,7 +175,7 @@ pipeline { stage('L2: Eng ITN export') { steps { - sh 'cd tools/text_processing_deployment && python pynini_export.py --output=/home/TestData/nlp/text_denorm/output/ --grammars=itn_grammars --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-14-22 --language=en && ls -R /home/TestData/nlp/text_denorm/output/ && echo ".far files created "|| exit 1' + sh 'cd tools/text_processing_deployment && python pynini_export.py --output=/home/TestData/nlp/text_denorm/output/ --grammars=itn_grammars --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-28-22 --language=en && ls -R /home/TestData/nlp/text_denorm/output/ && echo ".far files created "|| exit 1' sh 'cd nemo_text_processing/inverse_text_normalization/ && python inverse_normalize.py --input_file=/home/TestData/nlp/text_denorm/ci/test.txt --language=en --output_file=/home/TestData/nlp/text_denorm/output/test.pynini.txt --verbose' sh 'cmp --silent /home/TestData/nlp/text_denorm/output/test.pynini.txt /home/TestData/nlp/text_denorm/ci/test_goal_py.txt || exit 1' sh 'rm -rf /home/TestData/nlp/text_denorm/output/*' @@ -184,7 +184,7 @@ pipeline { stage('L2: TN with Audio (audio and raw text)') { steps { sh 'cd nemo_text_processing/text_normalization && \ - python normalize_with_audio.py --language=en --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-14-22 --text "The total amounts to \\$4.76." \ + python normalize_with_audio.py --language=en --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-28-22 --text "The total amounts to \\$4.76." \ --audio_data /home/TestData/nlp/text_norm/audio_based/audio.wav | tail -n2 | head -n1 > /tmp/out_raw.txt 2>&1 && \ cmp --silent /tmp/out_raw.txt /home/TestData/nlp/text_norm/audio_based/result.txt || exit 1' } @@ -192,7 +192,7 @@ pipeline { stage('L2: TN with Audio (audio and text file)') { steps { sh 'cd nemo_text_processing/text_normalization && \ - python normalize_with_audio.py --language=en --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-14-22 --text /home/TestData/nlp/text_norm/audio_based/text.txt \ + python normalize_with_audio.py --language=en --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-28-22 --text /home/TestData/nlp/text_norm/audio_based/text.txt \ --audio_data /home/TestData/nlp/text_norm/audio_based/audio.wav | tail -n2 | head -n1 > /tmp/out_file.txt 2>&1 && \ cmp --silent /tmp/out_file.txt /home/TestData/nlp/text_norm/audio_based/result.txt || exit 1' } @@ -200,7 +200,7 @@ pipeline { stage('L2: TN with Audio (manifest)') { steps { sh 'cd nemo_text_processing/text_normalization && \ - python normalize_with_audio.py --language=en --audio_data /home/TestData/nlp/text_norm/audio_based/manifest.json --n_tagged=120 --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-14-22' + python normalize_with_audio.py --language=en --audio_data /home/TestData/nlp/text_norm/audio_based/manifest.json --n_tagged=120 --cache_dir /home/TestData/nlp/text_norm/ci/grammars/6-28-22' } } } @@ -2129,7 +2129,7 @@ pipeline { model.num_attention_heads=8 \ model.activation='swiglu' \ model.masked_softmax_fusion=False \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ model.micro_batch_size=2 \ @@ -2161,7 +2161,7 @@ pipeline { model.hidden_size=64 \ model.num_attention_heads=8 \ model.activation='swiglu' \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.masked_softmax_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ @@ -2893,7 +2893,7 @@ pipeline { model.hidden_size=64 \ model.num_attention_heads=8 \ model.activation='swiglu' \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ model.transformer_block_type='pre_ln' \ @@ -2918,7 +2918,7 @@ pipeline { model.hidden_size=64 \ model.num_attention_heads=8 \ model.activation='swiglu' \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ model.transformer_block_type='pre_ln' \ @@ -3015,7 +3015,7 @@ pipeline { model.hidden_size=64 \ model.num_attention_heads=8 \ model.activation='swiglu' \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ model.transformer_block_type='normformer' \ @@ -3040,7 +3040,7 @@ pipeline { model.hidden_size=64 \ model.num_attention_heads=8 \ model.activation='swiglu' \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ model.transformer_block_type='normformer' \ @@ -3094,7 +3094,7 @@ pipeline { model.hidden_size=64 \ model.num_attention_heads=8 \ model.activation='reglu' \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document]" @@ -3116,7 +3116,7 @@ pipeline { model.hidden_size=64 \ model.num_attention_heads=8 \ model.activation='reglu' \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document]" @@ -3150,7 +3150,7 @@ pipeline { model.hidden_size=64 \ model.num_attention_heads=8 \ model.activation='geglu' \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document]" @@ -3173,7 +3173,7 @@ pipeline { model.hidden_size=64 \ model.num_attention_heads=8 \ model.activation='geglu' \ - model.bias_gelu_fusion=False \ + model.bias_activation_fusion=False \ model.activations_checkpoint_method='block' \ model.activations_checkpoint_num_layers=1 \ model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document]" diff --git a/docs/source/_static/css/custom.css b/docs/source/_static/css/custom.css index a071bc424c60..16fc8d3b469f 100644 --- a/docs/source/_static/css/custom.css +++ b/docs/source/_static/css/custom.css @@ -2,55 +2,53 @@ @import url('https://fonts.googleapis.com/css2?family=Roboto:wght@400&display=swap'); body { - font-size: 100%; - font-family: 'Roboto', sans-serif; + font-size: 100%; + font-family: 'Roboto', sans-serif; } /* Width of template */ .wy-nav-content { - max-width: 1200px !important; + max-width: 1200px !important; } /* Standard Text Formatting */ -h1 -{ +h1 { color: #76b900; - text-align: center; + text-align: center; background-color: #ffffff; } -h2 -{ +h2 { color: #ffffff; - background-color: #ffffff; /* #76b900 */ - Padding: 5px; + background-color: #ffffff; + /* #76b900 */ + Padding: 5px; } -h3 -{ +h3 { padding-top: 0px; - border-top: solid 3px #000000; /* #76b900 */ - border-bottom: solid 3px #000000; /* #76b900 */ + border-top: solid 3px #000000; + /* #76b900 */ + border-bottom: solid 3px #000000; + /* #76b900 */ } -p -{ +p { margin-bottom: 24px; } + /* Link Colors */ -a -{ - color: #76b900; +a { + color: #76b900; } -a:visited -{ - color: #218219; +a:visited { + color: #218219; } .container-xl { @@ -73,56 +71,52 @@ table { } /* Table head Color */ -thead td -{ +thead td { background-color: #333333 !important; } -.row-odd p -{ +.row-odd p { /*padding-bottom: 0px;*/ /*margin-bottom: 0px;*/ } + /* even rows*/ -.row-even tr -{ - background-color: #e5f1e6 !important; +.row-even tr { + background-color: #e5f1e6 !important; } /* odd rows*/ .wy-table-responsive table tr { - background-color: #ffffff !important; + background-color: #ffffff !important; } .wy-table-responsive table td { - white-space: normal; + white-space: normal; } /* Removes bottom margin in tables*/ .rst-content .line-block { - margin-bottom: 0px; + margin-bottom: 0px; } .wy-table-responsive { - overflow: visible !important; + overflow: visible !important; } /* reduces the size of text in multiline table columns. */ -.rst-content table.docutils td -{ - font-size: 80%; +.rst-content table.docutils td { + font-size: 80%; } -.rst-content dl:not(.docutils) dt -{ +.rst-content dl:not(.docutils) dt { background-color: inherit; color: #000000; @@ -131,100 +125,130 @@ thead td } .rst-content dl:not(.docutils) dt:before { - color: #333333; + color: #333333; } .rst-content .line-block { - margin-bottom: 0px; + margin-bottom: 0px; } -.wy-side-nav-search, .wy-nav-top - { - background-color: #000000; - padding: 0; - } +.wy-side-nav-search, +.wy-nav-top { + background-color: #000000; + padding: 0; +} -.wy-side-nav-search img - { - padding: 0px; - padding: 0px 0px; - margin-bottom: 0; - } +.wy-side-nav-search img { + padding: 0px; + padding: 0px 0px; + margin-bottom: 0; +} -.wy-side-nav-search input[type=text] - { - border-radius: 0px; - } +.wy-side-nav-search input[type=text] { + border-radius: 0px; +} -.wy-menu-vertical p.caption - { +.wy-menu-vertical p.caption { color: #76b900; - } +} -.wy-side-nav-search>a img.logo, .wy-side-nav-search .wy-dropdown>a img.logo - { +.wy-side-nav-search>a img.logo, +.wy-side-nav-search .wy-dropdown>a img.logo { margin: 0px 0px 0px 0px; - } +} -.wy-nav-content - { - margin: 0; - min-height: 100%; - height: 100%; - background: #ffffff; - } +.wy-nav-content { + margin: 0; + min-height: 100%; + height: 100%; + background: #ffffff; +} - /* List (numbered, bulleted) padding Fix */ +/* List (numbered, bulleted) padding Fix */ -.wy-plain-list-decimal li -{ - margin-top: -6px; - margin-bottom: -6px; +.wy-plain-list-decimal li { + margin-top: -6px; + margin-bottom: -6px; } -.rst-content .section ol.loweralpha -{ - margin-top: -6px; - margin-bottom: 12px; +.rst-content .section ol.loweralpha { + margin-top: -6px; + margin-bottom: 12px; } -.wy-plain-list-disc, .rst-content .toctree-wrapper ul, article ul -{ - margin-top: 0px !important; - margin-bottom: 12px; +.wy-plain-list-disc, +.rst-content .toctree-wrapper ul, +article ul { + margin-top: 0px !important; + margin-bottom: 12px; } - /* Alert Boxes */ - /* Background color of Alert Box Title */ +/* Alert Boxes */ +/* Background color of Alert Box Title */ -.rst-content .section ul -{ - margin-top: -12px; - margin-bottom: 16px; +.rst-content .section ul { + margin-top: -12px; + margin-bottom: 16px; } -.wy-alert.wy-alert-info .wy-alert-title, .rst-content .note .wy-alert-title, .rst-content .wy-alert-info.attention .wy-alert-title, .rst-content .wy-alert-info.caution .wy-alert-title, .rst-content .wy-alert-info.danger .wy-alert-title, .rst-content .wy-alert-info.error .wy-alert-title, .rst-content .wy-alert-info.hint .wy-alert-title, .rst-content .wy-alert-info.important .wy-alert-title, .rst-content .wy-alert-info.tip .wy-alert-title, .rst-content .wy-alert-info.warning .wy-alert-title, .rst-content .seealso .wy-alert-title, .rst-content .wy-alert-info.admonition-todo .wy-alert-title, .rst-content .wy-alert-info.admonition .wy-alert-title, .wy-alert.wy-alert-info .rst-content .admonition-title, .rst-content .wy-alert.wy-alert-info .admonition-title, .rst-content .note .admonition-title, .rst-content .wy-alert-info.attention .admonition-title, .rst-content .wy-alert-info.caution .admonition-title, .rst-content .wy-alert-info.danger .admonition-title, .rst-content .wy-alert-info.error .admonition-title, .rst-content .wy-alert-info.hint .admonition-title, .rst-content .wy-alert-info.important .admonition-title, .rst-content .wy-alert-info.tip .admonition-title, .rst-content .wy-alert-info.warning .admonition-title, .rst-content .seealso .admonition-title, .rst-content .wy-alert-info.admonition-todo .admonition-title, .rst-content .wy-alert-info.admonition .admonition-title - { +.wy-alert.wy-alert-info .wy-alert-title, +.rst-content .note .wy-alert-title, +.rst-content .wy-alert-info.attention .wy-alert-title, +.rst-content .wy-alert-info.caution .wy-alert-title, +.rst-content .wy-alert-info.danger .wy-alert-title, +.rst-content .wy-alert-info.error .wy-alert-title, +.rst-content .wy-alert-info.hint .wy-alert-title, +.rst-content .wy-alert-info.important .wy-alert-title, +.rst-content .wy-alert-info.tip .wy-alert-title, +.rst-content .wy-alert-info.warning .wy-alert-title, +.rst-content .seealso .wy-alert-title, +.rst-content .wy-alert-info.admonition-todo .wy-alert-title, +.rst-content .wy-alert-info.admonition .wy-alert-title, +.wy-alert.wy-alert-info .rst-content .admonition-title, +.rst-content .wy-alert.wy-alert-info .admonition-title, +.rst-content .note .admonition-title, +.rst-content .wy-alert-info.attention .admonition-title, +.rst-content .wy-alert-info.caution .admonition-title, +.rst-content .wy-alert-info.danger .admonition-title, +.rst-content .wy-alert-info.error .admonition-title, +.rst-content .wy-alert-info.hint .admonition-title, +.rst-content .wy-alert-info.important .admonition-title, +.rst-content .wy-alert-info.tip .admonition-title, +.rst-content .wy-alert-info.warning .admonition-title, +.rst-content .seealso .admonition-title, +.rst-content .wy-alert-info.admonition-todo .admonition-title, +.rst-content .wy-alert-info.admonition .admonition-title { background: #76b900; - } +} - /* Background and Font Color of Alert Box Main Body*/ -.wy-alert.wy-alert-info, .rst-content .note, .rst-content .wy-alert-info.attention, .rst-content .wy-alert-info.caution, .rst-content .wy-alert-info.danger, .rst-content .wy-alert-info.error, .rst-content .wy-alert-info.hint, .rst-content .wy-alert-info.important, .rst-content .wy-alert-info.tip, .rst-content .wy-alert-info.warning, .rst-content .seealso, .rst-content .wy-alert-info.admonition-todo, .rst-content .wy-alert-info.admonition { - background: #333333; - color: #999999; - } +/* Background and Font Color of Alert Box Main Body*/ +.wy-alert.wy-alert-info, +.rst-content .note, +.rst-content .wy-alert-info.attention, +.rst-content .wy-alert-info.caution, +.rst-content .wy-alert-info.danger, +.rst-content .wy-alert-info.error, +.rst-content .wy-alert-info.hint, +.rst-content .wy-alert-info.important, +.rst-content .wy-alert-info.tip, +.rst-content .wy-alert-info.warning, +.rst-content .seealso, +.rst-content .wy-alert-info.admonition-todo, +.rst-content .wy-alert-info.admonition { + background: #333333; + color: #999999; +} -.section -{ +.section { margin-top: 50px; } /* Logo */ .navbar-brand-box { - background-color: #ffffff; + background-color: #ffffff; } /* ---------------------------------------------- Media Queries --------------------------------------- */ @@ -238,7 +262,7 @@ thead td body { font-size: 18px; } - + #site-navigation nav ul.nav { font-size: 18px; } @@ -254,7 +278,7 @@ thead td .toc-h2 { font-size: 18px; } - + .toc-h3 { font-size: 1rem; } @@ -267,8 +291,8 @@ thead td font-size: 18px; } - #main-content > div { + #main-content>div { margin-left: 10%; margin-right: 10%; } -} +} \ No newline at end of file diff --git a/examples/nlp/language_modeling/conf/megatron_bart_config.yaml b/examples/nlp/language_modeling/conf/megatron_bart_config.yaml index fb8094842ca3..03bc6466f1e6 100644 --- a/examples/nlp/language_modeling/conf/megatron_bart_config.yaml +++ b/examples/nlp/language_modeling/conf/megatron_bart_config.yaml @@ -63,12 +63,15 @@ model: init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.') hidden_dropout: 0.1 # Dropout probability for hidden state transformer. attention_dropout: 0.1 # Dropout probability in the attention layer. + position_embedding_type: 'learned_absolute' # Position embedding type. Options ['learned_absolute', 'relative'] + relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias + relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets. kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number. layernorm_epsilon: 1e-5 persist_layer_norm: True # Use of persistent fused layer norm kernel. gradient_as_bucket_view: True # Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory) - bias_gelu_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent gelu activation. + bias_activation_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function. masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask. bias_dropout_add_fusion: True # Use a kernel that fuses the bias addition, dropout and residual connection addition. bias: True # Whether to use bias terms in all weight matrices. diff --git a/examples/nlp/language_modeling/conf/megatron_retro_config.yaml b/examples/nlp/language_modeling/conf/megatron_retro_config.yaml index 3cb87bb19c52..3b99d2ad904d 100644 --- a/examples/nlp/language_modeling/conf/megatron_retro_config.yaml +++ b/examples/nlp/language_modeling/conf/megatron_retro_config.yaml @@ -59,7 +59,7 @@ model: layernorm_epsilon: 1e-5 gradient_as_bucket_view: True # Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory) persist_layer_norm: False - bias_gelu_fusion: True + bias_activation_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function. bias_dropout_add_fusion: True masked_softmax_fusion: True activation: 'gelu' diff --git a/examples/nlp/language_modeling/conf/megatron_t5_config.yaml b/examples/nlp/language_modeling/conf/megatron_t5_config.yaml index d3f8f402bdb2..df8010fa6258 100644 --- a/examples/nlp/language_modeling/conf/megatron_t5_config.yaml +++ b/examples/nlp/language_modeling/conf/megatron_t5_config.yaml @@ -72,7 +72,6 @@ model: layernorm_epsilon: 1e-5 persist_layer_norm: True # Use of persistent fused layer norm kernel. gradient_as_bucket_view: True # Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory) - bias_gelu_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent gelu activation. bias_activation_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function. grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask. diff --git a/examples/nlp/language_modeling/conf/megatron_t5_lm_adaptation_finetune.yaml b/examples/nlp/language_modeling/conf/megatron_t5_lm_adaptation_finetune.yaml index c499e9d76bcf..d3860e9957c0 100644 --- a/examples/nlp/language_modeling/conf/megatron_t5_lm_adaptation_finetune.yaml +++ b/examples/nlp/language_modeling/conf/megatron_t5_lm_adaptation_finetune.yaml @@ -53,7 +53,7 @@ model: megatron_amp_O2: False # use AMP with O2 style mixed precision instead of native amp on-the-fly weight autocasting. # JIT fusion params. - bias_gelu_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent gelu activation. + bias_activation_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function. masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask. bias_dropout_add_fusion: True # Use a kernel that fuses the bias addition, dropout and residual connection addition. diff --git a/examples/nlp/language_modeling/conf/megatron_ul2_config.yaml b/examples/nlp/language_modeling/conf/megatron_ul2_config.yaml index 20cd172e460c..113f2a6961af 100644 --- a/examples/nlp/language_modeling/conf/megatron_ul2_config.yaml +++ b/examples/nlp/language_modeling/conf/megatron_ul2_config.yaml @@ -62,12 +62,15 @@ model: init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.') hidden_dropout: 0.1 # Dropout probability for hidden state transformer. attention_dropout: 0.1 # Dropout probability in the attention layer. + position_embedding_type: 'learned_absolute' # Position embedding type. Options ['learned_absolute', 'relative'] + relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias + relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets. kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number. layernorm_epsilon: 1e-5 persist_layer_norm: True # Use of persistent fused layer norm kernel. gradient_as_bucket_view: True # Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory) - bias_gelu_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent gelu activation. + bias_activation_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function. masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask. bias_dropout_add_fusion: True # Use a kernel that fuses the bias addition, dropout and residual connection addition. bias: True # Whether to use bias terms in all weight matrices. diff --git a/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml b/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml index d85946287e2d..286bfaf6d8d7 100644 --- a/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml +++ b/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml @@ -73,12 +73,15 @@ model: init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.') hidden_dropout: 0.1 # Dropout probability for hidden state transformer. attention_dropout: 0.1 # Dropout probability in the attention layer. + position_embedding_type: 'learned_absolute' # Position embedding type. Options ['learned_absolute', 'relative'] + relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias + relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets. kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number. layernorm_epsilon: 1e-5 persist_layer_norm: True # Use of persistent fused layer norm kernel. gradient_as_bucket_view: True # Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory) - bias_gelu_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent gelu activation. + bias_activation_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function. masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask. bias_dropout_add_fusion: True # Use a kernel that fuses the bias addition, dropout and residual connection addition. bias: True # Whether to use bias terms in all weight matrices. diff --git a/examples/tts/conf/de/fastpitch_align_22050.yaml b/examples/tts/conf/de/fastpitch_align_22050.yaml index d12ca3d2478d..2ba41b6fef1e 100644 --- a/examples/tts/conf/de/fastpitch_align_22050.yaml +++ b/examples/tts/conf/de/fastpitch_align_22050.yaml @@ -202,7 +202,7 @@ model: optim: name: adamw - lr: 1e-1 + lr: 1e-3 # optimizer arguments betas: [0.9, 0.999] weight_decay: 1e-6 diff --git a/nemo/collections/common/metrics/classification_accuracy.py b/nemo/collections/common/metrics/classification_accuracy.py index db746b869001..eca7379a382e 100644 --- a/nemo/collections/common/metrics/classification_accuracy.py +++ b/nemo/collections/common/metrics/classification_accuracy.py @@ -154,7 +154,7 @@ def compute_topk_accuracy(correct_counts_k, total_counts_k): class ExactStringPerCategoryMatchMetric(Metric): - def __init__(self, categories=[], dist_sync_on_step=False): + def __init__(self, categories=[], dist_sync_on_step=False, **kwargs): super().__init__(dist_sync_on_step=dist_sync_on_step) self.categories = set(categories) @@ -190,7 +190,7 @@ def compute(self): class ExactStringMatchMetric(Metric): - def __init__(self, dist_sync_on_step=False): + def __init__(self, dist_sync_on_step=False, **kwargs): super().__init__(dist_sync_on_step=dist_sync_on_step) self.add_state("correct", default=torch.tensor(0), dist_reduce_fx="sum") diff --git a/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py b/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py index cad504e04a28..6fa4e9d0129e 100644 --- a/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py +++ b/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py @@ -69,6 +69,10 @@ class MegatronLMEncoderDecoderModel(MegatronBaseModel): def __init__(self, cfg: DictConfig, trainer: Trainer): super().__init__(cfg, trainer=trainer) + if cfg.get('pipeline_model_parallel_size', 1) > 2 and self.cfg.get('position_embedding_type') == 'relative': + raise ValueError( + "pipeline_model_parallel_size cannot be > 2 with position_embedding_type == relative at the moment." + ) if cfg.get('pipeline_model_parallel_size', 1) > 1: if cfg.get('pipeline_model_parallel_split_rank', 0) <= 0: raise ValueError( @@ -116,7 +120,11 @@ def setup_optimizer_param_groups(self): def model_provider_func(self, pre_process, post_process, add_encoder, add_decoder): # TODO: create get_encoder_decoder_model()here for different losses (e..g, nll, vae, mim) - + if hasattr(self.cfg, 'bias_gelu_fusion'): + logging.warning('bias_gelu_fusion is deprecated. Please use bias_activation_fusion instead.') + activation_fusion = self.cfg.bias_gelu_fusion + else: + activation_fusion = self.cfg.get('bias_activation_fusion', True) model = MegatronTokenLevelEncoderDecoderModule( encoder_arch=self.cfg.encoder_arch, decoder_arch=self.cfg.decoder_arch, @@ -146,10 +154,7 @@ def model_provider_func(self, pre_process, post_process, add_encoder, add_decode activations_checkpoint_num_layers=self.cfg.get('activations_checkpoint_num_layers', 1), layernorm_epsilon=self.cfg.get('layernorm_epsilon', 1e-5), persist_layer_norm=self.cfg.get('persist_layer_norm', False), - bias_activation_fusion=( - (self.cfg.get('bias_gelu_fusion', True) and self.cfg.get('activation', 'gelu') == 'gelu') - or (self.cfg.get('bias_activation_fusion', True) and self.cfg.get('activation', 'gelu') == 'geglu') - ), + bias_activation_fusion=activation_fusion, bias_dropout_add_fusion=self.cfg.get('bias_dropout_add_fusion', True), masked_softmax_fusion=self.cfg.get('masked_softmax_fusion', True), onnx_safe=self.cfg.get('onnx_safe', False), @@ -393,7 +398,7 @@ def allreduce_word_and_position_embeddings(self): and parallel_state.get_pipeline_model_parallel_world_size() > 1 and parallel_state.get_pipeline_model_parallel_split_rank() is not None ): - if self.enc_dec_model.position_embedding_type != 'relative': + if self.cfg.get('position_embedding_type') != 'relative': position_embeddings_weight = self.enc_dec_model.position_embeddings_weight() if self.megatron_amp_o2: grad = position_embeddings_weight.main_grad @@ -706,7 +711,7 @@ def setup(self, stage=None): # when using pipeline model parallel the final stage need to initialize word embeddings if parallel_state.get_pipeline_model_parallel_world_size() > 1: self.enc_dec_model.sync_initial_word_embeddings() - if self.enc_dec_model.position_embedding_type != 'relative': + if self.cfg.get('position_embedding_type') != 'relative': self.enc_dec_model.sync_initial_position_embeddings() def setup_training_data(self, cfg): @@ -837,12 +842,12 @@ def dummy(): torch.distributed.broadcast( predicted_tokens_dec, parallel_state.get_pipeline_model_parallel_last_rank(), - group=parallel_state.get_model_parallel_group(), + group=parallel_state.get_pipeline_model_parallel_group(), ) torch.distributed.broadcast( log_probs, parallel_state.get_pipeline_model_parallel_last_rank(), - group=parallel_state.get_model_parallel_group(), + group=parallel_state.get_pipeline_model_parallel_group(), ) # Reset microbatch calculator to what it was before decoding. diff --git a/nemo/collections/nlp/modules/common/megatron/transformer.py b/nemo/collections/nlp/modules/common/megatron/transformer.py index 6fc77f01c27a..d1cdf7eb2211 100644 --- a/nemo/collections/nlp/modules/common/megatron/transformer.py +++ b/nemo/collections/nlp/modules/common/megatron/transformer.py @@ -159,19 +159,20 @@ def __init__( bias=bias, ) - glu_activation_family = activation in ['reglu', 'swiglu'] + self.glu_activation_family = activation in ['geglu', 'reglu', 'swiglu'] + bias_activation_fusion_unavailable = activation in ['reglu', 'swiglu'] - if glu_activation_family and bias_activation_fusion: + if bias_activation_fusion_unavailable and bias_activation_fusion: raise ValueError( f"Cannot use bias_activation_fusion with {activation} activation. Please turn bias gelu fusion off." ) - if glu_activation_family and openai_gelu: + if self.glu_activation_family and openai_gelu: raise ValueError( f"Cannot use openai_gelu with specificed activation function : {activation} Please turn openai gelu off." ) - if glu_activation_family and onnx_safe: + if self.glu_activation_family and onnx_safe: raise ValueError( f"Cannot use onnx_safe with specificed activation function : {activation} Please turn onnx safe off." ) @@ -180,8 +181,6 @@ def __init__( raise ValueError( f"Cannot use bias_activation_fusion without bias terms. Please set bias=True or bias_activation_fusion=False." ) - else: - glu_activation_family = False self.bias_activation_fusion = bias_activation_fusion @@ -224,18 +223,18 @@ def forward(self, hidden_states): # [s, b, 4hp] intermediate_parallel, bias_parallel = self.dense_h_to_4h(hidden_states) - if self.activation in ['geglu', 'reglu', 'swiglu']: + if self.glu_activation_family: intermediate_parallel_2, bias_parallel_2 = self.dense_h_to_4h_2(hidden_states) if self.bias_activation_fusion: if self.activation == 'gelu': intermediate_parallel = fused_bias_gelu(intermediate_parallel, bias_parallel) - else: + elif self.activation == 'geglu': intermediate_parallel = fused_bias_geglu( intermediate_parallel, bias_parallel, intermediate_parallel_2, bias_parallel_2 ) - elif self.activation in ['geglu', 'reglu', 'swiglu']: + elif self.activation in ['reglu', 'swiglu']: if bias_parallel is not None: intermediate_parallel = self.activation_func(intermediate_parallel + bias_parallel) * ( intermediate_parallel_2 + bias_parallel_2 @@ -287,6 +286,7 @@ def __init__( megatron_legacy=False, bias=True, headscale=False, + has_relative_attention_bias=False, ): super(ParallelAttention, self).__init__() @@ -299,6 +299,7 @@ def __init__( self.attn_mask_type = attn_mask_type self.megatron_legacy = megatron_legacy self.headscale = headscale + self.has_relative_attention_bias = has_relative_attention_bias if kv_channels is None: assert ( @@ -382,7 +383,7 @@ def __init__( self.position_embedding_type = position_embedding_type self.relative_attention_num_buckets = relative_attention_num_buckets self.relative_attention_max_distance = relative_attention_max_distance - if self.position_embedding_type == 'relative': + if self.position_embedding_type == 'relative' and self.has_relative_attention_bias: self.relative_attention_bias = torch.nn.Embedding( relative_attention_num_buckets, self.num_attention_heads_per_partition ).to(torch.cuda.current_device()) @@ -498,7 +499,7 @@ def compute_bias(self, query_length, key_length): relative_position = memory_position - context_position # shape (query_length, key_length) relative_position_bucket = self._relative_position_bucket( relative_position, # shape (query_length, key_length) - bidirectional=(self.layer_type != LayerType.decoder), # (not self.is_decoder), + bidirectional=(self.attention_type != AttnMaskType.causal), # self.is_decoder and self_attention. num_buckets=self.relative_attention_num_buckets, max_distance=self.relative_attention_max_distance, ) @@ -683,9 +684,12 @@ def forward( if position_bias is None: if self.position_embedding_type == 'relative': - position_bias = self.compute_bias(real_seq_length, key_length) - else: - pass # HuggingFace implementation initialize position_bias to zero when not using + if self.has_relative_attention_bias: + position_bias = self.compute_bias(real_seq_length, key_length) + elif attention_mask is not None: + position_bias = torch.zeros_like(attention_mask).to(torch.cuda.current_device()) + else: + position_bias = torch.zeros(1, key_length, key_length).to(torch.cuda.current_device()) # if key and values are already calculated # we want only the last query position bias @@ -977,6 +981,7 @@ def __init__( normalization='layernorm', transformer_block_type='pre_ln', headscale=False, + has_relative_attention_bias=False, ): super(ParallelTransformerLayer_, self).__init__() @@ -1040,6 +1045,7 @@ def __init__( megatron_legacy=megatron_legacy, bias=bias, headscale=headscale, + has_relative_attention_bias=has_relative_attention_bias, ) # Normformer normalization if transformer_block_type == 'normformer': @@ -1092,6 +1098,7 @@ def __init__( megatron_legacy=megatron_legacy, bias=bias, headscale=headscale, + has_relative_attention_bias=False, ) # Normformer normalization if transformer_block_type == 'normformer': @@ -1221,14 +1228,6 @@ def forward( # Post-LN: x -> MHA -> Residual -> LN -> MLP -> Residual -> LN # Normformer: x -> LN -> MHA -> LN -> Residual -> MLP (w/LN) -> Residual - if type(hidden_states) is tuple: - if len(hidden_states) == 2: - hidden_states, position_bias = hidden_states - elif len(hidden_states) == 3: - hidden_states, position_bias, encoder_decoder_position_bias = hidden_states - else: - raise IndexError('Hidden_states needs to be tuple containing 2 or 3 elements.') - residual = hidden_states # Layer norm at the beginning of the transformer layer. if self.transformer_block_type in ['pre_ln', 'normformer']: @@ -1242,6 +1241,7 @@ def forward( set_inference_key_value_memory=set_inference_key_value_memory, inference_max_sequence_len=inference_max_sequence_len, rotary_pos_emb=self_attention_pos_emb, + position_bias=position_bias, ) if get_key_value: @@ -1491,7 +1491,7 @@ def __init__( self.num_layers = self.get_num_layers(num_layers) # Transformer layers. - def build_layer(layer_number): + def build_layer(layer_number, has_relative_attention_bias=False): if isinstance(layer_type, list): lt = layer_type[layer_number - 1] else: @@ -1529,6 +1529,7 @@ def build_layer(layer_number): normalization=normalization, transformer_block_type=transformer_block_type, headscale=headscale, + has_relative_attention_bias=has_relative_attention_bias, ) if parallel_state.get_virtual_pipeline_model_parallel_world_size() is not None: @@ -1565,7 +1566,14 @@ def build_layer(layer_number): else: offset = parallel_state.get_pipeline_model_parallel_rank() * self.num_layers - self.layers = torch.nn.ModuleList([build_layer(i + 1 + offset) for i in range(self.num_layers)]) + self.layers = torch.nn.ModuleList( + [ + build_layer( + i + 1 + offset, has_relative_attention_bias=(i == 0) and parallel_state.is_pipeline_first_stage() + ) + for i in range(self.num_layers) + ] + ) if self.post_process and self.transformer_block_type != 'post_ln': # Final layer norm before output. @@ -1631,9 +1639,16 @@ def custom_forward(*inputs): encoder_output, enc_dec_attn_mask, rotary_pos_emb, - position_bias, - encoder_decoder_position_bias, + position_bias=position_bias, + encoder_decoder_position_bias=encoder_decoder_position_bias, ) + if type(x_) is tuple: + if len(x_) == 2: + x_, position_bias = x_ + elif len(x_) == 3: + x_, position_bias, encoder_decoder_position_bias = x_ + else: + raise IndexError('Hidden_states (x_) needs to be tuple containing 2 or 3 elements.') return x_ return custom_forward @@ -1711,9 +1726,10 @@ def forward( inference_max_sequence_len=None, rotary_pos_emb=None, # list of positional embedding tensors, first one self attention, second one and third one are for cross attention (q, k) retrieved_emb=None, # tensor of retrieved embedding of shape [b, k, r, n, d] - position_bias=None, - encoder_decoder_position_bias=None, ): + position_bias = None + encoder_decoder_position_bias = None + # Checks. if inference_max_sequence_len: assert self.activations_checkpoint_method is None, 'inference does not work with activation checkpointing' diff --git a/nemo/collections/tts/torch/g2ps.py b/nemo/collections/tts/torch/g2ps.py index 694f8b25820a..a6286aa5710c 100644 --- a/nemo/collections/tts/torch/g2ps.py +++ b/nemo/collections/tts/torch/g2ps.py @@ -405,7 +405,7 @@ def parse_one_word(self, word: str): ): if word[-3] == 'T': # Case like "airport's" - return self.phoneme_dict[word[:-2]][0] + ["t", "s"], True + return self.phoneme_dict[word[:-2]][0] + ["s"], True elif word[-3] == 'S': # Case like "jones's" return self.phoneme_dict[word[:-2]][0] + ["ɪ", "z"], True @@ -420,14 +420,11 @@ def parse_one_word(self, word: str): and (word[:-1] in self.phoneme_dict) and (not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(word[:-1])) ): - if word[-3] == 'T': - # Case like "airport's" - return self.phoneme_dict[word[:-2]][0] + ["t", "s"], True - elif word[-3] == 'S': - # Case like "jones's" - return self.phoneme_dict[word[:-2]][0] + ["ɪ", "z"], True + if word[-2] == 'T': + # Case like "airports" + return self.phoneme_dict[word[:-1]][0] + ["s"], True else: - return self.phoneme_dict[word[:-2]][0] + ["z"], True + return self.phoneme_dict[word[:-1]][0] + ["z"], True # Phoneme dict lookup for unique words (or default pron if ignore_ambiguous_words=False) if word in self.phoneme_dict and (not self.ignore_ambiguous_words or self.is_unique_in_phoneme_dict(word)): diff --git a/nemo_text_processing/text_normalization/en/data/measure/unit.tsv b/nemo_text_processing/text_normalization/en/data/measure/unit.tsv index c8893cbbdcd0..96afbb71d27f 100644 --- a/nemo_text_processing/text_normalization/en/data/measure/unit.tsv +++ b/nemo_text_processing/text_normalization/en/data/measure/unit.tsv @@ -1,8 +1,11 @@ amu atomic mass unit bar bar ° degree +º degree °c degree Celsius °C degree Celsius +ºc degree Celsius +ºC degree Celsius ℃ degree Celsius cm2 square centimeter cm² square centimeter diff --git a/nemo_text_processing/text_normalization/en/data/whitelist/symbol.tsv b/nemo_text_processing/text_normalization/en/data/whitelist/symbol.tsv index 63c035026564..6f2f8c69a8e6 100644 --- a/nemo_text_processing/text_normalization/en/data/whitelist/symbol.tsv +++ b/nemo_text_processing/text_normalization/en/data/whitelist/symbol.tsv @@ -19,3 +19,5 @@ $ dollar € euro ₩ won ¥ yen +° degree +º degree diff --git a/nemo_text_processing/text_normalization/en/taggers/word.py b/nemo_text_processing/text_normalization/en/taggers/word.py index 0b1ebc469384..fa6a965aab2e 100644 --- a/nemo_text_processing/text_normalization/en/taggers/word.py +++ b/nemo_text_processing/text_normalization/en/taggers/word.py @@ -14,6 +14,7 @@ import pynini from nemo_text_processing.text_normalization.en.graph_utils import ( + MIN_NEG_WEIGHT, NEMO_ALPHA, NEMO_DIGIT, NEMO_NOT_SPACE, @@ -22,6 +23,7 @@ convert_space, get_abs_path, ) +from nemo_text_processing.text_normalization.en.taggers.punctuation import PunctuationFst from pynini.examples import plurals from pynini.lib import pynutil @@ -40,8 +42,11 @@ class WordFst(GraphFst): def __init__(self, punctuation: GraphFst, deterministic: bool = True): super().__init__(name="word", kind="classify", deterministic=deterministic) + punct = PunctuationFst().graph + default_graph = pynini.closure(pynini.difference(NEMO_NOT_SPACE, punct.project("input")), 1) symbols_to_exclude = (pynini.union("$", "€", "₩", "£", "¥", "#", "%") | NEMO_DIGIT).optimize() graph = pynini.closure(pynini.difference(NEMO_NOT_SPACE, symbols_to_exclude), 1) + graph = pynutil.add_weight(graph, MIN_NEG_WEIGHT) | default_graph # leave phones of format [HH AH0 L OW1] untouched phoneme_unit = pynini.closure(NEMO_ALPHA, 1) + pynini.closure(NEMO_DIGIT) diff --git a/tests/nemo_text_processing/en/data_text_normalization/test_cases_measure.txt b/tests/nemo_text_processing/en/data_text_normalization/test_cases_measure.txt index 8d5e6dee094f..c26138f813ca 100644 --- a/tests/nemo_text_processing/en/data_text_normalization/test_cases_measure.txt +++ b/tests/nemo_text_processing/en/data_text_normalization/test_cases_measure.txt @@ -15,3 +15,4 @@ covid-19.5~covid- nineteen point five 2°C~two degrees Celsius 1°C~one degree Celsius 1234-123kg~one thousand two hundred and thirty four to one hundred and twenty three kilograms +45º&C~forty five degree and C diff --git a/tests/nemo_text_processing/en/data_text_normalization/test_cases_normalize_with_audio.txt b/tests/nemo_text_processing/en/data_text_normalization/test_cases_normalize_with_audio.txt index 639cea6366ad..5698b7886116 100644 --- a/tests/nemo_text_processing/en/data_text_normalization/test_cases_normalize_with_audio.txt +++ b/tests/nemo_text_processing/en/data_text_normalization/test_cases_normalize_with_audio.txt @@ -65,7 +65,7 @@ four five six seven forty five sixty seven four thousand five hundred and sixty seven ~This example number 15,000 can be a very long one, and can fail to produce valid normalization for such an easy number like 10,125 or dollar value $5349.01, and can fail to terminate, and can fail to terminate, and can fail to terminate, and can fail to terminate, and can fail to terminate, 452. -This example number fifteen thousand can be a very long one, and can fail to produce valid normalization for such an easy number like ten thousand one hundred twenty five or dollar value five thousand and three forty nine us dollars one cent, and can fail to terminate, and can fail to terminate, and can fail to terminate, and can fail to terminate, and can fail to terminate, four hundred fifty two. +This example number fifteen thousand can be a very long one, and can fail to produce valid normalization for such an easy number like ten thousand one hundred twenty five or dollar value five thousand and three forty nine us dollars and one cent, and can fail to terminate, and can fail to terminate, and can fail to terminate, and can fail to terminate, and can fail to terminate, four fifty two. ~$1.01 one dollar one cent one dollar and one cent diff --git a/tests/nemo_text_processing/en/data_text_normalization/test_cases_punctuation.txt b/tests/nemo_text_processing/en/data_text_normalization/test_cases_punctuation.txt index 56ab0c1ca12e..c3073a3934bf 100644 --- a/tests/nemo_text_processing/en/data_text_normalization/test_cases_punctuation.txt +++ b/tests/nemo_text_processing/en/data_text_normalization/test_cases_punctuation.txt @@ -60,3 +60,4 @@ dr. Evil~dr. Evil (1)Hello~(one) Hello ÀÁÂÃ check §- and ƛ, also ɧ~ÀÁÂÃ check section - and ƛ, also ɧ Hi it's 5pm,4A.M.?-34. Hi,no,yes,34! 12,again,4 and NO?17 and $.01,here & there--0.004kg~Hi it's five PM, four AM.? minus thirty four. Hi,no,yes, thirty four! twelve, again, four and NO? seventeen and one cent, here and there - minus zero point zero zero four kilograms +1°C.~one degree Celsius. diff --git a/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb b/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb index 443f0713a45c..bfa56e5a2567 100644 --- a/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb +++ b/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb @@ -31,7 +31,7 @@ "# If you're using Google Colab and not running locally, run this cell\n", "\n", "# install NeMo\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@${BRANCH}#egg=nemo_toolkit[nlp]" + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]" ] }, { diff --git a/tutorials/tts/Fastpitch_Training_GermanTTS.ipynb b/tutorials/tts/Fastpitch_Training_GermanTTS.ipynb index 1c0ec8716701..a35fb9d94d77 100644 --- a/tutorials/tts/Fastpitch_Training_GermanTTS.ipynb +++ b/tutorials/tts/Fastpitch_Training_GermanTTS.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "f5adc294", + "id": "9e23311d", "metadata": {}, "source": [ "# Modifying FastPitch to Train on a Non-English (German) Dataset\n", @@ -16,7 +16,7 @@ }, { "cell_type": "markdown", - "id": "fc18f4fb", + "id": "2e57b884", "metadata": {}, "source": [ "# License\n", @@ -39,7 +39,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e27a7dc1", + "id": "392161ff", "metadata": {}, "outputs": [], "source": [ @@ -51,7 +51,7 @@ "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", "4. Run this cell to set up dependencies# .\n", "\"\"\"\n", - "BRANCH = 'tts_germanfastpitch'\n", + "BRANCH = 'main'\n", "# # If you're using Colab and not running locally, uncomment and run this cell.\n", "# !apt-get install sox libsndfile1 ffmpeg\n", "# !pip install wget unidecode pynini==2.1.4 scipy==1.7.3\n", @@ -61,7 +61,7 @@ { "cell_type": "code", "execution_count": null, - "id": "bf1cc79c", + "id": "d9a5e132", "metadata": {}, "outputs": [], "source": [ @@ -78,25 +78,28 @@ { "cell_type": "code", "execution_count": null, - "id": "0023bb0f", + "id": "c588ff4f", "metadata": {}, "outputs": [], "source": [ "# lets download the files we need to run this tutorial\n", "\n", - "!mkdir /NeMo\n", - "!cd /NeMo && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/scripts/dataset_processing/tts/openslr/get_data.py\n", - "!cd /NeMo && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/fastpitch.py\n", - "!cd /NeMo && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/hifigan_finetune.py\n", - "!cd /NeMo && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/scripts/dataset_processing/tts/extract_sup_data.py\n", - "!cd /NeMo && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/conf/de/fastpitch_align_22050.yaml\n", - "!cd /NeMo && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/conf/hifigan/hifigan.yaml\n", - "!cd /NeMo && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/nemo_text_processing/text_normalization/de/data/whitelist.tsv" + "!mkdir NeMoGermanTTS\n", + "!cd NeMoGermanTTS && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/scripts/dataset_processing/tts/openslr/get_data.py\n", + "!cd NeMoGermanTTS && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/fastpitch.py\n", + "!cd NeMoGermanTTS && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/hifigan_finetune.py\n", + "!cd NeMoGermanTTS && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/scripts/dataset_processing/tts/extract_sup_data.py\n", + "!cd NeMoGermanTTS && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/conf/de/fastpitch_align_22050.yaml\n", + "!cd NeMoGermanTTS && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/conf/hifigan/hifigan.yaml\n", + "!cd NeMoGermanTTS && wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/nemo_text_processing/text_normalization/de/data/whitelist.tsv\n", + "!cd NeMoGermanTTS && mkdir -p model/train_ds && cd model/train_ds && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/hifigan/model/train_ds/train_ds_finetune.yaml\n", + "!cd NeMoGermanTTS && mkdir -p model/train_ds && cd model/validation_ds && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/hifigan/model/validation_ds/val_ds_finetune.yaml\n", + "!cd NeMoGermanTTS && mkdir -p model/generator && cd model/generator && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/hifigan/model/generator/v1.yaml" ] }, { "cell_type": "markdown", - "id": "0addf41a", + "id": "c3b37631", "metadata": {}, "source": [ "# Introduction" @@ -104,7 +107,7 @@ }, { "cell_type": "markdown", - "id": "c0c67aa7", + "id": "ba60fc45", "metadata": {}, "source": [ "### FastPitch\n", @@ -119,7 +122,7 @@ }, { "cell_type": "markdown", - "id": "b0fb61bc", + "id": "747a9f24", "metadata": {}, "source": [ "# Dataset Preparation" @@ -127,7 +130,7 @@ }, { "cell_type": "markdown", - "id": "dbc3a5ab", + "id": "e1cf7a8d", "metadata": {}, "source": [ "We will show example of preprocessing and training using OpenSLR's German Neutral TTS dataset ([link](https://www.openslr.org/95)). It is a free single german speaker dataset (> 23 hours) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for tts training. \n", @@ -138,12 +141,12 @@ "3. Normalizing text\n", "4. Phonemization\n", "5. Creating dataset config\n", - "6. Creating suppplementary data" + "6. Creating supplementary data" ] }, { "cell_type": "markdown", - "id": "add9f955", + "id": "40e76a38", "metadata": {}, "source": [ "## 1. Downloading the dataset" @@ -152,12 +155,12 @@ { "cell_type": "code", "execution_count": null, - "id": "9982b65b", + "id": "36b8a1d3", "metadata": {}, "outputs": [], "source": [ - "!mkdir /Data && \\\n", - " cd /Data && \\\n", + "!mkdir DataGermanTTS && \\\n", + " cd DataGermanTTS && \\\n", " wget https://us.openslr.org/resources/95/thorsten-de_v02.tgz && \\\n", " tar -zxvf thorsten-de_v02.tgz" ] @@ -165,29 +168,29 @@ { "cell_type": "code", "execution_count": null, - "id": "e500cf5b", + "id": "6db032e5", "metadata": {}, "outputs": [], "source": [ - "# /Data directory looks like\n", - "!ls /Data -R" + "# DataGermanTTS directory looks like\n", + "!ls DataGermanTTS -R" ] }, { "cell_type": "markdown", - "id": "8dd1f324", + "id": "0e94204b", "metadata": {}, "source": [ "\n", "```bash\n", - "$ ls /Data -R\n", - "/Data:\n", + "$ ls DataGermanTTS -R\n", + "DataGermanTTS:\n", "thorsten-de thorsten-de_v02.tgz\n", "\n", - "/Data/thorsten-de:\n", + "DataGermanTTS/thorsten-de:\n", "metadata.csv metadata_shuf.csv metadata_train.csv metadata_val.csv wavs\n", "\n", - "/Data/thorsten-de/wavs:\n", + "DataGermanTTS/thorsten-de/wavs:\n", "00025a6fbea659dae6ece011e749aa34.wav 80689a91d5c8e32847ccbba2322e2122.wav\n", "000314280388fb390b3e70b69ee53a23.wav 8068cbcbe28085c15d2e8a8f7291d009.wav\n", "000624f768d7e282534a850980619fb2.wav 8071b84557c9a780d23414e241393f00.wav\n", @@ -199,36 +202,38 @@ }, { "cell_type": "markdown", - "id": "1be59a49", + "id": "c4f33db9", "metadata": {}, "source": [ "## 2. Creating manifests \n", "\n", - "We've created `scripts/dataset_processing/tts/openslr/get_data.py` script that reads the `/Data/thorsten-de/metadata.csv` provided with the dataset and generates the following fields per each datapoint:\n", + "We've created `scripts/dataset_processing/tts/openslr/get_data.py` script that reads the `DataGermanTTS/thorsten-de/metadata.csv` provided with the dataset and generates the following fields per each datapoint:\n", "1. `audio_filepath`: location of the wav file\n", "2. `duration`: duration of the wav file\n", "3. `text`: original text supplied by OpenSLR\n", " \n", - "After that, the script randomly splits the datapoints into 3 buckets, `train_manifest.json`, `val_manifest.json` and `test_manifest.json`. Example:" + "After that, the script randomly splits the datapoints into 3 buckets, `train_manifest.json`, `val_manifest.json` and `test_manifest.json`.\n", + "\n", + "Note: This step will take sometime to run for the entire dataset. If you are only interested in testing the scripts, please feel free to shorten the `DataGermanTTS/thorsten-de/metadata.csv` file to include only, say, top 100 records." ] }, { "cell_type": "code", "execution_count": null, - "id": "a887e246", + "id": "eb063cf4", "metadata": {}, "outputs": [], "source": [ - "!(cd /NeMo && \\\n", + "!(cd NeMoGermanTTS && \\\n", " python get_data.py \\\n", - " --data-root /Data/ \\\n", + " --data-root ../DataGermanTTS/ \\\n", " --val-size 0.1 \\\n", " --test-size 0.2)" ] }, { "cell_type": "markdown", - "id": "7b5925c5", + "id": "3811cc56", "metadata": {}, "source": [ "In the example above, 10% datapoints go to validation set, 20% go to test set and the remaining 70% go to training set." @@ -237,26 +242,26 @@ { "cell_type": "code", "execution_count": null, - "id": "d85c0161", + "id": "5c7b9430", "metadata": {}, "outputs": [], "source": [ - "# /Data directory looks like\n", - "!ls /Data -R" + "# DataGermanTTS directory looks like\n", + "!ls DataGermanTTS -R" ] }, { "cell_type": "markdown", - "id": "e8478373", + "id": "4d2dd715", "metadata": {}, "source": [ "```bash\n", - "$ ls /Data -R\n", - "/Data:\n", + "$ ls DataGermanTTS -R\n", + "DataGermanTTS:\n", "thorsten-de\n", "thorsten-de_v02.tgz\n", "\n", - "/Data/thorsten-de:\n", + "DataGermanTTS/thorsten-de:\n", "metadata.csv\n", "metadata_shuf.csv\n", "metadata_train.csv\n", @@ -266,7 +271,7 @@ "val_manifest.json\n", "wavs\n", "\n", - "/Data/thorsten-de/wavs:\n", + "DataGermanTTS/thorsten-de/wavs:\n", "00025a6fbea659dae6ece011e749aa34.wav\n", "000314280388fb390b3e70b69ee53a23.wav\n", "000624f768d7e282534a850980619fb2.wav\n", @@ -276,12 +281,12 @@ }, { "cell_type": "markdown", - "id": "11ac14a9", + "id": "2f6ea189", "metadata": {}, "source": [ "## 3. Normalizing text\n", "\n", - "The script above, i.e. `scripts/dataset_processing/tts/openslr/get_data.py`, also generates a another field per each datapoint:\n", + "The script above, i.e. `scripts/dataset_processing/tts/openslr/get_data.py`, also generates another field per each datapoint:\n", "- `normalized_text`: normalized text via NeMo's text normalizer:\n", " ```python\n", " nemo_text_processing.text_normalization.normalize.Normalizer(lang=\"de\", input_case=\"cased\", overwrite_cache=True, cache_dir=str(file_path / \"cache_dir\"))\n", @@ -289,30 +294,30 @@ " \n", "German language text normalizer (defined here: `nemo_text_processing/text_normalization/de`) was created using the tutorial shared under NeMo's `Grammar customization` documentation [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/text_normalization/wfst/wfst_text_normalization.html#grammar-customization). Here are some example records:\n", "```json\n", - "{\"audio_filepath\": \"/Data/thorsten-de/wavs/f1becc89cb4079a123ead68c9c8bb8ae.wav\", \"duration\": 7.250023, \"text\": \"Öffne den Webbrowser und rufe www.archlinux.org auf.\", \"normalized_text\": \"Öffne den Webbrowser und rufe w w w punkt a r c h l i n u x punkt o r g auf.\"}\n", + "{\"audio_filepath\": \"DataGermanTTS/thorsten-de/wavs/f1becc89cb4079a123ead68c9c8bb8ae.wav\", \"duration\": 7.250023, \"text\": \"Öffne den Webbrowser und rufe www.archlinux.org auf.\", \"normalized_text\": \"Öffne den Webbrowser und rufe w w w punkt a r c h l i n u x punkt o r g auf.\"}\n", "```\n", "Notice that the URL has been spelled out. \n", "\n", "In other cases, the normalized text may look the same as text, example:\n", "```json\n", - "{\"audio_filepath\": \"/Data/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"Geht die Schandtat auf sein Konto?\"}\n", + "{\"audio_filepath\": \"DataGermanTTS/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"Geht die Schandtat auf sein Konto?\"}\n", "```" ] }, { "cell_type": "markdown", - "id": "1ea88eae", + "id": "6c7f253e", "metadata": {}, "source": [ "## 4. Phonemization\n", "\n", - "The pronunciation of a word can be represented as a string of phones, which are speech sounds, each represented with symbols adapated from the Roman alphabet. The IPA is designed to represent those qualities of speech that are part of lexical (and to a limited extent prosodic) sounds in oral language: phones, phonemes, intonation and the separation of words and syllables. Training model with phonemes as well as text will help the model generate more accurate speech sounds." + "The pronunciation of a word can be represented as a string of phones, which are speech sounds, each represented with symbols adapted from the Roman alphabet. The IPA is designed to represent those qualities of speech that are part of lexical (and to a limited extent prosodic) sounds in oral language: phones, phonemes, intonation and the separation of words and syllables. Training model with phonemes as well as text will help the model generate more accurate speech sounds." ] }, { "cell_type": "code", "execution_count": null, - "id": "00de9751", + "id": "b6c8dad0", "metadata": {}, "outputs": [], "source": [ @@ -322,7 +327,7 @@ }, { "cell_type": "markdown", - "id": "c2d57066", + "id": "f5a88926", "metadata": {}, "source": [ "The original dataset only contains text input, so, in order to add phonemes, we need to convert German text into phonemes using [bootphon/phonemizer](https://github.com/bootphon/phonemizer).\n", @@ -333,7 +338,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5ce78633", + "id": "2ffe9bb6", "metadata": {}, "outputs": [], "source": [ @@ -342,7 +347,7 @@ }, { "cell_type": "markdown", - "id": "2d416091", + "id": "f6a79794", "metadata": {}, "source": [ "Alternatively, you can use phonemizer via docker container:\n", @@ -350,7 +355,7 @@ "git clone https://github.com/bootphon/phonemizer\n", "cd phonemizer\n", "docker build -t phonemizer .\n", - "docker run --rm -d -it -p 8888:8888 -v /Data:/Data --ipc=host phonemizer /bin/bash\n", + "docker run --rm -d -it -p 8888:8888 -v DataGermanTTS:DataGermanTTS --ipc=host phonemizer /bin/bash\n", "docker exec -it /bin/bash\n", "```\n", "\n", @@ -360,7 +365,7 @@ { "cell_type": "code", "execution_count": null, - "id": "ec4f0de7", + "id": "131ce5d0", "metadata": {}, "outputs": [], "source": [ @@ -369,7 +374,9 @@ "\n", "backend = EspeakBackend('de')\n", "\n", - "input_manifest_filepaths = [\"/Data/thorsten-de/train_manifest\", \"/Data/thorsten-de/test_manifest\", \"/Data/thorsten-de/val_manifest\"]\n", + "input_manifest_filepaths = [\"DataGermanTTS/thorsten-de/train_manifest\", \\\n", + " \"DataGermanTTS/thorsten-de/test_manifest\", \\\n", + " \"DataGermanTTS/thorsten-de/val_manifest\"]\n", "\n", "for input_manifest_filepath in input_manifest_filepaths:\n", " output_manifest_filepath = input_manifest_filepath+\"_phonemes\"\n", @@ -399,29 +406,29 @@ }, { "cell_type": "markdown", - "id": "f5994604", + "id": "e0e4a206", "metadata": {}, "source": [ "To better understand the phonemize method, refer to the docs [here](https://github.com/bootphon/phonemizer/blob/master/phonemizer/backend/base.py#L137).\n", "\n", - "Run the above script for train, test and val records, resulting in `train_phonemes_manifest.json`, `test_phonemes_manifest.json` and `val_phonemes_manifest.json` respectively.\n", + "Run the above script for train, test and val records, resulting in `train_manifest_phonemes.json`, `test_manifest_phonemes.json` and `val_manifest_phonemes.json` respectively.\n", "\n", "We are effectively doubling the size of our dataset. Each original record maps on to two records, one with original `normalized_text` field value and `is_phoneme` set to 0 and another with phonemized text and `is_phoneme` flag set to 1.\n", "\n", "Example of input record:\n", "```json\n", - "{\"audio_filepath\": \"/Data/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"Geht die Schandtat auf sein Konto?\"}\n", + "{\"audio_filepath\": \"DataGermanTTS/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"Geht die Schandtat auf sein Konto?\"}\n", "```\n", "And corresponding output records:\n", "```json\n", - "{\"audio_filepath\": \"/Data/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"Geht die Schandtat auf sein Konto?\", \"is_phoneme\": 0}\n", - "{\"audio_filepath\": \"/Data/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"\\u0261e\\u02d0t di\\u02d0 \\u0283ant\\u0251\\u02d0t a\\u028af za\\u026an k\\u0254nto\\u02d0 \", \"is_phoneme\": 1}\n", + "{\"audio_filepath\": \"DataGermanTTS/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"Geht die Schandtat auf sein Konto?\", \"is_phoneme\": 0}\n", + "{\"audio_filepath\": \"DataGermanTTS/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"\\u0261e\\u02d0t di\\u02d0 \\u0283ant\\u0251\\u02d0t a\\u028af za\\u026an k\\u0254nto\\u02d0 \", \"is_phoneme\": 1}\n", "```" ] }, { "cell_type": "markdown", - "id": "3f2a97df", + "id": "18578da2", "metadata": {}, "source": [ "## 5. Creating dataset config\n", @@ -441,7 +448,7 @@ "manifest_filepath: \"train_manifest.json\"\n", "sup_data_path: \"sup_data\"\n", "sup_data_types: [ \"align_prior_matrix\", \"pitch\" ]\n", - "whitelist_path: \"/NeMo/whitelist.tsv\"\n", + "whitelist_path: \"NeMoGermanTTS/whitelist.tsv\"\n", "\n", "dataset:\n", " _target_: nemo.collections.tts.torch.data.TTSDataset\n", @@ -482,12 +489,12 @@ " phonemes: true\n", "```\n", "\n", - "Save the above config in `/NeMo/ds_for_fastpitch_align.yaml`." + "Save the above config in `NeMoGermanTTS/ds_for_fastpitch_align.yaml`." ] }, { "cell_type": "markdown", - "id": "20d3ea4b", + "id": "b515c5b4", "metadata": {}, "source": [ "## 6. Creating Supplementary Data\n", @@ -500,31 +507,30 @@ { "cell_type": "code", "execution_count": null, - "id": "a9bb1915", + "id": "114dabfc", "metadata": {}, "outputs": [], "source": [ - "!cd /NeMo/ && \\\n", - " python extract_sup_data.py \\\n", - " --config-path /NeMo \\\n", + "!python NeMoGermanTTS/extract_sup_data.py \\\n", + " --config-path . \\\n", " --config-name ds_for_fastpitch_align.yaml \\\n", - " manifest_filepath=/Data/thorsten-de/train_manifest_phonemes.json \\\n", - " sup_data_path=/Data/thorsten-de/phonemes/" + " manifest_filepath=DataGermanTTS/thorsten-de/train_manifest_phonemes.json \\\n", + " sup_data_path=DataGermanTTS/thorsten-de/phonemes/" ] }, { "cell_type": "markdown", - "id": "b7225af2", + "id": "adfba0f9", "metadata": {}, "source": [ "The above example gives the following result:\n", "1. Creates two folders under `sup_data_path` - `pitch` and `align_prior_matrix`\n", - "2. Prints out `PITCH_MEAN, PITCH_STD = 132.524658203125, 37.389366149902344`" + "2. Prints out some values for pitch mean and standard deviation: `PITCH_MEAN, PITCH_STD = 132.524658203125, 37.389366149902344`. Use these values while training FastPitch." ] }, { "cell_type": "markdown", - "id": "692ac466", + "id": "3278f1ee", "metadata": {}, "source": [ "# Training" @@ -532,7 +538,7 @@ }, { "cell_type": "markdown", - "id": "d552f7ab", + "id": "1ef91842", "metadata": {}, "source": [ "Before we train our model, let's define model config. Most of the model config stays the same as defined here: `examples/tts/conf/fastpitch_align_44100.yaml`, except:\n", @@ -542,22 +548,22 @@ "\n", "3. The `sample_rate` is updated to 22050 KHz per our dataset. And accordingly halve the `n_window_size`, `n_window_stride` and `n_fft` parameters as well. \n", "\n", - "We have already downloaded the config after making these changes here: `/NeMo/fastpitch_align_22050.yaml`" + "We have already downloaded the config after making these changes here: `NeMoGermanTTS/fastpitch_align_22050.yaml`" ] }, { "cell_type": "code", "execution_count": null, - "id": "f21fc784", + "id": "e7f2373e", "metadata": {}, "outputs": [], "source": [ - "!cat /NeMo/fastpitch_align_22050.yaml" + "!cat NeMoGermanTTS/fastpitch_align_22050.yaml" ] }, { "cell_type": "markdown", - "id": "96edcb13", + "id": "1de1cf64", "metadata": {}, "source": [ "If you are using Weights and Biases, you may need to login first. More details [here](https://docs.wandb.ai/ref/cli/wandb-login)." @@ -566,51 +572,47 @@ { "cell_type": "code", "execution_count": null, - "id": "bb23fba3", + "id": "c60ed72a", "metadata": {}, "outputs": [], "source": [ - "wandb_api_key = \"apikey\"\n", - "wandb_project_name = \"GermanTTS\"\n", - "wandb_run_name = \"tutorial\"\n", - "\n", - "!wandb login ${wandb_api_key}" + "!wandb login #paste_wandb_apikey_here" ] }, { "cell_type": "markdown", - "id": "48e16fed", + "id": "51ec332e", "metadata": {}, "source": [ - "Now we are ready for training our model! Let's try to train FastPitch." + "Now we are ready for training our model! Let's try to train FastPitch. Paste the PITCH_MEAN and PITCH_STD from previous steps here." ] }, { "cell_type": "code", "execution_count": null, - "id": "1b37ec68", + "id": "4ad43d30", "metadata": {}, "outputs": [], "source": [ - "!(cd /NeMo && CUDA_VISIBLE_DEVICES=0 python fastpitch.py --config-path /NeMo --config-name fastpitch_align_22050 \\\n", + "!(cd NeMoGermanTTS && CUDA_VISIBLE_DEVICES=0 python fastpitch.py --config-path . --config-name fastpitch_align_22050 \\\n", " model.train_ds.dataloader_params.batch_size=32 \\\n", " model.validation_ds.dataloader_params.batch_size=32 \\\n", - " train_dataset=/Data/thorsten-de/train_manifest_phonemes.json \\\n", - " validation_datasets=/Data/thorsten-de/val_manifest_phonemes.json \\\n", - " sup_data_path=/Data/thorsten-de/phonemes/ \\\n", - " whitelist_path=/NeMo/whitelist.tsv \\\n", - " exp_manager.exp_dir=/result \\\n", + " train_dataset=../DataGermanTTS/thorsten-de/train_manifest_phonemes.json \\\n", + " validation_datasets=../DataGermanTTS/thorsten-de/val_manifest_phonemes.json \\\n", + " sup_data_path=../DataGermanTTS/thorsten-de/phonemes/ \\\n", + " whitelist_path=./whitelist.tsv \\\n", + " exp_manager.exp_dir=resultGermanTTS \\\n", " trainer.max_epochs=1 \\\n", - " pitch_mean=132.524658203125 \\\n", - " pitch_std=37.389366149902344 \\\n", + " pitch_mean=#paste_pitch_mean_here \\\n", + " pitch_std=#paste_pitch_std_here \\\n", " +exp_manager.create_wandb_logger=true \\\n", - " +exp_manager.wandb_logger_kwargs.name=${wandb_run_name} \\\n", - " +exp_manager.wandb_logger_kwargs.project=${wandb_project_name})" + " +exp_manager.wandb_logger_kwargs.name=\"tutorial\" \\\n", + " +exp_manager.wandb_logger_kwargs.project=\"GermanTTS\")" ] }, { "cell_type": "markdown", - "id": "e422a61c", + "id": "b8082cfc", "metadata": {}, "source": [ "Note:\n", @@ -622,7 +624,7 @@ }, { "cell_type": "markdown", - "id": "02c5bea2", + "id": "7a36f955", "metadata": {}, "source": [ "## Evaluating FastPitch + pretrained HiFi-GAN\n", @@ -633,7 +635,20 @@ { "cell_type": "code", "execution_count": null, - "id": "bb7ae759", + "id": "d70d5f7d", + "metadata": {}, + "outputs": [], + "source": [ + "import IPython.display as ipd\n", + "from nemo.collections.tts.models import HifiGanModel, FastPitchModel\n", + "from matplotlib.pyplot import imshow\n", + "from matplotlib import pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e9ee07a", "metadata": {}, "outputs": [], "source": [ @@ -641,13 +656,14 @@ "fastpitch_model_path = \"\" # from the results directory\n", "test = \"Diese Musiksammlung soll die Vielfalt des Lebens widerspiegeln.\" # text input to the model\n", "test_id = \"877d9f668a877713b48735f282af62ca\" # identifier for the audio corresponding to the test text\n", - "data_path = \"/Data/thorsten-de/wavs/\" # path to dataset folder with wav files from original dataset" + "data_path = \"DataGermanTTS/thorsten-de/wavs/\" # path to dataset folder with wav files from original dataset\n", + "seed = 1234" ] }, { "cell_type": "code", "execution_count": null, - "id": "d409870c", + "id": "32a234f4", "metadata": {}, "outputs": [], "source": [ @@ -671,7 +687,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c5c60fc2", + "id": "9a3f5eaa", "metadata": {}, "outputs": [], "source": [ @@ -680,13 +696,13 @@ "if \".nemo\" in fastpitch_model_path:\n", " spec_gen_model = FastPitchModel.restore_from(fastpitch_model_path).eval().cuda()\n", "else:\n", - " FastPitchModel.load_from_checkpoint(checkpoint_path=fastpitch_model_path).eval().cuda()" + " spec_gen_model = FastPitchModel.load_from_checkpoint(checkpoint_path=fastpitch_model_path).eval().cuda()" ] }, { "cell_type": "code", "execution_count": null, - "id": "6695665f", + "id": "de04b514", "metadata": {}, "outputs": [], "source": [ @@ -708,15 +724,15 @@ }, { "cell_type": "markdown", - "id": "98782fb8", + "id": "e57e9baf", "metadata": {}, "source": [ - "We see that audio quality is not as good as we expect. One of the ways mentioned in the [FastPitch_Finetuning.ipynb](FastPitch_Finetuning.ipynb) tutorial is to finetune HiFi-GAN. Lets try that out next!" + "We see that audio quality is not as good as we expect, even after training FastPitch for 1000 epochs. One of the ways mentioned in the [FastPitch_Finetuning.ipynb](FastPitch_Finetuning.ipynb) tutorial is to finetune HiFi-GAN. Lets try that out next!" ] }, { "cell_type": "markdown", - "id": "4e33d57b", + "id": "e6d62337", "metadata": {}, "source": [ "# Finetuning HiFi-GAN\n", @@ -726,7 +742,7 @@ }, { "cell_type": "markdown", - "id": "473cbfb6", + "id": "4efcf84d", "metadata": {}, "source": [ "## Generating synthetic mels\n", @@ -736,20 +752,20 @@ }, { "cell_type": "code", - "execution_count": 7, - "id": "14a34ba9", + "execution_count": null, + "id": "59d12874", "metadata": {}, "outputs": [], "source": [ - "test_audio_filepath = \"/Data/thorsten-de/wavs/5d000c81c8e7c4817cbfd7c4b8738feb.wav\"\n", + "test_audio_filepath = \"DataGermanTTS/thorsten-de/wavs/5d000c81c8e7c4817cbfd7c4b8738feb.wav\"\n", "test_audio_text = \"Dieser Geruch, wenn jemand eine Clementine \\u00f6ffnet!\"\n", "fastpitch_model_path = \"\"" ] }, { "cell_type": "code", - "execution_count": 12, - "id": "dfa99f1e", + "execution_count": null, + "id": "9395abc3", "metadata": {}, "outputs": [], "source": [ @@ -782,7 +798,7 @@ { "cell_type": "code", "execution_count": null, - "id": "14988711", + "id": "1679a9f9", "metadata": {}, "outputs": [], "source": [ @@ -791,7 +807,7 @@ }, { "cell_type": "markdown", - "id": "046e7e7c", + "id": "4a2cb665", "metadata": {}, "source": [ "So we have 2 types of mel spectrograms that we can use for finetuning HiFi-GAN:\n", @@ -801,31 +817,10 @@ }, { "cell_type": "code", - "execution_count": 15, - "id": "3ac663e7", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "loading original melspec\n", - "spectrogram shape = (80, 315)\n" - ] - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "execution_count": null, + "id": "cb1ec7d4", + "metadata": {}, + "outputs": [], "source": [ "print(\"loading original melspec\")\n", "y, sr = librosa.load(test_audio_filepath)\n", @@ -839,7 +834,7 @@ }, { "cell_type": "markdown", - "id": "e67c3f3a", + "id": "a06ab269", "metadata": {}, "source": [ "### 2. Mel spectrogram predicted from FastPitch" @@ -847,31 +842,10 @@ }, { "cell_type": "code", - "execution_count": 17, - "id": "eb8fd365", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "loading fastpitch melspec via generate_spectrogram\n", - "spectrogram shape = (80, 291)\n" - ] - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "execution_count": null, + "id": "2ba1586a", + "metadata": {}, + "outputs": [], "source": [ "print(\"loading fastpitch melspec via generate_spectrogram\")\n", "with torch.no_grad():\n", @@ -888,7 +862,7 @@ }, { "cell_type": "markdown", - "id": "354fbd71", + "id": "bf1639d4", "metadata": {}, "source": [ "Note: The above spectrogram has the duration 291 which is not equal to the ground truth length, i.e. 315. In order to finetune HiFi-GAN we need mel spectrogram predicted from FastPitch with groundtruth alignment and duration.\n", @@ -898,31 +872,10 @@ }, { "cell_type": "code", - "execution_count": 20, - "id": "f2189d94", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "loading fastpitch melspec via forward method with groundtruth alignment and duration\n", - "spectrogram shape = (80, 315)\n" - ] - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "execution_count": null, + "id": "f9fde5e7", + "metadata": {}, + "outputs": [], "source": [ "print(\"loading fastpitch melspec via forward method with groundtruth alignment and duration\")\n", "with torch.no_grad():\n", @@ -953,7 +906,7 @@ }, { "cell_type": "markdown", - "id": "114e3c89", + "id": "f49b70b9", "metadata": {}, "source": [ "In our experience, \n", @@ -966,7 +919,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5686fc76", + "id": "4d2c5e5d", "metadata": {}, "outputs": [], "source": [ @@ -982,8 +935,8 @@ "\n", "folder_name = \"synmels\"\n", "fastpitch_model_path = \"\"\n", - "dataset_part = \"test_phonemes\" # or \"val_phonemes\", \"train_phonemes\"\n", - "dataset_base_path = \"/Data/\"\n", + "dataset_parts = [\"test_manifest_phonemes\", \"val_manifest_phonemes\", \"train_manifest_phonemes\"]\n", + "dataset_base_path = \"DataGermanTTS/\"\n", "\n", "from nemo.collections.tts.models import FastPitchModel\n", "if \".nemo\" in fastpitch_model_path:\n", @@ -998,74 +951,75 @@ " samples = f.read(dtype='float32')\n", " return samples.transpose()\n", " \n", - "# Get records from the manifest\n", - "manifest_path = dataset_base_path+\"thorsten-de/\"+dataset_part+\"_manifest.json\"\n", - "records = []\n", - "with open(manifest_path, \"r\") as f:\n", - " for i, line in enumerate(f):\n", - " records.append(json.loads(line))\n", + "for dataset_part in dataset_parts:\n", + " # Get records from the manifest\n", + " manifest_path = f\"{dataset_base_path}thorsten-de/{dataset_part}.json\"\n", + " records = []\n", + " with open(manifest_path, \"r\") as f:\n", + " for i, line in enumerate(f):\n", + " records.append(json.loads(line))\n", "\n", - "beta_binomial_interpolator = BetaBinomialInterpolator()\n", + " beta_binomial_interpolator = BetaBinomialInterpolator()\n", "\n", - "spec_model.eval()\n", - "device = spec_model.device\n", + " spec_model.eval()\n", + " device = spec_model.device\n", "\n", - "save_dir = Path(dataset_base_path+folder_name+\"/\"+dataset_part)\n", + " save_dir = Path(f\"{dataset_base_path}{folder_name}/{dataset_part}\")\n", "\n", - "save_dir.mkdir(exist_ok=True, parents=True)\n", + " save_dir.mkdir(exist_ok=True, parents=True)\n", "\n", - "# Generate a spectrograms (we need to use ground truth alignment for correct matching between audio and mels)\n", - "for i, r in enumerate(records):\n", - " audio = load_wav(r[\"audio_filepath\"])\n", + " # Generate a spectrograms (we need to use ground truth alignment for correct matching between audio and mels)\n", + " for i, r in enumerate(records):\n", + " audio = load_wav(r[\"audio_filepath\"])\n", "\n", - " audio = torch.from_numpy(audio).unsqueeze(0).to(device)\n", - " audio_len = torch.tensor(audio.shape[1], dtype=torch.long, device=device).unsqueeze(0)\n", - "\n", - " # Again, our finetuned FastPitch model doesn't use multiple speakers,\n", - " # but we keep the code to support it here for reference\n", - " if spec_model.fastpitch.speaker_emb is not None and \"speaker\" in r:\n", - " speaker = torch.tensor([r['speaker']]).to(device)\n", - " else:\n", - " speaker = None\n", + " audio = torch.from_numpy(audio).unsqueeze(0).to(device)\n", + " audio_len = torch.tensor(audio.shape[1], dtype=torch.long, device=device).unsqueeze(0)\n", "\n", - " with torch.no_grad():\n", - " if \"normalized_text\" in r:\n", - " text = spec_model.parse(r[\"normalized_text\"], normalize=False)\n", + " # Again, our finetuned FastPitch model doesn't use multiple speakers,\n", + " # but we keep the code to support it here for reference\n", + " if spec_model.fastpitch.speaker_emb is not None and \"speaker\" in r:\n", + " speaker = torch.tensor([r['speaker']]).to(device)\n", " else:\n", - " text = spec_model.parse(r['text'])\n", + " speaker = None\n", "\n", - " text_len = torch.tensor(text.shape[-1], dtype=torch.long, device=device).unsqueeze(0)\n", + " with torch.no_grad():\n", + " if \"normalized_text\" in r:\n", + " text = spec_model.parse(r[\"normalized_text\"], normalize=False)\n", + " else:\n", + " text = spec_model.parse(r['text'])\n", "\n", - " spect, spect_len = spec_model.preprocessor(input_signal=audio, length=audio_len)\n", + " text_len = torch.tensor(text.shape[-1], dtype=torch.long, device=device).unsqueeze(0)\n", "\n", - " # Generate attention prior and spectrogram inputs for HiFi-GAN\n", - " attn_prior = torch.from_numpy(\n", - " beta_binomial_interpolator(spect_len.item(), text_len.item())\n", - " ).unsqueeze(0).to(text.device)\n", - " \n", - " spectrogram = spec_model.forward(\n", - " text=text, \n", - " input_lens=text_len, \n", - " spec=spect, \n", - " mel_lens=spect_len, \n", - " attn_prior=attn_prior,\n", - " speaker=speaker,\n", - " )[0]\n", + " spect, spect_len = spec_model.preprocessor(input_signal=audio, length=audio_len)\n", + "\n", + " # Generate attention prior and spectrogram inputs for HiFi-GAN\n", + " attn_prior = torch.from_numpy(\n", + " beta_binomial_interpolator(spect_len.item(), text_len.item())\n", + " ).unsqueeze(0).to(text.device)\n", + "\n", + " spectrogram = spec_model.forward(\n", + " text=text, \n", + " input_lens=text_len, \n", + " spec=spect, \n", + " mel_lens=spect_len, \n", + " attn_prior=attn_prior,\n", + " speaker=speaker,\n", + " )[0]\n", "\n", - " save_path = save_dir / f\"mel_{i}.npy\"\n", - " np.save(save_path, spectrogram[0].to('cpu').numpy())\n", - " r[\"mel_filepath\"] = str(save_path)\n", + " save_path = save_dir / f\"mel_{i}.npy\"\n", + " np.save(save_path, spectrogram[0].to('cpu').numpy())\n", + " r[\"mel_filepath\"] = str(save_path)\n", "\n", - "hifigan_manifest_path = dataset_base_path+folder_name+\"/hifigan_\"+dataset_part+\"_ft.json\"\n", + " hifigan_manifest_path = f\"{dataset_base_path}{folder_name}/hifigan_{dataset_part}_ft.json\"\n", "\n", - "with open(hifigan_manifest_path, \"w\") as f:\n", - " for r in records:\n", - " f.write(json.dumps(r) + '\\n')" + " with open(hifigan_manifest_path, \"w\") as f:\n", + " for r in records:\n", + " f.write(json.dumps(r) + '\\n')" ] }, { "cell_type": "markdown", - "id": "4e4f65cb", + "id": "371506df", "metadata": {}, "source": [ "Revisiting how we implement #2.1 (i.e. Predicted mel spectrogram predicted from FastPitch with groundtruth alignment and duration):\n", @@ -1094,27 +1048,27 @@ " \n", "Repeat the above script for train and validation datasets as well. \n", "\n", - "Finally, the `/Data/synmels` will look like:\n", + "Finally, the `DataGermanTTS/synmels` will look like:\n", "```\n", - "/Data/synmels/:\n", - "hifigan_test_ft.json\n", - "hifigan_train_ft.json\n", - "hifigan_val_ft.json\n", - "test\n", - "train\n", - "val\n", - "\n", - "/Data/synmels/test:\n", + "DataGermanTTS/synmels/:\n", + "hifigan_test_manifest_phonemes_ft.json\n", + "hifigan_train_manifest_phonemes_ft.json\n", + "hifigan_val_manifest_phonemes_ft.json\n", + "test_manifest_phonemes\n", + "train_manifest_phonemes\n", + "val_manifest_phonemes\n", + "\n", + "DataGermanTTS/synmels/test_manifest_phonemes:\n", "mel_0.npy\n", "mel_1.npy\n", "...\n", "\n", - "/Data/synmels/train:\n", + "DataGermanTTS/synmels/train_manifest_phonemes:\n", "mel_0.npy\n", "mel_1.npy\n", "...\n", "\n", - "/Data/synmels/val:\n", + "DataGermanTTS/synmels/val_manifest_phonemes:\n", "mel_0.npy\n", "mel_1.npy\n", "...\n", @@ -1122,13 +1076,13 @@ "\n", "Example HiFi-GAN manifest:\n", "```json\n", - "{\"audio_filepath\": \"/Data/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"Geht die Schandtat auf sein Konto?\", \"mel_filepath\": \"/Data/synmels/test/mel_0.npy\"}\n", + "{\"audio_filepath\": \"DataGermanTTS/thorsten-de/wavs/e50eb02c25353f85549900d2fc1e0e32.wav\", \"duration\": 2.409977, \"text\": \"Geht die Schandtat auf sein Konto?\", \"normalized_text\": \"Geht die Schandtat auf sein Konto?\", \"mel_filepath\": \"DataGermanTTS/synmels/test_manifest_phonemes/mel_0.npy\"}\n", "```" ] }, { "cell_type": "markdown", - "id": "33541413", + "id": "565e07a8", "metadata": {}, "source": [ "## Launch finetuning\n", @@ -1139,37 +1093,18 @@ { "cell_type": "code", "execution_count": null, - "id": "2deb96ae", + "id": "fa06c0b1", "metadata": {}, "outputs": [], "source": [ - "!(cd /Data && \\\n", + "!(cd DataGermanTTS && \\\n", " wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/zip -O tts_hifigan_1.0.0rc1.zip && \\\n", " unzip tts_hifigan_1.0.0rc1.zip)" ] }, { "cell_type": "markdown", - "id": "5c324ba6", - "metadata": {}, - "source": [ - "Setting up wandb" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7f7a12f6", - "metadata": {}, - "outputs": [], - "source": [ - "wandb_project_name = \"GermanTTS\"\n", - "wandb_run_name = \"tutorial_2\"" - ] - }, - { - "cell_type": "markdown", - "id": "ddd4623b", + "id": "fc7e8554", "metadata": {}, "source": [ "We will be re-using the existing HiFi-GAN config and HiFi-GAN pretrained on English." @@ -1178,30 +1113,29 @@ { "cell_type": "code", "execution_count": null, - "id": "ca4a5291", + "id": "73b4cbc2", "metadata": {}, "outputs": [], "source": [ - "!(cd /NeMo && \\\n", - " python hifigan_finetune.py --config-path /NeMo --config-name hifigan.yaml \\\n", + "!(python NeMoGermanTTS/hifigan_finetune.py --config-path . --config-name hifigan.yaml \\\n", " model.max_steps=10 \\\n", " model.optim.lr=0.00001 \\\n", - " model.train_ds=train_ds_finetune \\\n", - " model.validation_ds=val_ds_finetune \\\n", " ~model.optim.sched \\\n", - " train_dataset=/Data/synmels/hifigan_train_phonemes_ft.json \\\n", - " validation_datasets=/Data/synmels/hifigan_val_phonemes_ft.json \\\n", - " exp_manager.exp_dir=/result \\\n", - " +init_from_nemo_model=/Data/tts_hifigan.nemo \\\n", + " train_dataset=DataGermanTTS/synmels/hifigan_train_manifest_phonemes_ft.json \\\n", + " validation_datasets=DataGermanTTS/synmels/hifigan_val_manifest_phonemes_ft.json \\\n", + " exp_manager.exp_dir=resultGermanTTS \\\n", + " +init_from_nemo_model=DataGermanTTS/tts_hifigan.nemo \\\n", " trainer.devices=-1 \\\n", + " model/train_ds=train_ds_finetune \\\n", + " model/validation_ds=val_ds_finetune \\\n", " exp_manager.create_wandb_logger=true \\\n", - " exp_manager.wandb_logger_kwargs.name=${wandb_run_name} \\\n", - " exp_manager.wandb_logger_kwargs.project=${wandb_project_name})" + " exp_manager.wandb_logger_kwargs.name=\"tutorial_2\" \\\n", + " exp_manager.wandb_logger_kwargs.project=\"GermanTTS\")" ] }, { "cell_type": "markdown", - "id": "3cdafb08", + "id": "c8721ea1", "metadata": {}, "source": [ "Note: We've limited the above run to 10 steps only, so we can validate the implementation within the scope of this tutorial. We recommend evaluating around every 50 steps HiFi-GAN until you get desired quality results." @@ -1209,7 +1143,7 @@ }, { "cell_type": "markdown", - "id": "3d69751f", + "id": "a3b04d69", "metadata": {}, "source": [ "## Evaluating FastPitch and Finetuned HiFi-GAN\n", @@ -1220,22 +1154,28 @@ { "cell_type": "code", "execution_count": null, - "id": "8abcc589", + "id": "a62d54cf", "metadata": {}, "outputs": [], "source": [ "hfg_path = \"\"\n", + "fastpitch_model_path = \"\"\n", "\n", "if \".nemo\" in hfg_path:\n", " vocoder_model_pt = HifiGanModel.restore_from(hfg_path).eval().cuda()\n", "else:\n", - " vocoder_model_pt = HifiGanModel.load_from_checkpoint(checkpoint_path=hfg_path).eval().cuda()" + " vocoder_model_pt = HifiGanModel.load_from_checkpoint(checkpoint_path=hfg_path).eval().cuda()\n", + " \n", + "if \".nemo\" in fastpitch_model_path:\n", + " spec_gen_model = FastPitchModel.restore_from(fastpitch_model_path).eval().cuda()\n", + "else:\n", + " spec_gen_model = FastPitchModel.load_from_checkpoint(checkpoint_path=fastpitch_model_path).eval().cuda()" ] }, { "cell_type": "code", "execution_count": null, - "id": "9b196494", + "id": "94c2b645", "metadata": {}, "outputs": [], "source": [ @@ -1257,7 +1197,7 @@ }, { "cell_type": "markdown", - "id": "5cb8f4d8", + "id": "dbe10199", "metadata": {}, "source": [ "That's it!" @@ -1280,7 +1220,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.12" + "version": "3.8.13" } }, "nbformat": 4,