Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to start xtts v2 training process. #3303

Closed
arbianqx opened this issue Nov 24, 2023 · 11 comments
Closed

Unable to start xtts v2 training process. #3303

arbianqx opened this issue Nov 24, 2023 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@arbianqx
Copy link

Describe the bug

I have prepared my own dataset in LJSpeech format. Tried starting the training process based on the recipe, but was unable to do so. I think it's acting like this since the dataset is not, in supported list provided by xtts v2. I get the following error:
AssertionError: ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.
The same dataset, can be used in different training scripts/approaches, such as vits or yourtts.

To Reproduce

Run training script with another language dataset!

Expected behavior

Training should be started.

Logs

> EPOCH: 0/1000
 --> /TTS/run/training/GPT_XTTS_v2.0_LJSpeech_FT-November-24-2023_05+18PM-990b209
 > Filtering invalid eval samples!!
[!] Warning: The text length exceeds the character limit of 250 for language 'sq', this might cause truncated audio.
[!] Warning: The text length exceeds the character limit of 250 for language 'sq', this might cause truncated audio.
 > Total eval samples after filtering: 0
 ! Run is removed from /TTS/run/training/GPT_XTTS_v2.0_LJSpeech_FT-November-24-2023_05+18PM-990b209
Traceback (most recent call last):
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 1826, in fit
    self._fit()
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 1780, in _fit
    self.eval_epoch()
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 1628, in eval_epoch
    self.get_eval_dataloader(
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 990, in get_eval_dataloader
    return self._get_loader(
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 914, in _get_loader
    len(loader) > 0
AssertionError:  ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "TTS": "0.21.1",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.8",
        "version": "#202212290932~1674066459~20.04~3cd2bf3-Ubuntu SMP PREEMPT_DYNAMI"
    }
}

Additional context

No response

@arbianqx arbianqx added the bug Something isn't working label Nov 24, 2023
@Okohedeki
Copy link

Okohedeki commented Nov 24, 2023

I saw this issue yesterday. Which dataset format are you using? My issue was due to the fact it was expecting a pipe-delimited csv when the ljspeech format for the metadata.csv and I still had it as a comma delimtied

@arbianqx
Copy link
Author

arbianqx commented Nov 24, 2023

I saw this issue yesterday. Which dataset format are you using? My issue was due to the fact it was expecting a pipe-delimited csv when the ljspeech format for the metadata.csv and I still had it as a comma delimtied

Hey @Okohedeki, I'm using ljspeech format. I've formmated my dataset in ljspeech format.

@Okohedeki
Copy link

Okohedeki commented Nov 24, 2023

Are you sure that the csv is pipe delmited? Just because there are pipes in the csv doesn't make it a pipe-delimted dataset. For example when I was saving the csv I had this line here:
csv_writer = csv.writer(csv_file)
and I had to switch it to

csv_writer = csv.writer(csv_file, delimiter='|')
Since before I was manually creating the dataset like wavefile | text | formatted text but in reality what was happening was it was wavefile |, text |, formatted text

The error is differently because the dataset is not correct

@arbianqx
Copy link
Author

Yes I can confirm that this is not the case. It is pipe limited ("|"). The same dataset is working on different approaches such as vits and yourtts!

@Okohedeki
Copy link

Okohedeki commented Nov 24, 2023

Only other thing is if you go to this file here:

TTS\TTS\tts\datasets\formatters.py

for the ljspeech function can you print out the actual path of the file? It should be the txt_file. I had to change the line to:

            # wav_file = os.path.join(root_path, "wavs", cols[0] + ".wav")
            wav_file = os.path.join(root_path, "wavs", cols[0])

to stop it from appending .wav to my file that was already saved as .wav

@arbianqx
Copy link
Author

yes I have the exact same format as ljspeech.

@Edresson
Copy link
Contributor

Edresson commented Nov 27, 2023

Hi @arbianqx,

This message "> Total eval samples after filtering: 0" indicates that you don't have any eval samples. It can be caused by three reasons:

  1. The Eval CSV that you provided is empty;
  2. The samples on the eval CSV that you provided are bigger than the max_wav_len and max_text_len defined on the recipe (https://github.com/coqui-ai/TTS/blob/dev/recipes/ljspeech/xtts_v2/train_gpt_xtts.py#L86C1-L87C29). Note that you do not recommend the changes of these values for fine-tuning;
  3. You do not provide an Eval CSV and all the samples automatically selected are bigger than max_wav_length and max_text_length.

In all these scenarios, you need to change (or create) your eval CSV to meet the requirements for training.

Alternatively, the PR #3296 implements a gradio demo for data processing plus training and inference for XTTS model. On the PR, have also have a Google Colab and soon we will do a video showing how to use the demo.

@Edresson Edresson self-assigned this Nov 27, 2023
@erogol
Copy link
Member

erogol commented Nov 28, 2023

Reopen if the comment above doesnt help.

@erogol erogol closed this as completed Nov 28, 2023
@rumbleFTW
Copy link

Hey @arbianqx! I would like to train XTTSv2 on my own dataset, but I've no clue on how to start. Could you provide me some resources/notebooks that will help me get started? Thanks!

@dorbodwolf
Copy link

dorbodwolf commented Dec 31, 2023

I use the formatter method to process my audio files(Chinese language), but I got the csv files with no data. Because it has never met the condition of if word.word[-1] in ["!", ".", "?"]:

I am sure that the whisper model outputs are fine:

(Pdb) words_list[0]
Word(start=0.0, end=0.42, word='但', probability=0.82470703125)
def format_audio_list(audio_files, target_language="en", out_path=None, buffer=0.2, eval_percentage=0.15, speaker_name="coqui", gradio_progress=None):
    audio_total_size = 0
    # make sure that ooutput file exists
    os.makedirs(out_path, exist_ok=True)

    # Loading Whisper
    device = "cuda" if torch.cuda.is_available() else "cpu" 

    print("Loading Whisper Model!")
    asr_model = WhisperModel("large-v2", device=device, compute_type="float16")

    metadata = {"audio_file": [], "text": [], "speaker_name": []}

    if gradio_progress is not None:
        tqdm_object = gradio_progress.tqdm(audio_files, desc="Formatting...")
    else:
        tqdm_object = tqdm(audio_files)

    for audio_path in tqdm_object:
        wav, sr = torchaudio.load(audio_path)
        # stereo to mono if needed
        if wav.size(0) != 1:
            wav = torch.mean(wav, dim=0, keepdim=True)

        wav = wav.squeeze()
        audio_total_size += (wav.size(-1) / sr)

        segments, _ = asr_model.transcribe(audio_path, word_timestamps=True, language=target_language)
        segments = list(segments)
        i = 0
        sentence = ""
        sentence_start = None
        first_word = True
        # added all segments words in a unique list
        words_list = []
        for _, segment in enumerate(segments):
            words = list(segment.words)
            words_list.extend(words)

        # process each word
        for word_idx, word in enumerate(words_list):
            if first_word:
                sentence_start = word.start
                # If it is the first sentence, add buffer or get the begining of the file
                if word_idx == 0:
                    sentence_start = max(sentence_start - buffer, 0)  # Add buffer to the sentence start
                else:
                    # get previous sentence end
                    previous_word_end = words_list[word_idx - 1].end
                    # add buffer or get the silence midle between the previous sentence and the current one
                    sentence_start = max(sentence_start - buffer, (previous_word_end + sentence_start)/2)

                sentence = word.word
                first_word = False
            else:
                sentence += word.word

            if word.word[-1] in ["!", ".", "?"]:
                sentence = sentence[1:]
                # Expand number and abbreviations plus normalization
                sentence = multilingual_cleaners(sentence, target_language)
                audio_file_name, _ = os.path.splitext(os.path.basename(audio_path))

                audio_file = f"wavs/{audio_file_name}_{str(i).zfill(8)}.wav"

                # Check for the next word's existence
                if word_idx + 1 < len(words_list):
                    next_word_start = words_list[word_idx + 1].start
                else:
                    # If don't have more words it means that it is the last sentence then use the audio len as next word start
                    next_word_start = (wav.shape[0] - 1) / sr

                # Average the current word end and next word start
                word_end = min((word.end + next_word_start) / 2, word.end + buffer)
                
                absoulte_path = os.path.join(out_path, audio_file)
                os.makedirs(os.path.dirname(absoulte_path), exist_ok=True)
                i += 1
                first_word = True

                audio = wav[int(sr*sentence_start):int(sr*word_end)].unsqueeze(0)
                # if the audio is too short ignore it (i.e < 0.33 seconds)
                if audio.size(-1) >= sr/3:
                    torchaudio.save(absoulte_path,
                        audio,
                        sr
                    )
                else:
                    continue

                metadata["audio_file"].append(audio_file)
                metadata["text"].append(sentence)
                metadata["speaker_name"].append(speaker_name)

    df = pandas.DataFrame(metadata)
    df = df.sample(frac=1)
    num_val_samples = int(len(df)*eval_percentage)

    df_eval = df[:num_val_samples]
    df_train = df[num_val_samples:]

    df_train = df_train.sort_values('audio_file')
    train_metadata_path = os.path.join(out_path, "metadata_train.csv")
    df_train.to_csv(train_metadata_path, sep="|", index=False)

    eval_metadata_path = os.path.join(out_path, "metadata_eval.csv")
    df_eval = df_eval.sort_values('audio_file')
    df_eval.to_csv(eval_metadata_path, sep="|", index=False)

    # deallocate VRAM and RAM
    del asr_model, df_train, df_eval, df, metadata
    gc.collect()

    return train_metadata_path, eval_metadata_path, audio_total_size

@OswaldoBornemann
Copy link

So can we use a dataset which contains multiple speakers but with the same language to train xtts v2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants