Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.
The North-Star objective of Amphion is to offer a platform for studying the conversion of various inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,
- TTS: Text to Speech Synthesis (supported)
- SVS: Singing Voice Synthesis (planning)
- VC: Voice Conversion (planning)
- SVC: Singing Voice Conversion (supported)
- TTA: Text to Audio (supported)
- TTM: Text to Music (planning)
- more…
In addition to the specific generation tasks, Amphion also includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.
- Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems.
- It supports the following models or architectures,
- FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
- VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
- Vall-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
- NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
- It supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec.
- It implements several state-of-the-art model architectures, including diffusion-based and Transformer-based models. The diffusion-based architecture uses Bidirectoinal dilated CNN and U-Net as a backend and supports DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.
- Supply TTA with latent diffusion model, including:
- AudioLDM: a two stage model with an autoencoder and a latent diffusion model
- Amphion supports both classic and state-of-the-art neural vocoders, including
We supply a comprehensive objective evaluation for the generated audios. The evaluation metrics contain:
- F0 Modeling
- F0 Pearson Coefficients
- F0 Periodicity Root Mean Square Error
- F0 Root Mean Square Error
- Voiced/Unvoiced F1 Score
- Energy Modeling
- Energy Pearson Coefficients
- Energy Root Mean Square Error
- Intelligibility
- Character/Word Error Rate based Whisper
- Spectrogram Distortion
- Frechet Audio Distance (FAD)
- Mel Cepstral Distortion (MCD)
- Multi-Resolution STFT Distance (MSTFT)
- Perceptual Evaluation of Speech Quality (PESQ)
- Short Time Objective Intelligibility (STOI)
- Signal to Noise Ratio (SNR)
- Speaker Similarity
- Cosine similarity based RawNet3