RFWave, a frame-level multi-band Rectified Flow model, achieves high-fidelity audio waveform reconstruction from Mel-spectrograms or discrete tokens, with generation speeds up to 160 times faster than real-time on a GPU.
Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow approach designed to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete acoustic tokens. RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency. Leveraging Rectified Flow, which targets a straight transport trajectory, RFWave achieves reconstruction with just 10 sampling steps. Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 160 times faster than real-time on a GPU.
BigVGAN(LibriTTS) | RFWave(LibriTTS) |
---|---|
Listen to BigVGAN | Listen to RFWave |
- Install the requirements.
sudo apt-get update
sudo apt-get install sox libsox-fmt-all libsox-dev
conda create -n rfwave python=3.10
conda activate rfwave
pip install -r requirements.txt
- Download and extract the LJ Speech dataset
- Update the wav paths in filelists
sed -i -- 's,LJSPEECH_PATH,ljs_dataset_folder,g' LJSpeech/*.filelist
- Update the
filelist_path
in configs/*.yaml.
- Train a vocoder
python3 train.py -c configs/rfwave.yaml
- Test a trained vocoder with
inference_voc.py
- Train an Encodec Decoder
python3 train.py -c configs/rfwave-encodec.yaml
- Download the alignment from the SyntaSpeech repo
- Convert the alignments and build a phoneset with
scripts/ljspeech_synta.py
- Modify the
filelist_path
andphoneset
path inconfigs/rfwave-dur.yaml
andconfigs/rfwave-tts-ctx.yaml
- Train a duration model
python3 train.py -c configs/rfwave-dur.yaml
- Train an acoustic model
python3 train.py -c configs/rfwave-tts-ctx.yaml
- Test the trained model with
inference_tts.py
python3 inference_voc.py --model_dir MODEL_DIR --wav_dir WAV_DIR --save_dir SAVE_DIR [--guidance_scale GUIDANCE_SCALE]
-
Optional parameter:
--guidance_scale
Adjusts the guidance scale for input types. Recommended to set 1.0 for Mel input and 2.0 for Encodec token input. -
Available Models
The test set for reconstructing waveform form EnCodec tokens.
audio_reconstruct_universal_testset
This repository uses code from Vocos, audiocraft
This project is licensed under the MIT License.