RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction.

TL;DR

RFWave, a frame-level multi-band Rectified Flow model, achieves high-fidelity audio waveform reconstruction from Mel-spectrograms or discrete tokens, with generation speeds up to 160 times faster than real-time on a GPU.

Abstract

Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow approach designed to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete acoustic tokens. RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency. Leveraging Rectified Flow, which targets a straight transport trajectory, RFWave achieves reconstruction with just 10 sampling steps. Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 160 times faster than real-time on a GPU.

BigVGAN(LibriTTS)	RFWave(LibriTTS)
Listen to BigVGAN	Listen to RFWave

Usage

Setup

Install the requirements.

sudo apt-get update
sudo apt-get install sox libsox-fmt-all libsox-dev
conda create -n rfwave python=3.10
conda activate rfwave
pip install -r requirements.txt

Download and extract the LJ Speech dataset
Update the wav paths in filelists sed -i -- 's,LJSPEECH_PATH,ljs_dataset_folder,g' LJSpeech/*.filelist
Update the filelist_path in configs/*.yaml.

Vocoder

Train a vocoder python3 train.py -c configs/rfwave.yaml
Test a trained vocoder with inference_voc.py

Encodec Decoder

Train an Encodec Decoder python3 train.py -c configs/rfwave-encodec.yaml

Text to Speech

Download the alignment from the SyntaSpeech repo
Convert the alignments and build a phoneset with scripts/ljspeech_synta.py
Modify the filelist_path and phoneset path in configs/rfwave-dur.yaml and configs/rfwave-tts-ctx.yaml
Train a duration model python3 train.py -c configs/rfwave-dur.yaml
Train an acoustic model python3 train.py -c configs/rfwave-tts-ctx.yaml
Test the trained model with inference_tts.py

Pre-trained models

python3 inference_voc.py --model_dir MODEL_DIR --wav_dir WAV_DIR --save_dir SAVE_DIR [--guidance_scale GUIDANCE_SCALE]

Optional parameter: --guidance_scale Adjusts the guidance scale for input types. Recommended to set 1.0 for Mel input and 2.0 for Encodec token input.
Available Models

Test set

The test set for reconstructing waveform form EnCodec tokens.

audio_reconstruct_universal_testset

Thanks

This repository uses code from Vocos, audiocraft

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
LJSpeech		LJSpeech
assets		assets
configs		configs
metrics		metrics
reflow		reflow
rfwave		rfwave
scripts		scripts
tests		tests
.gitignore		.gitignore
MIT-License.txt		MIT-License.txt
README.md		README.md
calculate_voc_metrics.py		calculate_voc_metrics.py
inference_tts.py		inference_tts.py
inference_voc.py		inference_voc.py
reflow_distill.sh		reflow_distill.sh
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction.

TL;DR

Abstract

Usage

Setup

Vocoder

Encodec Decoder

Text to Speech

Pre-trained models

Test set

Thanks

License

About

Releases

Packages

Contributors 2

Languages

License

bfs18/rfwave

Folders and files

Latest commit

History

Repository files navigation

RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction.

TL;DR

Abstract

Usage

Setup

Vocoder

Encodec Decoder

Text to Speech

Pre-trained models

Test set

Thanks

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages