Paper List

List of papers not just about speech synthesis 😀.

Content

TTS Frontend
Acoustic Model
Vocoder
TTS towards Stylization
Voice Conversion
Singing
- Singing Voice Synthesis
- Singing Voice Conversion
Speech Processing Related
Natural Language Processing
VAE & GAN
- VAE
- GAN
Others

TTS Frontend

Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis (Interspeech 2019)
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis (ICASSP 2020)
A hybrid text normalization system using multi-head self-attention for mandarin (ICASSP 2020)
Unified Mandarin TTS Front-end Based on Distilled BERT Model (2021-01)

Acoustic Model

Autoregressive Model

Tacotron V1: Tacotron: Towards End-to-End Speech Synthesis (Interspeech 2017)
Tacotron V2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (ICASSP 2018)
Deep Voice V1: Deep Voice: Real-time Neural Text-to-Speech (ICML 2017)
Deep Voice V2: Deep Voice 2: Multi-Speaker Neural Text-to-Speech (NeurIPS 2017)
Deep Voice V3: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
Transformer-TTS: Neural Speech Synthesis with Transformer Network (AAAI 2019)
DurIAN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis (ICASSP 2020)
Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (under review ICLR 2021)
RobuTrans (towards robust): RobuTrans: A Robust Transformer-Based Text-to-Speech Model (AAAI 2020)
DeviceTTS: DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-Speech (2020-10)
Wave-Tacotron: Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis (2020-11)
Streaming Acoustic Modeling: Transformer-based Acoustic Modeling for Streaming Speech Synthesis (2021-06)
Apple TTS system: On-device neural speech synthesis (ASRU 2021)

Non-Autoregressive Model

ParaNet: Non-Autoregressive Neural Text-to-Speech (ICML 2020)
FastSpeech: FastSpeech: Fast, Robust and Controllable Text to Speech (NeurIPS 2019)
JDI-T: JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment (2020)
EATS: End-to-End Adversarial Text-to-Speech (2020)
FastSpeech 2: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (2020)
FastPitch: FastPitch: Parallel Text-to-speech with Pitch Prediction (2020)
Glow-TTS (flow based, Monotonic Attention): Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (NeurIPS 2020)
Flow-TTS (flow based): Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow (ICASSP 2020)
SpeedySpeech: SpeedySpeech: Efficient Neural Speech Synthesis (Interspeech 2020)
Parallel Tacotron: Parallel Tacotron: Non-Autoregressive and Controllable TTS (2020)
BVAE-TTS: Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (ICLR 2021)
LightSpeech: LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search (ICASSP 2021)
Parallel Tacotron 2: Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling (2021)
Grad-TTS: Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech (ICML 2021)
VITS (flow based): Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (ICML 2021)
RAD-TTS: RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis (ICML 2021 Workshop)
WaveGrad 2: WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis (Interspeech 2021)
PortaSpeech: PortaSpeech: Portable and High-Quality Generative Text-to-Speech (NeurIPS 2021)
DelightfulTTS (To synthesize natural and high-quality speech from text): DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021 (Blizzard Challenge 2021)
DiffGAN-TTS: DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs (2022-01)
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis (ICLR 2022)
JETS: JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech (Interspeech 2022)
WavThruVec: WavThruVec: Latent speech representation as intermediate features for neural speech synthesis (2022-03)
FastDiff: FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis (IJCAI 2022)
NaturalSpeech: NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022-05)
DelightfulTTS 2: DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders (Interspeech 2022)
CLONE: Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech (2022-07)

Alignment Study

Monotonic Attention: Online and Linear-Time Attention by Enforcing Monotonic Alignments (ICML 2017)
Monotonic Chunkwise Attention: Monotonic Chunkwise Attention (ICLR 2018)
Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis (ICASSP 2018)
RNN-T for TTS: Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments (2019)
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis (ICASSP 2020)
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (under review ICLR 2021)
EfficientTTS: EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (2020-12)
VAENAR-TTS: VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis (2021-07)
One TTS Alignment To Rule Them All (2021-08)

Data Efficiency

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis (2018)
Almost Unsupervised Text to Speech and Automatic Speech Recognition (ICML 2019)
Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages (Interspeech 2020)
Multilingual Speech Synthesis: One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech (Interspeech 2020)
Low-resource expressive text-to-speech using data augmentation (2020-11)
One TTS Alignment To Rule Them All (2021-08)
DenoiSpeech: DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling (ICASSP 2021)
Revisiting Over-Smoothness in Text to Speech (ACL 2022)
Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition (2022-03)
Simple and Effective Unsupervised Speech Synthesis (2022-04)
A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS (Interspeech 2022)
EPIC TTS Models (research on pruning): EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models (Interspeech 2022)

Vocoder

Autoregressive Model

WaveNet: WaveNet: A Generative Model for Raw Audio (2016)
WaveRNN: Efficient Neural Audio Synthesis (ICML 2018)
WaveGAN: Adversarial Audio Synthesis (ICLR 2019)
LPCNet: LPCNet: Improving Neural Speech Synthesis Through Linear Prediction (ICASSP 2019)
Towards achieving robust universal neural vocoding (Interspeech 2019)
GAN-TTS: High Fidelity Speech Synthesis with Adversarial Networks (2019)
MultiBand-WaveRNN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)
Chunked Autoregressive GAN for Conditional Waveform Synthesis (2021-10)
Improved LPCNet: Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet (ICASSP 2022)
Bunched LPCNet2: Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge (2022-03)

Non-Autoregressive Model

Parallel-WaveNet: Parallel WaveNet: Fast High-Fidelity Speech Synthesis (2017)
WaveGlow: WaveGlow: A Flow-based Generative Network for Speech Synthesis (2018)
Parallel-WaveGAN: Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram (2019)
MelGAN: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (NeurIPS 2019)
MultiBand-MelGAN: Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (2020)
VocGAN: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network (Interspeech 2020)
WaveGrad: WaveGrad: Estimating Gradients for Waveform Generation (2020)
DiffWave: DiffWave: A Versatile Diffusion Model for Audio Synthesis (2020)
HiFi-GAN: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (NeurIPS 2020)
Parallel-WaveGAN (New): Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators (2020-10)
StyleMelGAN: StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization (ICASSP 2021)
Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss (SLT 2021)
Fre-GAN: Fre-GAN: Adversarial Frequency-consistent Audio Synthesis (Interspeech 2021)
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation (2021-07)
iSTFTNet: iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform (ICASSP 2022)
Parallel Synthesis for Autoregressive Speech Generation (2022-04)
Avocodo: Avocodo: Generative Adversarial Network for Artifact-free Vocoder (2022-06)

Others

(Robust vocoder): Towards Robust Neural Vocoding for Speech Generation: A Survey (2019)
(Source-filter model based): Neural source-filter waveform models for statistical parametric speech synthesis (TASLP 2019)
NHV: Neural Homomorphic Vocoder (Interspeech 2020)
Universal MelGAN: Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains (2020)
Binaural Speech Synthesis: Neural Synthesis of Binaural Speech From Mono Audio (ICLR 2021)
Checkerboard artifacts in neural vocoder: Upsampling artifacts in neural audio synthesis (ICASSP 2021)
Universal Vocoder Based on Parallel WaveNet: Universal Neural Vocoding with Parallel WaveNet (ICASSP 2021)
(Comparison of discriminator): GAN Vocoder: Multi-Resolution Discriminator Is All You Need (2021-03)
Vocoder Benchmark: VocBench: A Neural Vocoder Benchmark for Speech Synthesis (2021-12)
BigVGAN (Universal vocoder): BigVGAN: A Universal Neural Vocoder with Large-Scale Training (2022-06)

TTS towards Stylization

Expressive TTS

ReferenceEncoder-Tacotron: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron (ICML 2018)
GST-Tacotron: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (ICML 2018)
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis (2018)
GMVAE-Tacotron2: Hierarchical Generative Modeling for Controllable Speech Synthesis (ICLR 2019)
BERT-TTS: Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models (2019)
(Multi-style Decouple): Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency (2019)
(Multi-style Decouple): Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis (Interspeech 2019)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
Robust and fine-grained prosody control of end-to-end speech synthesis (ICASSP 2019)
Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
(local style): Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (ICASSP 2020)
Controllable Neural Prosody Synthesis (Interspeech 2020)
GraphSpeech: GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis (2020-10)
BERT-TTS: Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis (2020-11)
(Global Emotion Style Control): Controllable Emotion Transfer For End-to-End Speech Synthesis (2020-11)
(Phone Level Style Control): Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis (2020-11)
(Phone Level Prosody Modelling): Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis (ICASSP 2021)
(Phone Level Prosody Modelling): Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis (ICASSP 2021)
PeriodNet: PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components (ICASSP 2021)
PnG BERT: PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS (Interspeech 2021)
Towards Multi-Scale Style Control for Expressive Speech Synthesis (2021-04)
Learning Robust Latent Representations for Controllable Speech Synthesis (2021-05)
Diverse and Controllable Speech Synthesis with GMM-Based Phone-Level Prosody Modelling (2021-05)
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS (2021-06)
(Conversational Speech Synthesis): Controllable Context-aware Conversational Speech Synthesis (Interspeech 2021)
DeepRapper: DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling (ACL 2021)
Referee: Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis (2021)
(Text-Based Insertion TTS): Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration (Interspeech 2021)
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis (2021-10)
Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models (2021-10)
TTS for dubbing: Neural Dubber: Dubbing for Videos According to Scripts (NeurIPS 2021)
Word-Level Style Control for Expressive, Non-attentive Speech Synthesis (SPECOM 2021)
MsEmoTTS: MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis (2022-01)
Disentangling Style and Speaker Attributes for TTS Style Transfer (2022-01)
Word-level prosody modeling: Unsupervised word-level prosody tagging for controllable speech synthesis (ICASSP 2022)
ProsoSpeech: ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech (ICASSP 2022)
CampNet (speech editing):CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing (2022-02)
vTTS (visual text): vTTS: visual-text to speech (2022-03)
CopyCat2: CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer (Interspeech 2022)
Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech (Interspeech 2022)
Expressive, Variable, and Controllable Duration Modelling in TTS (Interspeech 2022)

MultiSpeaker TTS

Meta-Learning for TTS: Sample Efficient Adaptive Text-to-Speech (ICLR 2019)
SV-Tacotron: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (NeurIPS 2018)
Deep Voice V3: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings (ICASSP 2020)
MultiSpeech: MultiSpeech: Multi-Speaker Text to Speech with Transformer (2020)
SC-WaveRNN: Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions (Interspeech 2020)
MultiSpeaker Dataset: AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines (2020)
Life-long learning for multi-speaker TTS: Continual Speaker Adaptation for Text-to-Speech Synthesis (2021-03)
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation (ICML 2021)
Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis (Interspeech 2021)
Speaker Generation (2021-11)
Meta-Voice: Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning (2021-11)

New Perspective on TTS

PromptTTS: PromptTTS: Controllable Text-to-Speech with Text Descriptions (2022-11)
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023-01)
InstructTTS: InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (2023-01)
Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision (2023-02)
FoundationTTS: FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model (2023-03)

Voice Conversion

ASR & TTS Based

(introduce PPG into voice conversion): Phonetic posteriorgrams for many-to-one voice conversion without parallel data training (2016)
A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data (2019)
TTS-Skins: TTS Skins: Speaker Conversion via ASR (2019)
Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations (IEEE/ACM TASLP 2019)
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization (Interspeech 2019)
Cotatron (combine text information with voice conversion system): Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data (Interspeech 2020)
(TTS & ASR): Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer (Interspeech 2020)
FragmentVC (wav to vec): FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention (2020)
Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram (ICASSP 2021)
(TTS & ASR): On Prosody Modeling for ASR+TTS based Voice Conversion (2021-07)
Cloning one's voice using very limited data in the wild (2021-10)

VAE & Auto-Encoder Based

VAE-VC (VAE based): Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder (2016)
(Speech representation learning by VQ-VAE): Unsupervised speech representation learning using WaveNet autoencoders (2019)
Blow (Flow based): Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion (NeurIPS 2019)
AutoVC: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss (2019)
F0-AutoVC: F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder (ICASSP 2020)
One-Shot Voice Conversion by Vector Quantization (ICASSP 2020)
SpeechSplit (auto-encoder): Unsupervised Speech Decomposition via Triple Information Bottleneck (ICML 2020)
NANSY: Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations (NeurIPS 2021)

GAN Based

CycleGAN-VC V1: Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks (2017)
StarGAN-VC: StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks (2018)
CycleGAN-VC V2: CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion (2019)
CycleGAN-VC V3: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion (2020)
MaskCycleGAN-VC: MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames (ICASSP 2021)

Singing

Singing Voice Synthesis

XiaoIce Band: XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music (KDD 2018)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
ByteSing: ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders (2020)
JukeBox: Jukebox: A Generative Model for Music (2020)
XiaoIce Sing: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System (2020)
HiFiSinger: HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis (2019)
Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss (2020)
Learn2Sing: Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher (2020-11)
MusicBERT: MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training (ACL 2021)
SingGAN (Singing Voice Vocoder): SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation (AAAI 2022)
Background music generation: Video Background Music Generation with Controllable Music Transformer (ACM Multimedia 2021)
Multi-Singer (Singing Voice Vocoder): Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus (ACM Multimedia 2021)
Rapping-singing voice synthesis: Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control (SSW 11)
VISinger (VIST for Singing Voice Synthesis): VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis (2021-10)
Opencpop: Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis (2022-01)
Learning the Beauty in Songs: Neural Singing Voice Beautifier (ACL 2022)
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher (2022-03)
MusicLM: MusicLM: Generating Music From Text (2023-01)
SingSong: SingSong: Generating musical accompaniments from singing (2023-01)

Singing Voice Conversion

A Universal Music Translation Network (2018)
Unsupervised Singing Voice Conversion (Interspeech 2019)
PitchNet: PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network (ICASSP 2020)
DurIAN-SC: DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System (Interspeech 2020)
Speech-to-Singing Conversion based on Boundary Equilibrium GAN (Interspeech 2020)
PPG-based singing voice conversion with adversarial representation learning (2020)

Speech Processing Related

Speech Pretrained Model

Audio-Word2Vec: Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder (2016)
Unsupervised speech representation learning using WaveNet autoencoders (2019)
Improving Transformer-based Speech Recognition Using Unsupervised Pre-training (2019)
SpeechBERT: SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering (2019)
DDSP: DDSP: Differentiable Digital Signal Processing (ICLR 2020)
SoundStream: SoundStream: An End-to-End Neural Audio Codec (2021-07)
NANSY: Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations (NeurIPS 2021)
Study of audio representations: Audio representations for deep learning in sound synthesis: A review (2022-01)
MuLan (Music Text Embedding): MuLan: A Joint Embedding of Music Audio and Natural Language (2022-08)
AudioLM: AudioLM: a Language Modeling Approach to Audio Generation (2022-09)
AudioGen: AudioGen: Textually Guided Audio Generation (2022-09)
NANSY++: NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis (2022-11)

Speech Separation

TasNet: TasNet: time-domain audio separation network for real-time, single-channel speech separation (ICASSP 2018)
Conv-TasNet: Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
(Music Source Separation): Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation (2021-09)

Speaker Verification

DeepSpeaker: Deep Speaker: an End-to-End Neural Speaker Embedding System (2017)
GE2E Loss: Generalized End-to-End Loss for Speaker Verification (ICASSP 2018)

Audio Super Resolution

VoiceFixer: Toward General Speech Restoration With Neural Vocoder (2021)

Tools

ESPnet: ESPnet: End-to-End Speech Processing Toolkit (2018)
SpeechBrain: speechbrain
SpeechBrain Paper: SpeechBrain: A General-Purpose Speech Toolkit
ESPnet2-TTS: ESPnet2-TTS: Extending the Edge of TTS Research (2021-10)
WeNet 2.0: WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit (2022-03)

Natural Language Processing

Sequence Modeling

LSTM: Long Short-term Memory (1997)
GRU: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (EMNLP 2014)
TCN: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (2018)
Transformer: Attention Is All You Need (NIPS 2017)
Transformer-XL: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (ACL 2019)
Reformer: Reformer: The Efficient Transformer (ICLR 2020)

Pretrained Model

Awesome Repositories: transformers
BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (NAACL 2019)
XLNET: XLNet: Generalized Autoregressive Pretraining for Language Understanding (NeurIPS 2019)
ALBERT: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (ICLR 2020)
Masked Autoencoders that Listen (2022-07)

Non-autoregressive Translation Model

A Study of Non-autoregressive Model for Sequence Generation (ACL 2020)
Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (EMNLP 2018)
Non-Autoregressive Neural Machine Translation (ICLR 2018)
Non-Autoregressive Machine Translation with Auxiliary Regularization (AAAI 2019)
Mask-Predict: Parallel Decoding of Conditional Masked Language Models (EMNLP 2019)

Speech2Speech Translation Model

Awesome Paper List: awesome-speech-translation
Direct speech-to-speech translation with a sequence-to-sequence model (Interspeech 2020)
NeurST: NeurST: Neural Speech Translation Toolkit (2020-12)
Translatotron 2: Translatotron 2: Robust direct speech-to-speech translation (2021-07)

Neural Machine Reading Comprehension

Review 2019: Neural Machine Reading Comprehension: Methods and Trends (2019)
Review 2020: A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics, and Benchmark Datasets (2019)
NMRC first: Teaching Machines to Read and Comprehend (NIPS 2015)
RACE dataset: RACE: Large-scale ReAding Comprehension Dataset From Examinations (EMNLP 2017)
Cloze test: Large-scale Cloze Test Dataset Created by Teachers (EMNLP 2018)
HuggingFace: HuggingFace's Transformers: State-of-the-art Natural Language Processing (2019)

VAE & GAN

VAE

VAE: Auto-Encoding Variational Bayes (ICLR 2014)
GM-VAE: Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders (ICLR 2017)
VQ-VAE: Neural Discrete Representation Learning (NIPS 2017)
VQ-VAE 2: Generating Diverse High-Fidelity Images with VQ-VAE-2 (NeurIPS 2019)

GAN

GAN: Generative Adversarial Networks (NIPS 2014)
Condition-GAN: Conditional Generative Adversarial Nets (2014)
Info-GAN: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets (2016)
SeqGAN: SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient (AAAI 2017)
Cycle-GAN: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (ICCV 2017)
Star-GAN: StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation (CVPR 2018)
BigGAN: Large Scale GAN Training for High Fidelity Natural Image Synthesis (ICLR 2019)
Style-GAN: A Style-Based Generator Architecture for Generative Adversarial Networks (CVPR 2019)

Others

(Forgetting learning): An Empirical Study of Example Forgetting during Deep Neural Network Learning (ICLR 2019)
ScaNN (search accelerating): Accelerating Large-Scale Inference with Anisotropic Vector Quantization (ICML 2020)
(memory management): Efficient Memory Management for Deep Neural Net Inference (2020)
Conformer: Conformer: Convolution-augmented Transformer for Speech Recognition (InterSpeech 2020)
Computational Arts: When Creators Meet the Metaverse: A Survey on Computational Arts (2021-11)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Paper List

Content

TTS Frontend

Acoustic Model

Autoregressive Model

Non-Autoregressive Model

Alignment Study

Data Efficiency

Vocoder

Autoregressive Model

Non-Autoregressive Model

Others

TTS towards Stylization

Expressive TTS

MultiSpeaker TTS

New Perspective on TTS

Voice Conversion

ASR & TTS Based

VAE & Auto-Encoder Based

GAN Based

Singing

Singing Voice Synthesis

Singing Voice Conversion

Speech Processing Related

Speech Pretrained Model

Speech Separation

Speaker Verification

Audio Super Resolution

Tools

Natural Language Processing

Sequence Modeling

Pretrained Model

Non-autoregressive Translation Model

Speech2Speech Translation Model

Neural Machine Reading Comprehension

VAE & GAN

VAE

GAN

Others