This repository organizes papers, learning materials, codes for the purpose of understanding speech. There is another repository for machine/deep learning here.
- organize stars
- add more papers
- papers to read:
- Speech=T:Transducer for TTS and Beyond
- papers to read:
-
TTS
- DC-TTS [[paper]] [pytorch][tensorflow]
- Microsoft's LightSpeech [[paper]] [code]
- SpeechFormer [[paper]] [code]
- Non-Attentive Tacotron [paper] [pytorch]
- Parallel Tacotron 2 [[paper]] [code]
- FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 [[paper]] [code]
- Transformer TTS: Neural Speech Synthesis with Transformer Network [[paper]] [code]
- VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech [[paper]] [code]
- Reformer-TTS (adaptation of Reformer to TTS) [code]
-
Prompt-based TTS (see [link])
-
Voice Conversion / Voice Cloning / Speaker Embedding
- StarGan-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks [[paper]] [code]
- Neural Voice Cloning with Few Audio Samples (Baidu) [[paper]] [code]
- Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques [[paper]] [code]
- Unet-TTS: Improving Unseen Speaker and Style Transfer in One-Shot Voice Cloning [paper] [code]
- FragmentVC: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention [[paper]] [code]
- VectorQuantizedCPC: Vector-Quantized Contrastive Predictive Coding for Acoustic Unit Discovery and Voice Conversion [[paper]] [code]
- Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data [[paper]] [code]
- Again-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization [[paper]] [code]
- AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss [[paper]] [code]
- SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model [code]
- Deep Speaker: an End-to-End Neural Speaker Embedding System [[paper]] [code]
- VQMIVC: One-shot (any-to-any) Voice Conversion [[paper]] [code]
-
Style (Emotion, Prosody)
- SMART-TTS Single Emotional TTS [code]
- Cross Speaker Emotion Transfer [[paper]] [code]
- AutoPST: Global Rhythm Style Transfer Without Text Transcriptions [[paper]] [code]
- Transforming spectrum and prosody for emotional voice conversion with non-parallel training data [[paper]] [code]
- Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency [[paper]] [code]
- Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis (Tacotron-VAE) [[paper]] [code]
- Time Domain Neural Audio Style Transfer (NIPS 2017) [[paper]] [code]
- Meta-StyleSpeech and StyleSpeech [[paper]] [code]
- Cross-Speaker Emotion Transfer Based on Speaker Conditino Layer Normalization and Semi-Supervised Training in Text-to-Speech [[paper]] [code]
-
Cross-lingual
- End-to-End Code-switching TTS with Cross-Lingual Language Model
- mandarin and english
- cross-lingual and multi-speaker
- baseline: "Building a mixed-lingual neural TTS system with only monolingual data"
- Building a mixed-lingual neural TTS system with only monolingual data
- Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages
- has many good references
- Exploring Disentanglement with Multilingual and Monolingual VQ-VAE [paper] [code]
- End-to-End Code-switching TTS with Cross-Lingual Language Model
-
Music Related
- Learning the Beauty in Songs: Neural Singing Voice Beautifier (ACL 2022) [[paper]] [code]
- Speech to Singing (Interspeech 2020) [[paper]] [code]
- DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (AAAI 2022) [[paper]] [code]
- A Universal Music Translation Network (ICLR 2019)
- Jukebox: A Generative Model for Music (OpenAI) [paper] [code]
-
Toolkits
-
Vocoders
-
Attention
- Local attention [code]
- Towards End-to-End Spoken Language Understanding
- HTS-AT: A Hierarchial Token-Semantic Audio Transformer for Sound Classification and Detection [[paper]] [code]
- Google AI's VoiceFilter System [[paper]] [code]
- Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning (Interspeech 2019) [[paper]] [code]
- Multimodal Emotion Recognition with Tranformer-Based Self Supervised Feature Fusion [[paper]] [code]
- Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings (Interspeech 2021) [[paper]] [code]
- Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition [[paper]] [code]
- Rethinking CNN Models for Audio Classification [[paper]] [code]
- EEG-based emotion recognition using SincNet [[paper]] [code]
- Cross attentive pooling for speaker verification (IEEE SLT 2021) [[paper]] [code]
- VGGSound: A Large-scale Audio-Visual Dataset [[paper]] [code]
- CSS10: A collection of single speaker speech datsets for 10 langauges [code]
- IEMOCAP: 12 hours of audiovisual data with 10 male and female actors [website]
- VoxCeleb [repo]
- Audiomentations (Fast audio data augmentation in pytorch) [code]
- Montreal Forced Aligner
- For Korean [link]
- Data (pre)processing
- Korean pronunciation and romanization based on Wiktionary ko-pron lua module [code]
- Audio Signal Processing [code]
- Phonological Features (for the paper "Phonological features for 0-shot multilingual speech synthesis") [[paper]] [code]
- SMART-G2P (change English and Kanji expressions in Korean sentence into Korean pronunciation) [code]
- Kakao Grapheme to Phoneme Conversion Package for "Mandarin" [code]
- Webaverse Speech Tool [code]
- MCD [repo]
- Code works, but I am not sure if it is right. MCD numbers are a bit too high even for pairs of similar audios.