Voice technology is taking off in a big way. For organisations, businesses and individuals trying to make sense of voice and where it sits in their technical architectures, it can be really confusing to understand the open source offerings that are out there.
This repo is a listing of known open source voice tools, structured by where those tools sit in the voice stack.
- Duca, Daniela. “Disrupting Transcription – How Automation Is Transforming a Foundational Research Method.” Impact of Social Sciences (blog), September 17, 2019. https://blogs.lse.ac.uk/impactofsocialsciences/2019/09/17/disrupting-transcription-how-technology-is-transforming-a-foundational-research-method/.
Website | Tool name | License | Description |
---|---|---|---|
openslr.org | Open Speech Language Resources | N/A | Run by @danpovey, who is also a key maintainer of the Kaldi-ASR speech to text tool |
kaldi-asr.org | Kaldi Automatic Speech Recognition toolkit. | Apache 2 | One of the first open source speech recognition toolkits. Academic reference is: Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society. |
Website | Tool name | License | Description |
---|---|---|---|
Flowtron by Nvidia | A Tacotron-based speech synthsis tool which can be tweaked for pitch and prosody, setting it apart from other Tacotron-based TTS implementations | Apache2 | First released at the GTC 2020 Conference in May 2020. Academic paper is avaialble here. Citation is Valle, R., Shih, K., Prenger, R., & Catanzaro, B. (2020). Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis. arXiv preprint arXiv:2005.05957. |
- Mwiti, Derrick. “A 2019 Guide to Speech Synthesis with Deep Learning.” Medium, June 23, 2020. https://heartbeat.fritz.ai/a-2019-guide-to-speech-synthesis-with-deep-learning-630afcafb9dd.
^ This is a great article that explains the differences in the evolutions or generations of text to speech - from concatenative to statistical parametric to generative. More modern TTS approaches like Tacotron and WaveNet are generative approaches.
Website | Tool name | License | Description |
---|---|---|---|
Mindmeld by Cisco | . | Apache2 | The MindMeld Conversational AI platform is among the most advanced AI platforms for building production-quality conversational applications. It is a Python-based machine learning framework which encompasses all of the algorithms and utilities required for this purpose. Evolved over several years of building and deploying dozens of the most advanced conversational experiences achievable, MindMeld is optimized for building advanced conversational assistants which demonstrate deep understanding of a particular use case or domain while providing highly useful and versatile conversational experiences. The academic reference for this tool is: |
Raghuvanshi, A., Carroll, L. and Raghunathan, K., 2018, November. Developing Production-Level Conversational Interfaces with Shallow Semantic Parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 157-162) |
-
Mycroft.AI - an open source, layered voice assistant that works on a range of Linux-compatible hardware, such as x86 or ARM devices such as Raspberry Pi. Supported by a strong community of open source developers.
-
OVAL / Genie project at Stanford - Funded by the Alfred P Sloan Foundation and by a NIST grant, Stanford's OVAL project aims to provide an open source alternative to commercial voice assistants. The project is currently in its infancy and is attempting to build an open source community.
-
Python Natural Language Toolkit NLTK - NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
-
ECCO explainab - ECCO is a Python library that provides explainability for NLP using interactive visualisations.
-
Detext source code DeText is a Deep Text understanding framework for NLP related ranking, classification, and language generation tasks. It leverages semantic matching using deep neural networks to understand member intents in search and recommender systems. As a general NLP framework, currently DeText can be applied to many tasks, including search & recommendation ranking, multi-class classification and query understanding tasks. Published by the AI team at LinkedIn.
-
pglex - First presented at the ICLDC 7 conference in 2021, pglex is a 'pretty good' lexical service designed to facilitate the construction of dictionary websites and other applications that incorporate lexical data. With pglex, researchers can provide lexical entries in JSON format to an instance of the pglex API and get 'pretty good' search results without requiring language-specific configurations. Built on ElasticSearch.
-
Artie Bias Corpus - A corpus and set of tools for detecting demographic bias in ASR systems.
-
[Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of" Bias" in NLP. arXiv preprint arXiv:2005.14050.] https://arxiv.org/pdf/2005.14050.pdf
- ALIZE Open source toolkit for speaker recognition - A cross-platform, open source speaker recognition toolkit from Laboratoire Informatique d’Avignon, France.
Forced aligners help to align audio recordings with orthographic transcription
- aeneas | Docs is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment).
- Berlin Database of Emotional Speech - A tagged corpus (in German/Deutsche) of speech tagged with emotions.
- The Pile - The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
-
ActiveClean - ActiveClean is an iterative cleaning framework that can correctly retrain the machine learning model when data is cleaned, and provides a set of optimizations to select the best data to be cleaned. In this way, you only need to clean a small subset of the data in order to produce a model similar to if the full dataset were cleaned. Written in Python.
-
DataLinter - The Data Linter identifies potential issues (lints) in your ML training data.
-
Holoclean - Machine learning system for data enrichment
_There's also BoostClean
from Columbia University but I can't find a code reference anywhere on the web.
- No language left behind - Released by Meta, the NLLB project aims to make low-resource languages more accessible by providing a machine translation model which can translate between 200 languages. The model is evaluated using a human translated benchmark, FLORES-200, and perform 44% better than state of the art scores using BLEU.
- @mutiann's Speech Rankings - this site ranks academic papers by citations like [CSRankings]. Put together by user @mutiann.
There are a lot of terms and acronyms in open source voice technology. This section provides explanations for each of them.
-
Cognitive arbitration
: The process a voice assistant uses to understand what services and skills are available to it, depending on its context - such as being online or offline. -
CRF
: Conditional random field. A statistical modelling method which can take into account context. Used in some neural-network based intent-parsing and semantic extraction software. -
LSTM
: long short-term memory. Used within recurrent neural networks to help process sequences of data, such as audio or speech. In order to know what is likely to come next, LSTM records what came previously. -
LVCSR
: Large vocabulary continuous speech recognition. Used in speech recognition tools to denote that a) the vocabulary on which the recognizer works has not been restricted or constrained - for example if it is deployed on embedded or low-powered hardware which cannot handle the memory or compute requirements of a large vocabulary and b) the recognizer works continuously, in contrast to a Wake Word or Keyword spotter which cedes control to the STT once a Wake Word is detected.