Skip to content
lissyx edited this page May 17, 2018 · 40 revisions

DeepSpeech

Table of Contents

Introduction

Welcome to the DeepSpeech wiki. This space is meant to hold answers to questions that are not related to the code or the project's goals, so it should not to be an issue, and are common enough to warrant a dedicated place for documentation. Some examples of good topics are how to deploy our code on your favorite cloud provider, how to train on your own custom dataset, or how to use the native client on your favorite platform or framework. We don't currently have answers to all of those questions, so contributions are welcome!

Frequently Asked Questions

Where do I get pre-trained models?

DeepSpeech cannot do speech-to-text without a trained model file. You can create your own (see below), or use pre-trained model files available on the releases page.

How can I train using my own data?

The easiest way to train on a custom dataset is to write your own importer that knows the structure of your audio and text files. All you have to do is generate CSV files for your splits with three columns wav_filename, wav_filesize and transcript that specify the path to the WAV file, its size, and the corresponding transcript text for each of your train, validation and test splits.

To start writing your own importer, run bin/run-ldc93s1.sh, then look at the CSV file in data/ldc93s1 that's generated by bin/import_ldc93s1.sh, and also the other more complex bin/import_* scripts for inspiration. There's no requirement to use Python for the importer, as long as the generated CSV conforms to the format specified above.

DeepSpeech's requirements for the data is that the transcripts match the [a-z ]+ regex, and that the audio is stored WAV (PCM) files.

How can I import trained weights to do inference?

We save checkpoints (documentation) in the folder you specified with the --checkpoint_dir argument when running DeepSpeech.py. You can import it with the standard TensorFlow tools and run inference. A simpler inference graph is created in the export function in DeepSpeech.py, you can copy and paste that and restore the weights from a checkpoint to run experiments. Alternatively, you can also use the model exported by export directly with TensorFlow Serving.

How can I train on Amazon AWS/Google CloudML/my favorite platform?

Currently we train on our own hardware with NVIDIA Titan X's, so we don't have answers for those questions. Contributions are welcome!

I get an error about lm::FormatLoadException during training

If you get an error that looks like this:

Loading the LM will be faster if you build a binary file.
Reading data/lm/lm.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
terminate called after throwing an instance of 'lm::FormatLoadException'
  what():  native_client/kenlm/lm/read_arpa.cc:65 in void lm::ReadARPACounts(util::FilePiece&, std::vector<long unsigned int>&) threw FormatLoadException.
first non-empty line was "version https://git-lfs.github.com/spec/v1" not \data\. Byte: 43
Aborted (core dumped)

I get an error about E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Invalid argument: NodeDef mentions attr ‘identical_element_shapes’ when running inference

This is because you are trying to run inference on a model that was trained with a version of TensorFlow that added identical_element_shapes. This happened for TensorFlow r1.5, and latest v0.1.1 binary published are built using TensorFlow r1.4. You can either re-train with an older version of TensorFlow, or use newer (but potentially unstable) binaries:

Then you forgot to install Git LFS before cloning the repository. Make sure you follow the instructions on https://git-lfs.github.com/, including running git lfs install once before you clone the repo.

Add your own question/answer

Please add your own questions and answers above, or ask questions below.

  1. Is it possible to use AMD GPUs with DeepSpeech?

  2. Why can't I speak directly to DeepSpeech instead of first making an audio recording? This seems an unnecessary step and greatly reduces the value/ease of use.

  3. I would like to use this to send voice commands to my Linux Desktop to run commands, open programs and transcribe emails for example.

  4. What is the process of making "custom trained DeepSpeech engines" available to the end user on web, Android apps and iOS apps? Can the "recognition engine" run in the browser without realtime server involvement? In what form (java lib, C lib, javascript, objectiveC, ...) is the "recognition engine" integrated into the mobile apps?

  5. What is the accuracy of this speech recogniser compared to others using the pretrained models?

  6. Why does the pretrained model always return an empty string? Am I doing it wrong?

Clone this wiki locally