-
Notifications
You must be signed in to change notification settings - Fork 4k
Home
Table of Contents
- Introduction
-
Frequently Asked Questions
- How can I train using my own data?
- How can I import trained weights to do inference?
- How can I train on Amazon AWS/Google CloudML/my favorite platform?
- I get an error about lm::FormatLoadException during training
- How to build your homemade DEEPSPEECH model, from scratch! TUTO
- Add your own question/answer
Welcome to the DeepSpeech wiki. This space is meant to hold answers to questions that are not related to the code or the project's goals, so it should not to be an issue, and are common enough to warrant a dedicated place for documentation. Some examples of good topics are how to deploy our code on your favorite cloud provider, how to train on your own custom dataset, or how to use the native client on your favorite platform or framework. We don't currently have answers to all of those questions, so contributions are welcome!
The easiest way to train on a custom dataset is to write your own importer that knows the structure of your audio and text files. All you have to do is generate CSV files for your splits with three columns wav_filename
, wav_filesize
and transcript
that specify the path to the WAV file, its size, and the corresponding transcript text for each of your train, validation and test splits.
To start writing your own importer, run bin/run-ldc93s1.sh
, then look at the CSV file in data/ldc93s1 that's generated by bin/import_ldc93s1.sh
, and also the other more complex bin/import_*
scripts for inspiration. There's no requirement to use Python for the importer, as long as the generated CSV conforms to the format specified above.
DeepSpeech's requirements for the data is that the transcripts match the [a-z ]+
regex, and that the audio is stored WAV (PCM) files.
We save checkpoints (documentation) in the folder you specified with the --checkpoint_dir
argument when running DeepSpeech.py
. You can import it with the standard TensorFlow tools and run inference. A simpler inference graph is created in the export
function in DeepSpeech.py
, you can copy and paste that and restore the weights from a checkpoint to run experiments. Alternatively, you can also use the model exported by export
directly with TensorFlow Serving.
Currently we train on our own hardware with NVIDIA Titan X's, so we don't have answers for those questions. Contributions are welcome!
If you get an error that looks like this:
Loading the LM will be faster if you build a binary file.
Reading data/lm/lm.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
terminate called after throwing an instance of 'lm::FormatLoadException'
what(): native_client/kenlm/lm/read_arpa.cc:65 in void lm::ReadARPACounts(util::FilePiece&, std::vector<long unsigned int>&) threw FormatLoadException.
first non-empty line was "version https://git-lfs.github.com/spec/v1" not \data\. Byte: 43
Aborted (core dumped)
Then you forgot to install Git LFS before cloning the repository. Make sure you follow the instructions on https://git-lfs.github.com/, including running git lfs install
once before you clone the repo.
Personal experience. Adapt with your needs... For my robotic project, I needed to create a small monospeaker model, with nearly 1000 sentences orders (not just single word !) I recorded wav's with a Respeaker Microphone Array : https://www.seeedstudio.com/ReSpeaker-Mic-Array-Far-field-w%2F-7-PDM-Microphones-p-2719.html
Wav's were recorder with the following params : mono / 16 bits / 16 k. The use of the google vad lib helped me to limit white space before/after each wav, but Deepspeech seems to process wav, and un-necessary white sound too. (But, removing white sound before processing, limits time spent for model creation !)
MATERIAL PREPARATION :
- you should have a directory with all your waves (the more, the better !!)
- and a textfile containing each wav complete transcript per lign (utf8 encoded) we'll call this textfile, the original textfile.
1 - original textfile cleaning :
- open the alphabet.txt textfile,
- feed in your own alphabet,
- save it
- open the original textfile,
- with your best editor, clean the file : all its characters MUST be present in alphabet.txt remove any punctuation, but you can keep the apostroph, if present in alphabet, as a character.
2 - create 3 directories : train, valid, test.
3 - feed each dir. with corresponding wav's and a new transcript's textfile, as CSV file, containing those specific wav's transcript..
Note about the textfiles :
-
you must have train.csv in the train dir, valid.csv in valid dir and test.csv in test dir
-
Each CSV file must start with lign: wav_filename,wav_filesize,transcript
-
And an example of my train.csv content:
/home/nvidia/DeepSpeech/data/alfred/test/record.1.wav,85484,qui es-tu et qui est-il
/home/nvidia/DeepSpeech/data/alfred/test/record.2.wav,97004,quel est ton nom ou comment tu t'appelles...
- It seems that we should separe all wav's with the following ratio : 70 - 20 - 10 !
70% of all wav's content in train dir, with corresponding train.csv file,
20% in valid dir, with corresponding valid.csv file,
10% in test dir, with corresponding test.csv file.
IMPORTANT : A wav file can only appear in one directory file. It's needed for good model creation (it could result in overfitting...)
LANGUAGE MODEL CREATION :
Here, we use the original textfile, containing 100% of wav's transcripts, and we rename it vocabulary.txt
We'll use the powerful Kenlm tools for our LM build : http://kheafield.com/code/kenlm/estimation/
1 - Creating arpa file for binary build :
/bin/bin/./lmplz --text vocabulary.txt --arpa words.arpa --o 3
I asked Kenneth Heafield about -o param ("order model to estimate")
It seems that for small corpus (my case), a value from 3 to 4 seems to be the best way to success
See lmplz params on web link, if needed.
2 - creating binary file :
/bin/bin/./build_binary -T -s words.arpa lm.binary
TRIE CREATION :
We'll use the native_client "generate_trie" binary to create our trie file,
-creating trie file :
Adapt your links !
/home/nvidia/tensorflow/bazel-bin/native_client/generate_trie /
/home/nvidia/DeepSpeech/data/vocabulary.txt /
/home/nvidia/DeepSpeech/data/lm.binary /
/home/nvidia/DeepSpeech/data/texte_complet_allege.txt /
/home/nvidia/DeepSpeech/data/trie
RUN MODEL CREATION :
Well, we can pause, and take a coffe...
-
TRAIN
train.csv
record.1.wav
record.2.wav...(remember : all wav's are different)
-
VALID
valid.csv
record.1.wav
record.2.wav...
-
TEST
test.csv
record.1.wav
record.2.wav...
-
vocabulary.txt
-
lm.binary
-
trie
2 - Write your run file :
run-alfred.sh:
#!/bin/sh
set -xe
if [ ! -f DeepSpeech.py ]; then
echo "Please make sure you run this from DeepSpeech's top level directory."
exit 1
fi;
python -u DeepSpeech.py \
--train_files /home/nvidia/DeepSpeech/data/alfred/train/train.csv \
--dev_files /home/nvidia/DeepSpeech/data/alfred/dev/dev.csv \
--test_files /home/nvidia/DeepSpeech/data/alfred/test/test.csv \
--train_batch_size 80 \
--dev_batch_size 80 \
--test_batch_size 40 \
--n_hidden 375 \
--epoch 33 \
--validation_step 1 \
--early_stop True \
--earlystop_nsteps 6 \
--estop_mean_thresh 0.1 \
--estop_std_thresh 0.1 \
--dropout_rate 0.22 \
--learning_rate 0.00095 \
--report_count 100 \
--use_seq_length False \
--export_dir /home/nvidia/DeepSpeech/data/alfred/results/model_export/ \
--checkpoint_dir /home/nvidia/DeepSpeech/data/alfred/results/checkout/ \
--decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/libctc_decoder_with_kenlm.so \
--alphabet_config_path /home/nvidia/DeepSpeech/data/alfred/vocabulary.txt \
--lm_binary_path /home/nvidia/DeepSpeech/data/alfred/lm.binary \
--lm_trie_path /home/nvidia/DeepSpeech/data/alfred/trie \
"$@"
Adapt links to fit your needs...
Now, run the file IN YOUR DEEPSPEECH directory :
/bin/run-alfred.sh
IF everything worked correctly, you should now have a /model_export/output_graph.pb
,your model.
Enjoy your model with inferences.
Please add your own questions and answers
Don't edit this footer for questions, add them to the page with the edit button at the top.