Copyright (c) 2017 Idiap Research Institute, http://www.idiap.ch/
Written by Pierre-Edouard Honnet <pe[dot]honnet[at]gmail[dot]com>.
This is a bunch of scripts exploiting several tools to perform inverse text normalization (ITN). It is based on OpenNMT-py for NMT models, and ASRT, a text normalization tool (among other things), used to create the data. This tool is intended to be used as an interface between Automatic Speech Recognition (ASR) and Neural Machine Translation (NMT) modules, as both models generally used different conventions for the data format.
-
Moses is used for tokenization (http://www.statmt.org/moses/), but you can use your own tokenizer if you prefer.
-
ASRT, the Automatic Speech Recognition Tools, used for text normalization. You can also use another text normalization tool if you prefer.
-
OpenNMT-py (https://github.com/OpenNMT/OpenNMT-py), to train and run NMT models. This installation requires a bunch of other things, but mainly relies on PyTorch.
You simply need to adjust the paths in the script scripts/asr2mt.sh
($BASEDIR, $MOSESDIR and $NMTDIR, as well as $nmtmodel if you do not
have the same folder structure), then choose which steps to run, and
finally run the script as:
scripts/asr2mt.sh $inputfile $outputfile
Here the input file is supposed to be normalized, or some ASR output.
An example using Europarl German data, to do ITN on German text (in our tests we used the German part of the German-French parallel part of the corpus).
$MOSESBASE $ASRTBASE $OPENNMTBASE
mkdir -p data/europarl
cd data/europarl
wget http://opus.lingfil.uu.se/download.php?f=Europarl%2Fde-fr.txt.zip && unzip de-fr.txt.zip && rm de-fr.txt.zip Europarl.de-fr.fr
# It was segmented in train / dev /test as:
head -n 1708742 Europarl.de-fr.de > Europarl.train.de-fr.de
tail -n +1708743 Europarl.de-fr.de | head -n 116505 > Europarl.dev.de-fr.de
tail -n +1825248 Europarl.de-fr.de > Europarl.test.de-fr.de
cd ../..
Note that in our tests, we deleted manually some noise in the training data before this step (e.g. remove the lines with only punctuations, or redundancies, after sort | uniq, etc.).
# Tokenize
$MOSESBASE/scripts/tokenizer/tokenizer.perl -l de < data/europarl/Europarl.train.de-fr.de > data/europarl/europarl.train.de.tok.txt
$MOSESBASE/scripts/tokenizer/tokenizer.perl -l de < data/europarl/Europarl.dev.de-fr.de > data/europarl/europarl.dev.de.tok.txt
$MOSESBASE/scripts/tokenizer/tokenizer.perl -l de < data/europarl/Europarl.test.de-fr.de > data/europarl/europarl.test.de.tok.txt
# Escape punctuation
sed -f scripts/replace_punc.sed data/europarl/europarl.train.de.tok.txt > data/europarl/europarl.train.de.tok.punc.txt
sed -f scripts/replace_punc.sed data/europarl/europarl.dev.de.tok.txt > data/europarl/europarl.dev.de.tok.punc.txt
sed -f scripts/replace_punc.sed data/europarl/europarl.test.de.tok.txt > data/europarl/europarl.test.de.tok.punc.txt
export NLTK_DATA=$adjust_to_your_environment # based on your asrt install
export PYTHONPATH=$ASRTBASE/local/lib/python2.7/site-packages # or based on your asrt install
mkdir -p data/europarl_normalized
$ASRTBASE/data-preparation/python/run_data_preparation.py -i data/europarl/europarl.train.de.tok.punc.txt -l 2 -r $ASRTBASE/examples/resources/regex.csv -s -m -o data/europarl_normalized
sed -f scripts/replace_back_punc.sed data/europarl_normalized/sentences_german.txt > data/europarl_normalized/europarl.train.de.tok.punc.norm.txt
$ASRTBASE/data-preparation/python/run_data_preparation.py -i data/europarl/europarl.dev.de.tok.punc.txt -l 2 -r $ASRTBASE/examples/resources/regex.csv -s -m -o data/europarl_normalized
sed -f scripts/replace_back_punc.sed data/europarl_normalized/sentences_german.txt > data/europarl_normalized/europarl.dev.de.tok.punc.norm.txt
$ASRTBASE/data-preparation/python/run_data_preparation.py -i data/europarl/europarl.test.de.tok.punc.txt -l 2 -r $ASRTBASE/examples/resources/regex.csv -s -m -o data/europarl_normalized
sed -f scripts/replace_back_punc.sed data/europarl_normalized/sentences_german.txt > data/europarl_normalized/europarl.test.de.tok.punc.norm.txt
python $OPENNMTBASE/preprocess.py -train_src data/europarl_normalized/europarl.train.de.tok.punc.norm.txt \
-train_tgt data/europarl/europarl.train.de.tok.txt \
-valid_src data/europarl_normalized/europarl.dev.de.tok.punc.norm.txt \
-valid_tgt data/europarl/europarl.dev.de.tok.txt \
-src_vocab_size 80000 -tgt_vocab_size 80000 \
-save_data data/Europarl_punc.atok
mkdir -p asr2mt-models-punc
python $OPENNMTBASE/train.py -data data/Europarl_punc.atok.train.pt \
-save_model asr2mt-models-punc/asr2mt_model -gpus 0
The models tested were trained using punctuation in the normalized text.
Some other models have been trained using no punctuation (it means,
not using the 2 steps with sed
in step 1 and step 2). This means
that the model will try to recover punctuation during "translation".
In practice, it should be better to do it with punctuation in both
normalized and not normalized versions (if we assume that ASR is
followed by or has a punctuation prediction module).