- [ ] have more detailed command description on ‘abkhazia <command> –help’. Assume the user doesn’t know abkhazia or kaldi.
- [-] improve the online documentation
- [X] install and configure
- [ ] preparators for new corpora
- [ ] ASR: LM, AM, features, alignment, etc
- [ ] ABX: item files generation from alignment
- [ ] add READMEs in intermediate directories of the project to explain the sources tree (or have a ‘project organization’ page)
- [X] split with (p_train + p_test) < 1 (by now only =1 supported)
- [ ] des méthodes de stats sur les corpus (duration, per spks, etc)
- [ ] need of a structure to append data to an existing corpus (eg attach a LM, features or manual alignments)
- [ ] From the Thomas’s notes we want… Optionally, the corpus directory
may contains:
- Time-alignments (The previous files in the list are sufficient for training ASR models with kaldi, however for ABX experiments we also need time-alignments for each phone or each word. If they are not provided directly, a surrogate will be obtained through forced-alignment using the kaldi-trained ASR models)
- A language model, either in OpenFST text or binary format or in ARPA-MIT N-gram format
- Syllabification of each utterance
- [X] GlobalPhone & CSJ : Laisser la possibilité de choisir de garder ou non le ‘+’ (pour l’instant on l’enlève)
- [X] Chercher les transcriptions manquantes pour GlobalPhone (pour l’instant hard codé dans préparateur)
- [ ] Faire un dossier GlobalPhone et un dossier CSJ pour les différents cas possible pour chaque corpus
- [ ] Choisir CLAIREMENT la façon de traiter les parenthèses dans CSJ (pour l’instant :
- (W ..1;..2) : on garde ..1
- (?..) : on remplace le SUW par “?”
- (R xxx) : on remplace par “?”
- reste : on enlève caractère + parenthèse et on garde tout
- [ ] On enlève les IPU lorsqu’il y a un caractère problèmatique (e.g. “(?” ou “(W” ou autre) dans un des SUW ?
- [ ] –retrain option it should be possible to retrain a trained model on a new corpus (for instance, specifically retrain silence models, or retrain on a bunch of new corpus)
- [ ] questions vs data-driven option (is it for tree building ?)
- [X] write a prepare_lang.sh standalone wrapper
- [X] independant call to prepare_lang in AM classes
- [X] update the command line API in consequence
- [X] pass all the tests
- [X] refactor core -> am now exports the lang folder
- [X] optionally disable the scoring in decode forward –skip-scoring from kaldi to abkhazia
- [ ] check for phonesets compatibility between AM/LM
- [ ] add tests for cross use of AMs/LMs
- [ ] add tests for wpd decoding
- [ ] update the command line API in consequence
required files by mkgraph.sh are
- AM $lang/L.fst
- LM $lang/G.fst
- AM/LM $lang/phones.txt
- AM? $lang/words.txt
- AM/LM? $lang/phones/silence.csl
- AM/LM? $lang/phones/disambig.int
- AM $model/final.mdl
- AM $model/tree
(file in output-dir, link in recipe-dir)
- Need of an automated way to update new versions of the installed configuration file in the ./configure script.
Those messages are actually logged in debug level, should be smater to be warnong/error 2016-10-12 16:33:58,372 - DEBUG - ERROR (apply-cmvn:Write():kaldi-matrix.cc:1229) Failed to write matrix to stream 2016-10-12 16:33:58,373 - DEBUG - WARNING (apply-cmvn:Write():util/kaldi-holder-inl.h:54) Exception caught writing Table object: ERROR (apply-cmvn:Write():kaldi-matrix.cc:1229) Failed to write matrix to stream
corpus = BuckeyeCorpusPreparator('./buckeye').prepare()
corpus.speakers()
utt = corpus.utterances()
train, _ = corpus.split(train_prop=0.5, by_speakers=True)
train.save2h5('train.h5', group='corpus', wavs=True)
corpus = Corpus.read('train.h5', group='corpus')
lm = LanguageModelProcessor(order=3, level='word').compute(corpus)
lm.save('lm.fst')
lm.save2h5('train.h5', group='word-trigram')
assert lm.order == 3
assert lm.level == 'word'
features = FeaturesProcessor('mfcc', delta=2, pitch=True).compute(corpus)
f = features[utt[0]] # np.array
features.write2h5('train.h5', 'features')
features.write2ark('/somewhere')