This repository contains the code for the paper "Neural Networks Classifier for Data Selection in Statistical Machine Translation"
Built upon our fork of Keras (version 1.2) and tested for the Theano backend.
-
Neural network-based sentence classifiers, either at monolingual and bilingual level.
-
BLSTMs / CNNs classifiers. Easy to extend.
-
Support for including Glove or Word2Vec pretrained word vectors (binary or text formats).
-
Iterative semi-supervised selection from top/bottom scoring sentences from an out-of-domain corpus.
Provided that you have pip installed, run:
git clone https://github.com/lvapeab/sentence-selectioNN
cd sentence-selectioNN
pip install -r requirements.txt
for obtaining the required packages for running this library.
sentence-selectioNN
requires the following libraries:
Assuming you have a corpus:
-
Check out the inputs/outputs of your model in
data_engine/prepare_data.py
-
If you want to use pretrained word vectors, use the preprocessing scripts for binary or text for pretrained Glove or Word2Vec vectors.
-
Set a model configuration in
config.py
-
Train!:
python main.py
We support two different network architecture, BLSTM or CNN, both at monolingual or bilingual level.
Please, see the paper for a more detailed description of the model.
If you use this code for any purpose, please cite the following paper:
Peris Á., Chinea-Rios M., Casacuberta F.
Neural Networks Classifier for Data Selection in Statistical Machine Translation.
In The Prague Bulletin of Mathematical Linguistics No. 108, pp. 283–294. 2017.
Álvaro Peris (web page): lvapeab@prhlt.upv.es