The techniques behind the parser are described in the paper Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. Futher materials could be found here.
- Python 2.7 interpreter
- DyNet library
The software requires having a training.conll
and development.conll
files formatted according to the CoNLL data format.
For the faster graph-based parser change directory to bmstparser
(1200 words/sec), and for the more accurate transition-based parser change directory to barchybrid
(800 word/sec). The benchmark was performed on a Mac book pro with i7 processor. The graph-based parser acheives an accuracy of 93.8 UAS and the transition-based parser an accuracy of 94.7 UAS on the standard Penn Treebank dataset (Standford Dependencies). The transition-based parser requires no part-of-speech tagging and setting all the tags to NN will produce the expected accuracy. The model and param files achieving those scores are available for download (Graph-based model, Transition-based model). The trained models include improvements beyond those described in the paper, to be published soon.
To train a parsing model with for either parsing architecture type the following at the command prompt:
python src/parser.py --dynet-seed 123456789 [--dynet-mem XXXX] --outdir [results directory] --train training.conll --dev development.conll --epochs 30 --lstmdims 125 --lstmlayers 2 [--extrn extrn.vectors] --bibi-lstm
We use the same external embedding used in Transition-Based Dependency Parsing with Stack Long Short-Term Memory which can be downloaded from the authors github repository and directly here.
If you are training a transition-based parser then for optimal results you should add the following to the command prompt --k 3 --usehead --userl
. These switch will set the stack to 3 elements; use the BiLSTM of the head of trees on the stack as feature vectors; and add the BiLSTM of the right/leftmost children to the feature vectors.
Note 1: You can run it without pos embeddings by setting the pos embedding dimensions to zero (--pembedding 0).
Note 2: The reported test result is the one matching the highest development score.
Note 3: The parser calculates (after each iteration) the accuracies excluding punctuation symbols by running the eval.pl
script from the CoNLL-X Shared Task and stores the results in directory specified by the --outdir
.
Note 4: The external embeddings parameter is optional and better not used when train/predicting a graph-based model.
The command for parsing a test.conll
file formatted according to the CoNLL data format with a previously trained model is:
python src/parser.py --predict --outdir [results directory] --test test.conll [--extrn extrn.vectors] --model [trained model file] --params [param file generate during training]
The parser will store the resulting conll file in the out directory (--outdir
).
Note 1: If you are using the arc-hybrid trained model we provided please use the --extrn
flag and specify the location of the external embeddings file.
Note 2: If you are using the first-order trained model we provided please do not use the --extrn
flag.
If you make use of this software for research purposes, we'll appreciate citing the following:
@article{DBLP:journals/tacl/KiperwasserG16,
author = {Eliyahu Kiperwasser and Yoav Goldberg},
title = {Simple and Accurate Dependency Parsing Using Bidirectional {LSTM}
Feature Representations},
journal = {{TACL}},
volume = {4},
pages = {313--327},
year = {2016},
url = {https://transacl.org/ojs/index.php/tacl/article/view/885},
timestamp = {Tue, 09 Aug 2016 14:51:09 +0200},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/tacl/KiperwasserG16},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
This software is released under the terms of the Apache License, Version 2.0.
For questions and usage issues, please contact elikip@gmail.com