This is the code for the paper 'Syntactic Substitutability as Unsupervised Dependency Syntax'. Instructions to replicate experiments are below.
CONLL-formatted dependency treebanks are required. The PUD treebanks can be downloaded from the Universal Dependencies project website and the Surface Syntactic Universal Dependencies project website.
Results are outputted in CSV format. Induced trees can be saved and viewed in a Latex tikz format.
Substitutions must be generated before parsing sentences. See below for how to generate a parse for a single sentence.
The script generate_sentences.sh
can be used to generate substituted sentences. The following variables can be set:
SPLIT
: name of the datasetCONLLU_FILE
: path to the CONLL-formatted treebank to parseNUMBER_SENTS
: the number of substitutions to generate at each position in the sentence
The script parse_sentences.sh
can be used to parse and evaluate on each dataset. The following variables can be set:
SPLIT
: name of the datasetCONLLU_FILE
: path to the CONLL-formatted treebank to parseNUMBER_SENTS
: the number of substitutions to generate at each position in the sentence
It will save to the output directory a CSV formatted file containing the UUAS scores of the induced trees.
Given any CONLL-file, the parses can be obtained by running it through the pipeline described above. Otherwise, the structure induced for single sentences can also be obtained by running parse_single_sentence.py
directly in the command line as below:
python parse_single_sentence.py [NUMBER OF SUBSTITUTIONS]
Following the instructions in the command line, this will output a list of edges and a simple text description of the induced tree.