An attempt to classify sentences from the Penn Discourse Treebank as either coherent or incoherent.
To get started, you will need acces to the Penn Discourse Tree Bank (PDTB) (the CLaC lab has the data).
Once you have this, add a PDTB relations-XX-XX-XX-{dev | train | test}.json
file to the /data
directory, and update the value of relations_json
in generate_sentences.py
(declared around line 10) to the name of that file.
In order to train with the Google News word2vec embeddings, you will need to download them (available here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM), and unzip them in the \model
directory.
Below are brief explanations of what each python file does, and how it interacts with other files.
This script takes a PDTB .json
file from the /data
directory and create the coherent and incoherent datasets for our model from it. It first reads the .json
and extracts relevant values from it, then it creates the various datasets (specifying which one in the comments) in .json
format.
This file takes the resulting .json
files from generate_sentences.py
and converts them to .txt
. It also calculates corpus-wide statistics, such as the number of terms, the dictionary, the maximum sentence length, etc.
This file is where we run our convolutional neural network. It defines many flags for the various hyperparameters, then loads the data from specified file and transforms it to the format needed by the network. It then implements the training loop for the network, and saves intermediate results, while printing relevant information to the screen.
This is the class that implements the underlying logic of our convolutional neural network. It creates the actual network, connects layers, implements the convolution, etc.
This file was used to randomize the Arg2
of our training data, using several gamma
values that specify the probability of each word of being swapped with another.
This file randomly chooses a fixed number of samples from each of our datasets to get them annotated on Crowdflower.
This file looks at which connectives in our data are more likely to be in the middle of Arg1
and Arg2
. It was not used for our experiments, but is still in the code because it would be useful if we want to experiment with unannotated data in the future.
/
Main folder, contains the python files, as well as README and the vocabulary
/crowdflower_data
Data uploaded to Crowdflower for manual annotations
/data
Where we store the various datasets used in the project. The files in this folder directly are the PDTB files, along with data about the corpus (corpus_stats.txt
, dictionary.txt
)
/data/json
.json
files generated by generate_sentences.py
/data/model
word2vec Google News model (3GB in size, available for download at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM)
/data/random
Randomized datasets with different values of gamma
, output of randomize_words.py
/data/rt-polaritydata
Rotten Tomatoes reviews, used to evaluate our model
/data/txt
Input data for train.py
constructed from the PDTB data
/runs
Model parameters saved by tensorflow after each run
A detailled report of this project can be found at https://www.overleaf.com/read/ngfcbdxkcgby