Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quote attributions to Character Ids #3

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
a99b298
Added new function setCharacterIds
NikhilPr95 Jul 6, 2016
ac7cbd3
Added new function setCharacterIds
NikhilPr95 Jul 6, 2016
807ec7a
Added extra attributes to print in PrintQuotes
NikhilPr95 Jul 6, 2016
4fd9ccf
Calling new function setCharacterIds
NikhilPr95 Jul 6, 2016
228fe39
Update PrintUtil.java
NikhilPr95 Jul 6, 2016
7b259e6
Added attribute p (paragraphId) to Quotation
NikhilPr95 Jul 7, 2016
cf1ed82
Added new function to attribute quotes.
NikhilPr95 Jul 7, 2016
a9aa15d
Switched positions of functions
NikhilPr95 Jul 7, 2016
fdf6eaa
Added honorific 'professor'
NikhilPr95 Jul 7, 2016
1496f32
Updated PrintUtil
NikhilPr95 Jul 7, 2016
8efec25
Update BookNLP.java
NikhilPr95 Jul 28, 2016
2870f3a
Added honorifics 'uncle' and 'aunt'
NikhilPr95 Jul 28, 2016
0738987
Changed quote attributed name extraction method
NikhilPr95 Jul 28, 2016
0529f49
Update NP.java
NikhilPr95 Jul 28, 2016
c0f5525
Update PronounAntecedent.java
NikhilPr95 Jul 28, 2016
260803e
Update CharacterAnnotator.java
NikhilPr95 Jul 28, 2016
640d27c
Update CharacterFeatureAnnotator.java
NikhilPr95 Jul 28, 2016
307434f
Added new conditions and new feature
NikhilPr95 Jul 28, 2016
ec8aed8
Update PhraseAnnotator.java
NikhilPr95 Jul 28, 2016
3e33f76
Added new conditions and code to attribute quotes
NikhilPr95 Jul 28, 2016
aff1d99
Changed parser and added condition for tokenizing
NikhilPr95 Jul 28, 2016
334e490
New coref weights
NikhilPr95 Jul 28, 2016
b2ab929
Updated with latest files
NikhilPr95 Jul 28, 2016
1600e84
Delete maltparser-1.7.2.jar
NikhilPr95 Jul 28, 2016
39bf84d
Delete stanford-corenlp-3.3.1.jar
NikhilPr95 Jul 28, 2016
801940d
Updated for new CoreNLP
NikhilPr95 Jul 28, 2016
3c68efc
New coref weights due to new feature
NikhilPr95 Jul 28, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ How To Run

Download external jars (which are sadly too big for GitHub's 100MB file size limit)

* Download and unzip http://nlp.stanford.edu/software/stanford-corenlp-full-2014-01-04.zip
* copy stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1-models.jar to the lib/ folder in the current working directory
* Download and unzip stanford-core-nlp-full-2015-09.zip from http://stanfordnlp.github.io/CoreNLP/
* copy stanford-corenlp-full-2014-01-04/stanford-corenlp-3.6.0-models.jar to the lib/ folder in the current working directory


####Example
Expand Down
102 changes: 58 additions & 44 deletions coref/weights.txt
Original file line number Diff line number Diff line change
@@ -1,44 +1,58 @@
synpath:^poss^pobj^prep^null>nsubj 2.037356393978485
PRP$ 1.0634082674656997
deprel:nsubjpass 0.8905980753550408
synpath:^nsubj^dep^null>null>nsubj 0.6322892803048907
deprel:nsubj 0.5885064026180548
synpath:^dobj^conj^null>ccomp>nsubj 0.40154289909400553
NNP 0.3186964084686234
synpath:^nsubj^ccomp^null>null>advcl>nsubj 0.31748226421825476
synpath:^nsubj^null>null>nsubj 0.2564354843172664
synpath:^nsubj^ccomp^null>null>nsubjpass 0.2399215225866951
salience 0.14287476142050823
deprel:iobj 0.10225313969173129
PRP 0.09416343543500322
synpath:^poss^pobj^prep^nsubj^null>null>nsubj 0.06924108871927159
synpath:^nsubj^advcl^null>nsubj 0.05908077202444784
sameQuote 0.05206318468056584
synpath:^nsubj^ccomp^null>null>parataxis>nsubj 0.040015509123667414
deprel:pobj 0.012626614389932494
synpath:^nsubj^null>null>prep>pobj 0.010593798188933704
synpath:^nsubj^null>null>ccomp>nsubj -4.4039285674625126E-5
deprel:poss -0.03621389280263525
deprel:dobj -0.04585717394707185
syndist -0.05433894782548685
synpath:^nsubj^null>null>prep>pobj>poss -0.05656029815251714
synpath:^nsubj^rcmod^pobj^prep^null>nsubj -0.08824723597603963
linearDistance -0.09422825929495962
synpath:^nsubj^xcomp^null>nsubj -0.12773721686621253
deprel:null -0.15902700691440882
synpath:^dobj^null>null>nsubj -0.1647038583661299
synpath:^poss^dobj^parataxis^null>null>nsubj -0.17764959027062383
synpath:^nsubj^ccomp^null>null>nsubj -0.1948232755827206
synpath:^nsubj^null>null>dobj>prep>pobj>poss -0.2131943132034836
synpath:^nsubj^parataxis^null>null>nsubj -0.2958798167310025
synpath:^nsubj^null>null>poss -0.3099036931948747
synpath:^nsubj^null>null>parataxis>nsubj -0.3281780488111383
synpath:^nsubj^ccomp^advcl>nsubj -0.3505681260659827
synpath:^pobj^prep^null>nsubj -0.3952637562027398
NN -0.4038170185978463
synpath:^nsubj^null>null>dobj -0.4202526930109537
synpath:^dobj^conj>nsubj -0.4505472319667932
synpath:^dobj^null>nsubj -0.4519141907688465
synpath:^nsubj^ccomp^ccomp>nsubj -0.7795030364365683
NNS -2.79139918916984
oppositeGender -4.3968965899821475
synpath:^poss^pobj^prep^null>nsubj 2.23500425756401
PRP$ 1.1157587178303154
deprel:nsubjpass 0.9181455063235577
synpath:^nsubj^ccomp^null>null>nsubjpass 0.6274665420979655
deprel:nsubj 0.6023122290876726
synpath:^dobj^conj^null>ccomp>nsubj 0.5912597284472637
synpath:^nsubj^ccomp^null>null>advcl>nsubj 0.49687769349133726
synpath:^nsubj^dep^null>null>nsubj 0.4769034714244557
synpath:^poss^dobj^ccomp>nsubj 0.2893177312805945
synpath:^nsubj^null>null>nsubj 0.273500569588777
synpath:^poss^pobj^prep^nsubj^null>null>nsubj 0.24166638709512334
synpath:^poss^pobj^prep^conj^null>nsubjpass 0.22831870243652577
isPerson 0.17214877758779562
salience 0.11985447703843854
NNP 0.09372342722667712
synpath:^poss^pobj^prep^dobj^ccomp>nsubj 0.04736377182918983
deprel:pobj 0.031594183976217244
synpath:^nsubj^ccomp^xcomp^parataxis>nsubj 0.031073416659173424
synpath:^nsubj^conj^null>nsubj 0.025219114530353495
synpath:^nsubj^advcl^conj>prep>pobj 0.01915530511592635
synpath:^poss^pobj^prep^dobj^null>nsubj 0.013533797286846906
synpath:^nsubj^parataxis^null>nsubj 0.010173630482415196
synpath:^nsubj^ccomp^null>null>parataxis>nsubj 0.004002420320221095
synpath:^nsubj^advcl^null>nsubj 3.975463451989579E-4
synpath:^nsubj^null>advcl>nsubj -3.626398180532769E-4
synpath:^dobj^xcomp^null>nsubj -0.0039481279125020575
sameQuote -0.016871537523873426
deprel:poss -0.017198754685198547
synpath:^nsubj^null>null>ccomp>nsubj -0.024356599043122155
syndist -0.04807077523274833
PRP -0.059040902272392504
synpath:^pobj^prep^null>nsubjpass -0.08044813202835889
deprel:null -0.09204737588197029
linearDistance -0.09385910754637326
synpath:^poss^conj^null>null>nsubj -0.1033848526620039
synpath:^pobj^prep^ccomp>nsubj -0.15437328909280093
synpath:^nsubj^null>null>parataxis>nsubj -0.18377681772608997
synpath:^nsubj^ccomp^null>nsubj -0.18478330141920232
synpath:^dobj^null>null>nsubj -0.19333373302118825
synpath:^nsubj^null>null>dobj>prep>pobj>poss -0.27153906101910624
synpath:^nsubj^ccomp^null>null>nsubj -0.27178814024439074
synpath:^poss^dobj^parataxis^null>null>nsubj -0.2720074956345681
synpath:^nsubj^ccomp^conj>nsubj -0.2951913289122799
synpath:^nsubj^parataxis^null>null>nsubj -0.3053912182264542
synpath:^pobj^prep^ccomp^null>nsubj -0.3228381168903514
NN -0.331401114895203
synpath:^nsubj^null>null>poss -0.38657065003091684
synpath:^nsubj^xcomp^null>nsubj -0.4076584217490953
synpath:^nsubj^null>null>dobj -0.488179548989394
synpath:^nsubj^rcmod^pobj^prep^null>nsubj -0.489127485867776
synpath:^dobj^conj>nsubj -0.5694850360641267
synpath:^nsubj^ccomp^advcl>nsubj -0.5789508411431289
synpath:^pobj^prep^null>nsubj -1.01674615004074
synpath:^dobj^null>nsubj -1.0412024591160391
synpath:^nsubj^ccomp^ccomp>nsubj -1.2030847353933165
NNS -2.751885136283051
deprel:nn -2.976929155366783
oppositeGender -4.2830313567622715
85 changes: 58 additions & 27 deletions files/coref.weights
Original file line number Diff line number Diff line change
@@ -1,27 +1,58 @@
PRP$ 1.2704127912848633
synpath:^poss^pobj^prep^null>nsubj 1.2290242449970539
deprel:nsubjpass 0.8563208336022398
deprel:nsubj 0.6574765898453765
synpath:^poss^pobj^prep^null>null>nsubj 0.6453928940036563
NNP 0.4551827837366765
synpath:^nsubj^ccomp^null>null>nsubjpass 0.3364086508954913
synpath:^dobj^conj^null>ccomp>nsubj 0.32260152985106366
PRP 0.25905231070310264
synpath:^nsubj^ccomp^null>null>parataxis>nsubj 0.16303385922785454
salience 0.15390973041414974
sameQuote 0.04630125899153769
deprel:pobj 0.037774958745615976
synpath:^nsubj^parataxis^null>null>nsubj -2.6476811124519936E-5
synpath:^dobj^conj>nsubj -3.6435371007135284E-5
syndist -0.030268038450265912
synpath:^nsubj^null>null>dobj>prep>pobj>poss -0.06574704752055043
synpath:^nsubj^null>null>ccomp>nsubj -0.06795786941657951
linearDistance -0.10928942845970044
deprel:poss -0.13525967015220455
synpath:^dobj^advcl>nsubj -0.14825287831905246
deprel:dobj -0.16086166497535126
synpath:^nsubj^null>null>prep>pobj>poss -0.2601720751958899
NN -0.3195715801700655
synpath:^nsubj^null>null>poss -0.44196404207944345
NNS -2.4996696470920887
oppositeGender -4.952193388725023
synpath:^poss^pobj^prep^null>nsubj 2.23500425756401
PRP$ 1.1157587178303154
deprel:nsubjpass 0.9181455063235577
synpath:^nsubj^ccomp^null>null>nsubjpass 0.6274665420979655
deprel:nsubj 0.6023122290876726
synpath:^dobj^conj^null>ccomp>nsubj 0.5912597284472637
synpath:^nsubj^ccomp^null>null>advcl>nsubj 0.49687769349133726
synpath:^nsubj^dep^null>null>nsubj 0.4769034714244557
synpath:^poss^dobj^ccomp>nsubj 0.2893177312805945
synpath:^nsubj^null>null>nsubj 0.273500569588777
synpath:^poss^pobj^prep^nsubj^null>null>nsubj 0.24166638709512334
synpath:^poss^pobj^prep^conj^null>nsubjpass 0.22831870243652577
isPerson 0.17214877758779562
salience 0.11985447703843854
NNP 0.09372342722667712
synpath:^poss^pobj^prep^dobj^ccomp>nsubj 0.04736377182918983
deprel:pobj 0.031594183976217244
synpath:^nsubj^ccomp^xcomp^parataxis>nsubj 0.031073416659173424
synpath:^nsubj^conj^null>nsubj 0.025219114530353495
synpath:^nsubj^advcl^conj>prep>pobj 0.01915530511592635
synpath:^poss^pobj^prep^dobj^null>nsubj 0.013533797286846906
synpath:^nsubj^parataxis^null>nsubj 0.010173630482415196
synpath:^nsubj^ccomp^null>null>parataxis>nsubj 0.004002420320221095
synpath:^nsubj^advcl^null>nsubj 3.975463451989579E-4
synpath:^nsubj^null>advcl>nsubj -3.626398180532769E-4
synpath:^dobj^xcomp^null>nsubj -0.0039481279125020575
sameQuote -0.016871537523873426
deprel:poss -0.017198754685198547
synpath:^nsubj^null>null>ccomp>nsubj -0.024356599043122155
syndist -0.04807077523274833
PRP -0.059040902272392504
synpath:^pobj^prep^null>nsubjpass -0.08044813202835889
deprel:null -0.09204737588197029
linearDistance -0.09385910754637326
synpath:^poss^conj^null>null>nsubj -0.1033848526620039
synpath:^pobj^prep^ccomp>nsubj -0.15437328909280093
synpath:^nsubj^null>null>parataxis>nsubj -0.18377681772608997
synpath:^nsubj^ccomp^null>nsubj -0.18478330141920232
synpath:^dobj^null>null>nsubj -0.19333373302118825
synpath:^nsubj^null>null>dobj>prep>pobj>poss -0.27153906101910624
synpath:^nsubj^ccomp^null>null>nsubj -0.27178814024439074
synpath:^poss^dobj^parataxis^null>null>nsubj -0.2720074956345681
synpath:^nsubj^ccomp^conj>nsubj -0.2951913289122799
synpath:^nsubj^parataxis^null>null>nsubj -0.3053912182264542
synpath:^pobj^prep^ccomp^null>nsubj -0.3228381168903514
NN -0.331401114895203
synpath:^nsubj^null>null>poss -0.38657065003091684
synpath:^nsubj^xcomp^null>nsubj -0.4076584217490953
synpath:^nsubj^null>null>dobj -0.488179548989394
synpath:^nsubj^rcmod^pobj^prep^null>nsubj -0.489127485867776
synpath:^dobj^conj>nsubj -0.5694850360641267
synpath:^nsubj^ccomp^advcl>nsubj -0.5789508411431289
synpath:^pobj^prep^null>nsubj -1.01674615004074
synpath:^dobj^null>nsubj -1.0412024591160391
synpath:^nsubj^ccomp^ccomp>nsubj -1.2030847353933165
NNS -2.751885136283051
deprel:nn -2.976929155366783
oppositeGender -4.2830313567622715
Binary file removed lib/maltparser-1.7.2.jar
Binary file not shown.
Binary file added lib/maltparser-1.8.1.jar
Binary file not shown.
Binary file added lib/slf4j-api-1.7.21.jar
Binary file not shown.
Binary file removed lib/stanford-corenlp-3.3.1.jar
Binary file not shown.
Binary file added lib/stanford-corenlp-3.6.0.jar
Binary file not shown.
32 changes: 22 additions & 10 deletions src/novels/BookNLP.java
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,6 @@ public void process(Book book, File outputDirectory, String outputPrefix) {

process(book);

QuotationAnnotator quoteFinder = new QuotationAnnotator();
quoteFinder.findQuotations(book);

CharacterFeatureAnnotator featureAnno = new CharacterFeatureAnnotator();
featureAnno.annotatePaths(book);
Expand All @@ -49,23 +47,35 @@ public void process(Book book, File outputDirectory, String outputPrefix) {
}

public void process(Book book) {
System.out.println("Setting Dependents");
SyntaxAnnotator.setDependents(book);


System.out.println("Adding Dictionary");
Dictionaries dicts = new Dictionaries();
dicts.readAnimate(animacyFile, genderFile, maleFile, femaleFile);
dicts.processHonorifics(book.tokens);

System.out.println("Annotating Chatacters");
CharacterAnnotator charFinder = new CharacterAnnotator();

charFinder.findCharacters(book, dicts);
charFinder.resolveCharacters(book, dicts);


System.out.println("Getting Phrases");
PhraseAnnotator phraseFinder = new PhraseAnnotator();
phraseFinder.getPhrases(book, dicts);


System.out.println("Resolving Pronouns");
CoreferenceAnnotator coref = new CoreferenceAnnotator();
coref.readWeights(weights);
coref.resolvePronouns(book);

System.out.println("Setting Character IDs");
SyntaxAnnotator.setCharacterIds(book);


QuotationAnnotator quoteFinder = new QuotationAnnotator();
quoteFinder.findQuotations(book, dicts);
}

public void dumpForAnnotation(Book book, File outputDirectory, String prefix) {
Expand Down Expand Up @@ -94,6 +104,7 @@ public static void main(String[] args) throws Exception {

CommandLine cmd = null;
try {

CommandLineParser parser = new BasicParser();
cmd = parser.parse(options, args);
} catch (Exception e) {
Expand Down Expand Up @@ -153,7 +164,8 @@ public static void main(String[] args) throws Exception {
}

Book book = new Book(tokens);



if (cmd.hasOption("w")) {
bookNLP.weights = cmd.getOptionValue("w");
System.out.println(String.format("Using coref weights: ",
Expand All @@ -166,16 +178,16 @@ public static void main(String[] args) throws Exception {
book.id = prefix;
bookNLP.process(book, directory, prefix);

if (cmd.hasOption("printHTML")) {
File htmlOutfile = new File(directory, prefix + ".html");
PrintUtil.printWithLinksAndCorefAndQuotes(htmlOutfile, book);
}

if (cmd.hasOption("d")) {
System.out.println("Dumping for annotation");
bookNLP.dumpForAnnotation(book, directory, prefix);
}

if (cmd.hasOption("printHTML")) {
File htmlOutfile = new File(directory, prefix + ".html");
PrintUtil.printWithLinksAndCorefAndQuotes(htmlOutfile, book);
}
// Print out tokens
PrintUtil.printTokens(book, tokenFileString);

Expand Down
3 changes: 3 additions & 0 deletions src/novels/Dictionaries.java
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ public Dictionaries() {
maleHonorifics.add("mr");
maleHonorifics.add("mister");
maleHonorifics.add("lord");
maleHonorifics.add("uncle");

femaleHonorifics.add("ms.");
femaleHonorifics.add("ms");
Expand All @@ -43,11 +44,13 @@ public Dictionaries() {
femaleHonorifics.add("miss");
femaleHonorifics.add("madam");
femaleHonorifics.add("lady");
femaleHonorifics.add("aunt");

generalHonorifics.add("dr.");
generalHonorifics.add("dr");
generalHonorifics.add("prof.");
generalHonorifics.add("prof");
generalHonorifics.add("professor");

honorifics.addAll(maleHonorifics);
honorifics.addAll(femaleHonorifics);
Expand Down
3 changes: 2 additions & 1 deletion src/novels/Quotation.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@ public class Quotation {
public int end;
public int attributionId;
public int sentenceId;
public int p;

public Quotation(int start, int end, int sentenceId) {
this.start = start;
this.end = end;
this.sentenceId = sentenceId;
}

}
}
37 changes: 33 additions & 4 deletions src/novels/annotators/CharacterAnnotator.java
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ public class CharacterAnnotator {

// the minimum number of times a name must show up to denote a character
int minCharacterNameMentions = 2;

//the minimum number of times a discovered character (with all its subset names) must show up to be a vaild character
int minCharacterOccurences = 3;

// maximum length of a character name (in characters)
int maxCharacterNameLength = 50;
Expand Down Expand Up @@ -153,6 +156,7 @@ public HashSet<String> getVariants(String name, Dictionaries dicts) {
if (!dicts.honorifics.contains(parts[i])) {
variants.add(parts[i]);
}

for (int j = i + 1; j < parts.length; j++) {
variants.add(parts[i] + " " + parts[j]);
for (int k = j + 1; k < parts.length; k++) {
Expand Down Expand Up @@ -316,16 +320,41 @@ public void resolveCharacters(Book book, Dictionaries dicts) {
i++;

}


//delete characters that occur too rarely
//book.characters.length = 10;
/*
int tempLength = book.characters.length;
for (int c = 0; c < tempLength; c++){

if (book.characters[c].count < minCharacterOccurences){
tempLength--;
System.out.println("c " + c);
for (int k = c; k < tempLength; k++)
{
book.characters[k] = book.characters[k+1];
}

}
}

BookCharacter[] tempCharacters = new BookCharacter[tempLength];
for (i = 0; i < tempLength; i++)
tempCharacters[i] = book.characters[i];

book.characters = new BookCharacter[tempLength];

for (i = 0; i < tempLength; i++)
book.characters[i] = tempCharacters[i];
*/
// After all the tokens have been assigned, calculate and save
// properties of the characters (like most frequent name).
// for (int c = 0; c < tempLength; c++) {//
for (int c = 0; c < book.characters.length; c++) {
book.characters[c].setDominantName();
int charGender = dicts.getGender(book.characters[c].nameCounts);
book.characters[c].gender = charGender;
//System.out.println(String.format("%s\tCHAR: %s\t%s\t%s",
// book.characters[c].count, c, book.characters[c].name,
//charGender));
// System.out.println(String.format("%s\tCHAR: %s\t%s\t%s", book.characters[c].count, c, book.characters[c].name, charGender));

}
}
Expand Down
1 change: 1 addition & 0 deletions src/novels/annotators/CharacterFeatureAnnotator.java
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ public void annotatePaths(Book book) {
}
}
}

for (int i = 0; i < book.tokens.size(); i++) {
if (book.tokenToCharacter.containsKey(i)) {
Antecedent ant = book.tokenToCharacter.get(i);
Expand Down
Loading