- Enron emails - https://www.cs.cmu.edu/~enron
- OntoNotes 5.0
spaCy(v2.2.4) - https://spacy.io/
- Accuracy score
- ROC
- F1 score
- Confusion matrix
I have annotated the OntoNotes dataset using spacy NER, then treating the pre-annotations in the dataset as labels for performance and accuracy evaluation. Since we have to focus on only organization labelling, I divided the dataset into three categories with -1, 0, 1 as labels, where -1 means spacey didn’t give any entity name to the sentence,0 where the entity label was not ORG and 1 means where entity was ORG.
Performance results on OntoNotes dataset
-
Accuracy score : 0.9676
-
Evaluating the OntoNotes dataset I found out F1 score for 3 categories of data as follows : [0. , 0.98101418, 0.91691935]
Score of -1 is 0 since pre-annotated data had no -1 labels.
- Confusion matrix evaluated as follows:
-1 | 0 | 1 |
---|---|---|
0 | 0 | 0 |
25 | 4082 | 98 |
4 | 35 | 756 |
So in total correct classification of 'ORG' labels are 4082 .
- ROC :
fpr : [0. , 0.02330559, 0.9940547 , 1. ]
tpr : [0. , 0.9509434 , 0.99496855, 1. ]
Currently simplified enron data in readable dataframe
Further simplification of enron dataset is needed to properly evaluate spacy's performance on it
-
Also by observing the 'ORG' labels which were not classified by spacy the issue can be solved by removing punctuations from sentences or specifing some peculiar organisation name(ex.The Truth Squad).
-
The documentaion of organization can be improved by adding some research institutes and scientific publications such as Astrobiology Journal.
-
Some of the errors made were due to missing context in sentences.
-
The crowdsourced can be used ,but I found a lot of redundant labeling for example some of human labels should be empty as in agreeing with machine label but are give exact same values as the machine labeled dictionary .Which hampered processing of dataset for evaluation.