extended_penn_tokenizer

Fork of the Penn Treebank tokenizer

Original tokenizer written by Robert MacIntyre, University of Pennsylvania, late 1995
Original available at: http://www.cis.upenn.edu/~treebank/tokenizer.sed

Updated to:

fix 'comma in number' handling
fix open/close quote handling
generalize tokenization to documents with directional quotes
handle additional contractions
add an untokenizer to untokenize arbitrary documents to their original form