Skip to content

Latest commit

 

History

History
17 lines (11 loc) · 557 Bytes

README.md

File metadata and controls

17 lines (11 loc) · 557 Bytes

extended_penn_tokenizer

Fork of the Penn Treebank tokenizer

Original tokenizer written by Robert MacIntyre, University of Pennsylvania, late 1995
Original available at: http://www.cis.upenn.edu/~treebank/tokenizer.sed

Updated to:

  • fix 'comma in number' handling
  • fix open/close quote handling
  • generalize tokenization to documents with directional quotes
  • handle additional contractions
  • add an untokenizer to untokenize arbitrary documents to their original form