-
Notifications
You must be signed in to change notification settings - Fork 0
/
ptb_untokenizer.sed
executable file
·59 lines (48 loc) · 1.25 KB
/
ptb_untokenizer.sed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/bin/sed -f
### Untokenize script available at: https://github.com/vansky/extended_penn_tokenizer
# Sed script to undo Penn Treebank tokenization on arbitrary raw text.
# expected input: tokenized text with ONE SENTENCE TOKEN PER LINE
# attempt to get correct directional quotes
s=`` =``=g
s= ''=''=g
s= \.\.\.=...=g
s= \([!?;]\) \1= \1=g
s= \([%:!?;]\)=\1=g
s=\([$#]\) =\1=g
s= \([,.]\)=\1=g
s=,,=,=g
# parentheses, brackets, etc.
# Some taggers, such as Adwait Ratnaparkhi's MXPOST, use the parsed-file
# version of these symbols.
# UNCOMMENT THE FOLLOWING 6 LINES if you're using MXPOST.
s/-LRB-/(/g
s/-RRB-/)/g
s/-LSB-/\[/g
s/-RSB-/\]/g
s/-LCB-/{/g
s/-RCB-/}/g
s=\([([<{]\) =\1=g
s= \([])}>]\)=\1=g
s= --=--=g
# NOTE THAT SPLIT WORDS WERE NOT MARKED.
# First off, add a space to the beginning and end of each line, to reduce
# necessary number of regexps.
s=$= =
s=^= =
# handle contractions
s= '='=g
s= n't=n't=g
s= N'T=N'T=g
#s= \([Cc]\)annot = \1an not =g
s= \([Dd]\)' ye = \1'ye =g
s= \([Gg]\)im me = \1imme =g
s= \([Gg]\)on na = \1onna =g
s= \([Gg]\)ot ta = \1otta =g
s= \([Ll]\)em me = \1emme =g
#s= \([Mm]\)ore'n = \1ore 'n =g
#s= '\([Tt]\)is = '\1 is =g
#s= '\([Tt]\)was = '\1 was =g
s= \([Ww]\)an na = \1anna =g
# clean out extra spaces
s= *= =g
s=^ *==g