-
Notifications
You must be signed in to change notification settings - Fork 0
/
readme.txt
441 lines (368 loc) · 19.4 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
===================================================================
Ajavt (Ajaväljendite tuvastaja):
Temporal Expression Tagger for Estonian
===================================================================
Temporal Expression Tagger is a program that detects time
referring expressions (timexes) from natural language text and
normalises semantics of these expressions in a standard format.
This repository contains Ajavt, a rule-based language-specific
temporal expression tagger for Estonian. The tool uses an
annotation format that is based on the TimeML's TIMEX3 tag
(http://www.timeml.org), and it is currently tuned for temporal
tagging in news domain.
Previous versions of this tool have been introduced in two
publications. Technical details about the implementation were
covered in (Orasmaa 2010), and a more general overview,
accompanied with evaluation of the tagger on different text
genres, was provided in (Orasmaa 2012).
===========================================
Temporal expression tagging: an example
===========================================
The sentence:
Potsataja ütles eile, et vaatavad nüüd Genaga viie aasta plaanid
uuesti üle.
'Potsataja said yesterday that he and Gena will now check over
the plans for the five years.'
is tagged for temporal expressions in a following way (assuming
that the creation time of the text is 2014-10-06):
Potsataja ütles
<TIMEX tid="t1" type="DATE" value="2014-10-05">
eile
</TIMEX>,
et vaatavad
<TIMEX tid="t2" type="DATE" value="PRESENT_REF" anchorTimeID="t0">
nüüd
</TIMEX>
Genaga
<TIMEX tid="t3" type="DURATION" value="P5Y">
viie aasta
</TIMEX>
plaanid uuesti üle.
=========================
Requirements
=========================
For building the program (JAR file):
** Java JDK (at least version 1.8.x is expected);
** Apache Ant (at least version 1.8.2);
For using the program:
** A sentence segmentator;
** A word tokenizer;
** Estonian morphological analyzer and disambiguator, possible
options:
-- Filosoft's Vabamorf: https://github.com/Filosoft/vabamorf
-- PyVabamorf: https://github.com/estnltk/pyvabamorf
-- T3MESTA (a commercial morphological analyzer)
(NB! The program also works on morphologically ambiguous input,
but the quality of the analysis is expected to be lower than
on the morphologically disambiguated text.)
=========================
Building the program
=========================
The most straightforward way for compiling the program is by using
Apache Ant and the build script ("build.xml" in root dir);
Before building, correct path to JDK must be set in the file
"build.properties" (variable "java.home.location"). Then, building
and deploying can be evoked with the command:
ant deploy
(in the same directory where "build.xml" is located);
This compiles the Java source code, makes the JAR file (Ajavt.jar),
and copies the JAR file along with required files into the folder
"test";
=========================
Using the program
=========================
Before Ajavt can be applied on a text, a number of text preprocessing
steps must be made: text must be split into sentences and tokens
(words), and words must be morphologically analysed (and
disambiguated).
These functionalities are provided by EstNLTK toolkit, so the
easiest way to use the program is within this toolkit ( see
https://github.com/estnltk/estnltk for more details ).
Processing JSON format texts
------------------------------
In the JSON processing mode (flag '-format json'), it is expected that
the input of the program is in the same format as the output of
Vabamorf's command 'etana analyze' - a JSON structured text in UTF8
encoding. Note that the minimum JSON structure that the input should
have is:
{"words": [ {"analysis": [ ... ],
"text": "word1"},
{"analysis": [ ... ],
"text": "word2"},
...
{"analysis": [ ... ],
"text": "wordN"} ]}
That is, an object with key "words" must be present, indicating an
analysable sentence.
Also note that Ajavt expects that word root analyses are 'clean',
i.e. without any phonetic markup symbols (which can be optionally added
in 'etana' with flag '-phonetic').
An example of JSON input can be found in file "test/example_input.json";
In the "test" folder, following command evokes Ajavt on the input file
"example_input.json" and outputs the results to standard input:
java -jar Ajavt.jar -format json -in file example_input.json -pretty_print
(flag "-pretty_print" switches on the pretty printing mode, otherwise,
all of the output JSON is on a single line);
Alternatively, output can also be directed to a file by specifying:
java -jar Ajavt.jar -format json -in file example_input.json -pretty_print -out file my_output.json
Document creation time (DCT)
------------------------------
By default, temporal expressions with relative semantics (such as
'eile' / yesterday, 'reedel' / on Friday) are normalised with respect
to execution time of the program. This setting can be overridden by
providing a separate document creation time as an input of the
program:
java -jar Ajavt.jar 1999-01-01TXX:XX -format json -in file example_input.json -pretty_print
The document creation time must be in the format "YYYY-MM-DDThh:mm".
Date/time fields marked with X-es are considered as unspecified/unknown.
For example, it can be specified that the creation time is only known
at month level (e.g. 1999-01-XXTXX:XX) or at year level (e.g.
1999-XX-XXTXX:XX). This also affects the normalisation, e.g. if DCT is
only specified at the year level, date-granularity relative expressions
(for example: 'reedel' / on Friday) will be normalised as unspecified
temporal references (XXXX-XX-XXTXX:XX);
Other remarks
------------------------------
The Ajavt.jar should be executed in a directory that contains other files
required by the program:
javax.json-1.0.4.jar
joda-time-1.6.jar
reeglid.xml
The program can be executed with a custom configuration of rules, using the
flag "-r" followed by the full path to the XML rules file (e.g.
"reeglid.xml"):
java -jar Ajavt.jar -format json -in file example_input.json -pretty_print -r FULL/PATH/TO/reeglid.xml
Flag "-pyvabamorf" evokes the program in a special standard input/output
processing mode, where the program reads a JSON formatted line from the
standard input, analyzes the line, and outputs the results (in a single JSON
formatted line) to the standard output:
java -jar Ajavt.jar -pyvabamorf
============================
Interpreting the output
============================
The annotation format used by the program is described in the file
"doc/margendusformaat_et.pdf" (currently only in Estonian). Here,
we give a brief overview how this format is expressed in JSON.
JSON format output
----------------------
In JSON input/output format, the presence of identified temporal expression(s)
is indicated by adding object "timexes" to the token (at the same level as
objects "text" and "analysis"). The "timexes" is a list of objects and each object
has (at minimum) a following structure:
{
"tid": string,
}
where "tid" is an unique identifier of the temporal expression (in form that
can be described by a regular expression "t[0-9]+" ).
(Note that in "-pyvabamorf" processing mode, this uniqueness only holds
within a single input line, which is expected to be a single document);
If the token begins a temporal expression phrase (either a single-word phrase
or a multiword phrase), additional attribute/value pairs will be specified in
the timex object:
"text" : string
// full extent phrase of the temporal expression
"type" : string
// one of the following: "DATE", "TIME", "DURATION", "SET"
"value": string
// calendrical value (largely follows TimeML TIMEX3 value format),
// but see "doc/margendusformaat_et.pdf" for details;
"temporalFunction": string ("true" or "false")
// indicates whether the semantics of the expression are relative
// to the context:
// *) For DATE and TIME expressions:
// "true" indicates that the expression is relative and
// semantics have been computed by heuristics;
// "false" indicates that the expression is absolute and
// semantics haven't been computed by heuristics;
// *) For DURATION expressions, the value is mostly "false",
// except for vague durations;
// *) For SET expressions, the value is always "true";
Depending on the (semantics of the) temporal expression, there can also be
other attribute/value pairs:
"mod" : string
// largely follows TimeML TIMEX3 mod format, with two additional
// values used to mark first/second half of the date/time (e.g. "in
// the first half of the month"): FIRST_HALF, SECOND_HALF;
"anchorTimeID"
// points to the temporal expression (by identifier) that this
// expression has been anchored to while calculating or determining
// the value;
// "t0" -- means that the expression is anchored to document
// creation time;
"beginPoint"
// in case of DURATION: points to the temporal expression (by
// identifier) that serves as a beginning point of this duration;
// "?" -- indicates problems on finding the beginning point;
"endPoint"
// in case of DURATION: points to the temporal expression (by
// identifier) that serves as an ending point of this duration;
// "?" -- indicates problems on finding the ending point;
"quant"
// Quantifier; Used only in some SET expressions, e.g. quant="EVERY"
"freq"
// Used in some SET expressions, marks frequency of repetition,
// e.g. "three days in each month" will be have freq="3D"
An example
--------------
The sentence
"Potsataja ütles eile, et vaatavad nüüd Genaga viie aasta plaanid uuesti üle."
(created at 2014-10-06)
will obtain following temporal expression annotations:
{
"words":[ { "analysis":[ ... ],
"text":"Potsataja"
},
{ "analysis":[ ... ],
"text":"ütles"
},
{ "analysis":[ ... ],
"text":"eile,",
"timexes":[ { "tid":"t1",
"text":"eile,",
"type":"DATE",
"temporalFunction":"true",
"value":"2014-10-05" } ]
},
{ "analysis":[ ... ],
"text":"et"
},
{ "analysis":[ ... ],
"text":"vaatavad"
},
{
"analysis":[ ... ],
"text":"nüüd",
"timexes":[ { "tid":"t2",
"text":"nüüd",
"type":"DATE",
"temporalFunction":"true",
"value":"PRESENT_REF",
"anchorTimeID":"t0" } ]
},
{ "analysis":[ ... ],
"text":"Genaga"
},
{ "analysis":[ ... ],
"text":"viie",
"timexes":[ { "tid":"t3",
"text":"viie aasta",
"type":"DURATION",
"temporalFunction":"false",
"value":"P5Y" } ]
},
{ "analysis":[ ... ],
"text":"aasta",
"timexes":[ { "tid":"t3",
"text":"viie aasta" } ]
},
{ "analysis":[ ... ],
"text":"plaanid"
},
{ "analysis":[ ... ],
"text":"uuesti"
},
{ "analysis":[ ... ],
"text":"üle"
}
]
}
which should be interpreted as:
"eile," -- is a single-word temporal expression, which is from type
"DATE", and which refers to the date "2014-10-05";
"nüüd" -- is a single-word temporal expression, which is from type
"DATE", and which has an uncertain calendaric value, but it
refers to the present time (PRESENT_REF), contemporary to
the document creation time (t0, which is 2014-10-06);
"viie", "aasta" -- forms a multiword temporal expression phrase
("viie aasta"), referring to a period ("DURATION") of
length 5 years;
Specifics
--------------
I. Note that there can also be timexes with no "text" value, i.e. timexes that
form an implicit duration (A), or mark implicit beginning or ending points (B):
(A) e.g. "2001-2005" -- the period covering explicit timepoints "2001-" and
"2005" is annotated as a timex (DURATION) with no
textual content;
(B) e.g. "following three years" -- beginning and ending timepoints of the
explicit duration expression ("three years") are marked
as timexes with no textual content;
II. The program does not always resolve the ambiguities of possible multiple
readings of temporal expressions, e.g. "aastas 2000 tundi" can be interpreted
as "aastas 2000" (in year 2000) or as "2000 tundi" (2000 hours). In case of
ambiguities, "timexes" also lists multiple timex objects.
===============================
Development and evaluation
===============================
The structure
------------------
The Ajavt project has following directory structure:
[doc] <--- documentation about the annotation format
and about format of the rules file;
[lib] <--- Java dependencies of the program;
[res] <--- resources used by the program:
[res\reeglid.xml] <--- the rules file
[src] <--- source of the program:
[src\ee\ut\soras\ajavtV2] <--- main source of the tagger;
[src\ee\ut\soras\wrappers] <--- wrappers for handling different input formats,
and a common model for encapsulating
morphological analyses;
[test-src] <--- methods for automated testing & evaluation;
[test-src\ee\ut\soras\test_ajavt] --- tools for evaluating the tagger on an
annotated corpus;
[test] <--- the testing folder;
tagger's JAR file along with required
dependencies will be deployed here;
readme.txt <--- you are here :)
build.properties <--- configuration for the Ant build script;
build.xml <--- the Ant build script for compiling, deploying and testing
the tagger;
Automated testing and evaluation
----------------------------------
This distribution also contains tools for automated testing/evaluating the tagger
against manually annotated TIMEX corpora. In order to set up and use the automatic
evaluation, proceed in following steps:
I. Download Estonian TIMEX annotated corpora from following repository:
https://github.com/soras/EstTimexCorpora
II. Modify "build.properties" of this program and set the root directory of
evaluation corpora:
test.root=FULL/PATH/TO/EstTimexCorpora
III. Modify "build.xml" of this program to enable automated testing: remove
the comments around properties "use.tml.corpus.04" and "use.t3o.corpus.03".
The property "use.tml.corpus.04" enables the evaluation task "test-tml-04",
and the property "use.t3o.corpus.03" enables the task "test-t3-olp-03";
IV. Execute the automatic evaluation on all corpora with the command:
ant test-all
Alternatively, evaluation can be executed only on the TML format corpus:
ant test-tml
and only on the T3-OLP-AJAV format corpus:
ant test-t3-olp
V. The evaluation program will output a detailed analysis on matching (and
mismatching) TIMEX annotations for each document. Additionally, precisions
and recalls on TIMEX extents and attributes will be reported for each
document, and microaverages of these measures will be reported at the end
of the evaluation;
The results of the evaluation will also be written into text files, marked
with the timestamp of evaluation. Each evaluation corpus has a subdirectory
"testlog" that stores these text files.
============================
Acknowledgements and
license
============================
Copyright (C) 2009-2016 University of Tartu
Author: Siim Orasmaa ( siim . orasmaa {at} ut . ee )
Ajavt is released under the GNU General Public License version 2.
Dependency libraries have their own respective license terms, see
"lib/LIB_LICENSES.txt" for details.
Development of this tool has been supported by the National Programme
for Estonian Language Technology under projects EKKTT09-66, EKT7 and
EKT57.
============================
References
============================
Orasmaa, S. (2010). Ajaväljendite tuvastamine eestikeelses tekstis
(Recognition and Resolution of Estonian Temporal Expressions). Master’s
thesis, University of Tartu. (in Estonian).
( url: http://comserv.cs.ut.ee/forms/ati_report/downloader.php?file=F0E53012D5F88F71DD6E2E84830460F334E14EA2 )
Orasmaa, S. (2012) "Automaatne ajaväljendite tuvastamine eestikeelsetes tekstides"
(Automatic Recognition and Normalization of Temporal Expressions in Estonian
Language Texts). Eesti Rakenduslingvistika Ühingu aastaraamat 8: 153-169.