-
Notifications
You must be signed in to change notification settings - Fork 45
/
Copy pathSAMPLE_GRAMMARS
52 lines (44 loc) · 2.77 KB
/
SAMPLE_GRAMMARS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
This SAMPLE_GRAMMARS file describes the sample grammars that come
with the distribution, and provides an overview of how the grammars
are organized.
Grammars written directly in the XML format used by OpenCCG appear in
separate directories under grammars/. There are currently four small
English grammars -- tiny, worldcup, flights, and comic -- plus a series of
related grammars, mini-*, for Basque, Dyirbal, English, Inuit, Tagalog and
Turkish, which are from Bozsahin and Steedman's (2003) study of
ergativity. The worldcup grammar includes the English examples from
Baldridge (2002). (The Dutch, Turkish, Tagalog, and Toba Batak grammars
have not been updated from Grok version 0.6.) The flights and comic
grammars (used in the FLIGHTS and COMIC systems) make use of a shared
grammar of core English, in the core-en dir, and contain categories for
pitch accents and boundary tones.
Grammars written in the front-end `dot CCG' format, which attempts to provide
a more powerful and easier-to-use format than the raw XML, are in separate
directories under ccg-format-grammars/. There are currently three grammars
here -- tiny, tinytiny, and arabic. `tiny' is a grammar originally based on
the `tiny' English grammar contained in the grammars/ directory and
documented above. It has been significantly expanded so as to demonstrate
the various features of the CCG format. `tinytiny' is a smaller English
grammar extracted from `tiny', which attempts to demonstrate a minimal-size
useful grammar. `arabic' is a grammar of a large chunk of Classical Arabic,
written by Ben Wing. It was created in particular to demonstrate the power
of CCG-format macros in handling complex morphology, and contains a nearly
full grammar of Arabic verbs. Dot CCG grammars are compiled using ccg2xml;
run ccg2xml -h for usage.
The best place to look for more info on the dot CCG format is in
ccg-format-grammars/tiny/tiny.ccg and in src/ccg2xml/README.
This release also includes a broad English coverage grammar from the
CCGBank and associated statistical models; see docs/ccgbank-README for
details.
Note that with all the grammars, there is the option to store logical
forms in an xml node-rel format for graphs. Conversion to this graph
format is done using a couple of XSLT transforms specified in the
grammar.xml file; see grammars/worldcup/grammar.xml for an example.
When using this graph format, it is also possible to visualize the
semantic graphs, as described in the main README file.
At present, ccg2xml does not support writing grammar.xml files with
the XSLT transforms for the node-rel graph format. As a workaround,
you can add these transforms to your own version of the file which you
then copy over the generated grammar.xml file, as shown below.
> ccg2xml --prefix= mygram.ccg
> cp mygram-grammar.xml grammar.xml