This program reads input in Constraint Grammar format, and splits special “multiword cohorts” into separate cohorts, leaving other cohorts and intervening blanks as they were.
For examples of input/output, see the files in test/
, e.g.
test/input.simple.cg.
A C++ compiler that goes all the way to 11.
Tested with gcc-5.2.0, gcc-5.3.1 and clang-703.0.29.
(Should work all the way down to gcc-4.9, but will fail with e.g. gcc-4.8.4 or clang-3.5.0.)
./autogen.sh
./configure # optionally with argument --prefix=$HOME/my/prefix
make
make install # with sudo if you didn't specify a prefix
Takes no options, just stdin and stdout:
cg-mwesplit < infile > outfile
More typically, it’ll be in a pipeline after hfst-tokenise
and some
step that disambiguates multiwords using vislcg3
:
echo words go here | hfst-tokenise --gtd tokeniser.pmhfst | vislcg3 -g mwe-dis.cg3 | cg-mwesplit
If you get
terminate called after throwing an instance of 'std::regex_error' what(): regex_error
then your C++ compiler is too old. See Prerequisites.