Chart-based Japanese parsers exist, why bother? Because shift-reduce is much faster and Stanford's is accurate.
> ./gradlew build
> ./gradlew copyRuntimeLibs
Annotate Japanese sentences with word boundaries and part-of-speeches (POS). For example, use KyTea
> kytea -notag 2 < sentence_file.txt > tagged_sentence_file.txt
Note that words must be delimited by half-width whitespace and POS must attach to each word with a slash as separator.
すもも/名詞 も/助詞 もも/名詞 も/助詞 もも/名詞 の/助詞 うち/名詞
Be careful with classpath and model path.
> java -cp build/libs/yaraku-nlp-0.1.jar:lib/* \
com.yaraku.nlp.parser.shiftreduce.demo.JapaneseShiftReduceParserDemo \
-model ja.beam.rightmost.model.ser.gz \
< tagged_sentence_file.txt > parsed_sentence_file.txt
For example, input
すもも/名詞 も/助詞 もも/名詞 も/助詞 もも/名詞 の/助詞 うち/名詞
and expect the outcome like
(ROOT (名詞P (助詞P (助詞P (名詞 すもも) (助詞 も)) (助詞P (名詞 もも) (助詞 も))) (名詞P (助詞P (名詞 もも) (助詞 の)) (名詞 うち))))
If you must....
- Get a Japanese treebank such as Japanese Dependency Corpus (JDC)
- Prepare the trees in Penn Treebank S-expression.
- For example, use https://github.com/neubig/travatar/blob/master/script/tree/ja-dep2cfg.pl to convert JDC's trees.
- Build a model with training and development sets, e.g.
JDC/train/all.cfg
andJDC/dev/all.cfg
> java -cp build/libs/yaraku-nlp-0.1.jar:lib/* \
edu.stanford.nlp.parser.shiftreduce.ShiftReduceParser \
-headFinder com.yaraku.nlp.trees.RightHeadFinder \
-trainTreebank JDC/cfg/train/all.cfg \
-devTreebank JDC/cfg/dev/all.cfg \
-serializedPath ja.beam.rightmost.model.ser.gz \
-trainingThreads 8 \
-trainingIterations 60 \
-stalledIterationLimit 20 \
-trainingMethod REORDER_BEAM \
-trainBeamSize 4 \
-randomSeed 31337
- Thanks to Prof. Graham Neubig for the advice and scripts.
- This work is made possible through the support of Yaraku, Inc..