Skip to content

Latest commit

 

History

History
45 lines (36 loc) · 1.58 KB

README.md

File metadata and controls

45 lines (36 loc) · 1.58 KB

BMMM

The Bayesian Multinomial Mixture Model code from my 2011 paper (and thesis)

Requirements

  1. Java 1.7
  2. Maven (http://maven.apache.org/download.cgi)

Running BMMM

After cloning the project, or downloading the zip, open the bmmm folder in command line and run:

mvn package
mvn dependency:copy-dependencies

If the build is successful, to see the available runtime configuration options run

java -cp target/bmmm-2.0.11.jar:target/dependency/* tagInducer.Inducer

The main requirement is a CoNLL-style file with UPOS annotation (9 columns in total) as input. If the the input file contains dependencies (column 8) the deps feature can also be used. To use morphology (Morfessor) and PARG-based features you will need the appropriate files. You can convert a raw tokenised corpus to CoNLL format using the following command:

java -cp target/bmmm-2.0.10.jar tagInducer.utils.RawToCoNLL corpus.txt

You can also use a JSON file format with the following fields (one sentence per line):

{
    "words":[{"word":"more","pos":"qn","upos":"DET","cluster":"48"},
        {"word":"juice","pos":"n","upos":"NOUN","cluster":"48"},
        {"word":"?","pos":"?","upos":".","cluster":"-1"}]
}

Evaluating BMMM

To evaluate the output of the Inducer use:

java -cp target/bmmm-2.0.11.jar:target/dependency/* tagInducer.Evaluator

The input can be either a CoNLL-style file, where the clusters are contained in column 5 (4th 0-index-based column). The same file needs to contain either fine-grained tags (3rd 0-index column), UPOS (5th column) or CCG categories (6th column).