module__TreeTagger

Jump to bottom

Robert Bossy edited this page Jul 27, 2017 · 1 revision

#org.bibliome.alvisnlp.modules.treetagger.TreeTagger

Synopsis

Runs tree-tagger.

Description

org.bibliome.alvisnlp.modules.treetagger.TreeTagger applies tree-tagger on annotations in wordLayerName by generating an appropriate input file. This file will contain one line for each annotation. The first column, the token surface form, is the value of the formFeature feature. The second column, the token predefined POS tag, is the value posFeature feature. The third column, the token predefined lemma, is the value of lemmaFeature feature. If posFeature or lemmaFeature are not defined, then the second and third column are left blank.

The tree-tagger binary is specified by treeTaggerExecutable and the language model to use is specified by parFile. Additionally a lexicon file can be given through lexiconFile.

If sentenceLayerName is defined, then org.bibliome.alvisnlp.modules.treetagger.TreeTagger considers annotations in this layer as sentences. Sentence boundaries are reinforced by providing tree-tagger an additional end-of-sentence marker.

Once tree-tagger has processed the corpus, org.bibliome.alvisnlp.modules.treetagger.TreeTagger adds the predicted POS tag and lemma to the respective posFeature and lemmaFeature features of the corresponding annotations.

If recordDir and recordFeatures are both defined, then tree-tagger predictions are written into files in one file per section in the recordDir directory. recordFeatures is an array of feature names to record. An additional feature n is recognized as the annotation ordinal in the section.

Parameters

parFile

Optional

Type: InputFile

Path to the language model file.

treeTaggerExecutable

Optional

Type: ExecutableFile

Path to the tree-tagger executable file.

constantAnnotationFeatures

Optional

Constant features to add to each annotation created by this module

lexiconFile

Optional

Type: SourceStream

Path to a tree-tagger lexicon file, if set the lexicon will be applied to the corpus before treetagger processes it.

recordDir

Optional

Type: OutputDirectory

Path to the directory where to write tree-tagger result files (one file per section).

recordFeatures

Optional

Type: String[]]

List of attributes to display in result files.

documentFilter

Default value: true

Type: Expression

Only process document that satisfy this filter.

formFeature

Default value: form

Type: String

Name of the feature denoting the token surface form.

inputCharset

Default value: ISO-8859-1

Type: String

Tree-tagger input corpus character set.

lemmaFeature

Default value: lemma

Type: String

Name of the feature to set with the lemma.

noUnknownLemma

Default value: false

Either to replace unknown lemmas with the surface form.

outputCharset

Default value: ISO-8859-1

Type: String

Tree-tagger output character set.

posFeature

Default value: pos

Type: String

Name of the feature to set with the POS tag.

recordCharset

Default value: UTF-8

Type: String

Character encoding of the result files.

sectionFilter

Default value: true

Type: Expression

Process only sections that satisfy this filter.

sentenceLayerName

Default value: sentences

Type: String

Name of the layer containing sentence annotations, sentences are reinforced.

wordLayerName

Default value: words

Type: String

Name of the layer containing the word annotations.

AlvisNLP/ML Wiki

User guides

Developer guides

Clone this wiki locally