-
Notifications
You must be signed in to change notification settings - Fork 1
Weights
From end-user perspective, weight tells how probable a word or its analysis is. The weight can be thought as a penalty, i.e. words/analyses with a bigger weight are less probable. Accordingly, when there are several analyses for a word, they are printed in ascending order so that the most probable ones come first. It is possible to weight lexemes or grammatical rules, making it easier to disambiguate among several possible analyses for a given word. Of course weights can also be used in generating word forms.
For weights, we use the tropical semiring. When there are several paths or transitions that only differ by their weight, the tropical semiring chooses the one with the lowest weight. All HFST functions and transducer classes by default support weights. If weights are not specified anywhere, the functions just operate with zero weights. There are three back-end implementation formats available for almost all HFST functions: sfst, openfst-tropical and foma, openfst-tropical being the weighted one and used by default.
Weights can be specified in regular expressions when using functions hfst.regex and hfst.start_xfst.
The mechanism for adding weights is the ::
operator which can be used for assigning weights to individual transitions or to any regular expression in brackets, i.e.
a::weight
a:b::weight
[ any regular expression ]::weight
The weights are most often from the tropical semiring. The tropical weight is represented as a float, i.e. one or more digits that may be preceded by a minus or plus sign and followed by a comma followed by at least one digit. For example the regular expression
[ a b:c::0.5 d::0.3 ]::0.2
will produce a transducer that maps abd to acd with weight 0.5 + 0.3 + 0.2 = 1.0. In this example, we basically have a transition a:a with no weight followed by a transition b:c with weight 0.5 followed by transition d:d with weight 0.3 leading to a final state with weight 0.2. However, it is possible that operations that are called afterwards, e.g. minimization, modify the exact positions of weights in the transducer.
A more complex expression
[ [ foo:bar::-1.15 ]::+0.15 baz::0.5 ]::0.7
will yield a transducer that maps foobaz to barbaz with weight -1.15 + 0.15 + 0.5 + 0.7 = 0.2.
Note that using weights is possible only when using the implementation openfst-tropical (and basically openfst-log which is not very well supported). Inserting weights with unweighted implementations, i.e. sfst or foma, has no effect.
Tool | Usage |
---|---|
hfst.compile_lexc_file | Weights can be defined for individual entries and they can also used in regular expressions. |
hfst.compile_twolc_file | It may become possible to add weights to rules, which determine the relative importance of a rule in a conflict-situation. At this time it is only possible to compile weighted rules with zero weights. |
hfst.fst | The weight of a string can be given after the string separated by a tabulator. |
read_att_input, read_att_string, read_att_transducer, class AttReader | In AT&T format, weights for transitions and final states can be given after the transition or final state line separated by a tabulator. |
class PrologReader | In prolog format, weights can be given as last argument of compounds arc and final. |
There are some issues with weights that must be considered when specifying them or applying certain operations on weighted transducers.
Negation is based on subtraction so it does not preserve weights. Usually this is not a problem but must be taken into consideration when using double negation. With unweighted transducers double negation produces an equivalent transducer, but with weighted ones the weights are lost. For example the expression
~[~[a:b::2.5]]
yields a transducer equivalent to a:b::0
, not a:b::2.5
.
Weights are not yet fully defined for all rule types. See hfst-xfst for examples of using weights in rules.
The containment operators $.[t]
(contains t once) and $?[t]
(contains t once or does not contain t) support weights. The operator $[t]
(contains one or more t's) ignores weights, if you want to weight each occurrence of t with a weight of w, you can use the operator $::w[t]
. Note that the containment operators use shortest matching and allow overlapping when matching the expressions. E.g.
$.[a a]
will not accept string "aaa" since it contains two strings "aa" that overlap. Similarly, the weighted expression
$::1[a a]
will accept string "aaa" but with weight 2 because there are two strings "aa", each with weight 1.
The expression
$.[a::3|[a a]::1]
will not accept string "aa" because it contains two strings "a" according to shortest matching (that does not consider weights).
All weighted transducers are not determinizable (and thus not minimizable), so weights must be encoded when performing determinization or minimization. Encoding means that weights are regarded as a part of transition symbol pairs and cannot be freely moved. In some cases this results in minimized transducers that are not fully minimal.
For example, the determinization algorithm does not halt if given input such as
0 1 a a 1
0 2 a a 2
1 1 b b 3
1 3 c c 5
2 2 b b 4
2 3 d d 6
3 5 e e 0
3 4 e e 0
4 0
5 0
and will eventually run out of memory (example modified from Mohri, Efficient Algorithms for Testing the Twins Property, page 8). If weights are encoded, the algorithm produces the result
0 1 a a 1
0 2 a a 2
1 1 b b 3
1 3 c c 5
2 2 b b 4
2 3 d d 6
3 4 e e 0
4 0
where weights have effectively been ignored during determinization and determinization has only been applied to the parts of the transducer that do not have weights.
You can set weight encoding on in hfst-xfst with
set encode-weights ON
and in hfst-regexp2fst, hfst-determinize and hfst-minimize with option --encode-weights
.
When taking projection, the weights are copied as such. If an expression is divided into its input and output projections and they are combined with cross product, the weights will be duplicated. For example the expression
[a:b::5].u .x. [a:b::5].l
will yield a transducer equivalent to [a:b::10]
, not [a:b::5]
.
Another issue with projection is that it can produce weighted epsilon cycles. For example the expression
[[a:0::-1]*].l
will produce an epsilon cycle with a weight of minus one. When the epsilons are removed as part of determinization or minimization, it will result in infinite negative weights. (Positive weights are not a problem in tropical semiring because they are regarded as penalties and discarded in the first place.) See section below for more information:
Consider the transducer
0 0 ε ε -1
0 1 a a 0
1 0
Performing epsilon removal or determinization on it will theoretically produce the result
0 1 a a -INF
1 0
where INF
is the negative infinity. Similarly, for minimization we get:
0 1 a a 0
1 -INF
In practice, INF
will be the smallest float value that the environment where HFST is running supports.
This is basically the correct result given the finite values of a float, but very slow to calculate. The above
transducer will minimize in a couple of seconds, but adding more states will considerably slow down the algorithm.
Epsilon cycles with negative weights are relatively rare, but can e.g. occur when creating rules and taking projection
on the false side.
hfst-fst2strings has the option --n-best=N
which does not work
if the transducer has loops with negative weights.
hfst-compare works poorly with weighted transducers due to precision issues.
Package hfst
- AttReader
- PrologReader
- HfstBasicTransducer
- HfstBasicTransition
- HfstTransducer
- HfstInputStream
- HfstOutputStream
- MultiCharSymbolTrie
- HfstTokenizer
- LexcCompiler
- XreCompiler
- PmatchContainer
- ImplementationType