-
Notifications
You must be signed in to change notification settings - Fork 1
Scoring Function
The similarity of a path to the graph is used to to compute a score and the highest score kept. The scoring formula is
Score(P) = ((SS + SZ) / 2) − (λg ⋅ ∣L(P)−E∣) - (λf ⋅ N)
The first parameterized penalty (--gpenalty λg
) helps reduce genotyping split variant representations. For example,
imagine a window with three variants in its graph of -100bp, +49bp, and -55bp. These variants are pure tandem repeats
and so -100bp and +49bp makes the same net-change as a single -51bp variant.
If the haplotype has a single -52bp pileup, the -100bp & +49bp path will have a similarity of 96.0% while the path
scores will be 94.2% for just the -49bp and 94.5% for just the -55bp. However, with a penalty factor of 0.02 the highest score
becomes becomes 94% since the path has one more node than the number of pileups. The default
--gpenalty
of 0.01 can be interpreted as a 1 percentage point penalty for every non 1-to-1 matched graph node or
haplotype pileup. For vcfs with a dozen or so samples, a --gpenalty
0.02 may be beneficial, and for vcfs with large
sample sets, 0.03 or higher is recommended.
The second parameterized penalty (--fpenalty λf
) helps use the full haplotype while allowing for false negatives in the variant graph. For example, imagine a haplotype that comprises two adjacent pileups of -50bp and -100bp and a variant graph with only a -100bp change. The full haplotype's -150bp change over the window can, at best have a 66% similarity to the variant graph's path. Kanpig will allow up to three false-negatives in the haplotype by comparing subsets of the pileups (in this case only the -50bp and -100bp independently) and will find 100% similarity with the -100bp pileup. However, since this path has a false negative from the -50bp, its final score will be penalized by λf
. Generally, a --fpenalty 0.1
is sufficient with lower values boosting specificity.