Scoring Function

The similarity of a path to the graph is used to to compute a score and the highest score kept. The scoring formula is

Score(P) = ((SS + SZ) / 2) − (λg ⋅ ∣L(P)−E∣) - (λf ⋅ N)

The first parameterized penalty (--gpenalty λg) helps reduce genotyping split variant representations. For example, imagine a window with three variants in its graph of -100bp, +49bp, and -55bp. These variants are pure tandem repeats and so -100bp and +49bp makes the same net-change as a single -51bp variant. If the haplotype has a single -52bp pileup, the -100bp & +49bp path will have a similarity of 96.0% while the path scores will be 94.2% for just the -49bp and 94.5% for just the -55bp. However, with a penalty factor of 0.02 the highest score becomes becomes 94% since the path has one more node than the number of pileups. The default --gpenalty of 0.01 can be interpreted as a 1 percentage point penalty for every non 1-to-1 matched graph node or haplotype pileup. For vcfs with a dozen or so samples, a --gpenalty 0.02 may be beneficial, and for vcfs with large sample sets, 0.03 or higher is recommended.

The second parameterized penalty (--fpenalty λf) helps use the full haplotype while allowing for false negatives in the variant graph. For example, imagine a haplotype that comprises two adjacent pileups of -50bp and -100bp and a variant graph with only a -100bp change. The full haplotype's -150bp change over the window can, at best have a 66% similarity to the variant graph's path. Kanpig will allow up to three false-negatives in the haplotype by comparing subsets of the pileups (in this case only the -50bp and -100bp independently) and will find 100% similarity with the -100bp pileup. However, since this path has a false negative from the -50bp, its final score will be penalized by λf. Generally, a --fpenalty 0.1 is sufficient with lower values boosting specificity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring Function

Clone this wiki locally