-
Notifications
You must be signed in to change notification settings - Fork 196
Index Types
The following assumes vg version 1.41.0 or later unless otherwise noted.
This page provides further information on the indexes discussed in Index Construction. Some file types not listed here are described at File Types.
XG is a space-efficient immutable graph. It is typically used in mappers and other applications that do not modify the graph.
Multiple graphs can be combined into a single XG index with vg index -x
. By default, the variants (alt paths) created with vg construct -a
are not stored as paths in the XG index. They can be included with option -L
.
A single graph can be converted into XG format with vg convert -x
.
The usual extension for XG files is .xg
.
GBWTGraph is an immutable graph induced by the paths stored in an GBWT index. It uses the GBWT for graph topology and stores the sequences in the graph itself. Nodes and edges of the original graph not used on any GBWT path are not included in the GBWTGraph. Because any changes to the GBWT index make the GBWTGraph invalid, standalone GBWTGraph files are now deprecated.
GBZ is a wrapper that stores a GBWT index and the corresponding GBWTGraph. While the GBWTGraph file format corresponds closely to the in-memory representation of the data, GBZ compresses the sequences and is much more space-efficient.
GBWTGraphs can be built with vg gbwt -g
(step 5 of the pipeline). Option --gbz-format
stores the graph in GBZ format.
A GBWTGraph often contains many haplotypes. Since the introduction of path senses in version 1.41.0, vg convert
will include the haplotypes when converting a GBWTGraph / GBZ graph to other graph types. This can make the resulting graph excessively large. To avoid including the haplotypes, use vg convert
with option -H
/ --drop-haplotypes
.
The usual extension for a GBZ graph is .gbz
.
GCSA (sometimes called GCSA2) is an FM-index of a pruned de Bruijn graph that approximates the original graph. Queries up to a certain length (typically 256 bp) map perfectly to the original graph (or its subgraph if the graph was complex and had to be pruned). Longer queries may yield false positives.
In most applications, GCSA also needs the LCP array for additional functionality.
A GCSA index can be built with vg index -g
. The inputs are typically pruned single-chromosome graphs.
The usual extensions are .gcsa
for GCSA and .gcsa.lcp
for the LCP array.
The GBWT is an FM-index that stores similar paths space-efficiently. In most applications, it is used for storing (real or artificial) haplotypes.
A GBWT index may also contain metadata with structured (sample, contig, haplotype, fragment) names for each path. Paths with sample name _gbwt_ref
are often interpreted as generic paths.
Steps 1-2 of the vg gbwt
pipeline builds a GBWT index from several input types, including phased VCF (-v
), GFA (-G
), GBZ (-Z
), embedded paths in a graph (-E
), and GAM/GAF (-A
/ -A --gam-format
).
The usual extension for a GBWT index is .gbwt
.
Note: Older materials may refer to the paths stored in a GBWT index as threads, which can easily be confused with computational threads. This separation between full paths and lightweight threads existed, because GBWT paths have more limited functionality and often cannot be used in places that expect a proper path. This distinction has been obsolete since the introduction of the Path Metadata Model.
A distance index stores a hierarchical snarl decomposition of the graph and uses it for computing shortest distances between positions in the graph.
The index is built with vg index -j
.
The typical extension for a distance index is .dist
.
A minimizer index stores a subset of kmers from the haplotypes and the corresponding graph positions in a hash table. The positions must be annotated with distance information for faster distance index queries.
Minimizer indexes are built with vg minimizer
. The construction requires a GBZ graph.
The usual extension for a minimizer index is .min
.
The vg map
workflow requires a graph (usually XG index) and a GCSA index with the LCP array. It may also uses a GBWT index with haplotypes for (potentially) more accurate alignment scores.
Giraffe requires the following indexes:
- A GBZ graph containing:
- GBWT index with haplotypes. The GBWT should be augmented with artificial haplotypes for decoys etc (
vg gbwt -a
). If the number of haplotypes is large (hundreds or more), it is usually better to downsample them (vg gbwt -l
; also augments the result). - The corresponding GBWTGraph.
- GBWT index with haplotypes. The GBWT should be augmented with artificial haplotypes for decoys etc (
- A distance index.
- A minimizer index annotated with information from the distance index.