-
Notifications
You must be signed in to change notification settings - Fork 23
KissFormat
The KISS format is derived from the UCSC BED format, but with important differences allowing it to carry e.g. alignment information thus making it ideal for describing the mapping of short sequence matches from Next Generation Sequencing, but also the layout of multi-exon genes.
The BED format is per definition bound to chromosomes, which is not feasible when working with bacterial contigs, plasmids, vira, etc. Also, the BED format is not suitable for carrying alignment information, and the format contains the useless itemRgb
field. Moreover, the BED format position scheme is awkward because the chromEnd
field includes 1 extra position, that does not pertain to a base position, but is used for drawing features.
The SAM format cannot be used to describe a multi-exon gene, but is specifically designed for short sequence alignments of Next Generation Sequencing data. The SAM format is problematic to work with for several reasons. First of all, the SAM format is actually quite messy to parse even though it is claimed to be simple. The reason for this is the number of fields reserved for mate-pair sequence mapping and a complex bit field. Also, the alignment information is encoded using Extended Cigar format and two optional fields for mismatch information, however, there is no information on inserted or deleted nucleotides - for that you need to recreate the alignment form the original sequence. Finally, the SAM format cannot be considered generic when it has an unlimited number of optional fields which may contain user defined content.
The GFF format is impossible to work with because of the requirement of parsing multiple lines to resolve the parent/child features for describing e.g. a gene.
The KISS format (Keep it Simple Stupid) is a text based data format for describing generic feature information in a simple format with one feature per line in 12 tab-separated columns:
- S_ID: Subject ID - e.g. chr12.
- S_BEG: Begin position of a feature relating to the subject sequence. 0-based.
- S_END: End position of a feature relating to the subject sequence.
- Q_ID: Query ID - e.g. a Solexa read ID e.g. a3_2VCOjxwXsN1
- SCORE: A float that can describe e.g. a BLAT score.
- STRAND: Denotes which strand a feature relates to. + or -.
- HITS: Number of times a feature is found in the subject sequence.
- ALIGN: Comma-separated list of alignment descriptors for mismatches, insertions, and deletions
*
). - BLOCK_COUNT: Number of blocks in a feature (e.g. introns + exons).
- BLOCK_BEGS: Comma-separated list of block begin positions. Offset is S_BEG.
- BLOCK_LENS: Comma-separated list of block lengths.
- BLOCK_TYPE: Comma-separated list of block types (0=Gap,1=Non-gap,2=CDS,3=5'UTR,4=3'UTR).
Values in fields 4-12 are optional and empty fields must contain a '.'.
*
) Alignment descriptors:
- mismatch: (offset:S-base>Q-base) - e.g. 0:C>T,13:G>C
- insertion: (offset:->Q-base) - e.g. 8:->G,18:->A
- deletions: (offset:S-base>-) - e.g. 5:A>-,16:T>-
The offset position is based on S_BEG
and do not change with insertions or deletions. Alignment descriptors are based on the + strand.
Descriptors should be sorted by offset postion.
The mandatory Subject ID S_ID
field is used to identify the subject sequence which could be chr1
, contig00001
, or
gi|255961261|ref|NC_007492.2| Pseudomonas fluorescens Pf0-1, complete genome
. There are no restrictions to this field except that tabs must be escaped.
The mandatory Subject Begin S_BEG
field is the 0-based begin position of a feature. S_BEG
is a positive integer.
The mandatory Subject End S_END
field is the end position of a feature. S_END
is a positive integer, and is always equal to or greater than S_BEG
.
The most simple KISS entry could be like this (empty fields are denoted with .
):
Contig1 10 20 . . . . . . . . .
This would describe a match of an 11 bases long unidentified feature on the sequence of Contig1:
Pos: 0123456789012345678901234567890
Contig1: -------------------------------
===========
The optional Query ID Q_ID
field is used to identify the query sequnce which could be the ID of a mapped sequence such as a Solexa e.g. a3_2VCOjxwXsN1
or a gene ID NM_006140
. There are no restrictions to this field except that tabs must be escaped.
Contig1 10 20 NM_006140 . . . . . . . .
The optional SCORE
field is used to hold a score value of a sequence match, e.g. a BLAT or BLAST match. SCORE
is a float value.
Contig1 10 20 NM_006140 0.123 . . . . . . .
The optional STRAND
field is used to indicate the orientation of a feature. +
and -
are allowed.
Contig1 10 20 NM_006140 . - . . . . . .
The optional HITS
field can be used to denote how many times this feature was found on this Subject sequence or in a stack of subject sequences.
This is useful for analyzing multi-mapping Next Generation Sequencing reads. The below value of 123 in the HITS
field indicates that this sequence was
mapped at 122 other loci.
Contig1 10 20 NM_006140 . . 123 . . . . .
The optional ALIGN
field consists of a comma-separated list of alignment descriptors of which there are 3 types:
- mismatch: offset:S-base>Q-base
- insertion: offset:->Q-base
- deletion: offset:S-base>-
The offset position is based on S_BEG
and do not change with insertions or deletions.
The alignment descriptors are sorted on the offset positions.
The alignment descriptors are based on the + strand.
Thus we can generate an alignments of a Subject and a Query sequence using only the Subject sequence and the ALIGN descriptors.
The following Subject sequence is used in the below examples:
S_SEQ: CCGTAAGACTACGGCTTAGGC
Here is an example of two mismatches: ALIGN: 0:C>T,13:G>C
1 2
012345678901234567890
S_SEQ: CCGTAAGACTACGGCTTAGGC
|||||||||||| |||||||
Q_SEQ: TCGTAAGACTACGCCTTAGGC
Here is an example of two insertions: ALIGN: 8:->G,18:->A
1 2
01234567-8901234567-890
S_SEQ: CCGTAAGA-CTACGGCTTA-GGC
|||||||| |||||||||| |||
Q_SEQ: CCGTAAGAGCTACGGCTTAAGGC
Here is an example with two deletions: ALIGN: 3:T>-,16:T>-
1 2
012345678901234567890
S_SEQ: CCGTAAGACTACGGCTTAGGC
||| |||||||||||| ||||
Q_SEQ: CCG-AAGACTACGGCT-AGGC
And an example with all of the above: 0:C>T,3:T>-,8:->G,13:G>C,16:T>-,18:->A
1 2
01234567-8901234567-890
S_SEQ: CCGTAAGA-CTACGGCTTA-GGC
|| |||| ||||| || | |||
Q_SEQ: TCG-AAGAGCTACGCCT-AAGGC
The optional BLOCK_COUNT
field is used to denote the number of blocks in gapped alignments such as mapped paired-end
reads, or multi-exon genes. BLOCK_COUNT
is a positive integer.
The optional BLOCK_BEGS
field contains a comma-separated list of block begin positions with S_BEG
as offset.
The optional BLOCK_LENS
field contains a comma-separated list of block lengths.
The optional BLOCK_TYPE
field contains a comma-separated list of block types. BLOCK_TYPE
is a positive integer
denoting the block type which are the following:
0 = Gap (or intron)
1 = Non-gap
2 = CDS
3 = 5' UTR
4 = 3' UTR
Thus a paired-end read could result in the following KISS entry:
Contig1 10 50 ID00001 . . . . 3 0,5,10 5,5,8 1,0,1
ID00001: =====-----========
Block1 Block2
A gene with a 5' UTR, 3 exons, and a 3' UTR with an intron can be described like this:
Contig1 10 42 GENE00001 . . . . 9 0,6,9,14,18,21,26,28,30 6,3,5,4,3,5,2,2,3 3,2,0,2,0,2,4,0,2
# CDS
= UTR
- Intron
GENE00001: ======###-----####---#####==--===
Here are some real life KISS records from a BWA mapping of Solexa reads against a reference:
CP000046 49 92 1_524msxwXsN1 61.61 - . . 1 . . .
CP000046 50 93 5_LKLjAywXsN1 62.61 - . . 1 . . .
CP000046 51 94 1_64zDoxwXsN1 62.27 - . . 1 . . .
CP000046 55 98 3_WFMk4ywXsN1 61.52 - . 0:G>A,5:A>- 1 . . .
CP000046 59 102 5_XYjz6ywXsN1 60.98 + . 40:C>-,41:A>C 1 . . .
CP000046 59 102 5_XmvSlxwXsN1 62.73 + . 40:C>-,41:A>C 1 . . .
CP000046 64 107 7_8ZFZ3ywXsN1 62.32 + . . 1 . . .
CP000046 66 100 5_ay97zxwXsN1 61.97 + . . 1 . . .
CP000046 67 101 7_nqQ11ywXsN1 63.14 + . . 1 . . .
CP000046 67 110 5_eky4kxwXsN1 62.34 - . . 1 . . .