This repo contains code for manipulating annotation file formats.
##Repository structure
The temp/
folder contains some code which is never guaranteed to work.
The scripts outside this folder should work.
The test/
folder contains examples to check the scripts.
##Meaning and intended purpose of the scripts
compare_vcf.py serves for the purpose of comparing two non-reference samples by their difference from reference. It outputs the variations from reference existing in the first of the samples (not the symmetric difference). It could be useful for comparing two samples one of which is derived from the other.
Usage: compare_vcf.py <initial.vcf> <derived.vcf> <output.vcf>
update: it does work but bedtools intersect -v is much faster, so I would strongly recommend using this well-known tool instead of my code.
correct_coord.py <input.gff/bed> <output.gff/bed>
Some programs (e.g. command line Blast tabular output) list coordinates of the features as the find them, and so start can happen to be greater than stop. Such data cannot be directly used as input to bedtools. This script solves this problem.
lexicoSV.py takes a bed file with reference-assisted annotation as input and returns the list of contigs where it finds genes from different chromosomes (could be due to structural variation or assembly errors). It is written for S. cerevisiae standard ORF names (SGD).
Usage: lexicoSV.py <input.bed> <output.txt>
conversion.py, the main script in this folder
Usage: conversion.py [-h] -f FUNCTION -i INPUT_FILENAME -o OUTPUT_FILENAME [--source SOURCE] [--cds_only CDS_ONLY]
Currently supported formats: gtf to gff3, gff/gtf to bed, bed to gff3.
rename_gff scripts are useful if you have some annotation in which genes are not given any proper names (e.g. Maker2 output) and some knowledge about how these features shown be named (e.g. from a related species or another subspecies / strain / cultivar etc).
rename_gff_by_table was written specifically for proteinortho output but should be applicable to any similar output.
proteinortho output looks like this:
# Species Genes Alg.-Conn. <sample1 name> <sample2 name>
The basic idea is to cluster sequences and use the table to rename entries.
tests.py and tests.sh are just some scripts to test the code and report where it fails.
##Annotation file formats
###GFF2
GFF2 has nine required fields. GTF2 should be identical to GFF2.
- [0] seqname
- [1] source
- [2] feature
- [3] start
- [4] end
- [5] score
- [6] strand
- [7] frame
- [8] group
See more at https://genome.ucsc.edu/FAQ/FAQformat.html#format3, http://www.ensembl.org/info/website/upload/gff.html, and https://www.sanger.ac.uk/resources/software/gff/spec.html
###GTF2.2
GTF2.2 is alike GFF2 but more strictly defined. Eight required fields + the 'group' feature expanded into attributes + comments:
- [0] seqname
- [1] source
- [2] feature
- [3] start
- [4] end
- [5] score
- [6] strand
- [7] frame
- [8] attributes + comments
Possible attributes: transcript_id / protein_id / gene_id
Of note: the name and the value of each attribute are separated with a space, and values are in quotes.
Biomart example:
1 ensembl gene 11193 15975 . + . gene_id "ENSECAG00000012421"; gene_version "1"; gene_name "SYCE1"; gene_source "ensembl"; gene_biotype "protein_coding";
See more at (http://mblab.wustl.edu/GTF2.html)
###GFF3
GFF3 is the newer version of GFF not supported by some genome browsers (e.g. the UCSC genome browser).
- [0] seqid
- [1] source
- [2] type
- [3] start
- [4] end
- [5] score
- [6] strand
- [7] phase
- [8] attributes
Possible attributes: ID, Name, Alias, Parent, Target, Gap, Derives_from, Note, Dbxref, Ontology term.
Of note: the name and the value of each attribute are separated with a =, and values are in plain text (no quotes).
poff preferred example:
gi<something> sim CDS 1 1 . + . ID=C_1;
See more at http://gmod.org/wiki/GFF3#GFF3_Format
###VCF
VCF is designed to store information about sequence variations.
- [0] CHROM
- [1] POS
- [2] ID
- [3] REF
- [4] ALT
- [5] QUAL
- [6] FILTER
- [7] INFO
- [8] FORMAT 10 [9] ...
VCF4.1 and VCF4.2 differ mostly in the INFO field.
See more at http://samtools.github.io/hts-specs/VCFv4.1.pdf and http://samtools.github.io/hts-specs/VCFv4.2.pdf
###BED
BED is the most flexible of these formats. It can store anything and has only 3 required fields + 9 additional fields:
- [0] chrom
- [1] chromStart
- [2] chromEnd
- [3] name
- [4] score
- [5] strand
- [6] thickStart
- [7] thickEnd
- [8] itemRgb
- [9] blockCount
- [10] blockSizes
- [11] blockStarts
See more at https://genome.ucsc.edu/FAQ/FAQformat.html#format1