-
Notifications
You must be signed in to change notification settings - Fork 11
User guide
GenoVi generates circular genome representations for complete or draft bacterial and archaeal genomes. GenoVi pipeline combines several python scripts to automatically generate all needed files for Circos, including customisable options for colour palettes, fonts, font format, background colour and scaling options for complete genomes comprising more than one replicon. Optionally, GenoVi built-in workflow integrates DeepNOG to annotate COG categories using alignment-free methods with user-defined thresholds.
- Description
- Index
- Requirements
- Installation
- Usage
- Tutorials
- Arguments
- Scripts
- Output
- Publication
- Acknowledgements
- Citation and License
- Circos 0.69-8
- Python 3.7 or later
- DeepNog 1.2.3
- NumPy 1.20.2
- Pandas 1.2.4
- Biopython 1.79
- CairoSVG 2.5.2
- Perl 5
- List::MoreUtils (Perl library)
GenoVi dependencies can be installed in a python environment with a python version equal or higher than v.3.7.
conda create -n genovi python=3.7 circos
Activate the environment
conda activate genovi
GenoVi can then be installed using pip
pip install genovi
genovi [-h] [options ..] -i input_file -s status
-
-i
,--input_file
. GenBank input file path. -
-o
,--output_file
. Output file name. Default: genovi. -
-s
,--status
. “complete” or “draft”. Complete genomes are drawn as separate circles for each contig/replicon.
-
-h
,--help
. Shows this help message and exit. -
--version
. Shows the currently installed version of genovi.
-
-cu
,--cogs
_unclassified. Do not classify each coding sequence into Clusters of Orthologous Groups of proteins (COGs). -
--cogs
,COGS
To specify which COG categories include in the circular representation. For example 'ABJKLX' -
-b
,--deepnog
_confidence_threshold. DeepNOG confidence threshold range [0,1] Default: 0. If provided, predictions below the threshold are discarded.
-
-a
,--alignment
. When a--status complete
is specified, this flag defines the alignment of each individual contig. Options:center
,top
,bottom
,A
(First on top),<
(first to the left),U
(Two on top, the rest below). By default, this is defined by contig sizes. -
--scale
. When using--status complete
, whether to use a different scale format to ensure visibility. Options:variable
,linear
,sqrt
. Default:sqrt
. -
-k
,--keep_temporary_files
. Keep temporary files. -
-r
,-reuse_predictions
. If available, reuse DeepNog prediction result from the previous run. Useful only if --keep_temporary_files flag is enabled. -
-w
,--window
. Window size (base pair) to assign a GC analysis. Default:5000
. -
-v
,--verbose
. Verbose or in-console log messages activated.
-
-c
,--captions_not_included
. Do not include captions in the figure. -
-cp
,--captions_position
. Captions position. Options:left
,right
,auto
. -
-t
,--title
. Figure title. -
--title_position
. Title position. Options:center
,top
,bottom
. -
--italic_words
. How many title words should be written in italic. Default:2
. -
--size
. Displays the genome size of each independent circular representation. -
-te
,--tracks_explain
. To include an additional text on each track.
-
-cs
,--colour_scheme
. Prebuilt color scheme to use for CDS, RNAs, and GC analysis. Options:strong
,autumn
,dawn
,blossom
,paradise
,neutral
,blue
,purple
,soil
,grayscale
,velvet
,pastel
,ocean
,wood
,beach
,desert
,ice
,island
,forest
,toxic
,fire
,spring
. -
-bc
,--background
. Background colour, in R, G, B format. Default:transparent
. -
-fc
,--font_colour
. Font color. Default:black
. -
-pc
,--CDS_positive_colour
. Colour for positive CDSs, in R, G, B format. Default:'180, 205, 222'
. -
-nc
,--CDS_negative_colour
. Colour for negative CDSs, in R, G, B format. Default:'53, 176, 42'
. -
-tc
,--tRNA_colour
. Colour for tRNAs, in R, G, B format. Default:'150, 5, 50'
. -
-rc
,--rRNA_colour
. Colour for rRNAs, in R, G, B format. Default:'150, 150, 50'
. -
-cc
,--GC_content_colour
. Colour for GC content, in R, G, B format. Default:'23, 0, 115'
. -
-sc
,--GC_skew_colour
. Colour scheme for positive and negative GC skew. A pair of RGB colors. Default:'140, 150, 198 - 158, 188, 218'
. -
-sl
,--GC_skew_line_colour
. Colour for GC skew line. Default:black
.
genovi -i input_test/Corynebacterium_alimapuense_VA37.gbk -s draft -cs paradise --cogs_unclassified -bc white
This command will render an essential genome representation in png and svg formats, using the paradise
color scheme and white background. All contigs from Corynebacterium alimapuense VA37’s genome are drawn in a single circle (default behavior). From outside to inside, the contigs length (each contig alternatively depicted in black and white), positive and negative strand coding sequences (CDSs), respectively, GC content, and finally, GC skew are displayed.
genovi -i input_test/Acinetobacter_radioresistens_DD78.gbff -cs strong -s complete --size
This command renders an image separating each scaffold as an independent chromosome or plasmid showing its size in the middle. Additional image files are generated for each chromosome or plasmid, as 1.png and 1.svg, 2.png and 2.svg, and so on.
There is an additional option to render multiple genomes at once using a folder as an input. All genomes will be drawn either draft or complete. To differentiate each ideogram, the --title 'filename'
will be used, so each filename will be displayed as the title of each circular representation. As an output, one folder with a circular representation, general statistics, and COGs information will be delivered, for each file. Additionally, there will be a joined figure and general and COGs information table.
genovi -i input_test/Brevibacterium_Genomes -cs blossom -s draft --title 'filename'
This command renders an image separating each scaffold as an independent chromosome or plasmid showing its size in the middle. Additional image files are generated for each chromosome or plasmid, as 1.png and 1.svg, 2.png and 2.svg, and so on.
Supplementary to the representations described above, the image includes two colored circumferences showing a DeepNOG COG classification of each CDS.
genovi -i input_test/Acinetobacter_radioresistens_DD78.gbff -cs paradise --scale linear --alignment '<' -s complete
By default, circles are scaled using a square root scale, so small plasmids are still visible. If a linear scale is needed, you may specify it explicitly with --scale linear
.
Circles' order can be changed, by putting them on a line or using more complex ordering like this one, where the chromosome is on the left side and plasmids are lined up on the right.
-i, --input_file
. This mandatory argument specifies the path of the annotated genome file to be drawn. Accepted files are GenBank file format (.gbk and .gbff) and they might be gzipped (.gz or .z). Also, if a directory is specified, all of the supported files inside of it will be drawn, summary tables will be generated and, in case --status draft
is specified, an additional image will be created including all of the assemblies (useful for comparative analysis).
-s, --status
. Specify whether your genome is complete or draft. If draft
is selected (default), then each contig is drawn in the same circular genome representation. If complete
is selected, then GenoVi draws a different circle for each contig, generating several figures, one for each contig and a concatenated one. Below, the Paraburkholderia xenovorans’ genome is shown as a complete and draft genome. The '-c' flag will not include the caption.
genovi -i input_test/P_xenovorans_LB400.gbff -cs autumn -s draft -c
Paraburkholderia xenovorans LB400 as a draft genome drawing
genovi -i input_test/P_xenovorans_LB400.gbff -cs autumn -a A -s complete -c
Paraburkholderia xenovorans LB400 as a complete genome drawing. '-a A' will set the largest scaffold on top, and the rest below.
-h
, --help
. Displays the help message.
--version
. Displays the current version of GenoVi.
-o
, --output_file
. Output file name. GenoVi generates the image in both vectorial (svg) and pixel (png) formats. This argument specifies the name of the image to create, and the directory name to include additional figures, if --status complete
is defined. File extension should not be included as part of this argument.
-cu
, --cogs_unclassified
. By default, DeepNOG predicts Clusters of Orthologous Groups of proteins (COGs) of each coding sequence (CDS). Use this flag to specify you do not want CDSs to be classified into COGs. This will allow you to save time and run the program even if you don’t have DeepNOG installed on your machine.
-b, --deepnog_confidence_threshold
. DeepNOG confidence threshold range [0, 1]. Predictions below the threshold are discarded. This is equivalent to DeepNOG's infer -c/--confidence
_threshold argument.
--cogs
. By default, the figure shows COG classification of every CDS. This might difficult to see the important information. Using this argument you may specify a specific set of COG categories to draw. The argument received is a string where each character represents a specific COG category, according to this table:
Character | COG |
---|---|
D | Cell cycle control, division, chromosome partitioning |
M | Cell wall/membrane/envelope biogenesis |
N | Cell motility |
O | Post-translational modification, protein turnover, chaperones |
T | Signal transduction mechanism |
U | Intracellular trafficking, secretion, and vesicular transport |
V | Defense mechanism |
W | Extracellular structures |
Y | Nuclear structure |
Z | Cytoskeleton |
A | RNA processing and modification |
B | Chromatin structure and dynamics |
J | Translation, ribosomal structure, and biogenesis |
K | Transcription |
L | Replication, recombination, and repair |
X | Mobilome: prophages, transposons |
C | Energy production and conversion |
E | Amino acid transport and metabolism |
F | Nucleotide transport and metabolism |
G | Carbohydrate transport and metabolism |
H | Coenzyme transport and metabolism |
I | Lipid transport and metabolism |
P | Inorganic ion transport and metabolism |
Q | Secondary metabolites biosynthesis, transport, and metabolism |
R | General function prediction only |
S | Function unknown |
There are also a few shortcuts available: cel-
for DMNOTUVWYZ
(cellular processes and signaling), inf-
for ABJKLX
(information storage and processing), met-
for CEFGHIPQ
(metabolism) and finally poo-
for poorly characterized sequences.
For instance, to draw the genome of Rhodococcus sp. H-CA8f, displaying only the metabolism-related COGs. This strain has a complete genome assembly, with one chromosome and one plasmid, therefore the -s complete flag should be used. If no color scheme is specified, genovi will use the strong color palette, which is colorblind-safe.
genovi -i input_test/Rhodococcus_H-CA8f.gbff -s complete --cogs met-
Rhodococcus sp. H-CA8f as a complete genome drawing displaying only metabolism COG categories.
There is an option to only draw specific COGs categories using the --cogs
flag. For example, displaying only the Q and X categories.
genovi -i input_test/Rhodococcus_H-CA8f.gbff -s complete --cogs QX
Rhodococcus sp. H-CA8f as a complete genome drawing displaying only X and Q COG categories.
Additionally, there is an option to only display the top X number of COGs classification categories using --cogs
flag. For example, displaying only the top 5 COGs categories.
genovi -i input_test/Rhodococcus_H-CA8f.gbff -s complete --cogs 5
Rhodococcus sp. H-CA8f as a complete genome drawing displaying only top 5 COG categories.
-a
, --alignment
. When drawing a complete genome, the circular representation of each contig can be aligned in three ways. A
: First contig above and the rest below aligned horizontally. <
: The first contig left and the rest depicted on the right, aligned vertically. And U
: First and second contig top and the rest below, aligned horizontally.
--scale
. When drawing a complete genome, the relative size of each circular representation can be determined in three ways. A linear scale depicts circular representations proportional to the size of each contig. If variable is chosen, each circular representation is depicted in a variable scale, shown in a rectangle indicating the scale (X times). The default case is sqrt
, a square root scale.
-k
, --keep_temporary_files
. Multiple files will be generated within the user’s project folder and by default will be deleted upon completion. Specifying this argument stops the deletion of the files. Generated files are:
- circos.conf: Main CIRCOS configuration file.
- conf/colors_fonts_patterns.conf: Imports several files from the Circos distribution in order to define colors, fonts, and fill patterns.
- conf/highlight.conf: Defines ideogram highlights.
- conf/housekeeping.conf: Defines system and debug parameters.
- conf/image.conf: Imports generic Circos image configuration and background.
- conf/ticks.conf: Defines tick mark formatting.
- temp/_bands.kar: Contains band annotation positions of contigs and their color.
- temp/_CDS_neg.txt: Defines band annotation positions of negative-sense-strand coding sequences.
- temp/_CDS_pos.txt: Defines band annotation positions of positive-sense-strand coding sequences.
- temp/gbk_converted.fna: nucleotide fasta file converted from the original gbk.
- temp/GC_GC_content.wig: GC content percentage on each base-pair window.
- temp/GC_GC_skew.wig: Measures strand asymmetry in the distribution of guanines and cytosines on each base-pair window.
- temp/_rRNA_neg.txt: Defines band annotation positions of negative-sense-strand ribosomal RNA sequences.
- temp/_rRNA_pos.txt: Defines band annotation positions of positive-sense-strand ribosomal RNA sequences.
- temp/_tRNA_neg.txt: Defines band annotation positions of negative-sense-strand transfer RNA sequences.
- temp/_tRNA_pos.txt: Defines band annotation positions of positive-sense-strand transfer RNA sequences.
- temp/_prediction_deepnog.csv: Generated only if COG prediction is enabled (default behavior). Includes COG prediction and confidence for each coding sequence.
- temp/_CDS_pos_X.txt: Generated if COG prediction is enabled, one file for each COG category, “X” the corresponding letter. Defines band annotation positions of positive-sense-strand coding sequences of “X” COG category.
- temp/_CDS_neg_X.txt: Generated if COG prediction is enabled, one file for each COG category, “X” the corresponding letter. Defines band annotation positions of negative-sense-strand coding sequences of “X” COG category.
In the case of a complete genome, tem directory files will be generated for each contig and identified with the prefix contig-X-, with X being 1, 2, 3, etc.
-w
, --window
. Windows size For GC content and skew plotting. This indicates how many base pairs will be considered for the calculation.
-v
, --verbose
. Displays additional information while executing GenoVi.
-c
, --captions_not_included
. By default, generated images include a caption with COGs and other colors. Use this flag to stop the program from including this caption.
-cp
, --captions_position
. Caption position. Options: left
, right
or auto
.
-t
, --title
. Figures title, for example, which genome is being represented.
--title_position
. Title position in the figure. Options: top
, bottom
, or center
of the image.
--italic_words
. If required, a number of words of the title could be written in italic. As the title is intended for organism specification, the default is 2
. For example, if the title is “Paraburkholderia xenovorans LB400”, then “Paraburkholderia xenovorans” would be in italics, but “LB400” would not.
--size
. To display the genome size (in base pairs) of each circular representation.
As an example, let's use the genome of Streptomyces sp. H-KF8 to insert a genome title and size. This genome is in a permanent-draft state. We are going to insert the name of the strain as the title with the -t flag, on top of the figure, using --title_position, and specify that only one word should be in italic with --italic-words. Additionally, the size of the genome will be displayed using --size. WARNING! The PNG version of the image may look odd because italic text transformation is not yet properly implemented. Please prefer using the svg version instead.
genovi -i input_test/Streptomyces_H-KF8.gbff -s draft -t 'Streptomyces sp. H-KF8' --title_position top --italic_words 1 --size
Streptomyces sp. H-KF8 as a draft genome drawing displaying title and size.
-te
. Adds a space break in the circular representation, including captions for each track within the ideogram.
Using the genome of the strain Alcaligenes aquatilis QD168 as an example. This genome is a complete assembly consisting of a unique chromosome, therefore whether the flag -s draft
or -s complete
is irrelevant. We will add the -te flag to add space in the ideogram, including captions of each feature. Additionally, we will use the "blossom" color palette.
genovi -i input_test/Alcaligenes_aquatilis_QD168.gbff -s complete -te -cs blossom
A. aquatilis QD168
-cs
, --color_scheme
. Prebuilt color scheme to use. Available color schemes include: strong
, autumn
, dawn
, blossom
, paradise
, neutral
, blue
, purple
, soil
, grayscale
, velvet
, pastel
, ocean
, wood
, beach
, desert
, ice
, island
, forest
, toxic
, fire
, spring
. The Colour of specific parts of the image can be modified individually, as --background
, --CDS_positive_color
, --CDS_negative_color
, --tRNA_color
, --rRNA_color
, --GC_content_color
, --GC_skew_color
, and --GC_skew_line_color
.
By default, genovi uses the 'strong color palette. 'strong
, autumn
, dawn
, blossom
, and paradise
color palettes, are all colorblind-safe.
Main script. Uses custom arguments and calls the rest of the modules to generate the genome representations. Inputs and outputs are explained in the Arguments section.
Generates the .kar, CDS, rRNA, tRNA files for CIRCOS, and calls DeepNOG for predicting COGs.
Input:
- input file: GenBank file.
- output folder (
-o
/--output_folder
): Path to the folder that will contain all raw files. - CDS (
-cds
/--cds
): CDS band files for CIRCOS will be created. - tRNA (
-trna
/--trna
): tRNA band files for CIRCOS will be created. - rRNA (
-rrna
/--rrna
): rRNA band files for CIRCOS will be created. - COG categories (
-gc
/--get_categories
): CDS COG categories will be predicted. - Divided categories (
-d
/--divided
): COG categories will be split in one file per category. - Complete genome (
-c
/--complete_genome
): Script will consider the input file to be a complete genome.
Writes the following CIRCOS configuration files; circos.conf, conf/highlight.conf, conf/colors_fonts_patterns.conf, conf/housekeeping.conf, conf/image.conf, and conf/ticks.conf.
Input:
- Min GC content (
--content_min
/--min_GC_content
): Minimum GC content. Default0
. - Mac GC content (
--content_max
/--max_GC_content
): Maximum GC content. Default100
. - Min GC skew (
--skew_min
/min_GC_skew
): Minimum GC skew. Default-1
. - Max GC skew (
--skew_max
/--max_GC_skew
): Maximum GC skew. Default1
. - GC content color (
-cc
/--GC_content_color
): GC content color. Default:'23, 0, 115'
. - GC skew color (
-sc
/--GC_skew_color
) - CDS positive color (
-pc
/--CDS_positive_color
): Positive CDSs color. - CDS negative color (
-pc
/--CDS_negative_color
): Negative CDSs color.
Calculates GC percentage and GC skew of the genomic sequence, and writes them down to files.
Input:
- Input file (
-i
/--input_file
): FASTA input file path. - Window size (
-w
/--window_size
): Number of base pairs where the GC percentage is calculated for. - Shift increment (
-s
/--shift
): Shift increment. By default, it is-1
. - Output file (
-o
/--output_file
): Output file path. The default matches input file path. - Ignore trailing (
-ot
/--omit_tail
): Trailing sequence will be omitted. Default retains leftover sequence.
Transforms GenBank flat files into protein fasta format files. The output has the same name as the original file.
Transforms GenBank flat files into nucleotide fasta format files. The output has the same name as the original file.
Generates a .svg file with all scaled genome visualizations.
Input: List of dictionaries that includes filenames and each image's desired size. e.g. [{"fileName": "img1.svg", "size": 30000}, {"fileName": "img2.svg", "size": 10000}].
Adds title and contig size to the visualization, and allows to modify the legend color.
Parses color schemes.
Resulting images are saved in a folder called [name] as [name].svg and [name].png (the name being specified with output_file
argument or, by default, circos. In the case of a complete genome, individual contig image files are stored in a [name] subdirectory as [name]-contig_[i].png with i in [1, the number of circles].
Besides images, if -k
or --keep_temporary_files
was called, files described in user guide arguments will also be stored.
Three additional files are stored in [name] folder: a histogram displaying COG categories named [name]_COG_histogram.png; a file with the COG classification of each replicon named [name]_COG_Classification.csv; and a csv file named [name]_Gral_Stats.csv displaying general information of each replicon, including size, GC content, number of CDS, tRNA and rRNA.
Cumsille et al., 2022
GenoVi is under a BY-NC-SA Creative Commons License, Please cite. Cumsille et al., 2022. You may remix, tweak, and build upon this work even for commercial purposes, as long as you credit this work and license your new creations under identical terms.