This pipeline assemblies organelle genome from genomic skimming data.
Citation: Jian-Jun Jin*, Wen-Bin Yu*, Jun-Bo Yang, Yu Song, Ting-Shuang Yi, De-Zhu Li. 2018. GetOrganelle: a simple and fast pipeline for de novo assembly of a complete circular chloroplast genome using genome skimming data. bioRxiv, 256479. http://doi.org/10.1101/256479
License: GPL https://www.gnu.org/licenses/gpl-3.0.html
Bug&Usage contact: jinjianjun@mail.kib.ac.cn; yuwenbin@xtbg.ac.cn
Please cite the dependencies if they are used:
This pipeline was written in python 3.5.1, but compatible with 2.7.11. GetOrganelle is more efficient under Python 3.*.
Execute following simple git commands to download the latest version (suggested) or find older stable versions here:
# Supposing you are going to install it at ~/Applications/bin
GetOrganellePATH=~/Applications/bin
cd $GetOrganellePATH
git clone git://github.com/Kinggerm/GetOrganelle
then add GetOrganelle to the path:
# for MacOS
echo "PATH=$GetOrganellePATH/GetOrganelle:\$PATH" >> ~/.bash_profile
echo "PATH=$GetOrganellePATH/GetOrganelle/Utilities:\$PATH" >> ~/.bash_profile
echo "export PATH" >> ~/.bash_profile
# for Linux
echo "PATH=$GetOrganellePATH/GetOrganelle:\$PATH" >> ~/.bashrc
echo "PATH=$GetOrganellePATH/GetOrganelle/Utilities:\$PATH" >> ~/.bashrc
echo "export PATH" >> ~/.bashrc
and make them writable/executable if they are not:
chmod +x $GetOrganellePATH/GetOrganelle/*.py
chmod +x $GetOrganellePATH/GetOrganelle/Utilities/*.py
chmod +x $GetOrganellePATH/GetOrganelle/Library/*.py
chmod +w $GetOrganellePATH/GetOrganelle/Library/*Reference
It is also very IMPORTANT to keep updated (if you find your version is out of date!):
cd $GetOrganellePATH/GetOrganelle
rm Library/SeqReference/*index*
git pull
You could run the main script (get_organelle_reads.py) to get organelle reads (*.fastq) successfully, without any third-party libraries or software.
However, to get a complete organelle genome (such as a plastome) rather than organelle reads, other files in GetOrganelle are needed in the original relative path. Also, the following software/libraries are needed to be installed and added to the PATH, since they could be called automatically:
-
Python libraries numpy, scipy, sympy are used to solve the assembly graph, and could be easily installed by typing in:
pip install numpy scipy sympy
-
SPAdes is the assembler
-
Bowtie2 is used to speed up initial recruitment of target-like reads
-
BLAST+ is used to filter target-like contigs and simplify the final assembly graph
-
Bandage is suggested to view the final contig graph (
*.fastg
/*.gfa
).
Besides, if you installed python library psutil (pip install psutil), the memory cost of get_organelle_reads.py will be automatically logged.
What you actually need to do is just typing in one simple command as suggested in Example. But you are still invited to read the following introductions:
Preparing Data
Currently, this script was written for illumina pair-end/single-end data (fastq or fastq.gz). 1G per end is enough for chloroplast for most normal angiosperm samples, and 5G per end is enough for mitochondria data. You could simply assign a maximum number of reads (number of seqs, not number of bases) for GetOrganelle to use with flag --max-reads
or manually cut raw data into certain size before running GetOrganelle using the Linux or Mac OS build-in command (eg. head -n 20000000 large.fq > small.fq
).
Filtering and Assembly
Take your input reference (fasta or bowtie index; the default is Library/SeqReference/*.fasta
) as probe, the script would recruit target reads in successive rounds (extending process). You could also use a more related reference, which would be safer if the sequence quality is bad (say, degraded DNA samples). The value word size (followed with "-w"), like the kmer in assembly, is crucial to the feasibility and efficiency of this process. The best word size changes from data to data and will be affected by read length, read quality, base coverage, organ DNA percent and other factors. Since version 1.4.0, if there is no user assigned word size value, GetOrganelle would automatically estimate the initial word size based no the data characters and adjust the value ("--auto-wss") according to the behaviour of extending process. Although the automatically-estimated word size value does not ensure the best performance nor the best result, you do not need to adjust the value if a complete/circular organelle result is produced, because the circular result by GetOrganelle is generally consistent under different options. After extending, this script will automatically call SPAdes to assembly the target reads produced by the former step. The best kmer depends on a wide variety of factors too.
Producing Result
By default, SPAdes is automatically called and produce the assembly graph file filtered_spades/assembly_graph.fastg
. Then, Utilities/slim_fastg.py is called to modify the filtered_spades/assembly_graph.fastg
file and produce a new fastg file (would be assembly_graph.fastg.extend_plant_cp-del_plant_mt.fastg
if -F plant_cp been used) along with a tab-format annotation file (assembly_graph.fastg.extend_plant_cp-del_plant_mt.csv
).
The assembly_graph.fastg.extend_plant_cp-del_plant_mt.fastg
file along with the assembly_graph.fastg.extend_plant_cp-del_plant_mt.csv
file would be further parsed by disentangle_organelle_assembly.py, and your target sequence file(s) *complete*path_sequence.fasta
would be produced as the final result, if disentangle_organelle_assembly.py successfully solve the path. Otherwise, if disentangle_organelle_assembly.py failed to solve the path (produce *contigs*path_sequence.fasta
), you could use the incomplete sequence to conduct downstream analysis or manually view assembly_graph.fastg.extend_plant_cp-del_plant_mt.fastg
and load the assembly_graph.fastg.extend_plant_cp-del_plant_mt.csv
in Bandage, choose the best path(s) as the final result.
Here (or here) is a short video showing a standard way to extract the plastome from the assembly graph with Bandage. See here or here for more examples with more complicated (do not miss 3m01s - 5m53s
) situations.
For 2G raw data, 150 bp reads, to assembly chloroplast, typically I use:
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -o chloroplast_output -R 15 -k 75,85,95,105 -F plant_cp
or in a draft way:
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -o chloroplast_output --fast -k 75,85,95,105 -w 0.68 -F plant_cp
or in a slow and memory-economic way:
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s cp_reference.fasta -o chloroplast_output -R 30 -k 75,85,95,105 -F plant_cp --memory-save -a mitochondria.fasta
For 2G raw data, 150 bp reads, to assembly plant mitochondria
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s mt_reference.fasta -o mitochondria_output -R 50 -k 65,75,85,95,105 -P 1000000 -F plant_mt
For 2G raw data, 150 bp reads, to assembly plant nuclear ribosomal RNA (18S-ITS1-5.8S-ITS2-26S)
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -o nr_output -R 7 -k 95,105,115 -P 0 -F plant_nr
For 1G raw data, 150 bp reads, to assembly fungus mitochondria (currently only tested on limited samples, suggested parameters might not be the best)
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s fungus_mt_reference.fasta --genes fungus_mt_genes.fasta -R 3 -k 65,75,85,95,105 -F fungus_mt
For 1G raw data, 150 bp reads, to assembly animal mitochondria (currently only tested on limited samples, suggested parameters might not be the best)
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s animal_mt_reference.fasta --genes animal_mt_genes.fasta -R 3 -k 65,75,85,95,105 -F animal_mt
See the detailed illustrations of those arguments by typing in:
get_organelle_reads.py -h
or see verbose illustrations:
get_organelle_reads.py --help
Also see GetOrganelleComparison for a benchmark test of GetOrganelle
and NOVOPlasty
using 50 online samples.
It was previously cited as GetOrganelle (https://github.com/Kinggerm/GetOrganelle), but now we have a report paper (see above) to cite.
Yu Song, Wen-Bin Yu, Yun-Bong Tan, Bing Liu, Xin Yao, Jian-Jun Jin, Michael Padmanaba, Jun-Bo Yang, Richard T. Corlett. 2017. Evolutionary comparisons of the chloroplast genome in Lauraceae and insights into loss events in the Magnoliids. Genome biology and evolution. 9(9): 2354-64. doi: https://doi.org/10.1093/gbe/evx180
Twyford AD, Ness RW. 2017. Strategies for complete plastid genome sequencing. Molecular Ecology Resources. 17(5):858-68. doi: https://doi.org/10.1111/1755-0998.12626
Guan-Song Yang, Yin-Huan Wang, Yue-Hua Wang, Shi-Kang Shen. 2017. The complete chloroplast genome of a vulnerable species Champereia manillana (Opiliaceae). Conservation Genetics Resources. 9(3): 415-418. doi: https://doi.org/10.1007/s12686-017-0697-1
Thanks to Chao-Nan Fu, Han-Tao Qin, Xiao-Jian Qu, Shuo Wang, and Rong Zhang for giving tests or suggestions.