This package is an integrated tool for microbial sequence analyses. We refactored and packed the methods of the published research and evaluated the reproducibility with the same raw data.
- Repository structure
- Available functions
- Download Seq2Geno and ensure the environment
- Usage and Input
- Example data and usages
- Train the phenotypic predictor with the Seq2Geno results
- FAQ
- License
- Contact
- Citation
This repository includes:
- install: information and scripts for installation
- examples: the example data (the tutorial )
- main: the scripts for user interface and the calling methods of core workflows
- snps: the scripts for generating SNPs table
- denovo: the scripts for computing de novo assemblies and the gene presence/absence table
- expr: the scripts for calculating expression levels
- phylo: the scripts for phylogenetic tree inference
- difexpr: the methods for identifying differentially expressed genes with the expression levels matrix
- cont_anc_rcn: ancestral reconstruction for continuous data such as expression levels
- detect single nucleotide variants
- create de novo assemblies
- compute gene presence/absence and the indels
- count gene expression levels
- infer the phylogenetic tree
- find differentially expressed genes (additional data that won't be used by Geno2Pheno)
- reconstruct ancestral values of expression level (additional data that won't be used by Geno2Pheno)
-
Check the prerequisites
-
Download this package
git clone --recurse-submodules https://github.com/hzi-bifo/seq2geno.git
cd seq2geno
git submodule update --init --recursive
The command uses --recurse-submodules to download the submodules. The flag is available only in git version >2.13. Earlier git versions might have the substitute. After the package is downloaded, main/seq2geno and main/seq2geno_gui are the executable scripts for Seq2Geno.
- To ensure the environment and test the package, please either follow the steps in install/README.md or use the automatic tools:
cd install/
./SETENV.sh snakemake_env
conda activate snakemake_env
./TESTING.sh
Once the environment is properly set up, Seq2Geno can be launched using the graphical user interface (GUI) or command line
The commands
S2G
or
seq2geno_gui
will launch the graphic user interface. Use the tool to read, edit, or save the arguments in a yaml file. Once the arguments are ready, the analyses can be launched with this interface; for large-scale researches, however, generating the yaml file and launching the analyses with the command line method (described below) might be more convenient, as having processes running in the background should be more convenient. To learn more, please read the manual doc/GUI_manual.pdf.
S2G -d -f [options_yaml] -z [zip_input] -l [log_file] --outzip [output_zip_type] --to_gp
Both options_yaml and zip_input specify the materials to use (see the examples/). At least one of them should be used. When options_yaml is properly set, zip_input will be neglected. The options_yaml describes all the options and paths to input data for Seq2Geno. The zip_input packs all the materials and has a structure that Seq2Geno can recognize (see input_zip_structure.md for more details).
The log_file should be a non-existing filename to store the log information; if not set, the messages will be directed to stdout and stderr.
The output_zip_type should be one of 'none' (default), 'all', 'main', or 'g2p'. The choice specifies whether or how the output results should be packed into an zip file.
The flag --to_gp
specifies whether to submit the results to the Geno2Pheno server.
The input file is an yaml file where all options are described (a template in examples/). The file includes two parts:
- features:
option | action | values ([default]) |
---|---|---|
dryrun | display the processes and exit | [Y]/N |
snps | SNPs calling | Y/[N] |
denovo | creating de novo assemblies | Y/[N] |
expr | counting expression levels | Y/[N] |
phylo | inferring the phylogeny | Y/[N] |
de | differential expression | Y/[N] |
ar | ancestral reconstruction of expression levels | Y/[N] |
To only create the folder and config files, please turn off the last six options.
-
general (* mandatory):
-
cores: number of cpus (integer; automatically adjusted if larger than the available cpu number)
-
mem_mb: memory size to use (integer in mb; automatically adjusted if larger than the free memory). Note: some processes may crush because of insufficiently allocated memory
-
*wd: the working directory. The intermediate and final files will be stored under this folder. The final outcomes will be symlinked to the sub-directory RESULTS/.
-
*dna_reads: the list of DNA-seq data
It should be a two-column list, where the first column includes all samples and the second column lists the paired-end reads files. The two reads file are separated by a comma. The first line is the first sample.
sample01 /paired/end/reads/sample01_1.fastq.gz,/paired/end/reads/sample01_2.fastq.gz sample02 /paired/end/reads/sample02_1.fastq.gz,/paired/end/reads/sample02_2.fastq.gz sample03 /paired/end/reads/sample03_1.fastq.gz,/paired/end/reads/sample03_2.fastq.gz
- *ref_fa, ref_gff, ref_gbk: the data of reference genome
The fasta, gff, and genbank files of a reference genome. They should have same sequence ids.
-
old_config: if recognizable, the config files that were previously stored in the working directory will be reused. ('Y': on; 'N': off)
-
rna_reads: the list of RNA-seq data. (string of filename)
It should be a two-column list, where the first column includes all samples and the second column lists the short reads files. The first line is the first sample.
sample01 /transcription/reads/sample01.rna.fastq.gz sample02 /transcription/reads/sample02.rna.fastq.gz sample03 /transcription/reads/sample03.rna.fastq.gz
- phe_table: the phenotype table (string of filename)
The table is tab-separated. For n samples with m phenotypes, the table is (n+1)-by-(m+1) as shown below. The first column should be sample names. The header line should includes names of phenotypes. Missing values are acceptable.
strains virulence sample01 high sample02 mediate sample03 low
- adaptor: the adaptor file (string of filename)
The fasta file of adaptors of DNA-seq. It is used to process the DNA-seq reads.
-
The folder examples/ includes a structured zip file and a yaml file--the two input formats that Seq2Geno can recognize. The zip file can be used as the input with this command:
S2G -z examples/example_input/seq2geno_input.zip\
-l examples/example_input_zip.log\
--outzip g2p
To use the configuration yaml file, please ensure unpacked example data (that is, the zip file) and edit the yaml file to ensure the right paths to those example data. After they are ready, please run with this command:
S2G -f examples/example_input/seq2geno_input.yml\
-l exapmles/seq2geno_input_yml.log\
--outzip g2p
To include automatic submission to the Geno2Pheno server, just use the flag
--to_gp
:
S2G -f examples/example_input/seq2geno_input.yml\
-l exapmles/seq2geno_input_yml.log\
--outzip g2p --to_gp
Please directly visit Geno2Pheno
Why the analyses crushed?
Please check the log file or STDOUT and STDERR and determine the exact error.
Will every procedure be rerun if I want to add one sample?
No, you just need to add one more line in your reads list (i.e., the dna or the rna reads. See section arguments for more details.) and then run the same workflow again. Seq2Geno uses Snakemake to determine whether certain intermediate data need to be recomputed or not.
Will every procedure be rerun if I want to exclude one sample?
No; however, besides excluding that sample from the reads list, you will need to remove all the subsequent results that were previously computed. That could be risky.
Will every procedure be rerun if I accidentally delete some data?
No, only the deleted one and the subsequent data will be recomputed.
Where is the final data?
In the working directory, the main results are collected in the subfolder RESULTS/
. You can also find the other intermediate data in the corresponding folders (e.g., mapping results)
What is the current status?
If the log file was specified, you could check the log file to determine the current status. Otherwise, the status should be directed to your STDOUT or STDERR.
GPLv3 (please refer to LICENSE)
Please contact Tzu-Hao Kuo (Tzu-Hao.Kuo@helmholtz-hzi.de) and specify:
- the error message or unexpected situation
- how to reproduce the problem
We will be publishing the paper for the joint work of Seq2Geno and Geno2Pheno. Before that, please use
Asgari*, E., Kuo*, T.-H., Bremges, A., Robertson, G., Weimann, A. & McHardy, A. C. (2021). Seq2Geno2Pheno [A Computational Workflow for Phenotype Predictive-Modeling and Biomarker Detection from Microbial Sequence Data]
or
@software{seq2geno2pheno2021,
author = {Ehsaneddin Asgari*, Tzu-Hao Kuo*, Andreas Bremges, Gary Robertson, Aaron Weimann, Alice C. McHardy},
title = {Seq2Geno2Pheno: A Computational Workflow for Phenotype Predictive-Modeling and Biomarker Detection from Microbial Sequence Data},
version = {v2.00001},
date = {2021-07-07},
}