strainFocus is a workflow to identify strains of bacteria present in metagenomic samples; it is designed to present visualizations to the user that facilitate the interpretation of evolutionary relationships between strains of bacteria found in their samples and among publicly available genomes of the same species.
To run strainFocus, we first run Kneaddata(Lloyd-Price et al., 2019), a quality control software, on the raw reads, generating trimmed reads that are cleaned of contamination. To map the cleaned reads to reference genomes we run HUMAnN(Franzosa et al., 2018) against the uniref90 database. We subsequently use StrainPhlAn(Truong et al., 2017) to concatenate species marker genes present in the samples. We use a script to fetch all other publicly available genomes of interest and then compare the nucleotide similarity of both sample genomes and fetched genomes using RAxML and calculating best fit and parsimony trees. We run StrainPhlAn with the optional arguments --marker_in_n_samples 20 and --sample_with_n_markers 5 to balance a conservative inclusion criteria with a desire to include as many samples as possible in our downstream phylogenetic analyses. The multiple sequence alignment (MSA) files are used to do further downstream analysis and compare strains.
Xinyang Zhang, Tyson Dawson, Keith A. Crandall, Ali Rahnavard (2022+), strainFocus: Phylogenetic Analysis of Pooled Bacterial Genomes to Identify Evolutionary Patterns Among Strains, https://github.com/omicsEye/strainFocus
Please check our omicsEye Support Forum for common questions before open issue thread there.
- Features
- strainFocus
- Getting Started with strainFocus
- Tutorials for normalized mutual information calculation
- Applications
- Support
- Generic metagenomics software that can handle paired or unpaired transcriptomic or metagenomic reads
- Quality control built-in
- User-friendly
- Handles host reads
- Multiple outputs for further downstream analysis
python -m pip install git+https://github.com/omicsEye/strainFocus
To test if strainFocus is installed correctly, you may run the following command in the terminal:
strainFocus -h
Which yields strainFocus command line options.
usage: strainFocus -h
--seqfile SEQFILE --seqtype SEQTYPE --meta_data META_DATA --metavar METAVAR --anatype {reg,cl} [--fraction FRACTION]
optional arguments:
-h, --help show this help message and exit
--seqfile SEQFILE, -sf SEQFILE
files contains the sequences
--seqtype SEQTYPE, -st SEQTYPE
type of sequence: nuc, amino-acid
--meta_data META_DATA, -md META_DATA
files contains the meta data
--metavar METAVAR, -mv METAVAR
name of the meta var (response variable). This is teh lable will be used as phenotype of interest to find genotypes related to it.
--anatype {reg,cl}, -a {reg,cl}
type of analysis
--fraction FRACTION, -fr FRACTION
fraction of main data to run
$ strainFocus -h
--seqfile
or-sf
PATH to a sequence data file--seqtype
or-st
sequence type, values areamino-acid
andnu
for nucleotides--meta_data
or-md
PATH to metadata file--metavar
or-mv
name of the meta variable--anatype
or-a
analysis type, options arereg
for regression andcl
for classification--fraction
or-fr
fraction of the main data (sequence positions) to run. it is optional, but you can enter a value between 0 and 1 to sample from the main data set.
- correlated positions. We group all the collinear positions together.
- models summary. list of models and their performance metrics.
- plot of the feature importance of the top models in modelName_dpi.png format.
- csv files of feature importance based on top models containing, feature, importance, relative importance, group of the position (we group all the collinear positions together)
- plots and csv file of average of feature importance of top models.
- box plot (regression) or stacked bar plot (classification) for top positions of each model.
The detailed options will come soon!!!
strainFocus --input myReads1.fastq --output strainFocus_demo_output
- Please submit your questions or issues with the software at Issues tracker.