In these examples, you'll learn how to: (i) measure and fit correlation profiles across
whole genome sequences for viruses using mcorrViralGenome
and mcorr-viral-fit
,
(ii) measure and fit correlation profiles across single gene alignments
using mcorr-gene-aln
and mcorr-viral-fit
, as well as generate 95% bootstrap CIs
for the output parameters, and (iii) use mcorrLDGenome
to generate genome-wide maps of Pearson's r
for correlated substitutions.
For this example, we use the 191 sars-like coronaviruses we analyzed in our paper (link to preprint here), which were also used in the example of how to prepare viral sequences for use with viral-mcorr which you can find here. All of the alignment files are provided in this github repo. Before you begin, make sure you've installed viral-mcorr locally. Also download this repository onto your machine:
cd ~/Downloads
git clone clone https://github.com/kussell-lab/viral-mcorr_sl-cov_examples.git
cd viral-mcorr_sl-cov_examples
Here, we look at how to measure and fit correlation profiles across
whole genome sequences for viruses using mcorrViralGenome
and mcorr-viral-fit
.
Calculate correlation profiles across the aligned sars-like coronavirus sequences using the following commands:
mcorrViralGenome aligned_sl-cov/aligned_sl-cov.xmfa mvg_out/sl-cov_mvg_out
mcorr-viral-fit mvg_out/sl-cov_mvg_out.csv mvg_out/sl-cov_mvg
mcorrViralGenome
takes ~10 seconds and mcorr-viral-fit
takes ~5 seconds to run on a 2021 MacBook M1 with 16 GB RAM.
What your outputs should look like are in the subdirectory mvg_out
.
The main results can be found in the following files:
sl-cov_mvg_template-switch_best_fit.svg
andsl-cov_zero-recombo_best_fit.svg
show the plots of the correlation profile and fitting for the template-switching recombination model and for the zero recombination casesl-cov_comparemodels.csv
shows the table of fitted parameters for all recombination models (template-switching, fragment-incorporation, and zero-recombination) and AIC values
Here we look at how to measure and fit correlation profiles across single gene alignments
using mcorr-gene-aln
and mcorr-viral-fit
, as well as generate 95% bootstrap CIs
for inferred evolutionary parameters. This works with xmfa files for single genes.
For this example, let's use the spike protein from our sars-like coronavirus alignment (the CDS region spanning positions 21563 to 25384).
Calculate a correlation profile across the spike protein using the following commands:
mcorr-gene-aln aligned_sl-cov/genes/cds-YP_009724390.1_21563+25384.xmfa mga_out/spike_mga_out --show-progress
mcorr-viral-fit mga_out/spike_mga_out.csv mga_out/spike_mga
mcorr-gene-aln
takes ~20 minutes (using the default number of bootstraps)
and mcorr-viral-fit
takes ~30 seconds to run on a 2021 MacBook M1 with 16 GB RAM.
What your outputs should look like are in the subdirectory mga_out
.
The main results can be found in the following files:
spike_mga_template-switch_best_fit.svg
andspike_mga_zero-recombo_best_fit.svg
show the plots of the correlation profile and fitting for the template-switching recombination model and for the zero recombination casespike_mga_comparemodels.csv
shows the table of fitted parameters for all recombination models (template-switching, fragment-incorporation, and zero-recombination) and AIC valuesspike_mga_template-switch_fit_report.txt
gives 95% bootstrap CIs for the parameters generated by fitting bootstrap replicates with the template-switching model. Note: using the default 'template-switching' option, fbar is actually the genome size L, hence this parameter is fixed for all bootstraps.
Here, we look at using mcorrLDGenome
to measure correlations between specific genomic sites across viral genomes.
First, use mcorrLDGenome
to calculate the joint probability of a difference at two sites and the probability of a difference
at a single site for each genomic position in the aligned sars-like coronavirus whole genome sequences:
mcorrLDGenome aligned_sl-cov/aligned_sl-cov.xmfa mcorrLD_out/sl-cov_ld_genome_out
This took ~8 minutes to run on a 2021 MacBook M1 with 16 GB of RAM. The output file is sl-cov_ld_genome_out.csv
should be around 3.5 GB in size which, even compressed is a bit too large to store in this repo. You can then use the jupyter notebook pcc_example.ipynb
to make a visual map of Pearson's correlation coefficient
for synonymous substitutions across the sars-like coronavirus genome. The final output should look like the png file shown in the subdirectory mcorrLD_out.
The jupyter notebook takes ~10 minutes to run (again on a 2021 MacBook M1 w/ 16 GB of RAM)
Note on jupyter notebooks: If you haven't worked with a jupyter notebook before, see this guide to get started. Alternatively you can try to run the jupyter notebook using Google Colab.