This repository accompanies the paper Pilosof S, Alcalá-Corona SA, Wang T, Kim T, Maslov S. The network structure and eco-evolutionary dynamics of CRISPR-induced immune diversification. bioRxiv. 2020. Available: https://www.biorxiv.org/content/10.1101/850800v1.abstract. It contains the code and data for the simulation and empirical analysis included in the paper.
We implement a hybrid (deterministic/stochastic) model by Childs, et. al. 2012. The C++ source code to make the simulations can be found in the folder SourceCodeSimulatior
.
The code is written in C++ (version 11). It should be compiled using the command line with an appropriate C++11 compiler. For example:
g++ -std=c++11 CRISPR_Model.cpp -o CRISPRsimulator
Run the executable compiled code in the command line, using the next parameters:
./CRISPRsimulator Dp mu runT Pts Sp seed
Where:
Dp
: The initial number of phage strains.mu
: The prtospacer mutation rate per protospacer.runT
: Specifies the running time-length (in model hours) for the simulation. This is the maximum length of each simulation. Execution time varies depending on the other parameters.Pts
: The length of the protospacer cassette for virus strains.Sp
: The length of the spacer cassette for host strains.seed
: Set the seed for the random generator used for the stochastic part.
For example:
./CRISPRsimulator 1 1e-7 5000 15 10 12499
It is also possible to set the seed of each simulation using a random number generated by your Unix/Linux System. This is useful for running different simulations with the same set of parameters.
./CRISPRsimulator 1 1e-7 5000 15 10 $RANDOM
Notes:
- A full description of the parameters is given in the original paper by Childs, et. al. 2012 and in our paper.
- The analysis can be computationally intensive and we ran it on the Midway HPC cluster at the University of Chicago.
The simulator produces the following files files named muM_initialDiffDpD_SXPYR-Nsuffix, where M is mu
, D is Dp
, X is Sp
, Y is Pts
, N is the seed number of a particular simulation and suffix is one of the following:
Bacteria-abundance.txt
: The density of bacteria at different times.Phage-abundance.txt
: The density of viruses at different times.Bacteria-TREE.txt
: Specifies the parents and children of bacteria. Converted to a nwk file in sectionTrees
in filesimulations_analysis.R
.data-bact.txt
: Spacer composition of bacteria strains at each time.data-phage.txt
: Protospacer composition of virus strains at each time.
For example, our main simulation in the paper is mu1e-7_initialDiffDp1_S10P15_R-12499.
In this repository we include the output files of our main example in the paper in the folder data\mu1e-7_initialDiffDp1_S10P15_R-12499
. We also include a list with all the seeds for the 100 simulations we ran in file seed_index.csv
. Running the simulator with these seeds and same parameters as our main simulation will reproduce the exact data we use in the paper.
The simulator's output files are analyzed using the file simulations_analysis.R
. We also include the file host_spacer_simulation.Rmd
for specific analysis of modularity and phylogenetic distance in host-spacer and infection networks.
- The code is written in R and requires the packages specified in the code. Modularity analysis is performed with package infomapecology. See instructions for installation there.
- The analysis is computationally intensive and we ran it on the Midway HPC cluster at the University of Chicago. Some adaptation of the R code will be necessary to adequate it to other HPC systems.
- The section Set up in file
simulations_analysis.R
includes code that may need to be adapted.
The R file simulations_analysis.R
is divided into sections.
- Set up: The first line includes the parameter values used (Table is in the paper). Lines 11-24 are used when calling the file from an external sbatch file when running jobs on a HPC system.
- Initialize: Read simulation data and use it to define regimes. Also prepare the data frames.
- Functions: Define functions. The main function is
create_networks_hr
, which creates all the networks (Fig 2 in the paper). - bacteria / phage diversity: Plot abundance profiles of hosts and viruses.
- Generate networks: Generate networks for each time.
- Measures of diversity: Calculate host, virus and spacer abundance and richness. Other indices of diversity can be included in this section.
- Phage and bacteria diversification: Calculate the persistence of hosts and viruses.
- Trees: Create the phylogenetic trees for hosts and viruses.
- Modularity of infection networks: Calculate the modularity of infection networks in time. If
make_plots
is TRUE then the code will produce additional plots (e.g., of the infection network per time step). This is good for exploratory analysis. - WNODF: Calculate the WNDOF index for weighted nestedness in immunity networks at each time.
- Spacer matches: Calculate the proportion of spacer matches (the edge weights in immunity networks) at each time.
- Extinctions: Analyze order of virus extinction.
- R calculations: This calculates the epidemiological R quantities.
- Plot: Produce a PDF with all the plots for convenience.
The following sections are also used in file host_spacer_simulation.Rmd
. The analysis in these sections serves for comparison to empirical data.
- Significance of modularity of infection networks: Calculate the significance of modularity at the end of each VDR by comparing to shuffled networks.
- Phylogenetic signal in infection networks: Is there a phylogenetic signal in infection networks at the end of each VDR?
- Significance of modularity of host-spacer networks: Aggregates host-spacer networks within each VDR and test if modularity is non-random compared to shuffled networks.
- Phylogenetic signal in host-spacer modules: Is there a phylogenetic signal in the host-spacer networks analyzed in the previous section?
The data sets we use in the paper are stored in an SQL database CRISPR_database_NEE.sqlite
.
Code is concentrated in file empirical_data_analysis.R
. We also include empirical_data_analysis.Rmd
, which sources the R file to create a markdown file for convenience. The main functions that perform the analysis are:
- get_strain_spacers: Obtains the spacer set of a virus or a host from the host-spacer/virus-protospacer matrix.
- get_matches: Match spacers and protospacers and produce an immunity network.
- test_PD_modules: the same function as in the simulated data, to test for phylogenetic signal.
- calculate_ev_nestedness: Calculate leading eigenvalue weighted nestedness (Staniczenko PPA, Kopp JC, Allesina S. The ghost of nestedness in ecological networks. Nat Commun. 2013;4: 1391.).
- main: A wrapper to read data, build and analyze immunity networks.
The analysis of modularity is also performed with infomapecology.
In the paper we mention a comparison to a system without specific immune memory (specific spacer-protospacer matches). This done using a mean-field simplification of the hybrid model that does not explicitly include the diversity of hosts and viruses. Instead, we consider the model as a Lotka-Volterra system with all the host strains considered as one single prey, and all the virus strains as a single predator. The immunity is taken as an average value of the matrix M. The Python code for that can be found in folder MeanField-NeutralDynamics
. It is a simple ODE integrator of the LV mean-field simplified system, which uses different values of the average immunity matrix M.
python MFDyn_AverageImmMatrix.py M1 M2 M3 M4
Where Mi are the artificial average values of the matrix M.
For example:
python MFDyn_AverageImmMatrix.py 0.35 0.7 0.85 0.9167