Skip to content

Latest commit

 

History

History
205 lines (161 loc) · 10.3 KB

22a.fastsimcoal_fitting.md

File metadata and controls

205 lines (161 loc) · 10.3 KB

Demographic model fitting with fastsimcoal

We used fastsimcoal version 2.7 (the lastest version 2705 which fixed a bug regarding migration estimate in models with growth rate) to estimate demographic parameters based on the site frequency spectrum (SFS).

SNP filtering and SFS calculation

We used fastsimcoal 2.7 to infer past demographics by maximising a likelihood function based on the fit of a coalescent model to the genome-wide site frequency spectrum (SFS). We used all samples, except BR_5_121_S125_L004 (see 18.radseq_check ) which was likely mislabelled from inshore (29), all from north offshore (20), and all from south offshore (25). To minimise potential bias from linkage and selection, we only used non-genic SNPs, and filtered SNPs with an r2 > 0.3 which left us with 257,314 sites. To utilise the mutation rate in branch length computation, we estimated the monomorphic sites based on the proportional number of mappable sites defined by the SNPable pipeline we used in MSMC analysis. With this set of SNPs, we then used easySFS to generate the folded multidimensional allele frequency spectrum (jAFS) for three populations.

Prepare dataset

bedtools subtract -a Adigi.filtered.vcf.gz -b CDS.bed | bgzip >Adigi.nongenic.vcf.gz
bcftools +prune -m 0.3 -e'F_MISSING>=0.01' -w 1000 Adigi.nongenic.vcf.gz -o Adigi.nongenic.r03nomiss.vcf.gz
## calculate the number of monomorphic sites
## ([the number of mappability sites (from MSMC)] - [genic sites] )* (257,314/8,714,015) - SNPs
#(250,379,843 - 37,088,136)*(257,314/8,714,015) - 257314
#~ 6040924
## add number to sfs later

Calculate SFS

python easysfs.py -i Adigi.nongenic.r03nomiss.vcf.gz -p popfile.txt -a --GQ 20 --dtype int -o sfs --proj=58,40,52

After generating we visualise the SFS as a series of heatmaps.

Model Selection

We used a hierarchical approach to model selection, first testing alternative models for the tree topology (divergence), then alternative migration and growth scenarios based on the best topology.

Divergence

We first tested alternative models of the divergence topology. Among the four alternative models, the model where inshore split first clearly displays the best likelihood and AIC values.

Figure 1: The AIC distribution of four divergence models. Plot shows the best AIC value from each of 100 independent runs for each model.

Models with growth rate and migration.

Since our population genetic statistics and SMC++ analyses all point toward a bottleneck and recent population growth we added a growth phase to all populations after splitting. We then generated 6 alternative models based on the relative timing of migration as shown in the diagram below.

**Figure 2: Graphic illustration of six alternative models. Each model is given a long name and a short code in brackets. **

Model comparison

For each of the six models, we performed 100 independent runs and examined the distribution of parameter values and AIC values from the maximum likelihood estimate of each run.

In two of our models we allowed the migration phase to start independently of population splits (TMIG) (though constrained to be after TDIV1), however we found that these two models converged toward a simpler model, “Contemporary continuous migration IMc” (coded as growth_rate_SC internally). This can be seen in the parameter plot below where we see that the parameter TMIG converges to TMIG=TDIV1 for the model “Secondary contact SC” and converges to 0 for the model “Early migration EM”. Both scenarios reduce to the model “IMc” in this situation.

Figure 3: Estimated population split times and migration time (TMIG) in two models where the migration time (TMIG) represents migration onset (SC) or termination (EM)

Model selection analysis based on AIC for all six models clearly showed that the strict isolation model and ancient migration models were not supported (highest AIC). Other models, EM, IM, IMc and SC all had a similar distribution of AIC values however since SC and EM effectively reduce to IMc this left only the IM and IMc models as potential candidates for the best model. We decided to proceed with IMc as the best model on the basis that it is simpler (one less parameter) than IM and had a better overall weighted AIC value.

Non-parametric bootstrapping

To estimate the confidence interval of all the parameters. We did 100 times of sampling with replacement(using sample) of SNPs in our filtered SNP sets with sample number of loci and created the joint SFS the same way. Next, we performed 20 independent fastsimcoal runs with the same parameters for each bootstapping data set and obtain the range of parameters from the best estimates from those runs.

grep -v "#" Adigi.v2.nongenic.r03nomiss.SNPs.vcf > temp
grep "#" Adigi.v2.nongenic.r03nomiss.SNPs.vcf > header.txt
cat header.txt <(sample -k 257314 temp -r) > bootstrap.vcf

The distribution of the values of parameters estimated.

Divergence time range

Estimated population sizes

Migration rate

Tables: The point estimate of six models and the parameter 95% confidence intervals obtained under IMc model which were calcualted according to the percentile method.

parameters SI IM IMc SC EM AM
ANCSIZE 3.678630e+05 3.931100e+05 3.897220e+05 3.890450e+05 3.907860e+05 3.992670e+05
NAOFF 1.184500e+04 3.851000e+03 1.231900e+04 1.186000e+04 1.250300e+04 1.732000e+04
NAI 2.157000e+03 2.429000e+03 2.345000e+03 2.517000e+03 2.068000e+03 8.430000e+02
NAN 6.767000e+03 1.224200e+04 1.106100e+04 9.567000e+03 1.353000e+04 2.384000e+03
NAS 1.663000e+03 9.432000e+03 1.026700e+04 1.027100e+04 8.704000e+03 4.839000e+03
NI 8.239700e+04 3.625970e+05 2.175780e+05 2.318410e+05 4.407140e+05 2.585070e+05
NN 4.777400e+04 7.797890e+05 7.004010e+05 8.635340e+05 6.122120e+05 4.045820e+05
NS 3.141560e+05 4.296100e+05 3.347110e+05 2.671460e+05 6.594250e+05 2.792200e+04
TDIV1 1.220000e+02 8.360000e+02 9.040000e+02 9.170000e+02 9.670000e+02 1.180000e+02
TDIV2 3.110000e+02 9.560000e+02 1.275000e+03 1.258000e+03 1.354000e+03 7.060000e+02
RI -1.172690e-02 -5.236100e-03 -3.552100e-03 -3.596500e-03 -3.961300e-03 -8.110600e-03
RN -1.608080e-02 -4.971800e-03 -4.589700e-03 -4.910500e-03 -3.943900e-03 -4.363860e-02
RS -4.312660e-02 -4.570400e-03 -3.855100e-03 -3.553600e-03 -4.477200e-03 -1.489710e-02
MaxEstLhood -3.173814e+06 -3.168990e+06 -3.168978e+06 -3.169011e+06 -3.169213e+06 -3.169275e+06
MaxObsLhood -3.151638e+06 -3.151638e+06 -3.151638e+06 -3.151638e+06 -3.151638e+06 -3.151638e+06
MIGN NA 1.911000e-04 1.847000e-04 1.861000e-04 1.916000e-04 NA
MIGS NA 1.889000e-04 1.895000e-04 1.826000e-04 1.895000e-04 NA
MIGO NA 9.800000e-05 1.758000e-04 1.555000e-04 1.434000e-04 NA
MIGAO NA 0.000000e+00 NA NA NA 6.916000e-04
TMIG NA NA NA 9.030000e+02 1.020000e+02 NA
n 1.000000e+01 1.400000e+01 1.300000e+01 1.400000e+01 1.400000e+01 1.100000e+01
aic 1.461597e+07 1.459377e+07 1.459371e+07 1.459386e+07 1.459479e+07 1.459507e+07
delta_ll 2.217578e+04 1.735187e+04 1.733928e+04 1.737285e+04 1.757450e+04 1.763695e+04
Tdiff NA NA NA 1.400000e+01 8.650000e+02 NA
param lowerbound upperbound
ANCSIZE 3.864210e+05 3.968435e+05
MIGN 1.777000e-04 1.919000e-04
MIGO 4.000000e-07 1.788000e-04
MIGS 1.770000e-04 1.918000e-04
NAI 1.695000e+03 3.367800e+03
NAN 9.279200e+03 1.680715e+04
NAOFF 5.110000e+03 1.979170e+04
NAS 7.435550e+03 1.296545e+04
NI 1.505423e+05 6.506730e+05
NN 4.389432e+05 1.039143e+06
NS 2.729045e+05 9.757819e+05
RI -5.444300e-03 -2.506600e-03
RN -5.312200e-03 -3.779300e-03
RS -5.576100e-03 -3.447800e-03
TDIV1 8.010000e+02 9.780000e+02
TDIV2 1.012000e+03 1.586000e+03