We used fastsimcoal
version 2.7 (the lastest version 2705 which fixed
a bug regarding migration estimate in models with growth rate) to
estimate demographic parameters based on the site frequency spectrum
(SFS).
We used fastsimcoal 2.7 to infer past demographics by maximising a likelihood function based on the fit of a coalescent model to the genome-wide site frequency spectrum (SFS). We used all samples, except BR_5_121_S125_L004 (see 18.radseq_check ) which was likely mislabelled from inshore (29), all from north offshore (20), and all from south offshore (25). To minimise potential bias from linkage and selection, we only used non-genic SNPs, and filtered SNPs with an r2 > 0.3 which left us with 257,314 sites. To utilise the mutation rate in branch length computation, we estimated the monomorphic sites based on the proportional number of mappable sites defined by the SNPable pipeline we used in MSMC analysis. With this set of SNPs, we then used easySFS to generate the folded multidimensional allele frequency spectrum (jAFS) for three populations.
Prepare dataset
bedtools subtract -a Adigi.filtered.vcf.gz -b CDS.bed | bgzip >Adigi.nongenic.vcf.gz
bcftools +prune -m 0.3 -e'F_MISSING>=0.01' -w 1000 Adigi.nongenic.vcf.gz -o Adigi.nongenic.r03nomiss.vcf.gz
## calculate the number of monomorphic sites
## ([the number of mappability sites (from MSMC)] - [genic sites] )* (257,314/8,714,015) - SNPs
#(250,379,843 - 37,088,136)*(257,314/8,714,015) - 257314
#~ 6040924
## add number to sfs later
Calculate SFS
python easysfs.py -i Adigi.nongenic.r03nomiss.vcf.gz -p popfile.txt -a --GQ 20 --dtype int -o sfs --proj=58,40,52
After generating we visualise the SFS as a series of heatmaps.
We used a hierarchical approach to model selection, first testing alternative models for the tree topology (divergence), then alternative migration and growth scenarios based on the best topology.
We first tested alternative models of the divergence topology. Among the four alternative models, the model where inshore split first clearly displays the best likelihood and AIC values.
Figure 1: The AIC distribution of four divergence models. Plot shows the best AIC value from each of 100 independent runs for each model.
Since our population genetic statistics and SMC++ analyses all point toward a bottleneck and recent population growth we added a growth phase to all populations after splitting. We then generated 6 alternative models based on the relative timing of migration as shown in the diagram below.
**Figure 2: Graphic illustration of six alternative models. Each model is given a long name and a short code in brackets. **
Model comparison
For each of the six models, we performed 100 independent runs and examined the distribution of parameter values and AIC values from the maximum likelihood estimate of each run.
In two of our models we allowed the migration phase to start
independently of population splits (TMIG) (though constrained to be
after TDIV1), however we found that these two models converged toward a
simpler model, “Contemporary continuous migration IMc” (coded as
growth_rate_SC
internally). This can be seen in the parameter plot
below where we see that the parameter TMIG
converges to TMIG=TDIV1 for
the model “Secondary contact SC” and converges to 0 for the model “Early
migration EM”. Both scenarios reduce to the model “IMc” in this
situation.
Figure 3: Estimated population split times and migration time (TMIG) in two models where the migration time (TMIG) represents migration onset (SC) or termination (EM)
Model selection analysis based on AIC for all six models clearly showed
that the strict isolation model and ancient migration models were not
supported (highest AIC). Other models, EM, IM, IMc and SC all had a
similar distribution of AIC values however since SC and EM effectively
reduce to IMc this left only the IM and IMc models as potential
candidates for the best model. We decided to proceed with IMc
as the
best model on the basis that it is simpler (one less parameter) than
IM
and had a better overall weighted AIC value.
To estimate the confidence interval of all the parameters. We did 100 times of sampling with replacement(using sample) of SNPs in our filtered SNP sets with sample number of loci and created the joint SFS the same way. Next, we performed 20 independent fastsimcoal runs with the same parameters for each bootstapping data set and obtain the range of parameters from the best estimates from those runs.
grep -v "#" Adigi.v2.nongenic.r03nomiss.SNPs.vcf > temp
grep "#" Adigi.v2.nongenic.r03nomiss.SNPs.vcf > header.txt
cat header.txt <(sample -k 257314 temp -r) > bootstrap.vcf
Divergence time range
Estimated population sizes
Migration rate
Tables: The point estimate of six models and the parameter 95% confidence intervals obtained under IMc model which were calcualted according to the percentile method.
parameters | SI | IM | IMc | SC | EM | AM |
---|---|---|---|---|---|---|
ANCSIZE | 3.678630e+05 | 3.931100e+05 | 3.897220e+05 | 3.890450e+05 | 3.907860e+05 | 3.992670e+05 |
NAOFF | 1.184500e+04 | 3.851000e+03 | 1.231900e+04 | 1.186000e+04 | 1.250300e+04 | 1.732000e+04 |
NAI | 2.157000e+03 | 2.429000e+03 | 2.345000e+03 | 2.517000e+03 | 2.068000e+03 | 8.430000e+02 |
NAN | 6.767000e+03 | 1.224200e+04 | 1.106100e+04 | 9.567000e+03 | 1.353000e+04 | 2.384000e+03 |
NAS | 1.663000e+03 | 9.432000e+03 | 1.026700e+04 | 1.027100e+04 | 8.704000e+03 | 4.839000e+03 |
NI | 8.239700e+04 | 3.625970e+05 | 2.175780e+05 | 2.318410e+05 | 4.407140e+05 | 2.585070e+05 |
NN | 4.777400e+04 | 7.797890e+05 | 7.004010e+05 | 8.635340e+05 | 6.122120e+05 | 4.045820e+05 |
NS | 3.141560e+05 | 4.296100e+05 | 3.347110e+05 | 2.671460e+05 | 6.594250e+05 | 2.792200e+04 |
TDIV1 | 1.220000e+02 | 8.360000e+02 | 9.040000e+02 | 9.170000e+02 | 9.670000e+02 | 1.180000e+02 |
TDIV2 | 3.110000e+02 | 9.560000e+02 | 1.275000e+03 | 1.258000e+03 | 1.354000e+03 | 7.060000e+02 |
RI | -1.172690e-02 | -5.236100e-03 | -3.552100e-03 | -3.596500e-03 | -3.961300e-03 | -8.110600e-03 |
RN | -1.608080e-02 | -4.971800e-03 | -4.589700e-03 | -4.910500e-03 | -3.943900e-03 | -4.363860e-02 |
RS | -4.312660e-02 | -4.570400e-03 | -3.855100e-03 | -3.553600e-03 | -4.477200e-03 | -1.489710e-02 |
MaxEstLhood | -3.173814e+06 | -3.168990e+06 | -3.168978e+06 | -3.169011e+06 | -3.169213e+06 | -3.169275e+06 |
MaxObsLhood | -3.151638e+06 | -3.151638e+06 | -3.151638e+06 | -3.151638e+06 | -3.151638e+06 | -3.151638e+06 |
MIGN | NA | 1.911000e-04 | 1.847000e-04 | 1.861000e-04 | 1.916000e-04 | NA |
MIGS | NA | 1.889000e-04 | 1.895000e-04 | 1.826000e-04 | 1.895000e-04 | NA |
MIGO | NA | 9.800000e-05 | 1.758000e-04 | 1.555000e-04 | 1.434000e-04 | NA |
MIGAO | NA | 0.000000e+00 | NA | NA | NA | 6.916000e-04 |
TMIG | NA | NA | NA | 9.030000e+02 | 1.020000e+02 | NA |
n | 1.000000e+01 | 1.400000e+01 | 1.300000e+01 | 1.400000e+01 | 1.400000e+01 | 1.100000e+01 |
aic | 1.461597e+07 | 1.459377e+07 | 1.459371e+07 | 1.459386e+07 | 1.459479e+07 | 1.459507e+07 |
delta_ll | 2.217578e+04 | 1.735187e+04 | 1.733928e+04 | 1.737285e+04 | 1.757450e+04 | 1.763695e+04 |
Tdiff | NA | NA | NA | 1.400000e+01 | 8.650000e+02 | NA |
param | lowerbound | upperbound |
---|---|---|
ANCSIZE | 3.864210e+05 | 3.968435e+05 |
MIGN | 1.777000e-04 | 1.919000e-04 |
MIGO | 4.000000e-07 | 1.788000e-04 |
MIGS | 1.770000e-04 | 1.918000e-04 |
NAI | 1.695000e+03 | 3.367800e+03 |
NAN | 9.279200e+03 | 1.680715e+04 |
NAOFF | 5.110000e+03 | 1.979170e+04 |
NAS | 7.435550e+03 | 1.296545e+04 |
NI | 1.505423e+05 | 6.506730e+05 |
NN | 4.389432e+05 | 1.039143e+06 |
NS | 2.729045e+05 | 9.757819e+05 |
RI | -5.444300e-03 | -2.506600e-03 |
RN | -5.312200e-03 | -3.779300e-03 |
RS | -5.576100e-03 | -3.447800e-03 |
TDIV1 | 8.010000e+02 | 9.780000e+02 |
TDIV2 | 1.012000e+03 | 1.586000e+03 |