Species delimitation using Markov Chain Monte Carlo

We provide a Markov Chain Monte Carlo sampling method for assessing the confidence of the Maximum Likelihood delimitation scheme. The MCMC method is activated with the --mcmc switch followed by the number of MCMC steps and the switch for either the PTP model (--single) or the mPTP model (--multi). The user may define a number of additional parameters for the MCMC sampling as explained below.

#Overview of Command-line Parameters

Parameter	Explanation
`--mcmc INT`	Support values for the delimitation (INT steps).
`--mcmc_sample INT`	Sample every INT iteration (default: 1000).
`--mcmc_log`	Log samples and create SVG plot of log-likelihoods.
`--mcmc_burnin INT`	Ignore all MCMC steps below threshold.
`--mcmc_runs INT`	Run multiple chains.
`--mcmc_credible`	Specify the probability (0.0 to 1.0) for which to generate the credible interval i.e., the probability the true number of species will fall within the credible interval given the observed data. (default: 0.95)
`--mcmc_startnull`	Start each chain with the null model (one single species).
`--mcmc_startrandom`	Start each chain with a random delimitation.
`--mcmc_startml`	Start each chain with the delimitation obtained by the Maximum-likelihood heuristic.

Command line examples

The following command-line would execute a single MCMC analysis with 1 million steps and the mPTP :

$ mptp --tree_file tree_filename --output_file output_filename --mcmc 1000000 --multi --minbr 0.0009330519

The --mcmc_sample option can be used for setting the frequency of the MCMC sampling. For example, with the following command the sampling frequency is set to 1/100, consequently, 10000 files will be written in the output file.

$ mptp --tree_file tree_filename --output_file output_filename --mcmc 1000000 --multi --minbr 0.0009330519 --mcmc_sample 100

To assess convergence (see convergence), it is important to run at least two independent MCMC analyses. This can be done in one mptp execution with the command --mcmc_chains followed by the number of analysis you want to run. It is highly recommended to run at least two independent runs, as shown in the example below.

$ mptp --tree_file tree_filename --output_file output_filename --mcmc 1000000 --multi --minbr 0.0009330519 --mcmc_sample 100 --mcmc_chains 2

Another important point in assessing convergence is to use different starting delimitations for each MCMC run. By default, the starting point for each run is a randomly generated delimitation (--mcmc_startrandom). However a user may choose to start from the ML delimitation scheme (--mcmc_startml) or the null model, which assumes that all branch lengths fit a single exponential distribution.

Support values and credible intervals

Support values

Average Support Values (ASV)

mPTP prints an Average Support Value (ASV) for each chain. This is useful to assess the congruence of support values with the ML delimitation. Assume we have obtained the ML delimitation below (left tree) and using MCMC we obtain the support values for each node u shown in the right tree. Note that, the symbol S on the ML tree nodes indicuates that the particular node belongs to the between-species splitting process, while C indicates the node is part of the within-species splitting process.

We compute the ASV using the following formula:

where

The resulting ASV for our particular example is

indicating that there is a 86.4% congruence between the support values and the point-estimate.

In the end of an MCMC run the ASV will be printed in the end of the screen as follows:

ML average support based on chain 268791095 : 0.98683685575051938

If more than one MCMC runs was ran (with --mcmc_runs > 1), the ASV for each of the runs will be printed in the screen:

ML average support based on chain 268791095 : 0.98683685575051938

ML average support based on chain 15357013 : 0.98533246567769028

Average Standard Deviation of Delmitation Support Values (ASDDSV)

The ASDDSV is inspired by the standard deviation of split frequencies (Ronquist et al., 2012) and it is used for quantifying the similarity among independent MCMC runs. To calculate it we average the standard deviation of per-node delimitation support values across the independent runs. ASDDSV approaches zero as runs converge to the same distribution of delimitations.

In the end of the MCMC runs the ASDDSV will be printed at the end of the screen as follows:

Average standard deviation of support values among chains: 0.001385