OptiRADCT is a pipeline for optimizing Clustering Threshold values in Restrinction-site Associated DNA sequencing (RADseq) data assemblies performed with ipyrad for phylogenetic and phylogeography studies. It can be use to assemble RADseq datasets with a range of clustering parameters and vizualise how properties of the assemblies (e.g., proportion of informative sites and of missing data) and downstream phylogenetic inferences are affected by the values used.
- ipyrad (https://ipyrad.readthedocs.io/en/latest/)
- R (https://www.r-project.org/)
- raxml-PTHREADS (optionnal, https://cme.h-its.org/exelixis/web/software/raxml/)
The use of OptiRADCT presupposes a good understanding of the way ipyrad works. I strongly encourage the users to first get familiar with ipyrad's documentation.
A typical RADseq analysis workflow with de novo assembly would include the following steps:
- Step 1: Data demultiplexing and reads filtering. This can be performed by running steps 1 & 2 of ipyard, or using the process_radtags module of stacks.
- Step 2: Optimization of the clustering parameters.
- Step 2.1: Optimization of the intra-samples clustering threshold, which is used to cluster reads within each samples.
- Step 2.2: Optimization of the between-samples clustering threshold, which is used to cluster consensus sequences across samples.
- Step 3: Additionnal filtering of the dataset assembled with "optimal" parameters, and data formating. This is performed by re-running step 7 of ipyrad on the "best" assembly produced in Step 2.2
- Step 4: Downstream analyses.
OptiRADCT is meant to facilitate data exploration and vizualisation for the optimization step 2. Because ipyrad encodes a unique clustering threshold value for intra-samples and across-samples clustering, this is done by running ipyrad sequentially twice. For more detailed explanations, refer to Rancilhac et al. (2023) BioRxiv.
1) optimization of the intra-samples Clustering Threshold (1_run_iCT.sh)
in this first part, intra-samples clustering is performed with a user-specified range of thresholds (iCTs). Between samples clustering can be subsequently completed in each assembly using a user-specified standardized between-samples threshold (bCT).
bash 1_run_iCT.sh -d /path/to/data -n assemblies_base_name -w /path/to/work/dir -i '0.80 0.81 0.82 0.83' -b 0.9 -D 8 -T 4 -t pairddrad -s /path/to/scripts -p -r /path/to/reference.tre
arguments:
-d | --data : full path to a directory containing the demultiplexed data. Files are expected to be gzip-compressed (.gz extension) and all .gz files in the directory will be used. See ipyrad's documentation for more details on files naming conventions.
-n | --name : a base name for ipyrad's assemblies.
-w | --work : full path to working directory where ipyrad assemblies and optiRADCT will be stored
-i | --iCT : the range of iCTs to test, provided between apostrophes and seprated by single spaces, e.g. '0.80 0.85 0.90'.
-b | --bCT : standardized clustering threshold to be used for between-samples clustering (e.g. 0.90) (optional)
-D | --depth : minimum depth of ipyrad's clusters (ipyrad parameter #11 & #12)
-T | --threads : number of threads to use (default=1)
-t | --type : the data type used (one of: rad, ddrad, gbs, pairddrad, pairgbs, 2brad, pair3rad; ipyard parameter #7)
-s | --scripts : full path to the directory containing the optiRADCT scripts
-p | --tree : if this option is specified, maximum likelihood trees will be inferred with RAxML for each iCT.
-r | --ref : can be used to specify a reference tree in newick format to compare the RAxML trees to (using topological distances). The tip labels must be identical to the samples names in the assemblies. Used only if -p is specified.
-h | --help : prints a help message and exits.
This script will execute ipyrad's steps 1-5 for each of the iCTs specified with -i. If -b is specified, it will then finish each ipyrad run (i.e., run steps 6-7) with the standardized bCT. Assembly metrics for the clusters and loci are harvested internaly with harvest_stats.sh and plots are produced. The assembly metrics are saved in the directory stats_iCT
, and plots vizualizing the different metrics can be found in stats_iCT/iCT_plots.pdf
. Finally, if -p if specified, a directory trees_iCT
is created, containing the assemblies concatenation matrices in phylip format, the RAxML trees and plots showing changes in branch length and bootstrap support in
.
2) optimization of the between-samples Clustering Threshold (3_run_bCT.sh)
Once the best iCT value has been chosen, it is time to optimize the between-samples clustering threshold. This is done by using the clusters generated by ipyrad at the previous step with the chosen iCT to repeat steps 6-7 with a range of bCT.
bash 3_run_bCT.sh -n assemblies_base_name -b '0.80 0.85 0.90' -w /path/to/working/directory -i params-best_iCT.txt -T 4 -s /path/to/scripts -p -r reference.tre
arguments:
-n | --name : a base name for ipyrad's assemblies
-b | --bCT : the range of bCTs to test, provided between apostrophes and seprated by single spaces, e.g. '0.80 0.85 0.90'.
-w | --work : path to working directory, where the outputs of the previous step are saved.
-i | --iCT : name of the parameters file generated by ipyrad corresponding to the chosen iCT value. Assuming the chosen value is 0.95 and the base name in the previous step (-n) was test, the file will be names params-test_iCT_95.txt.
-T | --threads: number of threads to use (default=1)
-s | --scripts : : full path to the directory containing the optiRADCT scripts
-p | --tree : if this option is specified, maximum likelihood trees will be inferred with RAxML for each iCT.
-r | --ref : can be used to specify a reference tree in newick format to compare the RAxML trees to (using topological distances). The tip labels must be identical to the samples names in the assemblies. Used only if -p is specified.
-h | --help : prints a help message and exits.
The outputs are similar to that of the previous step, and saved in the directories stats_bCT
and trees_bCT
Note: the R scripts, harvest_stats.sh and 5_infer_trees.sh are executed internally, so that only 1_run_iCT.sh
and 3_run_bCT.sh
need to be run by the user.