Introduction
The code enclosed in this repository provides an extension of the Normalized Phylogenetic Profiling algorithm. Phylogenetic profiling (PP) is an evolution bases method for prediction of gene function or interaction. PP relies on the assumption that genes that are functionally related (for example, participate in the same cellular pathway or are members of a common protein complex) manifest correlated patterns of occurrence and absence across the phylogenetic tree. In PP analysis, each gene is represented using a phylogeneitc profile, a vector indiacating the presence or absence of a homolog for the gene across many evolutionarily diverse species. Given a gene or a set of genes of interest, the algorithm searches for additonal genes with similiar phylogenetic profiles, thus highlighting potential funcational or interaction partners of the query gene(s).
While most current PP methods examine co-evolution between genes across the entire tree of life (global co-evolution), CladePP combines signals from global co-evolution with signals from gene pairs that are co-evolved within the context of a specific clade (local co-evolution) in order to refine predictions.
Perquisites
CladePP is implemented in R and requires the R libriries plyr, dplyr and optparse avilable from CRAN. Calculating a false discovery rate for the predictions also requires the qvalue library from Bioconductor or from https://github.com/StoreyLab/qvalue.
Generating CladePP predictions
The main algorithm is run using the CladePP.R script. The script takes three inputs:
clusters - a RDS file with a list of of hclust objects. Each object represents the output of hierarchial clustering of a subset of columns of a noramlized phylogenetic profilign matrix corresponding to a given clade. The names of the list should correspond to the names of the clades.
genelist - a text file with the gene symbols of the genes of interest.
outfile - path of the desired output file. Needs to have a .tsv extension.
The output of the CladePP script is a tab delimited table which indicates for each gene its score (higher score indicates that the phylogenetic profile of the gene clusters tightly with the phylogeneitc profile of gene(s) in the query genelist. See out paper for details on how the score is computed), with which genes in the query it co-evolved and in which clade.
Estimating a p-value for the predictions
Once you have generated your predictions, you may want to estimate how tightly the genes reported in the outouted are not due to chance. To do so, we provide two scripts for estimating an emprical p-value for the scores in the output (See paper for details). In the first stage, the CladePP algorhtim is ran agianst random set of genes in the same size as the query set using the simulate.R script.
This script takes four inputs:
clusters, genelist - identical to the inputs of CladePP.R (see above)
n - the number of random sets to generate. In our paper we tested our predictions against 10,000 random gene sets. For generating a large number of random sets, it is recommended to parallelize the running of the script and/or use high performence computing/cloud platform.
outfile - Path to which the output will be written. Needs to have a .rds extension. It is recommended that outputs of this script will be stored in a dedicated directory since the Calculate_FDR.R script called after this script assumes all rds files in its input directory contained simulations generated by this script.
After the simulated sets are generated and scores are calculated for each of them, a p-value and false discovery rate are estimated by running the script Calculate_FDR.R. This script takes three inputs:
permdir - Directory containing the output of simulate.R (and only them! The script assumes that any rds files in this directory are generated by simulate.R. If any other rds files are found in this directory the script is likely to fail). MRStable - output of CladePP.R. outfile - file to which to write the output.
The output of this script consists of the table generated by CladePP.R to which it adds three columns with the p-value, q-value and FDR for each gene.
Citing and algorithm details
Full details of the algorithm are described in
Sherill-Rofe, Dana et al. (forthcoming) Mapping global and local co-evolution across 600 species to identify novel homologous recombination repair genes.