Skip to content

[CRBHits](https://github.com/kullrich/CRBHits) is a reimplementation of the Conditional Reciprocal Best Hit algorithm [crb-blast](https://github.com/cboursnell/crb-blast) in R.

License

Notifications You must be signed in to change notification settings

kullrich/CRBHits

Repository files navigation

CRBHits

pipeline status Contributor Covenant MITlicense DOI

CRBHits - From Conditional Reciprocal Best Hits to Codon Alignments and Ka/Ks in R

R package source code: https://github.com/kullrich/CRBHits

R package pages: https://kullrich.github.io/CRBHits/

R package issues: https://github.com/kullrich/CRBHits/issues

CRBHits - Description

CRBHits is a reimplementation of the Conditional Reciprocal Best Hit (CRBH) algorithm crb-blast in R. It covers all necessary steps from CRBHit pair calculation to Codon Alignments and Ka/Ks.

The CRBH algorithm was introduced by Aubry S, Kelly S et al. (2014) and ported to python shmlast (Scott C. 2017) which benefits from the blast-like sequence search software LAST (Kiełbasa SM et al. 2011).

Like shmlast, CRBHits plots the fitted model of the CRBHit evalue based algorithm. In addition users can filter the CRBHit pairs prior fitting for other criteria like evalue, protein identity and/or the twilight zone of protein sequence alignments according to Rost B. (1999).

CRBHits only takes coding nucleotide sequences as the query and target inputs, since the secondary aim of CRBHits is to calculate synonymous and non-synonymous substitutions with the R package seqinr (Sharif D, Lobry JR. 2007) or the external tool KaKs_Calculator2.0 (Wang D, Zhang Y et al. 2010).

This is in contrast to crb-blast, which can take proteins or nucleotides as the target sequences and in contrast to shmlast, which uses nucleotides as queries and proteins as target sequences.

The resulting CRBHit pairs can be used to obtain pairwise codon alignments, which are further used to calculate synonymous and nonsynonymous substitutions using parallelization.

The Ka/Ks (also sometimes denoted as dN/dS) values can be obtained either via the codon model of Li WH. (1999) as implemented in the R package seqinr or the model of Yang Z and Nielson R. (2000) and other models from the software KaKs_Calculator2.0 as ported in the R package MSA2dist.

The following two images show the two main functions of the package cds2rbh() and rbh2kaks(), which are described in more detail in the package vignettes.

Figure: Overview of the `cds2rbh()` function

Figure: Overview of the `rbh2kaks()` function

Comparing the primary transcripts from Arabidopsis thaliana and Arabidopsis lyrata to get conditional recirpocal best hit pairs takes the following runtime, depending on the selected search tool:

Hardware used:

  • Apple M1 Max (10-core CPU with 8 performance cores and 2 efficiency cores)
search tool version sensitivity threads runtime (sec) crbh.pairs
last v1595 -D 1e6 -m 25 10 21.811 31045
mmseqs2 v15.6f452 -s 1.0 --max-seqs 25 10 38.200 28625
mmseqs2 v15.6f452 -s 4.0 --max-seqs 25 10 36.530 28938
mmseqs2 v15.6f452 -s 5.7 --max-seqs 25 10 50.643 28951
mmseqs2 v15.6f452 -s 7.5 --max-seqs 25 10 115.081 28957
diamond v2.1.10 --fast --max-target-seqs 25 10 19.039 31807
diamond v2.1.10 --sensitive --max-target-seqs 25 10 51.435 31233
lambda3 v3.1.0 -p fast --num-matches 25 10 37.440 33685
lambda3 v3.1.0 -p sensitive --num-matches 25 10 50.093 33185

Overlap of either reciprocal best hit pairs (RBH) or conditional reciprocal best hit pairs (CRBH) found between the search tools:

Figure: Overlap of search tool results

Installation

see also here for the R package pages https://kullrich.github.io/CRBHits/

R specific installation prerequisites

install packages from cran

In most cases you need to first install the following system-wide packages to be able to compile the R dependencies. If you do not have 'sudo' rights, please ask your system-admin to install them for you.

Ubuntu/Debian

sudo apt-get install libcurl4-openssl-dev libssl-dev libxml2-dev libglu1-mesa-dev libgit2-dev
#Pandoc is required to build R Markdown vignettes
#sudo apt-get install pandoc
#pkgdown dependencies - pkgdown is used to build R package pages
#sudo apt-get install libssh2-1-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev

CentOS

sudo yum install libcurl-devel openssl-devel libxml2-devel mesa-libGLU-devel libgit2-devel
#Pandoc is required to build R Markdown vignettes
#sudo yum install pandoc
#pkgdown dependencies - pkgdown is used to build R package pages
#sudo yum install libssh2-devel fontconfig-devel harfbuzz-devel fribidi-devel freetype-devel libpng-devel libtiff-devel libjpeg-turbo-devel
install.packages("devtools")

install CRBHits

devtools::install_github("kullrich/CRBHits", build_vignettes = FALSE, dependencies = TRUE)
CRBHits::make_last()
CRBHits::make_dagchainer()

If the functions CRBHits::make_last() and CRBHits::make_dagchainer() fail on your system to install the prerequisites, please take a look at the detailed description how to install via e.g. conda or to compile from source.

Vignettes

These vignettes introduce CRBHits

  • CRBHits Basic Vignette - Basic Usage of CRBHits
    • includes CRBHit pair calculation
    • includes CRBHit pair filtering
    • includes Longest Isoform selection
    • includes Codon alignments
    • includes Ka/Ks calculations
  • KaKs Vignette - KaKs Calculations between two species and subsequent data filter steps
    • includes Arabidopsis thalina and Arabidopsis lyrata CRBHit pair calculation
    • includes Homo sapiens and Pan troglodytes CRBHit pair calculation
    • inlcudes Longest Isoform selection
    • includes Gene/Isoform chromosomal position extraction
    • includes Tandem Duplicate Assignment
    • includes Synteny Assignment
    • includes Ka/Ks colored Dot-Plot

Quick-guide

library(CRBHits)
## prerequisite: last
## if not done yet, try to compile last-1595 - uncomment the following line
#CRBHits::make_last()
## conditional reciprocal best hits (CRBHit pairs)
data("ath", package="CRBHits")
data("aly", package="CRBHits")
ath_aly_crbh <- cds2rbh(
    cds1=ath,
    cds2=aly,
    plotCurve=TRUE)
summary(ath_aly_crbh)
?cds2rbh

##kaks calculation - subset model "Li"
ath_aly_crbh$crbh.pairs <- head(ath_aly_crbh$crbh.pairs, 20)
ath_aly_crbh.kaks.Li <- rbh2kaks(
    rbhpairs=ath_aly_crbh,
    cds1=ath,
    cds2=aly,
    model="Li")
head(ath_aly_crbh.kaks.Li)
?rbh2kaks

## plot kaks
g.kaks.Li <- plot_kaks(ath_aly_crbh.kaks.Li)
?plot_kaks

## kaks calculation - subset model "YN"
ath_aly_crbh.kaks.YN <- rbh2kaks(
    rbhpairs=ath_aly_crbh,
    cds1=ath,
    cds2=aly,
    model="YN")

## kaks calculation using multiple threads
ath_aly_crbh.kaks <- rbh2kaks(
    rbhpairs=ath_aly_crbh,
    cds1=ath,
    cds2=aly,
    threads=2)
head(ath_aly_crbh.kaks)

## example how to calculate all pairwise kaks values given a MSA
data(hiv)
hiv_kaks.Li <- dnastring2kaks(
    cds=hiv,
    model="Li")
g.Li <- plot_kaks(hiv_kaks.Li)
g.Li

data(hiv)
hiv_kaks.NG86 <- dnastring2kaks(
    cds=hiv,
    model="NG86")
g.NG86 <- plot_kaks(hiv_kaks.NG86)
g.NG86

## codon plot - sites under possible positive selection
library(tidyr)
library(dplyr)
library(ggplot2)
hiv.xy <- codonmat2xy(dnastring2codonmat(hiv))
hiv.xy %>% select(Codon, SynMean, NonSynMean, IndelMean) %>%
    gather(variable, values, -Codon) %>% 
    ggplot(aes(x=Codon, y=values)) + 
    geom_line(aes(colour=factor(variable))) + 
    geom_point(aes(colour=factor(variable))) + 
    ggtitle("HIV-1 sample 136 patient 1 from Sweden envelope glycoprotein (env) gene")

##### see section install external tools via bioconda
## example how to use conda version of LAST
my.lastpath <- paste0(dirname(system2("which", "lastdb", stdout=TRUE)), "/")
ath_aly_crbh <- cds2rbh(ath, aly, plotCurve = TRUE,
                        lastpath = my.lastpath)
?cds2rbh

##### see section compile external tools from original source code
## example how to use own compiled version of LAST
my.lastpath <- "/tmp/last/last-1595/bin"
ath_aly_crbh <- cds2rbh(ath, aly, plotCurve = TRUE,
                        lastpath = my.lastpath)
?cds2rbh

## example how to use conda version of KaKs_Calculator2.0
my.kakspath <- "/tmp/KaKs_Calculator2/KaKs_Calculator2.0_src/src"
ath_aly_crbh.kaks <- rbh2kaks(ath_aly_crbh,
                              ath, aly, model = "YN",
                              kakscalcpath = my.kakspath)
?rbh2kaks

External tools installation prerequisites - detailed description {#prerequisites}

install external tools via bioconda

If the functions CRBHits::make_last() and CRBHits::make_dagchainer() fail on your system to install the prerequisites, there is the possibility to install them via conda:

conda config --add channels bioconda
conda config --add channels conda-forge

conda install last
conda install dagchainer

#optional sequence search algorithm
conda install mmseqs2
conda install diamond
conda install lambda

After this installation, the prerequisites are supposed to be in the PATH and you need to set the correct @param in the corresponding functions of CRBHits like this:

## example how to use conda versions of LAST and DAGchainer
my.lastpath <- paste0(dirname(system2("which", "lastdb", stdout=TRUE)), "/")
my.dagchainerpath <- paste0(dirname(system2("which", "dagchainer", stdout=TRUE)), "/")

?cds2rbh
cds2rbh(., ., lastpath=my.lastpath)

?rbh2dagchainer
rbh2dagchainer(., ., dagchainerpath=my.dagchainerpath)

## example how to use conda versions of MMSEQS2 and DIAMOND and LAMBDA
my.mmseqs2path <- paste0(dirname(system2("which", "mmseqs", stdout=TRUE)), "/")
my.diamondpath <- paste0(dirname(system2("which", "diamond", stdout=TRUE)), "/")
my.lambda3path <- paste0(dirname(system2("which", "lambda3", stdout=TRUE)), "/")

?cds2rbh
cds2rbh(., ., mmseqs2path=my.mmseqs2path, searchtool="mmseqs2")
cds2rbh(., ., diamondpath=my.diamondpath, searchtool="diamond")
cds2rbh(., ., lambda3path=my.lambda3path, searchtool="lambda3")

compile external tools from source code forked within this package

The source code for the prerequisites (LAST, DAGchainer) are forked within CRBHits.

To compile the forked version of LAST within the CRBHits R package directory try to use the function make_last():

## see more installation information here, if make fails
## last-install-help: https://gitlab.com/mcfrith/last
CRBHits::make_last()

All models from KaKs_Calculator2.0 have been ported to the MSA2dist R package with Rcpp and can be obtained via the MSA2dist::dnastring2kaks() function. There is no need anymore to compile this external tool.

## install MSA2dist
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("MSA2dist")
?MSA2dist::dnastring2kaks()

To compile the forked version of DAGchainer within the CRBHits R package directory try to use the function make_dagchainer():

## compile DAGchainer
CRBHits::make_dagchainer()

compile external tools from original source code

If you would like to install the latest version of the tools, you need to download the source code and compile again.

To compile LAST yourself on Linux/Unix/macOS into another folder:

## create and change into the directory to install LAST
## e.g.
mkdir /tmp/last
cd /tmp/last
## donwload last-1595
curl -O https://gitlab.com/mcfrith/last/-/archive/1550/last-1595.zip
unzip last-1595.zip
cd last-1595
## compile LAST
make

All models from KaKs_Calculator2.0 have been ported to the MSA2dist R package with Rcpp and can be obtained via the MSA2dist::dnastring2kaks() function. There is no need anymore to compile this external tool.

To compile DAGchainer:

Note: Due to some changes in the latest g++ compilers the source code was altered to meet this changes, which are directly incorporated into the dagchainer.zip that is distributed with CRBHits. It is recommended to compile from this file (see below):

## create and change into the directory to install DAGchainer
## e.g.
mkdir /tmp/DAGchainer
cd /tmp/DAGchainer
## donwload DAGchainer
curl -O https://github.com/kullrich/CRBHits/-/raw/devel/inst/extdata/dagchainer.zip
unzip dagchainer.zip
cd dagchainer
## compile DAGchainer
make

If you would like to use your own compiled versions of LAST and DAGchainer you need to set the correct @param in the corresponding functions of CRBHits.

## example how to use own compiled versions of LAST, KaKs_Calculator2.0 and DAGchainer
my.lastpath <- "/tmp/last/last-1595/bin"
my.dagchainerpath <- "/tmp/dagcahiner"

?cds2rbh
cds2rbh(., ., lastpath=my.lastpath)

?rbh2dagchainer
rbh2dagchainer(., ., dagchainerpath=my.dagchainerpath)

To compile MMseqs2 Pre-compiled MMseqs2 binaries

To compile diamond Pre-compiled diamond binaries

To compile lambda3 Pre-compiled lambda3 binaries

License

MIT (see LICENSE)

The CRBHits package includes source code that has been published under following licenses:

last-1595.zip

GNU General Public License Version 3, 29 June 2007 GPLv3

KaKs_Calculator2.0.tar.gz

The toolkit is freely available (licensed under GPLv3) online at https://sourceforge.net/projects/kakscalculator2/.

DAGchainer

All of this software is open source, free of license restrictions.

Contributing Code

If you would like to contribute to CRBHits, please file an issue so that one can establish a statement of need, avoid redundant work, and track progress on your contribution.

Before you do a pull request, you should always file an issue and make sure that someone from the CRBHits developer team agrees that it’s a problem, and is happy with your basic proposal for fixing it.

Once an issue has been filed and we've identified how to best orient your contribution with package development as a whole, fork the main repo, branch off a feature branch from devel, commit and push your changes to your fork and submit a pull request for CRBHits:devel.

By contributing to this project, you agree to abide by the Code of Conduct terms.

Bug reports

Please report any errors or requests regarding CRBHits to Kristian Ullrich (ullrich@evolbio.mpg.de)

or use the issue tracker at https://github.com/kullrich/CRBHits/issues

Code of Conduct - Participation guidelines

This repository adhere to Contributor Covenant code of conduct for in any interactions you have within this project. (see Code of Conduct)

See also the policy against sexualized discrimination, harassment and violence for the Max Planck Society Code-of-Conduct.

By contributing to this project, you agree to abide by its terms.

References

Aubry S., Kelly S., Kümpers B. M., Smith-Unna R. D., and Hibberd J. M. (2014). Deep evolutionary comparison of gene expression identifies parallel recruitment of trans-factors in two independent origins of C4 photosynthesis. PLoS Genetics, 10(6). https://doi.org/10.1371/journal.pgen.1004365

Buchfink B., Reuter K., Drost HG. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods, 18, 366-368. https://doi.org/10.1038/s41592-021-01101-x

Charif D., and Lobry J. R. (2007). SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. In Structural approaches to sequence evolution (pp. 207-232). Springer, Berlin, Heidelberg. https://link.springer.com/chapter/10.1007/978-3-540-35306-5_10

Duong T., and Wand M. (2015). feature: Local Inferential Feature Significance for Multivariate Kernel Density Estimation. R package version 1.2.13. https://cran.r-project.org/web/packages/feature/

Ganeshan S., Dickover RE., Korber BT., Bryson YJ., Wolinsky SM. (1997). Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. Journal of Virology, 71(1), 663-677.

Haas B. J., Delcher A. L., Wortman J. R., and Salzberg S. L. (2004). DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics, 20(18), 3643-3646. https://doi.org/10.1093/bioinformatics/bth397

Haug-Baltzell A., Stephens S. A., Davey S., Scheidegger C. E., Lyons E. (2017). SynMap2 and SynMap3D: web-based wholge-genome synteny browsers. Bioinformatics, 33(14), 2197-2198. https://academic.oup.com/bioinformatics/article/33/14/2197/3072872

Hauswedell, H., Hetzel, S., Gottlieb, S. G., Kretzmer, H., Meissner, A., and Reinert, K. (2024). Lambda3: homology search for protein, nucleotide, and bisulfite-converted sequences. Bioinformatics, 40(3), btae097. https://doi.org/10.1093/bioinformatics/btae097

Kiełbasa S. M., Wan R., Sato K., Horton P., and Frith M. C. (2011). Adaptive seeds tame genomic sequence comparison. Genome Research, 21(3), 487-493. https://doi.org/10.1101/gr.113985.110

Kimura M. (1977). Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature, 267, 275-276.

Li W. H. (1993). Unbiased estimation of the rates of synonymous and nonsynonymous substitution. Journal of Molecular Evolution, 36(1), 96-99. https://doi.org/10.1007/bf02407308

Microsoft, and Weston S. (2020). foreach: Provides Foreach Looping Construct. R package version, 1.5.1. foreach

Mugal C. F., Wolf J. B. W., Kaj I. (2014). Why Time Matters: Codon Evolution and the Temproal Dynamics of dN/dS. Molecular Biology and Evolution, 31(1), 212-231.

Nei M. and Gojobori T. (1986). Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution, 3(5), 418-426. https://doi.org/10.1093/oxfordjournals.molbev.a040410

Ooms J. (2019). curl: A Modern and Flexible Web Client for R. R package version, 4.3. curl

Ota T. and Nei M. (1994). Variance and covariances of the numbers of synonymous and nonsynonymous substitutions per site . Molecular Biology and Evolution, 11(4), 613-619. https://doi.org/10.1093/oxfordjournals.molbev.a040140

Pagès H., Aboyoun P., Gentleman R., and DebRoy S. (2017). Biostrings: Efficient manipulation of biological strings. R package version, 2.56.0. Biostrings

Paradis E., Schliep K. (2019). ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics, 35, 526-528.

Revolution Analytics, and Weston S. (2020). doMC: Foreach Parallel Adaptor for 'parallel'. R package version, 1.3.7. doMC

Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94. https://doi.org/10.1093/protein/12.2.85

Scott C. (2017). shmlast: an improved implementation of conditional reciprocal best hits with LAST and Python. Journal of Open Source Software, 2(9), 142. https://joss.theoj.org/papers/10.21105/joss.00142

Scrucca L., Fop M., Murphy T. B., and Raftery A. E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289-317. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/

Steinegger M., and Soeding J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35, 1026-1028. https://doi.org/10.1038/nbt.3988

Wickham H. (2011). testthat: Get Started with Testing. The R Journal, 3(1), 5. testthat

Wickham H. (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version, 1.4.0. stringr

Wickham H. (2020). tidyr: Tidy Messy Data. R package version, 1.1.2. tidyr

Wickham H., Hester J., and Chang W. (2020). devtools: Tools to make Developing R Packages Easier. R package version, (2.3.2). devtools

Wickham H., François R., Henry L., and Müller K. (2020). dplyr: A Grammar of Data Manipulation. R package version, 1.0.2. dplyr

Yang Z., and Nielsen R. (2000). Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Molecular Biology and Evolution, 17(1), 32-43. https://doi.org/10.1093/oxfordjournals.molbev.a026236

Yang Z., Nielsen R., Goldman N., Pedersen AM. (2000). Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics, 155(1), 431-449. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1461088