btest: a method to link, rank, and visualize associations among omics features across multi-omics datasets The tool is for general purpose, and well-powered association discovery in paired multi-omic datasets.
Citation:
Bahar Sayoldin, Mahdi Baghbanzadeh, Keith A. Crandall, Ali Rahnavard, btest: link, rank, and visualize associations among omics features across multi-omics datasets https://github.com/omicsEye/btest
- For installation and a quick demo, read the Initial Installation
btest
combines block nonparametric hypothesis testing with false discovery rate correction to
enable high-sensitivity discovery of linear and non-linear associations in high-dimensional datasets
(which may be categorical, continuous, or mixed). btest perform test by:
- Rapid correlation test for interdimensional paired omic datasets.
- Providing informative association by incorporating within data sets and among data sets relationships.
- Clustering associations to form block of significantly associated features.
- Generating high quality visualization of data.
- Providing guideline for adjust data to covariates.
Please join the community discussion forum at omicsEye/btest
- Features
- Overview workflow
- Requirements
- Initial Installation
- How to run
- Output files
- Result plots
- Configuration
- Tutorials
- Tools
- FAQs
- Complete option list
-
Generality: btest can handle datasets from various omics profiles
-
Efficiency: design and implementation of function tend to work with large data
-
Reliability: btest utilizes multiple hypothesis testing in paired omics data.
-
False discovery rate correction (FDR) using Benjamini–Hochberg(BH).
-
A simple user interface (single command driven flow)
- The user only needs to provide a paired dataset
- File Types: tab-delimited text file with columns headers as samples or no sample names with the same order samples without head and row names as features
- Python (version >= 2.7 or >= 3.4)
- Numpy (version >= 1.9.2) (automatically installed)
- Scipy (version >= 0.17.1) (automatically installed)
- Matplotlib (version >= 1.5.1) (automatically installed)
- Scikit-learn (version >= 0.14.1) (automatically installed)
- pandas (version >= 0.18.1) (automatically installed)
- Memory depends on input size mainly the number of features in each dataset
- Runtime depends on input size mainly the number of features in each dataset and similarity score that has been chosen
- Operating system (Linux, Mac, or Windows)
- First install conda
Go to the Anaconda website and download the latest version for your operating system. - For Windows users: DO NOT FORGET TO ADD CONDA TO your system PATH*
- Second is to check for conda availability
open a terminal (or command line for Windows users) and run:
conda --version
it should out put something like:
conda 4.9.2
if not, you must make conda available to your system for further steps. if you have problems adding conda to PATH, you can find instructions here.
If you are NOT using an Apple M1 MAC please go to the Apple M1 MAC for installation instructions.
If you have a working conda on your system, you can safely skip to step three.
- Create a new conda environment (let's call it btest_env) with the following command:
conda create --name btest_env python=3.8
- Activate your conda environment:
conda activate btest_env
- Install btest: you can directly install if from GitHub:
python -m pip install git+https://github.com/omicsEye/btest
- Update/install Xcode Command Line Tools
xcode-select --install
- Install Brew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- Install libraries for brew
brew install cmake libomp
- Install miniforge
brew install miniforge
- Close the current terminal and open a new terminal
- Create a new conda environment (let's call it btest_env) with the following command:
conda create --name btest_env python=3.8
- Activate the conda environment
conda activate btest_env
- Install packages from Conda
conda install numpy scipy scikit-learn==0.23.2
Then
conda install lightgbm
pip install xgboost
- Finally, install btest: you can directly install if from GitHub:
python -m pip install git+https://github.com/omicsEye/btest
- $ pip install btest ``
- This command will automatically install btest and its dependencies.
- To overwrite existing installations of dependencies use "-U" to force update them.
- To use the existing version of dependencies use "--no-dependencies."
- If you do not have write permissions to '/usr/lib/,' then add the option "--user" to the btest install command. Using this option will install the python package into subdirectories of '
/.local' on Linux. Please note when using the "--user" install option on some platforms, you might need to add '/.local/bin/' to your $PATH as default might not include it. You will know if it needs to be added if you see the following messagebtest: command not found
when trying to run btest after installing with the "--user" option. - If you use Windows operating system you can install it with administrator permission easily (please open a terminal with administrator permission and the rest is the same).
To test if btest is installed correctly, you may run the following command in the terminal:
btest -h
Example command from a real study:
btest -X BCAA.tsv -Y lipoproteins.tsv -o btest_BCCA --fdr 0.05 --diagnostics-plot
blockplot btest_BCCA/simtable.tsv btest_BCCA/X_Y.tsv --strongest 100 --similarity Spearman --axlabels "BCAAs" "Lipoproteins" --outfile btest_BCCA/hetamap_100.pdf
b_scatter --datax BCAA.tsv --datay lipoproteins.tsv --b_test btest_BCCA/X_Y.tsv --ind 0-440 --out btest_BCCA/scatters
Which yields btest command line options.
usage: btest [-h] [--version] -X <input_dataset_1.txt> [-Y <input_dataset_2.txt>] -o <output> [-m {pearson,spearman,kendall}] [--fdr FDR] [--var MIN_VAR] [-v VERBOSE] [--diagnostics-plot]
[--header] [--format-feature-names] [-s SEED]
block-wise association testing
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-X <input_dataset_1.txt>
first file: Tab-delimited text input file, one row per feature, one column per measurement
[REQUIRED]
-Y <input_dataset_2.txt>
second file: Tab-delimited text input file, one row per feature, one column per measurement
[default = the first file (-X)]
-o <output>, --output <output>
directory to write output files
[REQUIRED]
-m {pearson,spearman,kendall}
metric to be used for similarity measurement
[default = 'spearman']
--fdr FDR Target FDR correction using BH approach
--var MIN_VAR Minimum variation to keep a feature in tests
-v VERBOSE, --verbose VERBOSE
additional output is printed
--diagnostics-plot Diagnostics plot for associations
--header the input files contain a header line
--format-feature-names
Replaces special characters and for OTUs separated by | uses the known end of a clade
-s SEED, --seed SEED a seed number to make the random permutation reproducible
[default = 0,and -1 for random number]
- Test out the install with unit and functional tests
$ btest_test
With btest installed you can try out a demo run using two sample synthetic datasets.
$ btest -X demo/X_dataset.txt -Y demo/Y_dataset.txt -o $OUTPUT_DIR --fdr 0.1
The output from this demo run will be written to the folder $OUTPUT_DIR.
If you have already installed btest, using the Initial Installation steps, and would like to upgrade your installed version to the latest version, please do:
sudo -H pip install btest --upgrade --no-deps
or
pip install btest --upgrade --no-deps
This command upgrades btest to the latest version and ignores updating btest's dependencies.
When btest is completed, three main output files will be created:
- File name:
$OUTPUT_DIR/associations.txt
- This file details the associations. Features are grouped in clusters that participated in an association with another cluster.
Feature_1
: a feature from the first dataset that participate in the association.Feature_2
: a feature from the second dataset that participate in the association.Correlation coefficient
: from correlation test.complete_obs
: number of complete between two features.t_statistics
: test statistics.pvalue
: from correlation test.P_adjusted
: adjusted p-value (observed fdr).critical_bh_pval
: target FDR for benjamini hochberg approach.
- First dataset heatmap 2. Second dataset heatmap 3. Associations blockplot 4. Diagnostics scatter or confusion matrix plot
![](github.com/omicsEye/btest/img/hierarchical_heatmap_spearman_1.png =15x)
![](github.com/omicsEye/btest/img/hierarchical_heatmap_spearman_2.png =15x)
![](github.com/omicsEye/btest/img/blockplot_strongest_7.png =20x)
- File name:
$OUTPUT_DIR/blockplot.pdf
- This file has a visualized representative of results in a heatmap. Rows are the features from the first dataset that participated in at least on association and the orders comes from their order in
linkage hierarchical cluster
withaverage method
. Columns are the features from the second dataset that participated in at least on association and the orders comes from their order inlinkage hierarchical cluster
withaverage method
. - Each cell color represents the pairwise similarity between individual features.
- Number on each block represents significant association numbers which are based on
similarity score
descending order (largest first) with p-value ascending order in a case of the samesimilarity score
.
![](github.com/omicsEye/btest/img/Scatter_association1.png =20x)
- If option
--diagnostics-plot
is provided withbtest
command line then for each association, a set of plots will be produced at the end of btest's process. - File name:
$OUTPUT_DIR/diagnostics_plot/association_1/Scatter_association1.pdf
- This file has a visualized representative of Association 1 in a heatmap. X's are closer of features from a cluster in the first dataset that is significantly associated with a cluster of features, Ys, in the second dataset. The scatter plot shows how the association looks like within cluster and between initial features.
btest produces a performance file to store user configuration settings. This configuration file is automatically created in the output directory.
$ vi btest.log
btest version: 1.1.1
btest can be used to test the relationship between metadata (e.g. age and gender) and data (e.g. microbial species abundance and immune cell counts). In this case, related (covaried) metadata cluster together. In circumstances that two datasets are tested such as microbiome vs. metabolites, the effect of covariates (e.g. age, gender, and batch effect) from both datasets such as (microbial species and metabolites) should be regressed out. Users should adjust for covariates. Here we provide two examples of R programming that how to adjust for a variable.
- Adjust for age: let's regress out the age effect from microbial species or metabolites:
#!python
lmer(microbe ~ age, microbial_abundance_data = table)
lmer(metabolite ~ age, metabolites_data = table)
- Adjust for time: this type of adjustment with groups structure involving has more complexity for adjusting and we recommend to read Winkler et al. Neuroimage. 2014 entitled Permutation inference for the general linear model. A simple code for this case would be: assume we have microbial samples from the same subject in several time-points a linear mixed-effects model is fit using the R lme4 package to each microbial species or metabolites of the form:
lmer(microbe ~ 1 + (1 | subject) + time, microbial_abundance_data = table)
btest by default use 0.1 as the target false discovery rate. Users can change it to the desired value, for example, 0.05 or 0.25 by using -q 0.05
.
btest’s implementation and hypothesis testing scheme are highly general, allowing them to be used with a wide variety of similar measures. For similarity measurement option we recommend: 1) Spearman coefficient for continues data, 2)(default for btest) normalized mutual information (NMI) for mixed data (continuous, categorical, and binary data), and 3) discretized maximum information coefficient (DMIC) for complicated associations types such as sine waves in continuous data. Similarity measures are implemented in the current version of btest that user can use as options are: Spearman coefficient, discretized normalized mutual information, discretized adjusted mutual information, discretized maximal information coefficient, Pearson correlation, distance correlation (dCor).
-m spearman
for example change the default similarity to Spearman coefficient as similarity measurement, and it automatically bypasses discretizing step. For available similarity metrics, please look at btest options using btest -h
.
btest uses medoid of clusters as a representative to test the relation between clusters. A user can use other options using -d
with other decomposition methods such as PCA, ICA, MCA. For example, -d pca
will use the first principal component of a cluster as its representative.
A user can choose AllA as a naive pairwise testing approach using -a pair
option in the command line where the default is -a block
which uses the hierarchical approach.
btest by default removes features with low entropy (<.5) to reduce the unnecessary number tests. A user can use different threshold using option -e $THRESHOLD
. $THRESHOLD by default is .5.
btest includes tools to be used with results.
$ cd $OUTPUT_DIR
$ blockplot $blockplot similarity_table.txt hypotheses_tree.txt associations.txt blockplot.pdf
- $TABLE = gene/pathway table (tsv or biom format)
- $OUTPUT_DIR = the directory to write new gene/pathway tables (one per sample, in biom format if input is biom format)
usage: blockplot [-h] [--strongest STRONGEST] [--largest LARGEST] [--mask]
[--cmap CMAP] [--axlabels AXLABELS AXLABELS]
[--outfile OUTFILE] [--similarity SIMILARITY]
[--orderby ORDERBY]
simtable tree associations
positional arguments:
simtable table of pairwise similarity scores
tree hypothesis tree (for getting feature order)
associations btest associations
optional arguments:
-h, --help show this help message and exit
--strongest STRONGEST
isolate the N strongest associations
--largest LARGEST isolate the N largest associations
--mask mask feature pairs not in associations
--cmap CMAP matplotlib color map
--axlabels AXLABELS AXLABELS
axis labels
--outfile OUTFILE output file name
--similarity SIMILARITY
Similarity metric has been used for similarity
measurement
--orderby ORDERBY Order the significant association by similarity,
pvalue, or qvalue
btest provides a script scatter
to make a scatter matrix of between all features participate in an association.
$ scatter 1 --input ./ --outfile scatter_1.pdf
usage: scatter [-h] [--input INPUT] [--outfile OUTFILE]
association_number
positional arguments:
association_number Association number to be plotted
optional arguments:
-h, --help show this help message and exit
--input INPUT btest output directory
--outfile OUTFILE output file name
datasim generates paired datasets with various properties including: the size (number of features (rows) and samples (columns)), the number of blocks (clusters within each dataset, the structure of clusters, the type of associations between features, distribution of data (normal and uniform), the structure of clustering with each dataset, the strongness of association between cluster among datasets define by noise between associated blocks, and the strongness of similarity between features within clusters defined by noise within blocks.
Here are two examples to generate paired datasets with the associations between them and btest runs.
datasim -f 32 -n 100 -a line -d uniform -s balanced -o btest
The outputs will be located in btest_data_f32_s100_line
directory and include a paired datasets: X_line_32_100.txt
Y_line_32_100.txt
and A_line_32_100.txt
association between them. A's rows are features in X dataset, and A's columns are features in Y dataset and for each cell in A zero means no significant association and 1 mean significant association.
To run btest use on this synthetic data use:
btest -X btest/X_line_32_100.txt -Y btest/Y_line_32_100.txt -o btest_output_f32_n100_line_spearman
As all features in datasets are continuous btest uses Spearman coefficient as the similarity metric. One can specify a different similarity metric. For example, try the same dataset with Normalized Mutual Information:
btest -X btest/X_line_32_100.txt -Y btest/Y_line_32_100.txt -o btest_output_f32_n100_line_nmi -m nmi
For mixed data (categorical, continuous data) btest automatically uses NMI as simialrity metric. Let's generate some mixed data:
datasim -f 32 -n 100 -a mixed -d uniform -s balanced -o btest_data_f32_n100_mixed
Run btest od the data:
btest -X btest_data_f32_n100_mixed/X_mixed_32_100.txt -Y btest_data_f32_n100_mixed/Y_mixed_32_100.txt -o btest_output_f32_n100_mixed
If you try mixed data, btest provides a warning and ends as Spearman does NOT work with noncontinuous data.
usage: datasim [-h] [-v] [-f FEATURES] [-n SAMPLES] [-a ASSOCIATION]
[-d DISTRIBUTION] [-b NOISE_BETWEEN] [-w NOISE_WITHIN] -o
OUTPUT [-s STRUCTURE]
btest synthetic data generator to produce paired data sets with association among their features.
optional arguments:
-h, --help show this help message and exit
-v, --verbose additional output is printed
-f FEATURES, --features FEATURES
number of features in the input file D*N, Rows: D features and columns: N samples
-n SAMPLES, --samples SAMPLES
number of samples in the input file D*N, Rows: D features and columns: N samples
-a ASSOCIATION, --association ASSOCIATION
association type [sine, parabola, log, line, L, step, happy_face, default =parabola]
-d DISTRIBUTION, --distribution DISTRIBUTION
Distribution [normal, uniform, default =uniform]
-b NOISE_BETWEEN, --noise-between NOISE_BETWEEN
number of samples in the input file D*N, Rows: D features and columns: N samples
-w NOISE_WITHIN, --noise-within NOISE_WITHIN
number of samples in the input file D*N, Rows: D features and columns: N samples
-o OUTPUT, --output OUTPUT
the output directory
-s STRUCTURE, --structure STRUCTURE
structure [balanced, imbalanced, default =balanced]
btest function along with command line can be called from other programs using Python API we provide an example is demonstrated here to show how to import and use btesttest
function :
#!python
from btest.btest import btesttest
def main():
btest(X='/path/to/first/datase/X.txt',\
Y= '/path/to/second/datase/Y.txt',\
output_dir='/path/to/btest/output/btest_output_demo')
if __name__ == "__main__":
main( )
We have implemented both empirical cumulative distribution function (ECDF) and fast and accurate approach, generalized Pareto distribution (GPD) by Knijnenburg et al. 2009, permutation test. The function can be imported to other python programs :
from btest.stats import permutation_test_pvalue
import numpy as np
def main():
# Generate a list of random values for first vector
np.random.seed(0)
x_rand = np.random.rand(1,10)[0]
# Generate a list of random values for second vector
# set the numpy seed for different random values from the first set
np.random.seed(1)
y_rand = np.random.rand(1, 10)[0]
# Calculate pvalue using empirical cumulative distribution function (ECDF)
p_random_ecdf = permutation_test_pvalue(X = x_rand, Y = y_rand, similarity_method = 'spearman',permutation_func = 'ecdf')
p_perfect_ecdf = permutation_test_pvalue(X = x_rand, Y = x_rand, similarity_method = 'spearman', permutation_func = 'ecdf')
print ("ECDF P-value for random data: %s, ECDF P-value for perfect correlation data: %s")%(p_random_ecdf, p_perfect_ecdf)
# Calculate pvalue using our implementation in btest for generalized Pareto distribution (GPD) approach proposed by Knijnenburg et al. 2009
p_random_gpd = permutation_test_pvalue(X = x_rand, Y = y_rand, similarity_method = 'spearman',permutation_func = 'gpd')
p_perfect_gpd = permutation_test_pvalue(X = x_rand, Y = x_rand, similarity_method = 'spearman', permutation_func = 'gpd')
print ("GPD P-value for random data: %s, GPD P-value for perfect correlation data: %s")%(p_random_gpd, p_perfect_gpd)
if __name__ == "__main__":
main( )
The parameters that can be provided to the permutation test for calculating p-value are:
iterations
: the number permutation for the test (i.e. 1000)permutation_func
can be either 'ECDF' or 'GPD'similarity_method
a similarity metric supported by btest (check what are the choices by 'btest -h')seed
if -1 each run seeds a random value, 0 uses the same seed any place does permutation.