Skip to content
/ btest Public

btest: link, rank, and visualize associations among omics features across multi-omics datasets

License

Notifications You must be signed in to change notification settings

omicsEye/btest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

btest

btest: a method to link, rank, and visualize associations among omics features across multi-omics datasets The tool is for general purpose, and well-powered association discovery in paired multi-omic datasets.


Citation:

Bahar Sayoldin, Mahdi Baghbanzadeh, Keith A. Crandall, Ali Rahnavard, btest: link, rank, and visualize associations among omics features across multi-omics datasets https://github.com/omicsEye/btest

btest combines block nonparametric hypothesis testing with false discovery rate correction to enable high-sensitivity discovery of linear and non-linear associations in high-dimensional datasets (which may be categorical, continuous, or mixed). btest perform test by:

  1. Rapid correlation test for interdimensional paired omic datasets.
  2. Providing informative association by incorporating within data sets and among data sets relationships.
  3. Clustering associations to form block of significantly associated features.
  4. Generating high quality visualization of data.
  5. Providing guideline for adjust data to covariates.

Please join the community discussion forum at omicsEye/btest

Contents

Features

  1. Generality: btest can handle datasets from various omics profiles

  2. Efficiency: design and implementation of function tend to work with large data

  3. Reliability: btest utilizes multiple hypothesis testing in paired omics data.

  4. False discovery rate correction (FDR) using Benjamini–Hochberg(BH).

  5. A simple user interface (single command driven flow)

    • The user only needs to provide a paired dataset

Overview workflow

  • File Types: tab-delimited text file with columns headers as samples or no sample names with the same order samples without head and row names as features

Requirements

Software

  1. Python (version >= 2.7 or >= 3.4)
  2. Numpy (version >= 1.9.2) (automatically installed)
  3. Scipy (version >= 0.17.1) (automatically installed)
  4. Matplotlib (version >= 1.5.1) (automatically installed)
  5. Scikit-learn (version >= 0.14.1) (automatically installed)
  6. pandas (version >= 0.18.1) (automatically installed)

Other

  1. Memory depends on input size mainly the number of features in each dataset
  2. Runtime depends on input size mainly the number of features in each dataset and similarity score that has been chosen
  3. Operating system (Linux, Mac, or Windows)

Initial Installation

1. Install btest

Installing from source

INSTALLATION

  • First install conda
    Go to the Anaconda website and download the latest version for your operating system.
  • For Windows users: DO NOT FORGET TO ADD CONDA TO your system PATH*
  • Second is to check for conda availability
    open a terminal (or command line for Windows users) and run:
conda --version

it should out put something like:

conda 4.9.2

if not, you must make conda available to your system for further steps. if you have problems adding conda to PATH, you can find instructions here.

Windows Linux Mac

If you are NOT using an Apple M1 MAC please go to the Apple M1 MAC for installation instructions.
If you have a working conda on your system, you can safely skip to step three.

  1. Create a new conda environment (let's call it btest_env) with the following command:
conda create --name btest_env python=3.8
  1. Activate your conda environment:
conda activate btest_env 
  1. Install btest: you can directly install if from GitHub:
python -m pip install git+https://github.com/omicsEye/btest

Apple M1 MAC

  1. Update/install Xcode Command Line Tools
xcode-select --install
  1. Install Brew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  1. Install libraries for brew
brew install cmake libomp
  1. Install miniforge
brew install miniforge
  1. Close the current terminal and open a new terminal
  2. Create a new conda environment (let's call it btest_env) with the following command:
conda create --name btest_env python=3.8
  1. Activate the conda environment
conda activate btest_env
  1. Install packages from Conda
conda install numpy scipy scikit-learn==0.23.2

Then

conda install lightgbm
pip install xgboost
  1. Finally, install btest: you can directly install if from GitHub:
python -m pip install git+https://github.com/omicsEye/btest

Installing with pip

  • $ pip install btest ``
  • This command will automatically install btest and its dependencies.
  • To overwrite existing installations of dependencies use "-U" to force update them.
  • To use the existing version of dependencies use "--no-dependencies."
  • If you do not have write permissions to '/usr/lib/,' then add the option "--user" to the btest install command. Using this option will install the python package into subdirectories of '/.local' on Linux. Please note when using the "--user" install option on some platforms, you might need to add '/.local/bin/' to your $PATH as default might not include it. You will know if it needs to be added if you see the following message btest: command not found when trying to run btest after installing with the "--user" option.
  • If you use Windows operating system you can install it with administrator permission easily (please open a terminal with administrator permission and the rest is the same).

Getting Started with btest

Test btest

To test if btest is installed correctly, you may run the following command in the terminal:

btest -h

Example command from a real study:

btest -X BCAA.tsv -Y lipoproteins.tsv -o btest_BCCA --fdr 0.05 --diagnostics-plot
blockplot btest_BCCA/simtable.tsv btest_BCCA/X_Y.tsv --strongest 100 --similarity Spearman --axlabels "BCAAs" "Lipoproteins" --outfile btest_BCCA/hetamap_100.pdf
b_scatter --datax BCAA.tsv --datay lipoproteins.tsv --b_test btest_BCCA/X_Y.tsv --ind 0-440 --out btest_BCCA/scatters

Which yields btest command line options.

usage: btest [-h] [--version] -X <input_dataset_1.txt> [-Y <input_dataset_2.txt>] -o <output> [-m {pearson,spearman,kendall}] [--fdr FDR] [--var MIN_VAR] [-v VERBOSE] [--diagnostics-plot]
             [--header] [--format-feature-names] [-s SEED]

block-wise association testing

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -X <input_dataset_1.txt>
                        first file: Tab-delimited text input file, one row per feature, one column per measurement
                        [REQUIRED]
  -Y <input_dataset_2.txt>
                        second file: Tab-delimited text input file, one row per feature, one column per measurement
                        [default = the first file (-X)]
  -o <output>, --output <output>
                        directory to write output files
                        [REQUIRED]
  -m {pearson,spearman,kendall}
                        metric to be used for similarity measurement
                        [default = 'spearman']
  --fdr FDR             Target FDR correction using BH approach
  --var MIN_VAR         Minimum variation to keep a feature in tests
  -v VERBOSE, --verbose VERBOSE
                        additional output is printed
  --diagnostics-plot    Diagnostics plot for associations 
  --header              the input files contain a header line
  --format-feature-names
                        Replaces special characters and for OTUs separated  by | uses the known end of a clade
  -s SEED, --seed SEED  a seed number to make the random permutation reproducible
                        [default = 0,and -1 for random number]

2. Test the installation

  1. Test out the install with unit and functional tests
    • $ btest_test

3. Try out a demo run

With btest installed you can try out a demo run using two sample synthetic datasets. $ btest -X demo/X_dataset.txt -Y demo/Y_dataset.txt -o $OUTPUT_DIR --fdr 0.1

The output from this demo run will be written to the folder $OUTPUT_DIR.

Installation Update

If you have already installed btest, using the Initial Installation steps, and would like to upgrade your installed version to the latest version, please do:

sudo -H pip install btest --upgrade --no-deps or pip install btest --upgrade --no-deps

This command upgrades btest to the latest version and ignores updating btest's dependencies.

Output files

When btest is completed, three main output files will be created:

1. Associations file

  • File name: $OUTPUT_DIR/associations.txt
  • This file details the associations. Features are grouped in clusters that participated in an association with another cluster.
  • Feature_1: a feature from the first dataset that participate in the association.
  • Feature_2: a feature from the second dataset that participate in the association.
  • Correlation coefficient : from correlation test.
  • complete_obs: number of complete between two features.
  • t_statistics: test statistics.
  • pvalue: from correlation test.
  • P_adjusted: adjusted p-value (observed fdr).
  • critical_bh_pval: target FDR for benjamini hochberg approach.

Output files

  1. First dataset heatmap 2. Second dataset heatmap 3. Associations blockplot 4. Diagnostics scatter or confusion matrix plot

1. First dataset heatmap

![](github.com/omicsEye/btest/img/hierarchical_heatmap_spearman_1.png =15x)

2. Second dataset heatmap

![](github.com/omicsEye/btest/img/hierarchical_heatmap_spearman_2.png =15x)

3. Associations blockplot

![](github.com/omicsEye/btest/img/blockplot_strongest_7.png =20x)

  • File name: $OUTPUT_DIR/blockplot.pdf
  • This file has a visualized representative of results in a heatmap. Rows are the features from the first dataset that participated in at least on association and the orders comes from their order in linkage hierarchical cluster with average method. Columns are the features from the second dataset that participated in at least on association and the orders comes from their order in linkage hierarchical cluster with average method.
  • Each cell color represents the pairwise similarity between individual features.
  • Number on each block represents significant association numbers which are based on similarity score descending order (largest first) with p-value ascending order in a case of the same similarity score.

4. Diagnostics scatter or confusion matrix plot

![](github.com/omicsEye/btest/img/Scatter_association1.png =20x)

  • If option --diagnostics-plot is provided with btest command line then for each association, a set of plots will be produced at the end of btest's process.
  • File name: $OUTPUT_DIR/diagnostics_plot/association_1/Scatter_association1.pdf
  • This file has a visualized representative of Association 1 in a heatmap. X's are closer of features from a cluster in the first dataset that is significantly associated with a cluster of features, Ys, in the second dataset. The scatter plot shows how the association looks like within cluster and between initial features.

Configuration

btest produces a performance file to store user configuration settings. This configuration file is automatically created in the output directory.

$ vi btest.log
btest version:	1.1.1

Tutorials

Adjust for covariates

btest can be used to test the relationship between metadata (e.g. age and gender) and data (e.g. microbial species abundance and immune cell counts). In this case, related (covaried) metadata cluster together. In circumstances that two datasets are tested such as microbiome vs. metabolites, the effect of covariates (e.g. age, gender, and batch effect) from both datasets such as (microbial species and metabolites) should be regressed out. Users should adjust for covariates. Here we provide two examples of R programming that how to adjust for a variable.

  • Adjust for age: let's regress out the age effect from microbial species or metabolites:
#!python

lmer(microbe ~ age, microbial_abundance_data = table)
lmer(metabolite ~ age, metabolites_data = table)
  • Adjust for time: this type of adjustment with groups structure involving has more complexity for adjusting and we recommend to read Winkler et al. Neuroimage. 2014 entitled Permutation inference for the general linear model. A simple code for this case would be: assume we have microbial samples from the same subject in several time-points a linear mixed-effects model is fit using the R lme4 package to each microbial species or metabolites of the form:
lmer(microbe ~ 1 + (1 | subject) + time, microbial_abundance_data = table)

Selecting a desired false discovery rate

btest by default use 0.1 as the target false discovery rate. Users can change it to the desired value, for example, 0.05 or 0.25 by using -q 0.05.

Selecting a similarity measurement

btest’s implementation and hypothesis testing scheme are highly general, allowing them to be used with a wide variety of similar measures. For similarity measurement option we recommend: 1) Spearman coefficient for continues data, 2)(default for btest) normalized mutual information (NMI) for mixed data (continuous, categorical, and binary data), and 3) discretized maximum information coefficient (DMIC) for complicated associations types such as sine waves in continuous data. Similarity measures are implemented in the current version of btest that user can use as options are: Spearman coefficient, discretized normalized mutual information, discretized adjusted mutual information, discretized maximal information coefficient, Pearson correlation, distance correlation (dCor).
-m spearman for example change the default similarity to Spearman coefficient as similarity measurement, and it automatically bypasses discretizing step. For available similarity metrics, please look at btest options using btest -h.

Selecting a decomposition method

btest uses medoid of clusters as a representative to test the relation between clusters. A user can use other options using -d with other decomposition methods such as PCA, ICA, MCA. For example, -d pca will use the first principal component of a cluster as its representative.

Pairwise association testsing or AllA

A user can choose AllA as a naive pairwise testing approach using -a pair option in the command line where the default is -a block which uses the hierarchical approach.

Filtering features from input by minimum entropy

btest by default removes features with low entropy (<.5) to reduce the unnecessary number tests. A user can use different threshold using option -e $THRESHOLD. $THRESHOLD by default is .5.

Tools

blockplot: a tool for visualization

btest includes tools to be used with results.

$ cd $OUTPUT_DIR

$ blockplot $blockplot similarity_table.txt hypotheses_tree.txt associations.txt blockplot.pdf

  • $TABLE = gene/pathway table (tsv or biom format)
  • $OUTPUT_DIR = the directory to write new gene/pathway tables (one per sample, in biom format if input is biom format)
usage: blockplot [-h] [--strongest STRONGEST] [--largest LARGEST] [--mask]
                 [--cmap CMAP] [--axlabels AXLABELS AXLABELS]
                 [--outfile OUTFILE] [--similarity SIMILARITY]
                 [--orderby ORDERBY]
                 simtable tree associations

positional arguments:
  simtable              table of pairwise similarity scores
  tree                  hypothesis tree (for getting feature order)
  associations          btest associations

optional arguments:
  -h, --help            show this help message and exit
  --strongest STRONGEST
                        isolate the N strongest associations
  --largest LARGEST     isolate the N largest associations
  --mask                mask feature pairs not in associations
  --cmap CMAP           matplotlib color map
  --axlabels AXLABELS AXLABELS
                        axis labels
  --outfile OUTFILE     output file name
  --similarity SIMILARITY
                        Similarity metric has been used for similarity
                        measurement
  --orderby ORDERBY     Order the significant association by similarity,
                        pvalue, or qvalue

scatter: a tool for visualization

btest provides a script scatter to make a scatter matrix of between all features participate in an association.

$ scatter 1 --input ./ --outfile scatter_1.pdf

usage: scatter [-h] [--input INPUT] [--outfile OUTFILE]
                    association_number

positional arguments:
  association_number  Association number to be plotted

optional arguments:
  -h, --help          show this help message and exit
  --input INPUT       btest output directory
  --outfile OUTFILE   output file name

datasim: a tool for synthetic data

datasim generates paired datasets with various properties including: the size (number of features (rows) and samples (columns)), the number of blocks (clusters within each dataset, the structure of clusters, the type of associations between features, distribution of data (normal and uniform), the structure of clustering with each dataset, the strongness of association between cluster among datasets define by noise between associated blocks, and the strongness of similarity between features within clusters defined by noise within blocks.

Here are two examples to generate paired datasets with the associations between them and btest runs.

datasim -f 32 -n 100 -a line -d uniform -s balanced -o btest

The outputs will be located in btest_data_f32_s100_line directory and include a paired datasets: X_line_32_100.txt Y_line_32_100.txt and A_line_32_100.txt association between them. A's rows are features in X dataset, and A's columns are features in Y dataset and for each cell in A zero means no significant association and 1 mean significant association. To run btest use on this synthetic data use:

btest -X btest/X_line_32_100.txt -Y btest/Y_line_32_100.txt -o btest_output_f32_n100_line_spearman As all features in datasets are continuous btest uses Spearman coefficient as the similarity metric. One can specify a different similarity metric. For example, try the same dataset with Normalized Mutual Information: btest -X btest/X_line_32_100.txt -Y btest/Y_line_32_100.txt -o btest_output_f32_n100_line_nmi -m nmi

For mixed data (categorical, continuous data) btest automatically uses NMI as simialrity metric. Let's generate some mixed data:

datasim -f 32 -n 100 -a mixed -d uniform -s balanced -o btest_data_f32_n100_mixed

Run btest od the data:

btest -X btest_data_f32_n100_mixed/X_mixed_32_100.txt -Y btest_data_f32_n100_mixed/Y_mixed_32_100.txt -o btest_output_f32_n100_mixed

If you try mixed data, btest provides a warning and ends as Spearman does NOT work with noncontinuous data.

usage: datasim [-h] [-v] [-f FEATURES] [-n SAMPLES] [-a ASSOCIATION]
                 [-d DISTRIBUTION] [-b NOISE_BETWEEN] [-w NOISE_WITHIN] -o
                 OUTPUT [-s STRUCTURE]

btest synthetic data generator to produce paired data sets with association among their features.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         additional output is printed
  -f FEATURES, --features FEATURES
                        number of features in the input file D*N, Rows: D features and columns: N samples 
  -n SAMPLES, --samples SAMPLES
                        number of samples in the input file D*N, Rows: D features and columns: N samples 
  -a ASSOCIATION, --association ASSOCIATION
                        association type [sine, parabola, log, line, L, step, happy_face, default =parabola] 
  -d DISTRIBUTION, --distribution DISTRIBUTION
                        Distribution [normal, uniform, default =uniform] 
  -b NOISE_BETWEEN, --noise-between NOISE_BETWEEN
                        number of samples in the input file D*N, Rows: D features and columns: N samples 
  -w NOISE_WITHIN, --noise-within NOISE_WITHIN
                        number of samples in the input file D*N, Rows: D features and columns: N samples 
  -o OUTPUT, --output OUTPUT
                        the output directory
  -s STRUCTURE, --structure STRUCTURE
                        structure [balanced, imbalanced, default =balanced] 

btest Python API

btest function along with command line can be called from other programs using Python API we provide an example is demonstrated here to show how to import and use btesttest function :

#!python

from btest.btest import btesttest

def main():
    btest(X='/path/to/first/datase/X.txt',\
               Y= '/path/to/second/datase/Y.txt',\
               output_dir='/path/to/btest/output/btest_output_demo')

if __name__ == "__main__":
    main( )  

Nonparametric p-value

We have implemented both empirical cumulative distribution function (ECDF) and fast and accurate approach, generalized Pareto distribution (GPD) by Knijnenburg et al. 2009, permutation test. The function can be imported to other python programs :

from btest.stats import permutation_test_pvalue 
import numpy as np

def main():
    
    # Generate a list of random values for first vector
    np.random.seed(0)
    x_rand = np.random.rand(1,10)[0]   
    
    # Generate a list of random values for second vector
    # set the numpy seed for different random values from the first set
    np.random.seed(1)
    y_rand = np.random.rand(1, 10)[0]
    
    # Calculate pvalue using empirical cumulative distribution function (ECDF) 
    p_random_ecdf = permutation_test_pvalue(X  = x_rand, Y = y_rand, similarity_method = 'spearman',permutation_func = 'ecdf')
    p_perfect_ecdf = permutation_test_pvalue(X  = x_rand, Y = x_rand, similarity_method = 'spearman', permutation_func = 'ecdf')
    print ("ECDF P-value for random data: %s, ECDF P-value for perfect correlation data: %s")%(p_random_ecdf, p_perfect_ecdf)
    
    # Calculate pvalue using our implementation in btest for generalized Pareto distribution (GPD) approach proposed by Knijnenburg et al. 2009 
    p_random_gpd = permutation_test_pvalue(X  = x_rand, Y = y_rand, similarity_method = 'spearman',permutation_func = 'gpd')
    p_perfect_gpd = permutation_test_pvalue(X  = x_rand, Y = x_rand, similarity_method = 'spearman', permutation_func = 'gpd')
    print ("GPD P-value for random data: %s, GPD P-value for perfect correlation data: %s")%(p_random_gpd, p_perfect_gpd)

if __name__ == "__main__":
    main( ) 

The parameters that can be provided to the permutation test for calculating p-value are:

  • iterations: the number permutation for the test (i.e. 1000)
  • permutation_func can be either 'ECDF' or 'GPD'
  • similarity_method a similarity metric supported by btest (check what are the choices by 'btest -h')
  • seed if -1 each run seeds a random value, 0 uses the same seed any place does permutation.

About

btest: link, rank, and visualize associations among omics features across multi-omics datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •