Skip to content

Latest commit

 

History

History
339 lines (261 loc) · 16.3 KB

README.md

File metadata and controls

339 lines (261 loc) · 16.3 KB

waveome

Overview

This repository houses code for the waveome package - an easy to use and powerful Python library that analyzes longitudinal data using Gaussian processes.

waveome is a computational method for longitudinal data analysis particularly to characterize and identify temporal dynamics of omics and clinical variables in association with the phenotype of interest. It employs the Gaussian processes as prior to implement a nonparametric estimation for the dynamics of the underlying measurements.


Key features:

  • Generality: waveome is a new computational tool for identifying temporal dynamics significantly associated with phenotypes of interest.
  • Validation: A comprehensive evaluation of waveome performance using synthetic data generation with known ground truth for genotype-phenotype association testing.
  • Interpretation: By prioritizing comprehensive and flexible kernel functions, waveome significantly reduces computational costs.
  • Elegance: User-friendly, open-source software allowing for high-quality visualization and statistical tests.
  • Optimization: Since omics data are often very high dimensional, all modules have been written and benchmarked for computing time.
  • Documentation: Open-source GitHub repository of code complete with tutorials and a wide range of real-world applications.

Citation:

Allen Ross, Ali Reza Taheriouyn, Jason Llyod-Price, Ali Rahnavard (2024). **waveome: characterizing temporal dynamics of metabolites in longitudinal studies **, https://github.com/omicsEye/waveome/.


waveome user manual

Contents


Features

  1. Generic software that can handle any kind of sequencing data and phenotypes
  2. One place to perform all analyses and produce high-quality visualizations
  3. Optimized computation
  4. User-friendly software
  5. Provides temporal dynamics, associated omics features, and metadata
  6. Enhanced with diagnostic and summarizing visualizations

General usage

Running waveome requires multiple steps, including installing waveome package, loading data in the required format, specifying covariates and outcomes ( omics features), running kernel services (takes some time and computing resources), and visualizing overall and individual associations. All these steps are demonstrated in the waveome_overview.ipynb notebook as a template (an example of the package modeling simulated data) that you can use and modify for your input data. Each step is explained with details in following sections of this tutorial.

Installation

To install the package it is suggested that you create a new conda environment (required to have Python >= 3.9 and <= 3.11 for tensorflow)

Overall requirements

Installation of waveome is processed in a conda environment. You therefore need to install conda first. Go to the Anaconda website and download the latest version for your operating system.

  • For Windows users: do not forget to add conda to your system PATH. It will be asked as a part of installation procedure.
  • Make sure about conda availability. Open a terminal (or command line for Windows users) and run:
conda --version

It should output something like:

conda 23.7.4

if not, you must make conda available to your system for further steps. If you have problems adding conda to PATH, you can find instructions here.

Windows\Linux\Mac

If you are using Windows operating system, please make sure you have both git and 'Microsoft Visual C++ 14.0' or later installed. You need also to install Microsoft C++ build tools. In case you face issues with this step, this link may help you.

Regardless of what your operating system is, follow these steps:

  1. Open a terminal in your Linux or Mac system or command (ctrl+R then type cmd and press Enter) in your Windows system and use the following code to create a conda environment:

    conda create --name waveome_env python=3.11
    
  2. Activate your conda environment:

    conda activate waveome_env 
    
  3. If you want to use waveome in a Python notebook, for instance in Jupyter Notebook (which is recommended for running waveome_overview.ipynb sample file and example projects), we recommend the installation of Jupyter Notebook in this environment prior to the pip installation of waveome. To do so, if you are using any operating system except Mac M1/M2, simply run:

    conda install jupyter 
    

    in the waveome_env in your terminal or command prompt and go to step 4. But, if you are an M1/M2 Mac user, prior to installation of Jupyter Notebook run the following in the waveome_env:

    conda install -c conda-forge grpcio
    

    and afterwards run:

    conda install jupyter
    
  4. Install waveome directly from GitHub:

    python -m pip install git+https://github.com/omicsEye/waveome
    

Run using Jupyter Notebook & Jupyter kernel definition

If you would like to run waveome_overview.ipynb then you should also set up a Jupyter kernel for the new waveome_env environment. This can be done with

conda install -n waveome_env ipykernel

and then to employ waveome_env in Jupyter Notebook we need to provide the kernel. This can be done with

python -m ipykernel install --user --name=waveome_env

Change directory to where you have your iPhyton notebook cd /PATH-TO_YOUR_iPythonNotebook-DiRECTORY

Then run jupyter notebook in command line jupyter notebook.

in the terminal while waveome_env is active.

Loading and preparing data

Input

As an input, waveome requires a pandas data frame which contains at least:

  1. Subjects/individuals/patients index column,
  2. Columns of covariates; in longitudinal studies these columns contain the time of observing the sample.
  3. An Omics feature measurement.

All the above-mentioned items must be available in numeric types (int or float) in the data frame: sampledata

The Subject index is used to measure the subject effect. The categorical factors are encouraged to be considered through dummy variables. For instance, in the above example the factor Sex is considered as 'is the subject Female?' and '1' means "yes". waveome does not consider the samples with missing values and it is required to delete the rows with missing values prior to GPSearch.

Output

The output of GPSearch.run_serach contains the results for each Bayesian nonparametric regression model fit on the data corresponding a kernel (or summation or multiplication of kernels) function including but not restricted to the BIC, corresponding parameter estimations and residuals. Based on information criterion, the best kernel is selected and the coefficients of determination of each omics feature and all the covariates can be displayed. The estimated mean function of the omics feature as a function of each covariate alongside the corresponding residual is provided. Depend on the response distribution assumption on the omics feature (Gaussian and Poisson for now; but the negative binomial distribution is also under construction) the posterior mean of the omics feature is also included as an output. We refer the users to see the outputs of waveome_overview.ipynb ipython notebook file for more details.

Tutorial

Multiple detailed ipython notebook of waveome implementations are available in the examples and the required data for the examples are also available either in the data directory or the corresponding application directory.

Applications

Here we try to use the waveome on different datasets and elaborate on the results.

Breastmilk RNA sequence, infant gut microbiome and metabolites analysis

GWDBB is a reference data library for clinical trials and omics data. It contains the longitudinal gut microbiome and metabolomics data of infants and mothers breast milk RNA in different time-points. Two different longitudinal analysis has been derived on the data and can be found in breastmilk_infant_metabolites_Poisson.ipynb and Breastmilk_infant_Microbiome.ipynb notebook files.

Metagenomes targeting diverse body sites in multiple time-points

iHMP provided one of the broadest datasets for human microbiome data hosted in different niches in the body at different time-points. The available dataset has been collected out of 265 individuals. The longitudinal analysis for different body sights are presented in multioutput_ihmp.ipynb.

Treatment effect on longitudinal CD4 counts

The bivariate responses of HIV-1 RNA (count/ml) in seminal and blood of patients in HIV-RNA AIDS studies from Seattle, Swiss and UNCCH cohorts are considered in this example. The data were collected out of N = 149 subjects divided into two groups of patients who were receiving a therapy (14=106 patients) and those with no therapy or unknown therapy method (43 patients). The covariates are scaled time, baseline age, baseline CD4 and two factors consists of group and cohort. Data are also available through Wang (2013). The analysis using waveome is also provided in CD4.ipynb.

Wang, W.-L. (2013), Multivariate t linear mixed models for irregularly observed multiple repeated measures with missing outcomes. Biom. J., 55: 554-571. 10.1002/bimj.201200001

Identifying important metabolites associated with inflammatory bowel disease

ihmp

We used metabolomics data from iHMP (Inflammatory Bowel Diseases) project Lloyd-Price et al. (2017) for this application. Our goal was to characterize temporal dynamics of metabolites associated with severity of IBD and other patient characteristics. This Jupyter Notebook illustrates the steps.

Support

  • Please submit your questions or issues with the software at Issues tracker.
  • For community discussions, questions, and issue reporting, please visit our forum here