This repository houses code for the waveome package - an easy to use and powerful Python library that analyzes longitudinal data using Gaussian processes.
waveome is a computational method for longitudinal data analysis particularly to characterize and identify temporal dynamics of omics and clinical variables in association with the phenotype of interest. It employs the Gaussian processes as prior to implement a nonparametric estimation for the dynamics of the underlying measurements.
Key features:
- Generality: waveome is a new computational tool for identifying temporal dynamics significantly associated with phenotypes of interest.
- Validation: A comprehensive evaluation of waveome performance using synthetic data generation with known ground truth for genotype-phenotype association testing.
- Interpretation: By prioritizing comprehensive and flexible kernel functions, waveome significantly reduces computational costs.
- Elegance: User-friendly, open-source software allowing for high-quality visualization and statistical tests.
- Optimization: Since omics data are often very high dimensional, all modules have been written and benchmarked for computing time.
- Documentation: Open-source GitHub repository of code complete with tutorials and a wide range of real-world applications.
Citation:
Allen Ross, Ali Reza Taheriouyn, Jason Llyod-Price, Ali Rahnavard (2024). **waveome: characterizing temporal dynamics of metabolites in longitudinal studies **, https://github.com/omicsEye/waveome/.
- Generic software that can handle any kind of sequencing data and phenotypes
- One place to perform all analyses and produce high-quality visualizations
- Optimized computation
- User-friendly software
- Provides temporal dynamics, associated omics features, and metadata
- Enhanced with diagnostic and summarizing visualizations
Running waveome requires multiple steps, including installing waveome package, loading data in the required format, specifying covariates and outcomes ( omics features), running kernel services (takes some time and computing resources), and visualizing overall and individual associations. All these steps are demonstrated in the waveome_overview.ipynb notebook as a template (an example of the package modeling simulated data) that you can use and modify for your input data. Each step is explained with details in following sections of this tutorial.
To install the package it is suggested that you create a new conda environment (required to have Python >= 3.9 and <= 3.11 for tensorflow)
Installation of waveome is processed in a conda environment. You therefore need to install conda first. Go to the Anaconda website and download the latest version for your operating system.
- For Windows users: do not forget to add conda to your system PATH. It will be asked as a part of installation procedure.
- Make sure about conda availability. Open a terminal (or command line for Windows users) and run:
conda --version
It should output something like:
conda 23.7.4
if not, you must make conda available to your system for further steps. If you have problems adding conda to PATH, you can find instructions here.
If you are using Windows operating system, please make sure you have both git and 'Microsoft Visual C++ 14.0' or later installed. You need also to install Microsoft C++ build tools. In case you face issues with this step, this link may help you.
Regardless of what your operating system is, follow these steps:
-
Open a terminal in your Linux or Mac system or command (
ctrl+R
then typecmd
and press Enter) in your Windows system and use the following code to create a conda environment:conda create --name waveome_env python=3.11
-
Activate your conda environment:
conda activate waveome_env
-
If you want to use waveome in a Python notebook, for instance in Jupyter Notebook (which is recommended for running waveome_overview.ipynb sample file and example projects), we recommend the installation of Jupyter Notebook in this environment prior to the
pip
installation of waveome. To do so, if you are using any operating system except Mac M1/M2, simply run:conda install jupyter
in the
waveome_env
in your terminal or command prompt and go to step 4. But, if you are an M1/M2 Mac user, prior to installation of Jupyter Notebook run the following in thewaveome_env
:conda install -c conda-forge grpcio
and afterwards run:
conda install jupyter
-
Install waveome directly from GitHub:
python -m pip install git+https://github.com/omicsEye/waveome
If you would like to run waveome_overview.ipynb
then you should also set up a Jupyter kernel for the new waveome_env
environment. This can be done with
conda install -n waveome_env ipykernel
and then to employ waveome_env in Jupyter Notebook we need to provide the kernel. This can be done with
python -m ipykernel install --user --name=waveome_env
Change directory to where you have your iPhyton notebook
cd /PATH-TO_YOUR_iPythonNotebook-DiRECTORY
Then run jupyter notebook in command line
jupyter notebook
.
in the terminal while waveome_env
is active.
As an input, waveome requires a pandas data frame which contains at least:
- Subjects/individuals/patients index column,
- Columns of covariates; in longitudinal studies these columns contain the time of observing the sample.
- An Omics feature measurement.
All the above-mentioned items must be available in numeric types (int
or float
) in the data frame:
The Subject index is used to measure the subject effect. The categorical factors are encouraged to be considered
through dummy variables. For instance, in the above example the factor Sex
is considered as 'is the subject Female?'
and '1' means "yes". waveome does not consider the samples with missing values and it is required to delete
the rows with missing values prior to GPSearch
.
The output of GPSearch.run_serach
contains the results for each Bayesian nonparametric regression model
fit on the data corresponding a kernel (or summation or multiplication of kernels) function including but not
restricted to the BIC, corresponding parameter estimations and residuals. Based on information criterion, the
best kernel is selected and the coefficients of determination of each omics feature and all the
covariates can be displayed. The estimated mean function of the omics feature as a function of each covariate
alongside the corresponding residual is provided. Depend on the response distribution assumption on the omics feature (Gaussian and
Poisson for now; but the negative binomial distribution is also under construction) the posterior mean of the
omics feature is also included as an output. We refer the users to see the outputs of
waveome_overview.ipynb ipython notebook
file for more details.
Multiple detailed ipython notebook of waveome implementations are available in the examples and the required data for the examples are also available either in the data directory or the corresponding application directory.
Here we try to use the waveome on different datasets and elaborate on the results.
GWDBB is a reference data library for clinical trials and omics data. It contains the longitudinal gut microbiome and metabolomics data of infants and mothers breast milk RNA in different time-points. Two different longitudinal analysis has been derived on the data and can be found in breastmilk_infant_metabolites_Poisson.ipynb and Breastmilk_infant_Microbiome.ipynb notebook files.
iHMP provided one of the broadest datasets for human microbiome data hosted in different niches in the body at different time-points. The available dataset has been collected out of 265 individuals. The longitudinal analysis for different body sights are presented in multioutput_ihmp.ipynb.
The bivariate responses of HIV-1 RNA (count/ml) in seminal and blood of patients in HIV-RNA AIDS studies from Seattle, Swiss and UNCCH cohorts are considered in this example. The data were collected out of N = 149 subjects divided into two groups of patients who were receiving a therapy (14=106 patients) and those with no therapy or unknown therapy method (43 patients). The covariates are scaled time, baseline age, baseline CD4 and two factors consists of group and cohort. Data are also available through Wang (2013). The analysis using waveome is also provided in CD4.ipynb.
Wang, W.-L. (2013), Multivariate t linear mixed models for irregularly observed multiple repeated measures with missing outcomes. Biom. J., 55: 554-571. 10.1002/bimj.201200001
We used metabolomics data from iHMP (Inflammatory Bowel Diseases) project Lloyd-Price et al. (2017) for this application. Our goal was to characterize temporal dynamics of metabolites associated with severity of IBD and other patient characteristics. This Jupyter Notebook illustrates the steps.
- Please submit your questions or issues with the software at Issues tracker.
- For community discussions, questions, and issue reporting, please visit our forum here