Skip to content

hakyimlab/scPrediXcan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scPrediXcan: Leveraging Single-Cell Data for Cell-Type–Specific Transcriptome-Wide Association Studies Using Transfer Learning

Description

Single-cell PrediXcan (scPrediXcan) is a framework designed to perform Transcriptome-Wide Association Studies (TWAS) at the cell-type level using single-cell data. This framework utilizes GWAS summary statistics and single-cell RNA-seq data to assess the association between gene expression and disease risk.

The workflow of scPrediXcan framework consists of three steps: 1) training a deep learning model named ctPred for epigenomics-to-expression prediction at cell type level, 2) linearizing the ctPred into a SNP-based elastic net model for downstream association tests using GWAS summary statistics, and 3) performing the association tests between genes and trait of interest.

The framework is detailed in the bioRxiv preprint.

You can download the cell type specific prediction models and covariances in links from here

Setup and installation

scPrediXcan mainly requires Python, Nextflow and some deep learning packages like Pytorch and Tensorflow. You can install all the packages and softwares needed by creating a new conda environment named scPrediXcan using scPrediXcan_env.yml file.

conda env create -f scPrediXcan_env.yml
conda activate scPrediXcan

Usage

Step1: Training the ctPred model

!!Note that if you just want to use the pre-trained models of given cell types for TWAS in certain diseases, please skip step1,2 and directly do step3 with pre-trained l-ctPred models.

ctPred is a multilayer perceptron to predict gene expressions at pseudobulk level from gene epigenomic representations(i.g., Enformer-output epigenomic features). The inputs of this step are a population-average gene expression file and a gene epigenomic features file. They are combined into a single training data csv file since both are relatively small. The output of this step is a pt file storing the model weights.

python ctPred_train.py --parameters ctPred_train.json --cell_file 'training_data.csv'

Here is an example of the training data file, which is a bit different from the example training data file in the tutorial. In the tutorial, the expression data has already been combined to the epigenomics data; here, you only need to provide the expression data and specify the path of epigenomics data path in the json file, and the script will combine them.

or

python ctPred_train.py --parameters ctPred_train.json --data_dir 'The_path_of_the_folder' # inside the folder, you have many training data files (.csv) for different cell types

For the details, check the code and tutorial here. The Enformer-predicted epigenomic features of protein-coding genes are shared here.

Step2: Linearizing the ctPred into l-ctPred

scPrediXcan uses PrediXcan implementation to train an elastic-net model for ctPred linearization. In this step, we utilize the genotype data from 448 Geuvadis individuals along with ctPred-predicted gene expression profiles to fit an elastic-net model for the corresponding cell type. In principle, alternative genotype reference panels can also be employed at this stage.

Here is a nextflow pipeline for l-ctPred generation. The inputs include a genotype file and a ctPred-predicted cell-type-specific gene expression file. The outputs consist of a transcriptome model SQLite database (i.e., l-ctPred) and a SNP covariance matrix file. These output files are intended for use in the final association analysis step. Here are the detailed procedures of step-2:

  1. Clone the PredictDb-nextflow repository.
git clone https://github.com/hakyimlab/PredictDb-nextflow.git
  1. Run the PredictDb nextflow pipeline.
nextflow run ./main.nf \
--gene_annotation 'Gene_anno.txt' \
--snp_annotation 'snp_annot.txt' \
--genotype 'genotype_file' \
--gene_exp 'ctPred_predicted_gene_expression.csv' \
--outdir results \
--keepIntermediate \
-resume \

The detailed descriptions of the pipeline and the used data/output are here.

Step3: Performing association test between genes and traits

scPrediXcan uses Summary-PrediXcan(S-PrediXcan) to run the association test. The detailed description of S-PrediXcan are here. In this step, the input data include: a transcriptome model sqlite database (i.g., l-ctPred), a GWAS/Meta Analysis summary statistics, and SNP covariance matrices. The l-ctPred database and the SNP covariance matrices are obtained from the last step. Here are the detailed procedures of step-3:

  1. Clone the S-PrediXcan repository and go to the software folder.
git clone https://github.com/hakyimlab/MetaXcan
cd MetaXcan/software
  1. Run the High-Level S-PrediXcan Script
./SPrediXcan.py \
--model_db_path 'l-ctPred_celli.db' \
--covariance 'covariance.txt.gz' \
--gwas_folder data/GWAS \
--gwas_file_pattern ".*gz" \
--snp_column SNP \
--effect_allele_column A1 \
--non_effect_allele_column A2 \
--beta_column BETA \
--pvalue_column P \
--output_file 'results/TWAS_result.csv'

Pipeline details:

This step should take less than a minute on a 3GHZ computer. For the full specification of command line parameters, you can check the wiki and the turtorial. The output csv file is the TWAS result, and the detailed descriptions of each column are here

You can download example data here. This may take a few minutes depending on your connection: it has to download approximately 200Mb worth of data. Downloaded data will include all the input data needed.

Citation

If you find this code useful, we would appreciate it if you cite the following publication