scPrediXcan: Leveraging Single-Cell Data for Cell-Type–Specific Transcriptome-Wide Association Studies Using Transfer Learning
Single-cell PrediXcan (scPrediXcan) is a framework designed to perform Transcriptome-Wide Association Studies (TWAS) at the cell-type level using single-cell data. This framework utilizes GWAS summary statistics and single-cell RNA-seq data to assess the association between gene expression and disease risk.
The workflow of scPrediXcan framework consists of three steps: 1) training a deep learning model named ctPred for epigenomics-to-expression prediction at cell type level, 2) linearizing the ctPred into a SNP-based elastic net model for downstream association tests using GWAS summary statistics, and 3) performing the association tests between genes and trait of interest.
The framework is detailed in the bioRxiv preprint.
You can download the cell type specific prediction models and covariances in links from here
scPrediXcan mainly requires Python, Nextflow and some deep learning packages like Pytorch and Tensorflow. You can install all the packages and softwares needed by creating a new conda environment named scPrediXcan using scPrediXcan_env.yml file.
conda env create -f scPrediXcan_env.yml
conda activate scPrediXcan
!!Note that if you just want to use the pre-trained models of given cell types for TWAS in certain diseases, please skip step1,2 and directly do step3 with pre-trained l-ctPred models.
ctPred is a multilayer perceptron to predict gene expressions at pseudobulk level from gene epigenomic representations(i.g., Enformer-output epigenomic features). The inputs of this step are a population-average gene expression file and a gene epigenomic features file. They are combined into a single training data csv file since both are relatively small. The output of this step is a pt file storing the model weights.
python ctPred_train.py --parameters ctPred_train.json --cell_file 'training_data.csv'
Here is an example of the training data file, which is a bit different from the example training data file in the tutorial. In the tutorial, the expression data has already been combined to the epigenomics data; here, you only need to provide the expression data and specify the path of epigenomics data path in the json file, and the script will combine them.
or
python ctPred_train.py --parameters ctPred_train.json --data_dir 'The_path_of_the_folder' # inside the folder, you have many training data files (.csv) for different cell types
For the details, check the code and tutorial here. The Enformer-predicted epigenomic features of protein-coding genes are shared here.
scPrediXcan uses PrediXcan implementation to train an elastic-net model for ctPred linearization. In this step, we utilize the genotype data from 448 Geuvadis individuals along with ctPred-predicted gene expression profiles to fit an elastic-net model for the corresponding cell type. In principle, alternative genotype reference panels can also be employed at this stage.
Here is a nextflow pipeline for l-ctPred generation. The inputs include a genotype file and a ctPred-predicted cell-type-specific gene expression file. The outputs consist of a transcriptome model SQLite database (i.e., l-ctPred) and a SNP covariance matrix file. These output files are intended for use in the final association analysis step. Here are the detailed procedures of step-2:
- Clone the PredictDb-nextflow repository.
git clone https://github.com/hakyimlab/PredictDb-nextflow.git
- Run the PredictDb nextflow pipeline.
nextflow run ./main.nf \
--gene_annotation 'Gene_anno.txt' \
--snp_annotation 'snp_annot.txt' \
--genotype 'genotype_file' \
--gene_exp 'ctPred_predicted_gene_expression.csv' \
--outdir results \
--keepIntermediate \
-resume \
The detailed descriptions of the pipeline and the used data/output are here.
scPrediXcan uses Summary-PrediXcan(S-PrediXcan) to run the association test. The detailed description of S-PrediXcan are here. In this step, the input data include: a transcriptome model sqlite database (i.g., l-ctPred), a GWAS/Meta Analysis summary statistics, and SNP covariance matrices. The l-ctPred database and the SNP covariance matrices are obtained from the last step. Here are the detailed procedures of step-3:
- Clone the S-PrediXcan repository and go to the software folder.
git clone https://github.com/hakyimlab/MetaXcan
cd MetaXcan/software
- Run the High-Level S-PrediXcan Script
./SPrediXcan.py \
--model_db_path 'l-ctPred_celli.db' \
--covariance 'covariance.txt.gz' \
--gwas_folder data/GWAS \
--gwas_file_pattern ".*gz" \
--snp_column SNP \
--effect_allele_column A1 \
--non_effect_allele_column A2 \
--beta_column BETA \
--pvalue_column P \
--output_file 'results/TWAS_result.csv'
Pipeline details:
This step should take less than a minute on a 3GHZ computer. For the full specification of command line parameters, you can check the wiki and the turtorial. The output csv file is the TWAS result, and the detailed descriptions of each column are here
You can download example data here. This may take a few minutes depending on your connection: it has to download approximately 200Mb worth of data. Downloaded data will include all the input data needed.
If you find this code useful, we would appreciate it if you cite the following publication