Skip to content

Latest commit

 

History

History
76 lines (55 loc) · 3.6 KB

README.md

File metadata and controls

76 lines (55 loc) · 3.6 KB

scPCI

single-cell Post-Clustering Inference

Abstract

Single-cell RNA-sequencing (scRNA-seq) is useful for uncovering hidden cellular heterogeneity in a cell population. In scRNA-seq data analysis, clustering is commonly used for identifying cell groups as clusters, and then cluster-specific genes which are differentially expressed (DE) in one or more clusters are detected. Unfortunately, due to the lack of valid statistical method for computing p-values, the latter cluster-specific DE gene detection has been subjectively done by ``eyes'', i.e., based on visualization-tools such as t-SNE. The intrincic difficulty of the statistical analysis of cluster-specific DE genes is in double-dipping effect; the scRNA-seq data is first used for identifying clusters, and then the same data is used again for detecting cluster-specific DE genes.

We develop a new statistical method called single-cell post-clustering inference (scPCI) that can properly correct the clustering bias by using recent statistical analysis framework called selective inference. We demonstrate the validity and the usefulness of the scPCI method in controlled simulations studies and re-analyses of published scRNA-seq datasets. The scPCI method enables the researchers to obtain valid p-values of cluster-specific DE genes, which makes scRNA-seq more quantitative, reliable and reproducible.

Environmental Requirement

  • Python version 2.7 or 3.6
  • Please install required packages when the python "ImportError" occurs

If you want to reproduce the results of real data analysis, you must run preprocessing code written by R. So you may need following environments.

  • R version 3.5
  • If some R packages do not exist, please install required R packages.

Usage

In our study, we re-analyzed two data, PBMC and FACS fat. Here, we explain the flow of our analysis.

Preprocessing for real data

Firstly, you do following command.

create_dir.py

Then you perform the preprocessing for two dataset. Preprocessing basically follows the original paper (PBMC dataset, FACS fat ). In "Preprocessing" directory, you can find the R code for preprocessing. You should download two dataset in the same directory, according to the original paper.

1. Clustering

Perform clustering on preprocessed data.

python clustering_[dataset name].py

There are "PBMC" or "FACSfat" in dataset name.

2. Post-Clustering Inference

Perform scPCI method. You choose scPCI-gene or scPCI-cluster. If you want to perform scPCI-gene, you should run following command.

Run_[dataset name].py

Or if you want to perform scPCI-cluster, you should run following command.

Run_[dataset name]_gn.py

Recalculation of p-value

In many cases of real data analysis, it is often impossible to calculate p-values numerically. So we provide the code to approximate these by importance sampling. If you have nan in the result of step2, you should do the follwing after step2.

Recomp_pval.py

Recomp_pval_gn.py

Create Figures

The outputs of the above result are saved in a directory named as "Result". In order to reproduce figures in the paper, you should move the results of PCI-gene from "Result" to PCI_[dataset name]/each/, or similarly those of PCIgn-gene from "Result" to PCI_[dataset name]/global/.

Lisence

GNU General Public License