single-cell Post-Clustering Inference
Single-cell RNA-sequencing (scRNA-seq) is useful for uncovering hidden cellular heterogeneity in a cell population. In scRNA-seq data analysis, clustering is commonly used for identifying cell groups as clusters, and then cluster-specific genes which are differentially expressed (DE) in one or more clusters are detected. Unfortunately, due to the lack of valid statistical method for computing p-values, the latter cluster-specific DE gene detection has been subjectively done by ``eyes'', i.e., based on visualization-tools such as t-SNE. The intrincic difficulty of the statistical analysis of cluster-specific DE genes is in double-dipping effect; the scRNA-seq data is first used for identifying clusters, and then the same data is used again for detecting cluster-specific DE genes.
We develop a new statistical method called single-cell post-clustering inference (scPCI) that can properly correct the clustering bias by using recent statistical analysis framework called selective inference. We demonstrate the validity and the usefulness of the scPCI method in controlled simulations studies and re-analyses of published scRNA-seq datasets. The scPCI method enables the researchers to obtain valid p-values of cluster-specific DE genes, which makes scRNA-seq more quantitative, reliable and reproducible.
- Python version 2.7 or 3.6
- Please install required packages when the python "ImportError" occurs
If you want to reproduce the results of real data analysis, you must run preprocessing code written by R. So you may need following environments.
- R version 3.5
- If some R packages do not exist, please install required R packages.
In our study, we re-analyzed two data, PBMC and FACS fat. Here, we explain the flow of our analysis.
Firstly, you do following command.
create_dir.py
Then you perform the preprocessing for two dataset. Preprocessing basically follows the original paper (PBMC dataset, FACS fat ). In "Preprocessing" directory, you can find the R code for preprocessing. You should download two dataset in the same directory, according to the original paper.
Perform clustering on preprocessed data.
python clustering_[dataset name].py
There are "PBMC" or "FACSfat" in dataset name.
Perform scPCI method. You choose scPCI-gene or scPCI-cluster. If you want to perform scPCI-gene, you should run following command.
Run_[dataset name].py
Or if you want to perform scPCI-cluster, you should run following command.
Run_[dataset name]_gn.py
In many cases of real data analysis, it is often impossible to calculate p-values numerically.
So we provide the code to approximate these by importance sampling.
If you have nan
in the result of step2, you should do the follwing after step2.
Recomp_pval.py
Recomp_pval_gn.py
The outputs of the above result are saved in a directory named as "Result". In order to reproduce figures in the paper, you should move the results of PCI-gene from "Result" to PCI_[dataset name]/each/, or similarly those of PCIgn-gene from "Result" to PCI_[dataset name]/global/.
GNU General Public License