Workflow Qiu lab

Biological motivations

In our group, we are interested in developing computational methods for analysis and visualization of single-cell data, generated by flow cytometry, CyTOF and sequencing. We only recently started to look at image-based single-cell data. The general biological problem is to understand cellular heterogeneity underlying single-cell data, and correlate the cellular heterogeneity with overall phenotypic features such as disease progression, treatment mechanism.

Image analysis and feature extraction

We do not have experience in dealing with images. We have been relying on our collaborators to perform image segmentation and feature extraction. We look at cell*feature data matrices derived by CellProfiler and ImageStream software.

Select features / reduce dimensionality

Features from image-based data can be highly correlated. We typically perform agglomerative hierarchical clustering to cluster features into groups containing highly correlated features. After that, an average feature is created from each group, so that highly correlated/redundant features are collapsed.
Features can be selected or ranked based on training labels of the images. For example, for more informative features, their distributions should be relatively similar across images derived from replicates, and more different across images derived under different conditions. Such a uni-variate criterion can be used to select features or rank order features.

Workflow for analyzing cellular heterogeneity

We have developed the SPADE algorithm for uncovering the underlying cellular hierarchy of single-cell data generated by flow cytometry and CyTOF (Qiu et al 2011). This algorithm can also be applied to image-based single-cell data. SPADE stands for Spanning-tree Progression Analysis of Density-normalized Events (SPADE). Below is the SPADE workflow illustrated using a flow cytometry data set.

Input to SPADE is single-cell data matrices for multiple biological samples. Each sample/matrix can be viewed as a point cloud living in a high-dimensional space. The point clouds corresponding to two different samples can be very similar or very different from each other.

* SPADE first performs density-dependent downsampling and concatenate the samples. This process creates a "union" sample that correspond to the union of all individual point clouds in this dataset.

* SPADE clusters the cells and constructs a minimum spanning tree to approximate the skeleton of the union cloud. This SPADE tree represents all cell types that exist in at least one sample. Different branches correspond to different cell types or cell activity. Here is one example of SPADE tree derived from flow cytometry measurement of cell cycle markers in a drug response dataset, where different parts of the tree correspond to different stages of the cell cycle.

* The SPADE tree can be used to visualize an individual samples, showing which part of the tree is occupied by cells in this sample (in other words, which cell types exist in this sample, how cells in this sample are distributed on the tree). This example dataset contains 14 controls, and 14 samples treated by different drugs. From the figure below, we can observe that control samples exhibit similar distributions. Samples corresponding to the same drug mechanism lead to similar distribution.

* To quantify the distance among the distribution above, we use the Earth Mover's Distance (EMD). EMD can serve as a distance metric between probability distributions, which taking into account additional structures, which is the SPADE tree. Below is the 28*28 EMD matrix for the 28 distributions above, showing clear patterns of the controls and drugs with similar mechanisms. We tried a few other distance measures (KL, L2, KS, etc), but the resulting distance matrices were not as clean as the one from EMD.

* The EMD distance matrix can serve as the input (kernel matrix) for subsequent machine learning algorithms, for the purpose of clustering and classification of samples.

Note: this dataset is not published, but the workflow has been published using an AML flow cytometry dataset. (Qiu 2012).

Per-sample profile of image-based data

We have applied the SPADE pipeline to single-cell data generated by CellProfiler. The data was published in (Ljosa 2013). Data preprocessing included normalization according to DMSO plates, and feature selection based on variances within and across images. 29 features were selected to build the SPADE tree. The per-well (or per-sample) profile is the cell distribution on the SPADE tree. Below are a few examples of cell distributions on the SPADE tree, where we observe that perturbations with the same mechanism lead to similar distributions.

Similarity measure and downstream analysis

This dataset has a total of 103 compounds/perturbations to be classified into 13 mechanisms. The heatmap below shows the 103*103 EMD distance, which correlated well with the mechanisms. Based on the EMD matrix, the simple nearest neighbor classifier results in 91% accuracy, comparable to that reported in (Ljosa 2013). We can also use SPADE to visualize the EMD matrix, showing which mechanisms are similar to each other and which mechanisms are far away from each other.