Workflow Rees lab

Instructions: Here is an example of the morphological profiling data analysis workflow followed by the Carpenter lab. Please write down the workflow that you use in your own group, adding new steps if needed. Also, please provide references where relevant. During the hackathon, each group will be allotted 8 minutes to present their workflow.

Biological motivations

Biological problems addressed by your group that use morphological profiling and types of perturbations that are profiled.

Analysis of the uptake of (fluorescent) nanoparticles by cells
The determination of cell cycle control mechanism in fission yeast
Identification of cell phenotypes using ImageStream data with as few cell markers as possible
Development of a micro nuclear assay using the ImageStream for genotoxic assessment of compounds

A TYPICAL PIPELINE FOR IDENTIFICATION OF CELL CYCLE [1]

Image acquisition by imaging flow cytometry.

Typically use the ImageStream X platform to capture images of asynchronously growing Jurkat cells. For each cell, we captured images of brightfield and darkfield as well as fluorescent channels to measure the Propidium Iodide (PI) that quantifies DNA content and an anti-phospho-histone (pH3) antibody to identify cells in mitosis.

Image quality control

After image acquisition, we used the IDEAS analysis tool (this is software that accompanies the ImageStream X software) to discard multiple cells or debris, omitting them from further analysis.

We usually use a RMS gradient metric to discard out of focus cells

Followed by a perimeter/area metric to discard cells which are 'clumped'

Image pre-processing.

The image sizes from the ImageStream cytometer range between ~30x30 and 60x60 pixels (We reshape their sizes to 55x55 pixel images by either adding pixels with random values that we sampled from the background of the image for images, which are smaller or by discarding pixels on the edge of the image for images, which are too large. We then tile the images to 15x15 montages, with up to 225 cells per montage.

Segmentation and feature extraction.

We load the image montages of 15x15 cells into the open source image software CellProfiler (version 2.1.1). The darkfield image shows light scattered from the cells within a cone centered at a 90° angle and hence does not necessarily depict the cell’s physical shape nor does it align with the brightfield image. Therefore we do not segment the darkfield image but instead use the full image for further analysis. In the brightfield image, there is sufficient contrast between the cells and the flow media to robustly segment the cells. We segment the cells in the brightfield image by enhancing the edges of the cells and thresholding on the pixel values. We then extract features, which we categorized into area and shape, Zernike polynomials, granularity, intensity, radial distribution, and texture. The CellProfiler pipeline to carry out all of these steps is provided (The measurements are exported in a text file, an example of which is provided The measurements are post-processed using a Matlab script to discard cells with missing values.

Determination of ground truth.

To train the machine-learning algorithm we need a subset of cells where the cell’s true state is annotated, i.e. the ground truth is known. For the experiment shown in Figure 1, the cells were labeled with a PI and a pH3 stain. As the ground truth (expected results) for the cells’ DNA content we extracted the integrated intensities of the nuclear PI stain with the imaging software CellProfiler. The mitotic cell cycle phases were identified with the IDEAS analysis tool by categorizing the pH3 positive cells into anaphase, prophase and metaphase using a limited set of user-formulated morphometric parameters on their PI stain images2 followed by manual confirmation. The telophase cells were identified using a complex set of masks (using the IDEAS analysis tool) on the brightfield images to gate doublet cells2. We used those values as the ground truth to train the machine-learning algorithm and to evaluate the prediction of the nuclear stain intensity.

Machine Learning.

For the prediction of the DNA content we use LSboosting as implemented in Matlab’s fitensemble routine. For the assignment of the mitotic cell cycle phases we use RUSboosting as also implemented in Matlab’s fitensemble routine. In both cases we partition the cells into a training and a testing set. The brightfield and darkfield features of the training set as well as the ground truth of these cells are used to train the ensemble. Once the ensemble is trained we evaluate its predictive power on the testing set. To demonstrate the generalizability of this approach and to obtain error bars for our results the procedure is ten-fold cross-validated. To prevent overfitting the data the stopping criterion of the training was determined via five-fold internal cross-validation. All of these steps are described in the tutorial.

Additionally, we analyzed which features have the most significant contributions for the prediction. We find that leaving one feature out has only a minor effect on the results of the supervised machine learning algorithms we used, likely because many features are highly correlated to others. The most important features are intensity, area and shape and radial distribution of the brightfield images.

Image analysis and feature extraction

Use image analysis software to extract features from images. This results in a data matrix where the rows correspond to cells in the experiment and the columns are the extracted image features.

(How we do it)

Image quality control

Flag/remove images that are affected by technical artifacts or segmentation errors.

(How we do it)

Data cleaning

Filter out or impute missing values in the data matrix.

(How we do it)

Normalize features

Normalize cell features with respect to a reference distribution (e.g. by z-scoring against all DMSO cells on the plate).

(How we do it)

Create per-well profiles

Aggregate single-cell data from each well to create a per-well morphological profile. This is typically done by computing the median across all cells in the well, per feature. Other approaches include methods to first identify sub-populations, then construct a profile by counting the number of cells in each sub-population.