Skip to content

Workflow Jaensch lab

SteffenJaensch edited this page May 11, 2016 · 18 revisions

Instructions: Here is an example of the morphological profiling data analysis workflow followed by the Carpenter lab. Please write down the workflow that you use in your own group, adding new steps if needed. Also, please provide references where relevant. During the hackathon, each group will be allotted 8 minutes to present their workflow.

Biological motivations

Biological problems addressed by your group that use morphological profiling and types of perturbations that are profiled.

Focusing on profiling of small molecules to predict target(s)/mechanism-of-action, identifying off-target/toxicity effects, identifying compounds with similar phenotypic effect but different chemical structure (scaffold-hoping), enriching hit sets from HTS campaigns and characterizing compound library composition (how different is this library to what we already have?).

Types of perturbations: Small molecules, some projects +/- infectious pathogens; to muss lesser extend RNAi; CRISPR experiments planned. Mostly in human cell lines, some experiments in primary patient cells.

Image analysis and feature extraction

Use image analysis software to extract features from images. This results in a data matrix where the rows correspond to cells in the experiment and the columns are the extracted image features.

Acapella/Columbus ==> Well-level means and medians (NA values removed), single-cell data and montages of raw images and segmentation outlines

CellProfiler

For most assays relying on the instrument-based illumination correction (Opera Phenix: data-based, CV7000: reference plate based)

Image quality control

Flag/remove images that are affected by technical artifacts or segmentation errors.

We upload all data (well-level data, single-cell data and montages of raw images and segmentation outlines) to "Phaedra", an open-source high-content imaging analysis tool.

http://www.phaedra.io/

In Phaedra, heat maps, scatter plots and image views are linked with each other and help to efficiently identify and reject wells with technical artifacts by manual inspection. Wells can be sorted across plates by (QC-)features to narrow down the set of wells that are likely to have artifacts. Training sets for supervised machine learning can be assembled interactively by placing wells or single cells in 'silos' and Knime workflows can be executed directly in Phaedra.

Data cleaning

Filter out or impute missing values in the data matrix.

Remove features with large number of NAs (typical case: texture features on large scale for the nucleus region). Remove wells (i.e., rows) with NAs (typical case: well with very small number of cells, all with no cytoplasm).

Normalize features

Normalize cell features with respect to a reference distribution (e.g. by z-scoring against all DMSO cells on the plate).

z-score against DMSO controls: subtracting the mean DMSO control from each plate, then dividing by the standard deviation of all DMSO within a batch of plates. We found this approach to result in higher correlation between replicates, compared to taking the standard deviation plate-by-plate.

Transform features

Transform features as appropriate, e.g. log transform.

We compute the log-transform of each intensity feature (on single-cell level) as additional features.

Select features / reduce dimensionality

Select features that are most informative, based on some appropriate criterion, or, perform dimensionality reduction

We use a modified version of mRMR [1] feature selection, using treatment-ID as classes. To decide on the number of required features, we compute AUC values for "replicates vs. non-replicates" using Pearson correlation as similarity measure. The smallest number of features that results in an AUC value within 1 standard error of the maximum AUC achieved is chosen as the 'optimal' number of features.

In/active calling

Classify each replicate (i.e., well) as active or inactive using a threshold on the Euclidean distance to the mean DMSO profile. Threshold is the 95th percentile of null distribution “DMSO replicate to DMSO center”-distance.

Count for each treatment the percentage of active replicates. Treatments with ≥ 50% active replicates are considered active.

Create per-well profile

Aggregate single-cell data from each well to create a per-well morphological profile. This is typically done by computing the median across all cells in the well, per feature. Other approaches include methods to first identify sub-populations, then construct a profile by counting the number of cells in each sub-population.

Mean and median of each feature (already computed, in case of the Acapella script). Then creating a per treatment profile by taking the median (and mad) over replicate wells.

Measure similarity between profiles

An appropriate similarity metric is crucial to the downstream analysis. Pearson correlation and Euclidean distance are the most common metrics used.

Pearson correlation between profiles seems to be the best choice based on how well replicates are separated from non-replicates. Spearman correlation and cosine distance give very similar results to Pearson correlation, whereas Euclidean and Mahalanobis distance perform inferior in terms of separating different treatments from each other.

Downstream analysis / visualization

Analysis/visualization performed after creating profiles. E.g. clustering, classification, visualization using 2D embeddings, etc.


References

  • [1] Ding,C. and Peng,H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol., 2005
  • Ref 2
  • ...
Clone this wiki locally