-
Notifications
You must be signed in to change notification settings - Fork 16
Workflow Jaensch lab
Instructions: Here is an example of the morphological profiling data analysis workflow followed by the Carpenter lab. Please write down the workflow that you use in your own group, adding new steps if needed. Also, please provide references where relevant. During the hackathon, each group will be allotted 8 minutes to present their workflow.
Biological problems addressed by your group that use morphological profiling and types of perturbations that are profiled.
Focusing on profiling of small molecules to predict target(s)/mechanism-of-action, identifying off-target/toxicity effects, identifying compounds with similar phenotypic effect but different chemical structure (scaffold-hoping), enriching hit sets from HTS campaigns and characterizing compound library composition (how different is this library to what we already have?).
Types of perturbations: Small molecules, some projects +/- infectious pathogens; to muss lesser extend RNAi; CRISPR experiments planned. Mostly in human cell lines, some experiments in primary patient cells.
Use image analysis software to extract features from images. This results in a data matrix where the rows correspond to cells in the experiment and the columns are the extracted image features.
Acapella/Columbus ==> Well-level means and medians (NA values removed), single-cell data and montages of raw images and segmentation outlines
CellProfiler
For most assays relying on the instrument-based illumination correction (Opera Phenix: data-based, CV7000: reference plate based)
Flag/remove images that are affected by technical artifacts or segmentation errors.
We upload all data (well-level data, single-cell data and montages of raw images and segmentation outlines) to "Phaedra", an open-source high-content imaging analysis tool.
In Phaedra, heat maps, scatter plots and image views are linked with each other and help to efficiently identify and reject wells with technical artifacts by manual inspection. Wells can be sorted across plates by (QC-)features to narrow down the set of wells that are likely to have artifacts. Training sets for supervised machine learning can be assembled interactively by placing wells or single cells in 'silos' and Knime workflows can be executed directly in Phaedra.
Filter out or impute missing values in the data matrix.
Remove features with large number of NAs (typical case: texture features on large scale for the nucleus region). Remove wells (i.e., rows) with NAs (typical case: well with very small number of cells, all with no cytoplasm).
Normalize cell features with respect to a reference distribution (e.g. by z-scoring against all DMSO cells on the plate).
z-score against DMSO controls: subtracting the mean DMSO control from each plate, then dividing by the standard deviation of all DMSO within a batch of plates. We found this approach to result in higher correlation between replicates, compared to taking the standard deviation plate-by-plate.
Transform features as appropriate, e.g. log transform.
We compute the log-transform of each intensity feature (on single-cell level) as additional features.
Select features that are most informative, based on some appropriate criterion, or, perform dimensionality reduction
We use a modified version of mRMR [1] feature selection, using treatment-ID as classes. To decide on the number of required features, we compute AUC values for "replicates vs. non-replicates" using Pearson correlation as similarity measure. The smallest number of features that results in an AUC value within 1 standard error of the maximum AUC achieved is chosen as the 'optimal' number of features.
Classify each replicate (i.e., well) as active or inactive using a threshold on the Euclidean distance to the mean DMSO profile. Threshold is the 95th percentile of null distribution “DMSO replicate to DMSO center”-distance.
Count for each treatment the percentage of active replicates. Treatments with ≥ 50% active replicates are considered active.
Aggregate single-cell data from each well to create a per-well morphological profile. This is typically done by computing the median across all cells in the well, per feature. Other approaches include methods to first identify sub-populations, then construct a profile by counting the number of cells in each sub-population.
Mean and median of each feature (already computed, in case of the Acapella script). Then creating a per treatment profile by taking the median (and mad) over replicate wells.
An appropriate similarity metric is crucial to the downstream analysis. Pearson correlation and Euclidean distance are the most common metrics used.
Pearson correlation between profiles seems to be the best choice based on how well replicates are separated from non-replicates. Spearman correlation and cosine distance give very similar results to Pearson correlation, whereas Euclidean and Mahalanobis distance perform inferior in terms of separating different treatments from each other.
Analysis/visualization performed after creating profiles. E.g. clustering, classification, visualization using 2D embeddings, etc.
- [1] Ding,C. and Peng,H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol., 2005
- Ref 2
- ...
Implementing profiling workflows
- IA-Lab (AstraZeneca Cambridge)
- Bakal (Inst. Cancer Research London)
- Borgeson (Recursion)
- Boutros (German Cancer Research Center)
- Carpenter (Broad Imaging Platform)
- Carragher (U Edinburgh)
- Clemons (Broad Comp. Chem. Bio)
- de Boer (Maastricht U)
- Frey (U Toronto)
- Horvath (Hungarian Acad of Sciences)
- Huber (EMBL Heidelberg)
- Jaensch (Janssen)
- Jaffe (Broad Comp. Proteomics)
- Jones (Harvard)
- Linington (Simon Fraser U)
- Pelkmans (U Zurich)
- Qiu (Georgia Tech)
- Ross (Novartis High Throughput Biol.)
- Rees (Swansea U)
- Subramanian (Broad CMap)
- Sundaramurthy (Nat. Center for Biol. Sciences)