This repo describes techniques for immune cell phenotyping of non-small cell lung cancer using PCA, tSNE, and other techniques in R and FloJo.
Modern single cell analysis techniques (flow cytometry, mass cytometry, single cell RNA sequencing, etc) capture massive amounts of high dimensional data: for example a comprehensive flow cytometry panel can stain cells with dozens of markers and identify hundreds of distinct cell types, and the raw data can occupy 1-25 GB. Interpreting this high dimensional data can be challenging. Dimensional reduction techniques can be used either to analyze raw flow cytometry data (to naively identify cell populations) or to analyze populations identified through traditional gating approaches (to identify population changes between groups). These techniques can be used to simplify complex high dimensional data and identify novel cell populations, such as interferon gamma producing immune cells in immune cells isolated from non-small cell lung cancer (NSCLC) tumors:
tSNE plot CD45+ immune cells derived from NSCLC tumor and non-tumor adjacent lung tissue, z-axis and color indicates the degree of IFN-gamma production
Principle Component Analysis (PCA) is a linear dimensional reduction algorithm with O(n2) time complexity. PCA is perhaps the most widely used dimensional reduction technique (having been first described in 1933) and has many implementations in R and other programming languages. PCA preserves the global structure of the data but not local structure: PCA places dissimilar points far apart but when reducing high dimensional data to a low dimension manifold similar points are not placed close together.
In this example of immune cell phenotyping of NSCLC and lung tissue, I examined a dataset of samples obtained from individuals undergoing surgical resection of NSCLC; these samples were processed fresh into a single cell suspension and stained with a panel of >30 antibodies against surface and intracellular antigens and >50,000 events were obtained by flow cytometry.
Overlapping immune cell phenotypes of tumor and lung samples
Immune cell populations in tumor and lung samples with Eigenvectors shown
We can also look at other clinical parameters, such as if the sample came from a patient with COPD or not, based on either the degree of airflow obstruction (GOLD stage) or the degree of emphysema as measured radiologically (Goddard score). Although the degree of emphysema and airflow obstruction are correlated there are phenotypic and immunologic differences, as seen below:
For example:
# define the presence of COPD
copd <- mutate(copd, new_gold = ifelse(gold_stage < 1, "No COPD present", "COPD present"))
copd <- mutate(copd, new_goddard = ifelse(goddard_score < 0.5, "No emphysema present", "Emphysema present"))
copd <- mutate(copd, new_copd_any_definition = ifelse(copd_anydefinition < 1, "No COPD", "COPD"))
copd <- na.omit(copd)
# select the correct cell types for inclusion
copd <- select(copd, gold_stage, goddard_score, new_gold, new_goddard, colNames, sample, copd_anydefinition,
cd45, cd3, cd4, cd8, nkt, pmn, nk, b, #gdt, nk, b, mac, pmn, # basic cell types
cd8ifng, th1, th17, treg, gdtil17, gdtifng, # cytokine profiles
#cd4_pd1, cd8_pd1, # pd-1 expression
#pmnpdl1, macpdl1, monopdl1, nocd45pdl1, # pd-l1 expression
#cd4_tim3, cd8_tim3, # tim3 expression
#cd4pd1tim3, cd8pd1tim3, # dual checkpoint expression
)
pr <- prcomp(minus_gold)
pc_comps <- data.frame(pr$rotation)
pc1_vars <- select(pc_comps, PC1)
pc2_vars <- select(pc_comps, PC2)
arrange(pc1_vars, PC1)
# Write 2 axis PCA
autoplot(pr, data = copd,
colour = "new_gold", frame = TRUE, frame.type = "norm",
#loadings = TRUE, loadings.label = TRUE, loadings.colour = "black" # show eigenvectors
) +
ggtitle(label = "COPD vs Non-COPD PCA")
This shows us the the effect of COPD being present in the resected non-adjacent lung on immune cell phenotype in the resected tumor. In this case we define COPD as the presence of airflow obstruction based on GOLD stage (see code above).
t-Distributed Stochastic Neighbor Embedding (tSNE) is an non-linear algorithm for performing dimensionality reduction, allowing visualization of complex multi-dimensional data in fewer dimensions while maintaining the overall structure of the data. tSNE was first described in 2008 and has become a widely used dimensional reduction technique (see the creator, Laurens van der Maaten's website for more details). Importantly, tSNE is able to preserve BOTH the local and global structures of the data. tSNE was first described in 2008 and is a powerful and useful technique that can be done either natively in FloJo or R using the rTsne
package. For a complete description of the underlying algorithm, see here
For tSNE in FloJo there is excellent documentation available here.
To perform tSNE in R, we can use the rTsne
package. In this example, Paired immune cell populations (CD45+) from lung tumor and non-tumor adjacent lung were concatenated (50,000 events from each sample) and analyzed using tSNE. Specific immune cell populations can be labeled according to origin (lung vs tumor), immune effector cell type (CD4+, CD8+, gamma delta TCR+), or intracellular cytokine production (interferon gamma, IL-17a, etc). We can export flow cytometry data in a dataframe such that each row represents a single event (cell) and each column represents the values for each marker.
training_set <- loadExcel("NSCLC.xlsx",1)
immune_cell_tsne <- Rtsne(training_set[,-1], dims = 2, perplexity=25, theta = 0.2, verbose = TRUE, PCA = TRUE, max_iter = 500)
plot(immune_cell_tsne$Y, t='n', main="immune_cell_tsne")
text(immune_cell_tsne$Y, labels=train$label, col=colors[train$label])
When performing tSNE it is important to carefully select hyperparameters:
- dimensions - how many dimensions are desired (usually 2)
- perplexity - increase with larger number of cells or with a denser cluster; typically 25-100
- maximum iterations - typically 500 or 1000
- theta (speed/accuracy tradeoff) -
- PCA (true or false) -
- eta (learning rate) - controls how much the weights are adjusted at each iteration. Optimally set at 7% the number of cells being mapped into tSNE space.
Hyperparameter tuning requires experimentation. I recommend downsampling the dataset to 10,000 events while optimizing the parameters to save time.
In this example, we can see that t-Distributed Stochastic Neighbor Embedding demonstrates overlapping immune cell populations in paired NSCLC and lung samples. Specifically, we can see that there are similar/overlapping immune cell populations in both the lung and tumor populations.
Here is a summary of tSNE findings for multiple concatenated lung and tumor samples: tSNE analysis of multiple lung tumors - note the overlapping immune phenotype for several different lung/tumor samples
While tSNE is a powerful and useful tool for analyzing immune cell populations, there are some important limitations of tSNE to consider:
- computationally expensive; with O(n2) time complexity this can take a long time to run (it makes sense to setup a cloud VM with lots of RAM and compute to run for datasets larger than 100k cells)
- non-deterministic; running the same data can produce (slightly) different results
- sensitive to hyper-parameters; make sure to empirically tune and then standardize the settings used
- images can be deceptive; although tSNE space preserves the local and global aspects of the data, the relative area of different regions is not representative of the number of cells
- sensitive to the compensation of the data; one of the strengths of tSNE is that it can accomodate log distributed data, however if there are events off scale it will distort the analysis
Examining TCR sequences, consensus sequences, and shared sequences in both tumor and adjacent tissue using the tcr
R package.
- Uniform Manifold Approximation and Projection (UMAP) - An alternative non-linear, non-deterministic, dimensional reduction algorithm. I have less experience using UMAP but it has some clear advantages including O(d*n^1.14) rather than O(n2) time complexity that make it appealing for large datasets. UMAP is available as an R package here.
- Hierarchical Clustering
current version 0.1.3
- [ ]current version 0.1.0 - this is a work in progress
- [ ]need to cleanup the tSNE R code
- [ ]add more detailed examples and explanations
- [ ]add additional references
- Mark NM et al, Chronic Obstructive Pulmonary Disease Alters Immune Cell Composition and Immune Checkpoint Inhibitor Efficacy in Non-Small Cell Lung Cancer, AJRCCM 2018
- Thorsson V et al, The Immune Landscape of Cancer, Immunity. 2018
- FloJo tSNE documentation
- Comprehensive Guide on t-SNE algorithm with implementation in R & Python
- Becht E et al, Evaluation of UMAP as an alternative to t-SNE for single-cell data
- Nazarov, V.I., Pogorelyy, M.V., Komech, E.A. et al. tcR: an R package for T cell receptor repertoire advanced data analysis. BMC Bioinformatics 16, 175 (2015). https://doi.org/10.1186/s12859-015-0613-1