diff --git a/DESCRIPTION b/DESCRIPTION deleted file mode 100644 index b49d221..0000000 --- a/DESCRIPTION +++ /dev/null @@ -1,71 +0,0 @@ -Type: Package -Package: scDiagnostics -Title: Cell type annotation diagnostics -Version: 0.99.0 -Authors@R: c( - person("Anthony", "Christidis", role = c("aut", "cre"), - email = "anthony-alexander_christidis@hms.harvard.edu"), - person("Andrew", "Ghazi", role = "aut"), - person("Smriti", "Chawla", role = "aut"), - person("Nitesh", "Turaga", role = "ctb"), - person("Ludwig", "Geistlinger", role = "aut"), - person("Robert", "Gentleman", role = "aut") - ) -Description: The scDiagnostics package provides diagnostic plots to - assess the quality of cell type assignments from single cell gene - expression profiles. The implemented functionality allows to - assess the reliability of cell type annotations, investigate gene - expression patterns, and explore relationships between different - cell types in query and reference datasets allowing users to - detect potential misalignments between reference and query - datasets. The package also provides visualization capabilities for - diagnositics purposes. -License: Artistic-2.0 -URL: https://github.com/ccb-hms/scDiagnostics -BugReports: https://github.com/ccb-hms/scDiagnostics/issues -Depends: - R (>= 4.4.0) -Imports: - SingleCellExperiment, - isotree, - methods, - ggplot2, - RColorBrewer, - gridExtra, - SummarizedExperiment, - stats, - utils, - ranger, - BiocNeighbors, - Hotelling, - rlang -Suggests: - AUCell, - BiocStyle, - corrplot, - knitr, - Matrix, - rmarkdown, - scran, - scRNAseq, - SingleR, - celldex, - ComplexHeatmap, - scuttle, - scater, - testthat (>= 3.0.0) -VignetteBuilder: - knitr -biocViews: - Annotation, - Classification, - Clustering, - GeneExpression, - RNASeq, - SingleCell, - Software, - Transcriptomics -Encoding: UTF-8 -LazyData: true -RoxygenNote: 7.3.1 -Config/testthat/edition: 3 diff --git a/NAMESPACE b/NAMESPACE deleted file mode 100644 index a8e8b8d..0000000 --- a/NAMESPACE +++ /dev/null @@ -1,55 +0,0 @@ -# Generated by roxygen2: do not edit by hand - -S3method(plot,calculateAveragePairwiseCorrelation) -S3method(plot,calculateSampleDistances) -S3method(plot,calculateSampleSimilarityPCA) -S3method(plot,compareCCA) -S3method(plot,comparePCA) -S3method(plot,comparePCASubspace) -S3method(plot,detectAnomaly) -S3method(plot,nearestNeighborDiagnostics) -export(boxplotPCA) -export(calculateAveragePairwiseCorrelation) -export(calculateCategorizationEntropy) -export(calculateHVGOverlap) -export(calculateHotellingPValue) -export(calculatePairwiseDistancesAndPlotDensity) -export(calculateSampleDistances) -export(calculateSampleDistancesSimilarity) -export(calculateSampleSimilarityPCA) -export(calculateVarImpOverlap) -export(compareCCA) -export(comparePCA) -export(comparePCASubspace) -export(detectAnomaly) -export(histQCvsAnnotation) -export(nearestNeighborDiagnostics) -export(plotGeneExpressionDimred) -export(plotGeneSetScores) -export(plotMarkerExpression) -export(plotPCRegression) -export(plotQCvsAnnotation) -export(projectPCA) -export(regressPC) -export(visualizeCellTypeMDS) -export(visualizeCellTypePCA) -import(SingleCellExperiment) -import(ggplot2) -importFrom(SummarizedExperiment,assay) -importFrom(ggplot2,ggplot) -importFrom(gridExtra,grid.arrange) -importFrom(methods,is) -importFrom(rlang,.data) -importFrom(stats,approxfun) -importFrom(stats,cancor) -importFrom(stats,cmdscale) -importFrom(stats,cor) -importFrom(stats,density) -importFrom(stats,dist) -importFrom(stats,lm) -importFrom(stats,na.omit) -importFrom(stats,predict) -importFrom(stats,qnorm) -importFrom(stats,setNames) -importFrom(utils,combn) -importFrom(utils,tail) diff --git a/NEWS.md b/NEWS.md deleted file mode 100644 index 2389c17..0000000 --- a/NEWS.md +++ /dev/null @@ -1,4 +0,0 @@ -# scDiagnostics 0.99.0 - -* Initial CRAN submission. -* New package scDiagnostics, for cell type annotation diagnostics. diff --git a/R/boxplotPCA.R b/R/boxplotPCA.R deleted file mode 100644 index 22a7ace..0000000 --- a/R/boxplotPCA.R +++ /dev/null @@ -1,146 +0,0 @@ -#' @title Plot Principal Components for Different Cell Types -#' -#' @description This function generates a \code{ggplot2} boxplot visualization of principal components (PCs) for different -#' cell types across two datasets (query and reference). -#' -#' @details -#' The function \code{boxplotPCA} is designed to provide a visualization of principal component analysis (PCA) results. It projects -#' the query dataset onto the principal components obtained from the reference dataset. The results are then visualized -#' as boxplots, grouped by cell types and datasets (query and reference). This allows for a comparative analysis of the -#' distributions of the principal components across different cell types and datasets. The function internally calls \code{projectPCA} -#' to perform the PCA projection. It then reshapes the output data into a long format suitable for ggplot2 plotting. -#' The color scheme is automatically determined using the \code{RColorBrewer} package, ensuring a visually distinct and appealing plot. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. -#' Must be less than or equal to the number of components available in the reference PCA. -#' @param cell_types A character vector specifying the cell types to include in the plot. If NULL, all cell types are included. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} -#' that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} -#' that identifies the cell types. -#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5. -#' -#' @return A ggplot object representing the boxplots of specified principal components for the given cell types and datasets. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- scran::getTopHVGs(ref_data, n = 2000) -#' query_var <- scran::getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data (assumed to be prepared) -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' pc_plot <- boxplotPCA(query_data_subset, ref_data_subset, -#' n_components = 10, -#' cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"), -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' pc_subset = c(1:5)) -#' pc_plot -#' -#' -#' @importFrom stats approxfun cancor density setNames -#' @importFrom utils combn -#' -# Function to plot PC for different cell types -boxplotPCA <- function(query_data, reference_data, - n_components = 10, - cell_types = NULL, - query_cell_type_col = NULL, - ref_cell_type_col = NULL, - pc_subset = c(1:5)){ - - # Get the projected PCA data - pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, - n_components = n_components, - query_cell_type_col = query_cell_type_col, - ref_cell_type_col = ref_cell_type_col) - - # Create the long format data frame manually - pca_output <- pca_output[!is.na(pca_output$cell_type),] - if(!is.null(cell_types)){ - if(all(cell_types %in% pca_output$cell_type)){ - pca_output <- pca_output[which(pca_output$cell_type %in% cell_types),] - } else{ - stop("One or more of the specified \'cell_types\' are not available.") - } - } - pca_long <- data.frame(PC = rep(paste0("pc", pc_subset), each = nrow(pca_output)), - Value = unlist(c(pca_output[, pc_subset])), - dataset = rep(pca_output$dataset, length(pc_subset)), - cell_type = rep(pca_output$cell_type, length(pc_subset))) - pca_long$PC <- toupper(pca_long$PC) - - # Create a new variable representing the combination of cell type and dataset - pca_long$cell_type_dataset <- paste(pca_long$dataset, pca_long$cell_type, sep = " ") - - # Define the order of cell type and dataset combinations - order_combinations <- paste(rep(c("Reference", "Query"), length(unique(pca_long$cell_type))), - rep(sort(unique(pca_long$cell_type)), each = 2)) - - # Reorder the levels of cell type and dataset factor - pca_long$cell_type_dataset <- factor(pca_long$cell_type_dataset, levels = order_combinations) - - # Define the colors for cell types - color_mapping <- setNames(RColorBrewer::brewer.pal(length(order_combinations), "Paired"), order_combinations) - cell_type_colors <- color_mapping[order_combinations] - - # Create the ggplot - plot <- ggplot2::ggplot(pca_long, aes(x = cell_type, y = Value, fill = cell_type_dataset)) + - ggplot2::geom_boxplot(alpha = 0.7, outlier.shape = NA, width = 0.7) + - ggplot2::facet_wrap(~ PC, scales = "free") + - ggplot2::scale_fill_manual(values = cell_type_colors, name = "Cell Types") + - ggplot2::labs(x = "", y = "Value") + - ggplot2::theme_minimal() + - ggplot2::theme(legend.position = "right", - axis.text.x = ggplot2::element_text(angle = 45, hjust = 1, size = 10), - axis.title = ggplot2::element_text(size = 14), - strip.text = ggplot2::element_text(size = 12, face = "bold"), - panel.grid.major = ggplot2::element_line(color = "grey", linetype = "dotted", linewidth = 0.7), - panel.grid.minor = ggplot2::element_blank(), - panel.border = ggplot2::element_blank(), - strip.background = ggplot2::element_rect(fill = "lightgrey", color = "grey", linewidth = 0.5), - plot.title = ggplot2::element_text(size = 16, face = "bold", hjust = 0.5)) - - # Return the plot - return(plot) -} - - diff --git a/R/calculateAveragePairwiseCorrelation.R b/R/calculateAveragePairwiseCorrelation.R deleted file mode 100644 index 1c99f51..0000000 --- a/R/calculateAveragePairwiseCorrelation.R +++ /dev/null @@ -1,169 +0,0 @@ -#' Compute Average Pairwise Correlation between Cell Types -#' -#' Computes the average pairwise correlations between specified cell types -#' in single-cell gene expression data. -#' -#' @details This function operates on \code{\linkS4class{SingleCellExperiment}} objects, -#' ideal for single-cell analysis workflows. It calculates pairwise correlations between query and -#' reference cells using a specified correlation method, then averages these correlations for each -#' cell type pair. This function aids in assessing the similarity between cells in reference and query datasets, -#' providing insights into the reliability of cell type annotations in single-cell gene expression data. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -#' expression data and metadata. -#' @param n_components The number of principal components to use for the dimensionality reduction of the data using PCA. Defaults to 10. -#' If set to \code{NULL} then no dimensionality reduction is performed and the assay data is used directly for computations. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell -#' expression data and metadata. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} -#' that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} -#' that identifies the cell types. -#' @param cell_types A character vector specifying the cell types to be analysed consider. -#' @param correlation_method The correlation method to use for calculating pairwise correlations. -#' -#' @return A matrix containing the average pairwise correlation values. -#' Rows and columns are labeled with the cell types. Each element -#' in the matrix represents the average correlation between a pair -#' of cell types. -#' -#' @seealso \code{\link{plot.calculateAveragePairwiseCorrelation}} -#' -#' @examples -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' library(SingleR) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Compute Pairwise Correlations -#' # Note: The selection of highly variable genes and desired cell types may vary -#' # based on user preference. -#' # The cell type annotation method used in this example is SingleR. -#' # User can use any other method for cell type annotation and provide -#' # the corresponding labels in the metadata. -#' -#' # Selecting highly variable genes -#' ref_var <- getTopHVGs(ref_data, n = 2000) -#' query_var <- getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' -#' # Select desired cell types -#' selected_cell_types <- c("CD4", "CD8", "B_and_plasma") -#' ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad %in% selected_cell_types] -#' query_data_subset <- query_data[common_genes, query_data$reclustered.broad %in% selected_cell_types] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Compute pairwise correlations -#' cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, -#' reference_data = ref_data_subset, -#' n_components = 10, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' cell_types = selected_cell_types, -#' correlation_method = "spearman") -#' -#' # Visualize the results -#' plot(cor_matrix_avg) -#' -#' -#' @import SingleCellExperiment -#' @importFrom SummarizedExperiment assay -#' @importFrom stats cor -#' @export -calculateAveragePairwiseCorrelation <- function(query_data, - reference_data, - n_components = 10, - query_cell_type_col, - ref_cell_type_col, - cell_types, - correlation_method) { - # Sanity checks - - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if reference_data is a SingleCellExperiment object - if (!is(reference_data, "SingleCellExperiment")) { - stop("reference_data must be a SingleCellExperiment object.") - } - - # Check if query_cell_type_col is a valid column name in query_data - if (!query_cell_type_col %in% names(colData(query_data))) { - stop("query_cell_type_col: '", query_cell_type_col, "' is not a valid column name in query_data.") - } - - # Check if ref_cell_type_col is a valid column name in reference_data - if (!ref_cell_type_col %in% names(colData(reference_data))) { - stop("ref_cell_type_col: '", ref_cell_type_col, "' is not a valid column name in reference_data.") - } - - # Check if all cell_types are present in query_data - if (!all(cell_types %in% unique(query_data[[query_cell_type_col]]))) { - stop("One or more cell_types specified are not present in query_data.") - } - - # Check if all cell_types are present in reference_data - if (!all(cell_types %in% unique(reference_data[[ref_cell_type_col]]))) { - stop("One or more cell_types specified are not present in reference_data.") - } - - # Function to compute correlation between two cell types - .computeCorrelation <- function(type1, type2) { - - if(!is.null(n_components)){ - # Project query data onto PCA space of reference data - pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, - n_components = n_components, return_value = "list") - ref_mat <- pca_output$ref[which(reference_data[[ref_cell_type_col]] == type2), paste0("PC", 1:n_components)] - query_mat <- pca_output$query[which(query_data[[query_cell_type_col]] == type1), paste0("PC", 1:n_components)] - } else{ - - # Subset query data to the specified cell type - query_subset <- query_data[ , query_data[[query_cell_type_col]] == type1, drop = FALSE] - ref_subset <- reference_data[ , reference_data[[ref_cell_type_col]] == type2, drop = FALSE] - - query_mat <- t(as.matrix(assay(query_subset, "logcounts"))) - ref_mat <- t(as.matrix(assay(ref_subset, "logcounts"))) - } - - cor_matrix <- cor(t(query_mat), t(ref_mat), method = correlation_method) - mean(cor_matrix) - } - - # Use outer to compute pairwise correlations - cor_matrix_avg <- outer(cell_types, cell_types, Vectorize(.computeCorrelation)) - - # Assign cell type names to rows and columns - rownames(cor_matrix_avg) <- paste0("Query-", cell_types) - colnames(cor_matrix_avg) <- paste0("Ref-", cell_types) - - # Update class of output - class(cor_matrix_avg) <- c(class(cor_matrix_avg), "calculateAveragePairwiseCorrelation") - - return(cor_matrix_avg) -} diff --git a/R/calculateCategorizationEntropy.R b/R/calculateCategorizationEntropy.R deleted file mode 100644 index 0d84716..0000000 --- a/R/calculateCategorizationEntropy.R +++ /dev/null @@ -1,123 +0,0 @@ -#' Calculate Categorization Entropy -#' @description This function takes a matrix of category scores (cell type by -#' cells) and calculates the entropy of the category probabilities for each -#' cell. This gives a sense of how confident the cell type assignments are. -#' High entropy = lots of plausible category assignments = low confidence. Low -#' entropy = only one or two plausible categories = high confidence. This is -#' confidence in the vernacular sense, not in the "confidence interval" -#' statistical sense. Also note that the entropy tells you nothing about -#' whether or not the assignments are correct -- see the other functionality -#' in the package for that. This functionality can be used for assessing how -#' comparatively confident different sets of assignments are (given that the -#' number of categories is the same). -#' @param X a matrix of category scores -#' @param inverse_normal_transform if TRUE, apply -#' @param verbose if TRUE, display messages about the calculations -#' @param plot if TRUE, plot a histogram of the entropies -#' @returns A vector of entropy values for each column in X. -#' @details The function checks if X is already on the probability scale. -#' Otherwise, it applies softmax columnwise. -#' -#' You can think about entropies on a scale from 0 to a maximum that depends -#' on the number of categories. This is the function for entropy (minus input -#' checking): \code{entropy(p) = -sum(p*log(p))} . If that input vector p is a -#' uniform distribution over the \code{length(p)} categories, the entropy will -#' be a high as possible. -#' @export -#' @examples -#' # Simulate 500 cells with scores on 4 possible cell types -#' X <- rnorm(500 * 4) |> matrix(nrow = 4) -#' X[1, 1:250] <- X[1, 1:250] + 5 # Make the first category highly scored in the first 250 cells -#' -#' -#' # The function will issue a message about softmaxing the scores, and the entropy histogram will be -#' # bimodal since we made half of the cells clearly category 1 while the other half are roughly even. -#' # entropy_scores <- calculateCategorizationEntropy(X) -calculateCategorizationEntropy <- function(X, - inverse_normal_transform = FALSE, - plot = TRUE, - verbose = TRUE) { - if (inverse_normal_transform) { - # https://cran.r-project.org/web/packages/RNOmni/vignettes/RNOmni.html#inverse-normal-transformation - if (verbose) message("Applying global inverse normal transformation.") - # You can't do the INT column-wise (by cell) because it will set a - # constant "range" to the probabilities, eliminating the differences in - # confidence across methods we're trying to quantify. - - # You can't do the INT row-wise (by cell-type) because even though - # different cell types exhibit different marginal distributions of - # scores (in SingleR at least), doing the transformation row-wise would - # eliminate any differences in which cell types are "hard to predict". - # You don't want a score of .5 for cytotoxic T cells (hard to predict - # type) to overwhelm a score of .62 from erythroid type 2 (easy to - # predict), even though the first would be extraordinary within its cell - # type and the latter unexceptional within its cell type. - - X <- inverse_normal_trans(X) - } - - colSumsX <- colSums(X) - - X_is_probabilities <- all(X >= 0 & X <= 1) & - all((colSumsX - 1) <= 1e-8) - - if (!X_is_probabilities) { - if (verbose) message("X doesn't seem to be on the probability scale, applying column-wise softmax.") - expX <- exp(X) - - X <- sweep(expX, MARGIN = 2, STATS = colSums(expX), FUN = "/") - } - - ncat <- nrow(X) - - max_ent <- calculate_entropy(rep(1 / ncat, ncat)) - - if (verbose) { - message( - "Max possible entropy given ", ncat, " categories: ", - round(max_ent, - digits = 2 - ) - ) - } - - entropies <- apply(X, 2, calculate_entropy) - - if (plot) { - p <- data.frame(entropies = entropies) |> - ggplot(aes(entropies)) + - geom_histogram( - color = "black", fill = "white", - bins = 30, - boundary = 0 - ) + - theme_bw() - print(p) - } - - return(entropies) -} - -calculate_entropy <- function(p) { - # p is one column of X, a vector of probabilities summing to 1. - - nonzeros <- p != 0 - - -sum(p[nonzeros] * log(p[nonzeros])) -} - -n_elements <- function(X) ifelse(is.matrix(X), prod(dim(X)), length(X)) - -inverse_normal_trans <- function(X, constant = 3 / 8) { - n <- n_elements(X) - - rankX <- rank(X) - - intX <- qnorm((rankX - constant) / (n - 2 * constant + 1)) - - if (is.matrix(X)) { - intX <- matrix(intX, nrow = nrow(X)) - } - - return(intX) -} diff --git a/R/calculateHVGOverlap.R b/R/calculateHVGOverlap.R deleted file mode 100644 index 9083a1d..0000000 --- a/R/calculateHVGOverlap.R +++ /dev/null @@ -1,82 +0,0 @@ -#' @title Calculate the Overlap Coefficient for Highly Variable Genes -#' -#' @description Calculates the overlap coefficient between the sets of highly variable genes -#' from a reference dataset and a query dataset. -#' -#' @details The overlap coefficient measures the similarity between two gene sets, indicating how well-aligned -#' reference and query datasets are in terms of their highly variable genes. This metric is -#' useful in single-cell genomics to understand the correspondence between different datasets. -#' -#' The coefficient is calculated using the formula: -#' -#' \deqn{Coefficient(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}} -#' -#' where X and Y are the sets of highly variable genes from the reference and query datasets, respectively, -#' |X ∩ Y| is the number of genes common to both X and Y, and min(|X|, |Y|) is the size of the smaller set among X and Y. -#' -#' @param reference_genes character. A vector of highly variable genes from the reference dataset. -#' @param query_genes character. A vector of highly variable genes from the query dataset. -#' -#' @return Overlap coefficient, a value between 0 and 1, where 0 indicates no overlap -#' and 1 indicates complete overlap of highly variable genes between datasets. -#' -#' @references Luecken et al. Benchmarking atlas-level data integration in -#' single-cell genomics. Nature Methods, 19:41-50, 2022. -#' -#' @examples -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' library(SingleR) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Selcting highly variable genes -#' -#' ref_var <- getTopHVGs(ref_data, n=2000) -#' query_var <- getTopHVGs(query_data, n=2000) -#' -#' overlap_coefficient <- calculateHVGOverlap(reference_genes = ref_var, -#' query_genes = query_var) -#' -#' @export -calculateHVGOverlap <- function(reference_genes, query_genes) { - - # Sanity checks - if (!is.vector(reference_genes) || !is.character(reference_genes)) { - stop("reference_genes must be a character vector.") - } - if (!is.vector(query_genes) || !is.character(query_genes)) { - stop("query_genes must be a character vector.") - } - if (length(reference_genes) == 0 || length(query_genes) == 0) { - stop("Input vectors must not be empty.") - } - - # Calculate the intersection of highly variable genes - common_genes <- intersect(reference_genes, query_genes) - - # Calculate the size of the intersection - intersection_size <- length(common_genes) - - # Calculate the size of the smaller set - min_size <- min(length(reference_genes), length(query_genes)) - - # Compute the overlap coefficient - overlap_coefficient <- intersection_size / min_size - overlap_coefficient <- round(overlap_coefficient, 2) - - # Return the overlap coefficient - return(overlap_coefficient) -} \ No newline at end of file diff --git a/R/calculateHotellingPValue.R b/R/calculateHotellingPValue.R deleted file mode 100644 index ca2d8a6..0000000 --- a/R/calculateHotellingPValue.R +++ /dev/null @@ -1,113 +0,0 @@ -#' @title Perform Hotelling's T-squared Test on PCA Scores for Single-cell RNA-seq Data -#' -#' @description This function performs Hotelling's T-squared test to assess the similarity between reference and query datasets -#' for each cell type based on their PCA scores. -#' -#' @details This function first performs PCA on the reference dataset and then projects the query dataset onto the PCA space -#' of the reference data. For each cell type, it computes pseudo-bulk signatures in the PCA space by averaging the principal -#' component scores of cells belonging to that cell type. Hotelling's T-squared test is then performed to compare the mean -#' vectors of the pseudo-bulk signatures between the reference and query datasets. The resulting p-values indicate the similarity -#' between the reference and query datasets for each cell type. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} -#' that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} -#' that identifies the cell types. -#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5. -#' -#' @return A named numeric vector of p-values from Hotelling's T-squared test for each cell type. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- scran::getTopHVGs(ref_data, n = 2000) -#' query_var <- scran::getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -#' -#' # Get the p-values from the test -#' p_values <- calculateHotellingPValue(query_data_subset, ref_data_subset, -#' n_components = 10, -#' query_cell_type_col = "reclustered.broad", -#' ref_cell_type_col = "reclustered.broad", -#' pc_subset = c(1:10)) -#' round(p_values, 5) -#' -# Function to perform Hotelling T^2 test for each cell type -# The test is performed on the PCA space of the reference data -# The query data projected onto PCA space of reference -calculateHotellingPValue <- function(query_data, reference_data, - n_components = 10, - query_cell_type_col, - ref_cell_type_col, - pc_subset = c(1:5)) { - - # Get the projected PCA data - pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, - n_components = n_components, - query_cell_type_col = query_cell_type_col, - ref_cell_type_col = ref_cell_type_col, - return_value = "list") - - # Get unique cell types - unique_cell_types <- na.omit(unique(c(colData(reference_data)[[ref_cell_type_col]], - colData(query_data)[[query_cell_type_col]]))) - - # Create a list to store p-values for each cell type - p_values <- rep(NA, length(unique_cell_types)) - names(p_values) <- unique_cell_types - - for (cell_type in unique_cell_types) { - - # Subset principal component scores for current cell type - ref_subset_scores <- pca_output$ref[which(cell_type == reference_data[[ref_cell_type_col]]), pc_subset] - query_subset_scores <- pca_output$query[which(cell_type == query_data[[query_cell_type_col]]), pc_subset] - - # Calculate the p-value - hotelling_output <- Hotelling::hotelling.test(x = ref_subset_scores, y = query_subset_scores) - - # Store the result - p_values[cell_type] <- hotelling_output$pval - } - - # Return p-values - return(p_values) -} diff --git a/R/calculatePairwiseDistancesAndPlotDensity.R b/R/calculatePairwiseDistancesAndPlotDensity.R deleted file mode 100644 index 6176165..0000000 --- a/R/calculatePairwiseDistancesAndPlotDensity.R +++ /dev/null @@ -1,183 +0,0 @@ -#' @title Pairwise Distance Analysis and Density Visualization -#' -#' @description -#' Calculates pairwise distances or correlations between query and reference cells -#' of a specific cell type. -#' -#' @details -#' The function works with \code{\linkS4class{SingleCellExperiment}} objects, ensuring -#' compatibility with common single-cell analysis workflows. It subsets the data for specified cell types, -#' computes pairwise distances or correlations, and visualizes these measurements using density plots. By comparing the distances and correlations, -#' one can evaluate the consistency and reliability of annotated cell types within single-cell datasets. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -#' expression data and metadata. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell -#' expression data and metadata. -#' @param n_components The number of principal components to use for the dimensionality reduction of the data using PCA. Defaults to 10. -#' If set to \code{NULL} then no dimensionality reduction is performed and the assay data is used directly for computations. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} -#' that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} -#' that identifies the cell types. -#' @param cell_type_query The query cell type for which distances or correlations are calculated. -#' @param cell_type_reference The reference cell type for which distances or correlations are calculated. -#' @param distance_metric The distance metric to use for calculating pairwise distances, such as euclidean, manhattan etc. -#' Set it to "correlation" for calculating correlation coefficients. -#' @param correlation_method The correlation method to use when distance_metric is "correlation". -#' Possible values: "pearson", "spearman". -#' -#' @return A plot generated by \code{ggplot2}, showing the density distribution of -#' calculated distances or correlations. -#' -#' @examples -#' library(scran) -#' library(scRNAseq) -#' library(SingleR) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 2000) -#' query_var <- getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Example usage of the function -#' calculatePairwiseDistancesAndPlotDensity(query_data = query_data_subset, -#' reference_data = ref_data_subset, -#' n_components = 10, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' cell_type_query = "CD8", -#' cell_type_reference = "CD8", -#' distance_metric = "euclidean") -#' -#' -#' @importFrom stats cor dist -#' @import SingleCellExperiment -#' @importFrom SummarizedExperiment assay -#' @export -#' -calculatePairwiseDistancesAndPlotDensity <- function(query_data, - reference_data, - n_components = 10, - query_cell_type_col, - ref_cell_type_col, - cell_type_query, - cell_type_reference, - distance_metric, - correlation_method = "pearson") { - - # Sanity checks - - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if reference_data is a SingleCellExperiment object - if (!is(reference_data, "SingleCellExperiment")) { - stop("reference_data must be a SingleCellExperiment object.") - } - - # Convert to matrix and potentially applied PCA dimensionality reduction - if(!is.null(n_components)){ - # Project query data onto PCA space of reference data - pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, - n_components = n_components, return_value = "list") - ref_mat <- pca_output$ref[which(reference_data[[ref_cell_type_col]] == cell_type_reference), paste0("PC", 1:n_components)] - query_mat <- pca_output$query[which(query_data[[query_cell_type_col]] == cell_type_query), paste0("PC", 1:n_components)] - } else{ - - # Subset query data to the specified cell type - query_data_subset <- query_data[, !is.na(query_data[[query_cell_type_col]]) & query_data[[query_cell_type_col]] == cell_type_query] - query_mat <- t(as.matrix(assay(query_data_subset, "logcounts"))) - ref_mat <- t(as.matrix(assay(ref_data_subset, "logcounts"))) - } - - # Combine query and reference matrices - combined_mat <- rbind(query_mat, ref_mat) - - # Calculate pairwise distances or correlations for all comparisons - if (distance_metric == "correlation") { - if (correlation_method == "pearson") { - dist_matrix <- cor(t(combined_mat), method = "pearson") - } else if (correlation_method == "spearman") { - dist_matrix <- cor(t(combined_mat), method = "spearman") - } else { - stop("Invalid correlation method. Available options: 'pearson', 'spearman'") - } - } else { - dist_matrix <- dist(combined_mat, method = distance_metric) - } - - # Convert dist_matrix to a square matrix - dist_matrix <- as.matrix(dist_matrix) - - # Extract the distances or correlations for the different pairwise comparisons - num_query_cells <- nrow(query_mat) - num_ref_cells <- nrow(ref_mat) - dist_query_query <- dist_matrix[1:num_query_cells, 1:num_query_cells] - dist_ref_ref <- dist_matrix[(num_query_cells+1):(num_query_cells+num_ref_cells), - (num_query_cells+1):(num_query_cells+num_ref_cells)] - dist_query_ref <- dist_matrix[1:num_query_cells, (num_query_cells+1):(num_query_cells+num_ref_cells)] - - # Create data frame for plotting - dist_df <- data.frame( - Comparison = c(rep("Query vs Query", length(dist_query_query)), - rep("Reference vs Reference", length(dist_ref_ref)), - rep("Query vs Reference", length(dist_query_ref))), - Distance = c(as.vector(dist_query_query), - as.vector(dist_ref_ref), - as.vector(dist_query_ref)) - ) - - # Plot density plots with improved aesthetics - ggplot2::ggplot(dist_df, aes(x = Distance, color = Comparison, fill = Comparison)) + - ggplot2::geom_density(alpha = 0.5, linewidth = 1) + # Updated: linewidth instead of size - ggplot2::scale_color_manual(values = c("#1f78b4", "#33a02c", "#e31a1c")) + - ggplot2::scale_fill_manual(values = c("#1f78b4", "#33a02c", "#e31a1c")) + - ggplot2::labs(x = ifelse(distance_metric == "correlation", - ifelse(correlation_method == "spearman", "Spearman Correlation", "Pearson Correlation"), - "Distance"), y = "Density", - title = "Pairwise Distance Analysis and Density Visualization") + - ggplot2::theme_minimal() + - ggplot2::theme( - plot.title = ggplot2::element_text(size = 16, hjust = 0.5, face = "bold"), - axis.title = ggplot2::element_text(size = 14), - axis.text = ggplot2::element_text(size = 12), - legend.title = ggplot2::element_blank(), - legend.text = ggplot2::element_text(size = 12), - panel.grid.major = ggplot2::element_line(color = "gray", linetype = "dashed"), - panel.grid.minor = ggplot2::element_blank(), - panel.background = ggplot2::element_rect(fill = "white"), - panel.border = ggplot2::element_blank(), - legend.position = "top" - ) -} diff --git a/R/calculateSampleDistances.R b/R/calculateSampleDistances.R deleted file mode 100644 index 829c9e7..0000000 --- a/R/calculateSampleDistances.R +++ /dev/null @@ -1,147 +0,0 @@ -#' @title Compute Sample Distances Between Reference and Query Data -#' -#' @description This function computes the distances within the reference dataset and the distances from each query sample to all -#' reference samples for each cell type. It uses PCA for dimensionality reduction and Euclidean distance for distance calculation. -#' -#' @details The function first performs PCA on the reference dataset and projects the query dataset onto the same PCA space. -#' It then computes pairwise Euclidean distances within the reference dataset for each cell type, as well as distances from each -#' query sample to all reference samples of a particular cell type. The results are stored in a list, with one entry per cell type. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} -#' that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} -#' that identifies the cell types. -#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5. -#' -#' @return A list containing distance data for each cell type. Each entry in the list contains: -#' \describe{ -#' \item{ref_distances}{A vector of all pairwise distances within the reference subset for the cell type.} -#' \item{query_to_ref_distances}{A matrix of distances from each query sample to all reference samples for the cell type.} -#' } -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{plot.calculateSampleDistances}} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 2000) -#' query_var <- getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Plot the PC data -#' distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, -#' n_components = 10, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' pc_subset = c(1:10)) -#' -#' # Identify outliers for CD4 -#' cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' n_components = 10, -#' n_tree = 500, -#' anomaly_treshold = 0.5)$CD4 -#' cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6]) -#' -#' # Plot the densities of the distances -#' plot(distance_data, ref_cell_type = "CD4", sample_names = cd4_top5_anomalies) -#' -# Function to compute distances within reference data and between query data and reference samples -calculateSampleDistances <- function(query_data, reference_data, - query_cell_type_col, - ref_cell_type_col, - n_components = 10, - pc_subset = c(1:5)) { - - # Get the projected PCA data - pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, - n_components = n_components, - query_cell_type_col = query_cell_type_col, - ref_cell_type_col = ref_cell_type_col, - return_value = "list") - - # Get unique cell types - unique_cell_types <- na.omit(unique(c(colData(reference_data)[[ref_cell_type_col]], - colData(query_data)[[query_cell_type_col]]))) - - # Create a list to store distance data for each cell type - distance_data <- list() - - # Function to compute Euclidean distance between a vector and each row of a matrix - .compute_distances <- function(matrix, vector) { - - # Apply the distance function to each row of the matrix - distances <- apply(matrix, 1, function(row) { - sqrt(sum((row - vector) ^ 2)) - }) - - return(distances) - } - - for (cell_type in unique_cell_types) { - - # Subset principal component scores for current cell type - ref_subset_scores <- pca_output$ref[which(cell_type == reference_data[[ref_cell_type_col]]), pc_subset] - query_subset_scores <- pca_output$query[, pc_subset] - - # Compute all pairwise distances within the reference subset - ref_distances <- as.vector(dist(ref_subset_scores)) - - # Compute distances from each query sample to all reference samples - query_to_ref_distances <- apply(query_subset_scores, 1, function(query_sample, ref_subset_scores) { - .compute_distances(ref_subset_scores, query_sample) - }, ref_subset_scores = ref_subset_scores) - - # Store the distances - distance_data[[cell_type]] <- list( - ref_distances = ref_distances, - query_to_ref_distances = t(query_to_ref_distances) - ) - } - - # Add class of object - class(distance_data) <- c(class(distance_data), "calculateSampleDistances") - - # Return the distance data - return(distance_data) -} \ No newline at end of file diff --git a/R/calculateSampleDistancesSimilarity.R b/R/calculateSampleDistancesSimilarity.R deleted file mode 100644 index e0c0eae..0000000 --- a/R/calculateSampleDistancesSimilarity.R +++ /dev/null @@ -1,178 +0,0 @@ -#' @title Function to compute Bhattacharyya coefficients and Hellinger distances -#' -#' @description -#' This function computes Bhattacharyya coefficients and Hellinger distances to quantify the similarity of density -#' distributions between query samples and reference data for each cell type. - -#' -#' @details -#' This function first computes distance data using the \code{calculateSampleDistances} function, which calculates -#' pairwise distances between samples within the reference data and between query samples and reference samples in the PCA space. -#' Bhattacharyya coefficients and Hellinger distances are calculated to quantify the similarity of density distributions between query -#' samples and reference data for each cell type. Bhattacharyya coefficient measures the similarity of two probability distributions, -#' while Hellinger distance measures the distance between two probability distributions. -#' -#' Bhattacharyya coefficients range between 0 and 1. A value closer to 1 indicates higher similarity between distributions, while a value -#' closer to 0 indicates lower similarity -#' -#' Hellinger distances range between 0 and 1. A value closer to 0 indicates higher similarity between distributions, while a value -#' closer to 1 indicates lower similarity. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} -#' that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} -#' that identifies the cell types. -#' @param sample_names A character vector specifying the names of the query samples for which to compute distance measures. -#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. -#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5. -#' -#' @return A list containing distance data for each cell type. Each entry in the list contains: -#' \describe{ -#' \item{ref_distances}{A vector of all pairwise distances within the reference subset for the cell type.} -#' \item{query_to_ref_distances}{A matrix of distances from each query sample to all reference samples for the cell type.} -#' } -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- scran::getTopHVGs(ref_data, n = 2000) -#' query_var <- scran::getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -#' -#' # Plot the PC data -#' distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, -#' n_components = 10, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' pc_subset = c(1:10)) -#' -#' # Identify outliers for CD4 -#' cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' n_components = 10, -#' n_tree = 500, -#' anomaly_treshold = 0.5)$CD4 -#' cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6]) -#' -#' # Get overlap measures -#' overlap_measures <- calculateSampleDistancesSimilarity(query_data_subset,ref_data_subset, -#' sample_names = cd4_top5_anomalies, -#' n_components = 10, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' pc_subset = c(1:10)) -#' -#' -# Function to compute Bhattacharyya coefficients and Hellinger distances -calculateSampleDistancesSimilarity <- function(query_data, reference_data, - query_cell_type_col, - ref_cell_type_col, - sample_names, - n_components = 10, - pc_subset = c(1:5)) { - - # Check if samples are available in data for that cell type - if(!all(sample_names %in% colnames(query_data))) - stop("One or more specified 'sample_names' are not available for that cell type.") - - # Compute distance data - query_data_subset <- query_data[, sample_names] - distance_data <- calculateSampleDistances(query_data = query_data_subset, reference_data = reference_data, - query_cell_type_col = query_cell_type_col, - ref_cell_type_col = ref_cell_type_col, - n_components = n_components, - pc_subset = pc_subset) - - # Initialize empty lists to store results - bhattacharyya_list <- list() - hellinger_list <- list() - - # Iterate over each cell type - for (cell_type in names(distance_data)) { - - # Extract distances within the reference dataset for the current cell type - ref_distances <- distance_data[[cell_type]]$ref_distances - - # Compute density of reference distances - ref_density <- density(ref_distances) - - # Initialize an empty vector to store overlap measures for the current cell type - bhattacharyya_coef <- numeric(length(sample_names)) - hellinger_dist <- numeric(length(sample_names)) - - # Iterate over each sample - for (i in 1:length(sample_names)) { - - # Extract distances from the current sample to reference samples - sample_distances <- distance_data[[cell_type]]$query_to_ref_distances[sample_names[i], ] - - # Compute density of sample distances - sample_density <- density(sample_distances) - - # Create a common grid for evaluating densities - common_grid <- seq(min(min(ref_density$x), min(sample_density$x), 0), - max(max(ref_density$x), max(sample_density$x)), length.out = 1000) - - # Interpolate densities onto the common grid - ref_density_interp <- approxfun(ref_density$x, ref_density$y)(common_grid) - ref_density_interp[is.na(ref_density_interp)] <- 0 - sample_density_interp <- approxfun(sample_density$x, sample_density$y)(common_grid) - sample_density_interp[is.na(sample_density_interp)] <- 0 - - # Compute and store Bhattacharyya coefficient/Hellinger distance - bhattacharyya_coef[i] <- sum(sqrt(ref_density_interp * sample_density_interp) * mean(diff(common_grid))) - hellinger_dist[i] <- sqrt(1 - sum(sqrt(ref_density_interp * sample_density_interp)) * mean(diff(common_grid))) - } - - # Store overlap measures for the current cell type - bhattacharyya_list[[cell_type]] <- bhattacharyya_coef - hellinger_list[[cell_type]] <- hellinger_dist - } - - # Return list with overlap measures - bhattacharyya_coef <- data.frame(Sample = sample_names, bhattacharyya_list) - hellinger_dist <- data.frame(Sample = sample_names, hellinger_list) - return(list(bhattacharyya_coef = bhattacharyya_coef, - hellinger_dist = hellinger_dist)) -} - - diff --git a/R/calculateSampleSimilarityPCA.R b/R/calculateSampleSimilarityPCA.R deleted file mode 100644 index f6c31ac..0000000 --- a/R/calculateSampleSimilarityPCA.R +++ /dev/null @@ -1,128 +0,0 @@ -#' @title Calculate Sample Similarity Using PCA Loadings -#' -#' @description -#' This function calculates the cosine similarity between samples based on the principal components (PCs) -#' obtained from PCA (Principal Component Analysis) loadings. -#' -#' @details -#' This function calculates the cosine similarity between samples based on the loadings of the selected -#' principal components obtained from PCA. It extracts the rotation matrix from the PCA results of the -#' \code{\linkS4class{SingleCellExperiment}} object and identifies the high-loading variables for each selected PC. -#' Then, it computes the cosine similarity between samples using the high-loading variables for each PC. -#' -#' @param se_object A \code{\linkS4class{SingleCellExperiment}} object containing expression data. -#' @param samples A character vector specifying the samples for which to compute the similarity. -#' @param pc_subset A numeric vector specifying the subset of principal components to consider (default: c(1:5)). -#' @param n_top_vars An integer indicating the number of top loading variables to consider for each PC (default: 50). -#' -#' @return A data frame containing cosine similarity values between samples for each selected principal component. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{plot.calculateSampleSimilarityPCA}} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- scran::getTopHVGs(ref_data, n = 2000) -#' query_var <- scran::getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data (assumed to be prepared) -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Store PCA anomaly data and plots -#' anomaly_output <- detectAnomaly(reference_data = ref_data_subset, -#' ref_cell_type_col = "reclustered.broad", -#' n_components = 10, -#' n_tree = 500, -#' anomaly_treshold = 0.5) -#' top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, -#' decreasing = TRUE)[1:6]) -#' -#' # Compute cosine similarity between anomalies and top PCs -#' cosine_similarities <- calculateSampleSimilarityPCA(ref_data_subset, samples = top6_anomalies, -#' pc_subset = c(1:10), n_top_vars = 50) -#' cosine_similarities -#' -#' # Plot similarities -#' plot(cosine_similarities, pc_subset = c(1:5)) -#' -# Function to calculate cosine similarities between samples and PCs -calculateSampleSimilarityPCA <- function(se_object, samples, pc_subset = c(1:5), n_top_vars = 50){ - - # Extract rotation matrix for SingleCellExperiment object - rotation_mat <- attributes(reducedDim(se_object, "PCA"))$rotation[, pc_subset] - - # Function to identify high-loading variables for each PC - .getHighLoadingVars <- function(rotation_mat, n_top_vars) { - high_loading_vars <- lapply(1:ncol(rotation_mat), function(pc) { - abs_loadings <- abs(rotation_mat[, pc]) - top_vars <- names(sort(abs_loadings, decreasing = TRUE))[1:n_top_vars] - return(top_vars) - }) - return(high_loading_vars) - } - - # Extract high-loading variables - high_loading_vars <- .getHighLoadingVars(rotation_mat, n_top_vars) - - # Function to compute cosine similarity - .cosine_similarity <- function(vector1, vector2) { - sum(vector1 * vector2) / (sqrt(sum(vector1^2)) * sqrt(sum(vector2^2))) - } - - # Function to compute cosine similarity for each PC using high-loading variables - .computeCosineSimilarity <- function(samples, rotation_mat, high_loading_vars) { - similarities <- lapply(1:length(high_loading_vars), function(pc) { - vars <- high_loading_vars[[pc]] - sample_subset <- samples[, vars, drop = FALSE] - pc_vector <- rotation_mat[vars, pc] - apply(sample_subset, 1, .cosine_similarity, vector2 = pc_vector) - }) - return(similarities) - } - - # Calculate similarities - assay_mat <- t(as.matrix(assay(se_object[, samples], "logcounts"))) - similarities <- .computeCosineSimilarity(assay_mat, rotation_mat, high_loading_vars) - - # Format the result into a data frame for easy interpretation - similarity_df <- do.call(cbind, similarities) - colnames(similarity_df) <- paste0("PC", 1:ncol(rotation_mat)) - - # Update class of output - class(similarity_df) <- c(class(similarity_df), "calculateSampleSimilarityPCA") - return(similarity_df) -} \ No newline at end of file diff --git a/R/calculateVarImpOverlap.R b/R/calculateVarImpOverlap.R deleted file mode 100644 index 4060103..0000000 --- a/R/calculateVarImpOverlap.R +++ /dev/null @@ -1,140 +0,0 @@ -#' @title Compare Gene Importance Across Datasets Using Random Forest -#' -#' @description This function identifies and compares the most important genes for differentiating cell types between a query dataset -#' and a reference dataset using Random Forest. -#' -#' @details This function uses the Random Forest algorithm to calculate the importance of genes in differentiating between cell types -#' within both a reference dataset and a query dataset. The function then compares the top genes identified in both datasets to determine -#' the overlap in their importance scores. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param query_cell_type_col A character string specifying the column name in the query dataset containing cell type annotations. -#' @param ref_cell_type_col A character string specifying the column name in the reference dataset containing cell type annotations. -#' @param n_tree An integer specifying the number of trees to grow in the Random Forest. Default is 500. -#' @param n_top An integer specifying the number of top genes to consider when comparing variable importance scores. Default is 20. -#' -#' @return A list containing three elements: -#' \item{var_imp_ref}{A list of data frames containing variable importance scores for each combination of cell types in the reference -#' dataset.} -#' \item{var_imp_query}{A list of data frames containing variable importance scores for each combination of cell types in the query -#' dataset.} -#' \item{var_imp_comparison}{A named vector indicating the proportion of top genes that overlap between the reference and query -#' datasets for each combination of cell types.} -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @examples -#' # Load necessary library -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 500) -#' query_var <- getTopHVGs(query_data, n = 500) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Compare PCA subspaces -#' rf_output <- calculateVarImpOverlap(query_data_subset, ref_data_subset, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' n_tree = 500, -#' n_top = 20) -#' -#' -# RF function to compare (between datasets) which genes are best at differentiating cell types from each -calculateVarImpOverlap <- function(query_data, - reference_data, - query_cell_type_col, - ref_cell_type_col, - n_tree = 500, - n_top = 20){ - - # Extract assay data for reference and query datasets - ref_x <- t(as.matrix(assay(reference_data, "logcounts"))) - query_x <- t(as.matrix(assay(query_data, "logcounts"))) - - # Extract labels from reference and query datasets - ref_y <- reference_data[[ref_cell_type_col]] - query_y <- query_data[[query_cell_type_col]] - - # Remove NA from reference - ref_x <- ref_x[-which(is.na(ref_y)),] - ref_y <- ref_y[-which(is.na(ref_y))] - - # Finding importance scores for each cell type in reference dataset - var_imp_ref <- list() - cell_types <- unique(intersect(ref_y, query_y)) - cell_types_combn <- combn(length(cell_types), 2) - for(combn_id in 1:ncol(cell_types_combn)){ - - ref_x_subset <- ref_x[which(ref_y %in% c(cell_types[cell_types_combn[1, combn_id]], cell_types[cell_types_combn[2, combn_id]])),] - ref_y_subset <- ref_y[which(ref_y %in% c(cell_types[cell_types_combn[1, combn_id]], cell_types[cell_types_combn[2, combn_id]]))] - training_data <- data.frame(ref_x_subset, cell_type = factor(ref_y_subset)) - rf_binary <- ranger::ranger(cell_type ~ ., data = training_data, num.trees = n_tree, importance = "impurity") - var_importance_name <- paste0(cell_types[cell_types_combn[1, combn_id]], "-", cell_types[cell_types_combn[2, combn_id]]) - var_imp_ref[[var_importance_name]] <- rf_binary$variable.importance - var_imp_ref[[var_importance_name]] <- - data.frame(Gene = names(var_imp_ref[[var_importance_name]])[order(var_imp_ref[[var_importance_name]], - decreasing = TRUE)], - RF_Importance = var_imp_ref[[var_importance_name]][order(var_imp_ref[[var_importance_name]], - decreasing = TRUE)]) - } - - # Finding importance scores for each cell type in query dataset - var_imp_query <- list() - for(combn_id in 1:ncol(cell_types_combn)){ - - ref_x_subset <- ref_x[which(ref_y %in% c(cell_types[cell_types_combn[1, combn_id]], cell_types[cell_types_combn[2, combn_id]])),] - ref_y_subset <- ref_y[which(ref_y %in% c(cell_types[cell_types_combn[1, combn_id]], cell_types[cell_types_combn[2, combn_id]]))] - training_data <- data.frame(ref_x_subset, cell_type = factor(ref_y_subset)) - rf_binary <- ranger::ranger(cell_type ~ ., data = training_data, num.trees = n_tree, importance = "impurity") - var_importance_name <- paste0(cell_types[cell_types_combn[1, combn_id]], "-", cell_types[cell_types_combn[2, combn_id]]) - var_imp_query[[var_importance_name]] <- rf_binary$variable.importance - var_imp_query[[var_importance_name]] <- - data.frame(Gene = names(var_imp_query[[var_importance_name]])[order(var_imp_query[[var_importance_name]], - decreasing = TRUE)], - RF_Importance = var_imp_query[[var_importance_name]][order(var_imp_query[[var_importance_name]], - decreasing = TRUE)]) - } - - # Comparison vector - var_imp_comparison <- rep(NA, length(var_imp_ref)) - names(var_imp_comparison) <- names(var_imp_ref) - for(cells in names(var_imp_comparison)){ - var_imp_comparison[cells] <- length(intersect(var_imp_ref[[cells]]$Gene[1:n_top], - var_imp_query[[cells]]$Gene[1:n_top])) / n_top - } - - # Return variable importance scores for each combination of cell types in each dataset and the comparison - return(list(var_imp_ref = var_imp_ref, - var_imp_query = var_imp_query, - var_imp_comparison = var_imp_comparison)) -} \ No newline at end of file diff --git a/R/compareCCA.R b/R/compareCCA.R deleted file mode 100644 index 214f37f..0000000 --- a/R/compareCCA.R +++ /dev/null @@ -1,163 +0,0 @@ -#' @title Compare Subspaces Spanned by Top Principal Components Using Canonical Correlation Analysis -#' -#' @description -#' This function compares the subspaces spanned by the top principal components (PCs) of the reference -#' and query datasets using canonical correlation analysis (CCA). It calculates the canonical variables, -#' correlations, and a similarity measure for the subspaces. -#' -#' @details -#' This function performs canonical correlation analysis (CCA) to compare the subspaces spanned by the -#' top principal components (PCs) of the reference and query datasets. The function extracts the rotation -#' matrices corresponding to the specified PCs and performs CCA on these matrices. It computes the canonical -#' variables and their corresponding correlations. Additionally, it calculates a similarity measure for the -#' canonical variables using cosine similarity. The output is a list containing the canonical coefficients -#' for both datasets, the cosine similarity values, and the canonical correlations. - -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param pc_subset A numeric vector specifying the subset of principal components (PCs) -#' to compare. Default is the first five PCs. -#' @param n_top_vars An integer indicating the number of top loading variables to consider for each PC. Default is 25. -#' -#' @return A list containing the following elements: -#' \describe{ -#' \item{coef_ref}{Canonical coefficients for the reference dataset.} -#' \item{coef_query}{Canonical coefficients for the query dataset.} -#' \item{cosine_similarity}{Cosine similarity values for the canonical variables.} -#' \item{correlations}{Canonical correlations between the reference and query datasets.} -#' } -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{plot.compareCCA}} -#' -#' @examples -#' # Load necessary library -#' library(scRNAseq) -#' library(scuttle) -#' library(scran) -#' library(SingleR) -#' library(ggplot2) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # Log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 500) -#' query_var <- getTopHVGs(query_data, n = 500) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Subset reference and query data for a specific cell type -#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] -#' -#' # Run PCA on the reference and query datasets -#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -#' query_data_subset <- runPCA(query_data_subset, ncomponents = 50) -#' -#' # Compare CCA -#' cca_comparison <- compareCCA(query_data_subset, ref_data_subset, -#' pc_subset = c(1:5), n_top_vars = 25) -#' -#' # Visualize output of CCA comparison -#' plot(cca_comparison) -#' -#' -# Function to compare subspace spanned by top PCs in reference and query datasets -compareCCA <- function(reference_data, query_data, - pc_subset = c(1:5), - n_top_vars = 25){ - - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if reference_data is a SingleCellExperiment object - if (!is(reference_data, "SingleCellExperiment")) { - stop("reference_data must be a SingleCellExperiment object.") - } - - # Check of genes in both datasets are the same - if(!all(rownames(attributes(reducedDim(query_data, "PCA"))$rotation) %in% - rownames(attributes(reducedDim(reference_data, "PCA"))$rotation))) - stop("The genes in the rotation matrices differ. Consider decreasing the number of genes using for PCA.") - - # Check input if PC subset is valid - if(!all(c(pc_subset %in% 1:ncol(reducedDim(reference_data, "PCA")), - pc_subset %in% 1:ncol(reducedDim(query_data, "PCA"))))) - stop("\'pc_subset\' is out of range.") - - # Extract the rotation matrices - ref_rotation <- attributes(reducedDim(reference_data, "PCA"))$rotation[, pc_subset] - query_rotation <- attributes(reducedDim(query_data, "PCA"))$rotation[, pc_subset] - - # Function to identify high-loading variables for each PC - .getHighLoadingVars <- function(rotation_mat, n_top_vars) { - high_loading_vars <- lapply(1:ncol(rotation_mat), function(pc) { - abs_loadings <- abs(rotation_mat[, pc]) - top_vars <- names(sort(abs_loadings, decreasing = TRUE))[1:n_top_vars] - return(top_vars) - }) - return(high_loading_vars) - } - - # Get union of variables with highest loadings - top_ref <- .getHighLoadingVars(ref_rotation, n_top_vars) - top_query <- .getHighLoadingVars(query_rotation, n_top_vars) - top_union <- unlist(lapply(1:length(pc_subset), function(i) return(union(top_ref[[i]], top_query[[i]])))) - - # Perform CCA - cca_result <- cancor(ref_rotation, query_rotation) - - # Extract canonical variables and correlations - canonical_ref <- cca_result$xcoef - canonical_query <- cca_result$ycoef - correlations <- cca_result$cor - - # Function to compute similarity measure (e.g., cosine similarity) - .cosine_similarity <- function(u, v) { - return(abs(sum(u * v)) / (sqrt(sum(u^2)) * sqrt(sum(v^2)))) - } - - # Compute similarities and account for correlations - similarities <- rep(0, length(pc_subset)) - for (i in 1:length(pc_subset)) { - similarities[i] <- .cosine_similarity(canonical_ref[, i], canonical_query[, i]) - } - - # Update class of return output - output <- list(coef_ref = canonical_ref, - coef_query = canonical_query, - cosine_similarity = similarities, - correlations = correlations) - class(output) <- c(class(output), "compareCCA") - - # Return cosine similarity output - return(output) -} - diff --git a/R/comparePCA.R b/R/comparePCA.R deleted file mode 100644 index 53ad60f..0000000 --- a/R/comparePCA.R +++ /dev/null @@ -1,185 +0,0 @@ -#' @title Compare Principal Components Analysis (PCA) Results -#' -#' @description This function compares the principal components (PCs) obtained from separate PCA on reference and query -#' datasets for a single cell type using either cosine similarity or correlation. -#' -#' @details -#' This function compares the PCA results between the reference and query datasets by computing cosine -#' similarities or correlations between the loadings of top variables for each pair of principal components. It first -#' extracts the PCA rotation matrices from both datasets and identifies the top variables with highest loadings for -#' each PC. Then, it computes the cosine similarities or correlations between the loadings of top variables for each -#' pair of PCs. The resulting matrix contains the similarity values, where rows represent reference PCs and columns -#' represent query PCs. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param pc_subset A numeric vector specifying the subset of principal components (PCs) to compare. Default is the first five PCs. -#' @param n_top_vars An integer indicating the number of top loading variables to consider for each PC. Default is 50. -#' @param metric The similarity metric to use. It can be either "cosine" or "correlation". Default is "cosine". -#' @param correlation_method The correlation method to use if metric is "correlation". It can be "spearman" -#' or "pearson". Default is "spearman". -#' -#' @return A similarity matrix comparing the principal components of the reference and query datasets. -#' Each element (i, j) in the matrix represents the similarity between the i-th principal component -#' of the reference dataset and the j-th principal component of the query dataset. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{plot.comparePCA}} -#' -#' @examples -#' # Load necessary library -#' library(scRNAseq) -#' library(scuttle) -#' library(scran) -#' library(SingleR) -#' library(ComplexHeatmap) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # Log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 500) -#' query_var <- getTopHVGs(query_data, n = 500) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Subset reference and query data for a specific cell type -#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] -#' -#' # Run PCA on the reference and query datasets -#' ref_data_subset <- runPCA(ref_data_subset) -#' query_data_subset <- runPCA(query_data_subset) -#' -#' # Call the PCA comparison function -#' similarity_mat <- comparePCA(query_data_subset, ref_data_subset, -#' pc_subset = c(1:5), -#' n_top_vars = 50, -#' metric = c("cosine", "correlation")[1], -#' correlation_method = c("spearman", "pearson")[1]) -#' -#' # Create the heatmap -#' plot(similarity_mat) -#' -# Compare PCA vectors of reference and query datasets for specific cell type. -comparePCA <- function(reference_data, query_data, - pc_subset = c(1:5), - n_top_vars = 50, - metric = c("cosine", "correlation")[1], - correlation_method = c("spearman", "pearson")[1]){ - - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if reference_data is a SingleCellExperiment object - if (!is(reference_data, "SingleCellExperiment")) { - stop("reference_data must be a SingleCellExperiment object.") - } - - # Check of genes in both datasets are the same - if(!all(rownames(attributes(reducedDim(query_data, "PCA"))$rotation) %in% - rownames(attributes(reducedDim(reference_data, "PCA"))$rotation))) - stop("The genes in the rotation matrices differ. Consider decreasing the number of genes used for PCA.") - - # Check input if PC subset is valid - if(!all(c(pc_subset %in% 1:ncol(reducedDim(reference_data, "PCA")), - pc_subset %in% 1:ncol(reducedDim(query_data, "PCA"))))) - stop("\'pc_subset\' is out of range.") - - # Check input for metric - if(!(metric %in% c("cosine", "correlation"))) - stop("\'metric\' should be one of \'cosine\' or \'correlation\'.") - - # Check input for correlation method - if(!(correlation_method %in% c("spearman", "pearson"))) - stop("\'correlation_method\' should be one of \'spearman\' or \'pearson\'.") - - # Extract PCA data from reference and query data - ref_rotation <- attributes(reducedDim(reference_data, "PCA"))$rotation[, pc_subset] - query_rotation <- attributes(reducedDim(query_data, "PCA"))$rotation[, pc_subset] - - # Function to identify high-loading variables for each PC - .getHighLoadingVars <- function(rotation_mat, n_top_vars) { - high_loading_vars <- lapply(1:ncol(rotation_mat), function(pc) { - abs_loadings <- abs(rotation_mat[, pc]) - top_vars <- names(sort(abs_loadings, decreasing = TRUE))[1:n_top_vars] - return(top_vars) - }) - return(high_loading_vars) - } - - # Get union of variables with highest loadings - top_ref <- .getHighLoadingVars(ref_rotation, n_top_vars) - top_query <- .getHighLoadingVars(query_rotation, n_top_vars) - top_union <- lapply(1:length(pc_subset), function(i) return(union(top_ref[[i]], top_query[[i]]))) - - # Initialize a matrix to store cosine similarities - similarity_matrix <- matrix(NA, nrow = length(pc_subset), ncol = length(pc_subset)) - - if(metric == "cosine"){ - # Function to compute cosine similarity - .cosine_similarity <- function(x, y) { - sum(x * y) / (sqrt(sum(x^2)) * sqrt(sum(y^2))) - } - - # Loop over each pair of columns and compute cosine similarity - for (i in 1:length(pc_subset)) { - for (j in 1:length(pc_subset)) { - combination_union <- union(top_union[[i]], top_union[[j]]) - similarity_matrix[i, j] <- .cosine_similarity(ref_rotation[combination_union, i], query_rotation[combination_union, j]) - } - } - } else if(metric == "correlation"){ - # Loop over each pair of columns and compute cosine similarity - for (i in 1:length(pc_subset)) { - for (j in 1:length(pc_subset)) { - combination_union <- union(top_union[[i]], top_union[[j]]) - similarity_matrix[i, j] <- cor(ref_rotation[combination_union, i], query_rotation[combination_union, j], - method = correlation_method) - } - } - } - - # Add rownames and colnames with % of variance explained for each PC of each dataset - rownames(similarity_matrix) <- paste0("Ref PC", pc_subset, " (", - round(attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset] / - sum(attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset]) * - 100, 1), "%)") - colnames(similarity_matrix) <- paste0("Query PC", pc_subset, " (", - round(attributes(reducedDim(query_data, "PCA"))$varExplained[pc_subset] / - sum(attributes(reducedDim(query_data, "PCA"))$varExplained[pc_subset]) * - 100, 1), "%)") - - # Update class of return output - class(similarity_matrix) <- c(class(similarity_matrix), "comparePCA") - - # Return similarity matrix - return(similarity_matrix) -} - - diff --git a/R/comparePCASubspace.R b/R/comparePCASubspace.R deleted file mode 100644 index 5b292db..0000000 --- a/R/comparePCASubspace.R +++ /dev/null @@ -1,153 +0,0 @@ -#' @title Compare Subspaces Spanned by Top Principal Components -#' -#' @description -#' This function compares the subspace spanned by the top principal components (PCs) in a reference dataset to that -#' in a query dataset. It computes the cosine similarity between the loadings of the top variables for each PC in -#' both datasets and provides a weighted cosine similarity score. -#' -#' @details -#' This function compares the subspace spanned by the top principal components (PCs) in a reference dataset -#' to that in a query dataset. It first computes the cosine similarity between the loadings of the top variables -#' for each PC in both datasets. The top cosine similarity scores are then selected, and their corresponding PC -#' indices are stored. Additionally, the function calculates the average percentage of variance explained by the -#' selected top PCs. Finally, it computes a weighted cosine similarity score based on the top cosine similarities -#' and the average percentage of variance explained. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param pc_subset A numeric vector specifying the subset of principal components (PCs) to compare. Default is the first five PCs. -#' @param n_top_vars An integer indicating the number of top loading variables to consider for each PC. Default is 50. -#' -#' @return A list containing the following components: -#' \item{principal_angles_cosines}{A numeric vector of cosine values of principal angles.} -#' \item{average_variance_explained}{A numeric vector of average variance explained by each PC.} -#' \item{weighted_cosine_similarity}{A numeric value representing the weighted cosine similarity.} -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{plot.comparePCASubspace}} -#' -#' @examples -#' # Load necessary library -#' library(scRNAseq) -#' library(scuttle) -#' library(scran) -#' library(SingleR) -#' library(ggplot2) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # Log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 500) -#' query_var <- getTopHVGs(query_data, n = 500) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Subset reference and query data for a specific cell type -#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] -#' -#' # Run PCA on the reference and query datasets -#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -#' query_data_subset <- runPCA(query_data_subset, ncomponents = 50) -#' -#' # Compare PCA subspaces -#' subspace_comparison <- comparePCASubspace(query_data_subset, ref_data_subset, -#' pc_subset = c(1:5), n_top_vars = 50) -#' -#' # Create a data frame for plotting -#' plot(subspace_comparison) -#' -# Function to compare subspace spanned by top PCs in reference and query datasets -comparePCASubspace <- function(reference_data, query_data, - pc_subset = c(1:5), - n_top_vars = 50){ - - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if reference_data is a SingleCellExperiment object - if (!is(reference_data, "SingleCellExperiment")) { - stop("reference_data must be a SingleCellExperiment object.") - } - - # Check of genes in both datasets are the same - if(!all(rownames(attributes(reducedDim(query_data, "PCA"))$rotation) %in% - rownames(attributes(reducedDim(reference_data, "PCA"))$rotation))) - stop("The genes in the rotation matrices differ. Consider decreasing the number of genes using for PCA.") - - # Check input if PC subset is valid - if(!all(c(pc_subset %in% 1:ncol(reducedDim(reference_data, "PCA")), - pc_subset %in% 1:ncol(reducedDim(query_data, "PCA"))))) - stop("\'pc_subset\' is out of range.") - - # Compute the cosine similarity (cosine of principal angle) - cosine_similarity <- comparePCA(query_data = query_data, reference_data = reference_data, - pc_subset = pc_subset, n_top_vars = n_top_vars, metric = "cosine") - - # Vector to store top cosine similarities - top_cosine <- numeric(length(pc_subset)) - # Matrix to store PC IDs for each top cosine similarity - cosine_id <- matrix(NA, nrow = length(pc_subset), ncol = 2) - colnames(cosine_id) <- c("Ref", "Query") - - # Looping to store top cosine similarities and PC IDs - for(id in 1:length(pc_subset)){ - - # Store data for top cosine - top_ref <- which.max(apply(cosine_similarity, 1, max)) - top_query <- which.max(cosine_similarity[top_ref,]) - top_cosine[id] <- cosine_similarity[top_ref, top_query] - cosine_id[id,] <- c(top_ref, top_query) - - # Remove as candidate - cosine_similarity[top_ref,] <- -Inf - cosine_similarity[, top_query] <- -Inf - } - - # Vector of variance explained - var_explained_ref <- attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset]/ - sum(attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset]) - var_explained_query <- attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset]/ - sum(attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset]) - var_explained_avg <- (var_explained_ref[cosine_id[, 1]] + var_explained_query[cosine_id[, 2]]) / 2 - - # Weighted cosine similarity score - weighted_cosine_similarity <- sum(top_cosine * var_explained_avg) - - # Update class of return output - output <- list(cosine_similarity = top_cosine, - cosine_id = cosine_id, - var_explained_avg = var_explained_avg, - weighted_cosine_similarity = weighted_cosine_similarity) - class(output) <- c(class(output), "comparePCASubspace") - - # Return cosine similarity output - return(output) -} - diff --git a/R/detectAnomaly.R b/R/detectAnomaly.R deleted file mode 100644 index 4ee1c62..0000000 --- a/R/detectAnomaly.R +++ /dev/null @@ -1,175 +0,0 @@ -#' -#' @importFrom methods is -#' @importFrom stats na.omit predict qnorm -#' @importFrom utils tail -#' -#' @title PCA Anomaly Scores via Isolation Forests with Visualization -#' -#' @description -#' This function detects anomalies in single-cell data by projecting the data onto a PCA space and using an isolation forest -#' algorithm to identify anomalies. -#' -#' @details This function projects the query data onto the PCA space of the reference data. An isolation forest is then built on the -#' reference data to identify anomalies in the query data based on their PCA projections. If no query dataset is provided by the user, -#' the anomaly scores are computed on the reference data itself. Anomaly scores for the data with all combined cell types are also -#' provided as part of the output. -#' -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param query_data An optional \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' If NULL, then the isolation forest anomaly scores are computed for the reference data. Default is NULL. -#' @param ref_cell_type_col A character string specifying the column name in the reference dataset containing cell type annotations. -#' @param query_cell_type_col A character string specifying the column name in the query dataset containing cell type annotations. -#' @param n_components An integer specifying the number of principal components to use. Default is 10. -#' @param n_tree An integer specifying the number of trees for the isolation forest. Default is 500 -#' @param anomaly_treshold A numeric value specifying the threshold for identifying anomalies, Default is 0.5. -#' @param ... Additional arguments passed to the `isolation.forest` function. -#' -#' @return A list containing the following components for each cell type and the combined data: -#' \item{anomaly_scores}{Anomaly scores for each cell in the query data.} -#' \item{anomaly}{Logical vector indicating whether each cell is classified as an anomaly.} -#' \item{reference_mat_subset}{PCA projections of the reference data.} -#' \item{query_mat_subset}{PCA projections of the query data (if provided).} -#' \item{var_explained}{Proportion of variance explained by the retained principal components.} -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{plot.detectAnomaly}} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 2000) -#' query_var <- getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Store PCA anomaly data and plots -#' anomaly_output <- detectAnomaly(ref_data_subset, query_data_subset, -#' ref_cell_type_col = "reclustered.broad", -#' query_cell_type_col = "labels", -#' n_components = 10, -#' n_tree = 500, -#' anomaly_treshold = 0.5) -#' -#' # Plot the output for a cell type -#' plot(anomaly_output, cell_type = "CD8", pc_subset = c(1:5), data_type = "query") -#' -# Function to perform diagnostics using isolation forest with PCA and visualization -detectAnomaly <- function(reference_data, - query_data = NULL, - ref_cell_type_col, - query_cell_type_col, - n_components = 10, - n_tree = 500, - anomaly_treshold = 0.5, - ...) { - - # Check whether the anlaysis is done only for one dataset - if (is.null(query_data)) { - include_query_in_output <- FALSE - } else{ - if(is.null(query_cell_type_col)) - stop("If \'query_data\' is not NULL, a value for \'query_cell_type_col\' must be provided.") - include_query_in_output <- TRUE - } - - if(!is.null(n_components)){ - reference_mat <- reducedDim(reference_data, "PCA")[, 1:n_components] - if(include_query_in_output){ - # Get PCA data from reference and query datasets (query data projected onto PCA space of reference dataset) - pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, - query_cell_type_col = query_cell_type_col, ref_cell_type_col = ref_cell_type_col, - n_components = n_components, return_value = "list") - query_mat <- pca_output$query[, paste0("PC", 1:n_components)] - } - } else{ - reference_mat <- t(as.matrix(assay(reference_data, "logcounts"))) - if(include_query_in_output){ - query_mat <- t(as.matrix(assay(query_data, "logcounts"))) - } - } - - # List to store output - output <- list() - - # Extract reference and query annotations - reference_labels <- reference_data[[ref_cell_type_col]] - if(!include_query_in_output){ - cell_types <- c(as.list(na.omit(unique(reference_labels))), - list(na.omit(unique(reference_labels)))) - } else{ - query_labels <- query_data[[query_cell_type_col]] - cell_types <- c(as.list(na.omit(intersect(unique(reference_labels), unique(query_labels)))), - list(na.omit(intersect(unique(reference_labels), unique(query_labels))))) - } - - for (cell_type in cell_types) { - - # Filter reference and query PCA data for the current cell type - reference_mat_subset <- na.omit(reference_mat[reference_labels %in% cell_type,]) - - # Build isolation forest on reference PCA data for this cell type - isolation_forest <- isotree::isolation.forest(reference_mat_subset, ntree = n_tree, ...) - - # Calculate anomaly scores for query data (scaled by reference path length) - reference_anomaly_scores <- predict(isolation_forest, newdata = reference_mat_subset, type = "score") - if(include_query_in_output){ - query_mat_subset <- na.omit(query_mat[query_labels %in% cell_type,]) - query_anomaly_scores <- predict(isolation_forest, newdata = query_mat_subset, type = "score") - } - - # Store cell type anomaly scores and PCA data - list_name <- ifelse(length(cell_type) == 1, cell_type, "Combined") - output[[list_name]] <- list() - output[[list_name]]$reference_anomaly_scores <- reference_anomaly_scores - output[[list_name]]$reference_anomaly <- reference_anomaly_scores > anomaly_treshold - output[[list_name]]$reference_mat_subset <- reference_mat_subset - if(include_query_in_output){ - output[[list_name]]$query_mat_subset <- query_mat_subset - output[[list_name]]$query_anomaly_scores <- query_anomaly_scores - output[[list_name]]$query_anomaly <- query_anomaly_scores > anomaly_treshold - } - if(!is.null(n_components)) - output[[list_name]]$var_explained <- (attributes(reducedDim(reference_data, "PCA"))$varExplained[1:n_components]) / - sum(attributes(reducedDim(reference_data, "PCA"))$varExplained) - } - - # Set the class of the output - class(output) <- c(class(output), "detectAnomaly") - - # Return anomaly, PCA data and optional PCA anomaly plots for each cell type - return(output) -} diff --git a/R/histQCvsAnnotation.R b/R/histQCvsAnnotation.R deleted file mode 100644 index 4d126aa..0000000 --- a/R/histQCvsAnnotation.R +++ /dev/null @@ -1,130 +0,0 @@ -#' @title Histograms: QC Stats and Annotation Scores Visualization -#' -#' @description -#' This function generates histograms for visualizing the distribution of quality control (QC) statistics and -#' annotation scores associated with cell types in single-cell genomic data. -#' -#' @details The particularly useful in the analysis of data from single-cell experiments, -#' where understanding the distribution of these metrics is crucial for quality assessment and -#' interpretation of cell type annotations. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -#' expression data and metadata. -#' @param qc_col character. A column name in the \code{colData} of \code{query_data} that -#' contains the QC stats of interest. -#' @param label_col character. The column name in the \code{colData} of \code{query_data} -#' that contains the cell type labels. -#' @param score_col numeric. The column name in the \code{colData} of \code{query_data} that -#' contains the cell type scores. -#' @param label character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")). -#' Defaults to \code{NULL}, which will include all the cells. -#' -#' @return A object containing two histograms displayed side by side. -#' The first histogram represents the distribution of QC stats, -#' and the second histogram represents the distribution of annotation scores. -#' -#' @examples -#' \donttest{ -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' library(SingleR) -#' library(gridExtra) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # Log-transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR -#' pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Assign labels to query data -#' colData(query_data)$labels <- pred$labels -#' -#' # Get annotation scores -#' scores <- apply(pred$scores, 1, max) -#' -#' # Assign scores to query data -#' colData(query_data)$cell_scores <- scores -#' -#' # Generate histograms -#' histQCvsAnnotation(query_data = query_data, -#' qc_col = "percent.mito", -#' label_col = "labels", -#' score_col = "cell_scores", -#' label = c("CD4", "CD8")) -#' -#' histQCvsAnnotation(query_data = query_data, -#' qc_col = "percent.mito", -#' label_col = "labels", -#' score_col = "cell_scores", -#' label = NULL) -#' } -#' -#' @export -histQCvsAnnotation <- function(query_data, - qc_col = qc_col, - label_col, - score_col, - label = NULL) { - # Sanity checks - - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if qc_col is a valid column name in query_data - if (!qc_col %in% colnames(colData(query_data))) { - stop("qc_col: '", qc_col, "' is not a valid column name in query_data.") - } - - # Check if label_col is a valid column name in query_data - if (!label_col %in% colnames(colData(query_data))) { - stop("label_col: '", label_col, "' is not a valid column name in query_data.") - } - - # Check if score_col is a valid column name in query_data - if (!score_col %in% colnames(colData(query_data))) { - stop("score_col: '", score_col, "' is not a valid column name in query_data.") - } - - # Filter cells based on label if specified - if (!is.null(label)) { - index <- which(colData(query_data)[[label_col]] %in% label) - query_data <- query_data[, index] - } - - # Extract QC stats, scores, and labels - qc_stats <- colData(query_data)[, qc_col] - cell_type_scores <- colData(query_data)[, score_col] - - # Combine QC stats, scores, and labels into a data frame - data <- data.frame(QCStats = qc_stats, Scores = cell_type_scores) - - # Create histogram for QC stats - qc_histogram <- ggplot2::ggplot(data, aes(x = QCStats)) + - ggplot2::geom_histogram(color = "black", fill = "white") + - ggplot2::xlab(paste(qc_col)) + - ggplot2::ylab("Frequency") + - ggplot2::theme_bw() - - # Create histogram for scores - scores_histogram <- ggplot2::ggplot(data, aes(x = Scores)) + - ggplot2::geom_histogram(color = "black", fill = "white") + - ggplot2::xlab("Annotation Scores") + - ggplot2::ylab("Frequency") + - ggplot2::theme_bw() - - # Return the list of plots - return(gridExtra::grid.arrange(qc_histogram, scores_histogram, ncol = 2)) -} diff --git a/R/nearestNeighborDiagnostics.R b/R/nearestNeighborDiagnostics.R deleted file mode 100644 index 54d827f..0000000 --- a/R/nearestNeighborDiagnostics.R +++ /dev/null @@ -1,159 +0,0 @@ -#' @title Calculate Nearest Neighbor Diagnostics for Cell Type Classification -#' -#' @description -#' This function computes the probabilities for each sample of belonging to either the reference or query dataset for -#' each cell type using nearest neighbor analysis. - -#' -#' @details -#' This function performs a nearest neighbor search to calculate the probability of each sample in the query dataset -#' belonging to the reference dataset for each cell type. It uses principal component analysis (PCA) to reduce the dimensionality -#' of the data before performing the nearest neighbor search. The function balances the sample sizes between the reference and query -#' datasets by data augmentation if necessary. - -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param n_neighbor An integer specifying the number of nearest neighbors to consider. Default is 15. -#' @param n_components An integer specifying the number of principal components to use for dimensionality reduction. Default is 10. -#' @param pc_subset A vector specifying the subset of principal components to use in the analysis. Default is c(1:10). -#' @param query_cell_type_col A character string specifying the column name in the query dataset containing cell type annotations. -#' @param ref_cell_type_col A character string specifying the column name in the reference dataset containing cell type annotations. -#' -#' @return A list where each element corresponds to a cell type and contains two vectors: -#' \item{prob_ref}{The probabilities of each query sample belonging to the reference dataset.} -#' \item{prob_query}{The probabilities of each query sample belonging to the query dataset.} -#' The list is assigned the class \code{"nearestNeighbotDiagnostics"}. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{plot.nearestNeighborDiagnostics}} -#' -#' @examples -#' # Load necessary library -#' library(scRNAseq) -#' library(scuttle) -#' library(scran) -#' library(SingleR) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 500) -#' query_var <- getTopHVGs(query_data, n = 500) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Project the query data onto PCA space of reference -#' nn_output <- nearestNeighborDiagnostics(query_data_subset, ref_data_subset, -#' n_neighbor = 15, -#' n_components = 10, -#' pc_subset = c(1:10), -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad") -#' -#' # Plot output -#' plot(nn_output, cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"), -#' prob_type = "query") -#' -#' -# Function to get probabilities for each sample of belonging to reference or query dataset for each cell type -nearestNeighborDiagnostics <- function(query_data, reference_data, - n_neighbor = 15, - n_components = 10, - pc_subset = c(1:10), - query_cell_type_col, - ref_cell_type_col){ - - # Check if n_components is a positive integer - if (!inherits(n_components, "numeric")) { - stop("n_components should be numeric") - } else if (any(!n_components == floor(n_components), n_components < 1)) { - stop("n_components should be an integer, greater than zero.") - } - - # Get PCA data - pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, - n_components = n_components, - query_cell_type_col = query_cell_type_col, - ref_cell_type_col = ref_cell_type_col, - return_value = c("data.frame", "list")[2]) - - # Initialize list to store probabilities - probabilities <- list() - - # Get unique cell types - cell_types <- na.omit(intersect(unique(query_data[[query_cell_type_col]]), - unique(reference_data[[ref_cell_type_col]]))) - - # Loop through each cell type - for (cell_type in cell_types) { - - # Extract PCA-reduced data for the current cell type - ref_pca_cell_type <- pca_output$ref[which(reference_data[[ref_cell_type_col]] == cell_type), paste0("PC", pc_subset)] - query_pca_cell_type <- pca_output$query[which(query_data[[query_cell_type_col]] == cell_type), paste0("PC", pc_subset)] - - # Combine reference and query data for the current cell type - combined_data_cell_type <- rbind(ref_pca_cell_type, query_pca_cell_type) - - # Number of samples for reference and query datasets - n_ref <- nrow(ref_pca_cell_type) - n_query <- nrow(query_pca_cell_type) - - # Data augmentation to balance sample size of datasets - if(n_ref > n_query){ - - combined_data_cell_type <- rbind(combined_data_cell_type, - query_pca_cell_type[sample(1:n_query, n_ref - n_query, replace = TRUE),]) - } else if (n_query > n_ref){ - - combined_data_cell_type <- rbind(combined_data_cell_type, - ref_pca_cell_type[sample(1:n_ref, n_query - n_ref, replace = TRUE),]) - } - - # Perform nearest neighbors search - knn_result <- BiocNeighbors::findKNN(combined_data_cell_type, k = n_neighbor, warn.ties = FALSE) - - prob_ref <- apply(knn_result$index[(n_ref + 1):nrow(knn_result$index),], 1, function(x, n_ref) { - mean(x <= n_ref)}, - n_ref = n_ref) - - # Store the probabilities - probabilities[[cell_type]] <- list() - probabilities[[cell_type]]$prob_ref <- prob_ref - probabilities[[cell_type]]$prob_query <- 1 - prob_ref - } - - # Creating class for output - class(probabilities) <- c(class(probabilities), "nearestNeighborDiagnostics") - - # Return the list of probabilities - return(probabilities) -} - diff --git a/R/plot.calculateAveragePairwiseCorrelation.R b/R/plot.calculateAveragePairwiseCorrelation.R deleted file mode 100644 index a89d4d5..0000000 --- a/R/plot.calculateAveragePairwiseCorrelation.R +++ /dev/null @@ -1,107 +0,0 @@ -#' @title -#' Plot the output of the calculateAveragePairwiseCorrelation function -#' -#' @description -#' This function takes the output of the calculateAveragePairwiseCorrelation function, -#' which should be a matrix of pairwise correlations, and plots it as a heatmap. -#' -#' @details -#' This function converts the correlation matrix into a dataframe, creates a heatmap using ggplot2, -#' and customizes the appearance of the heatmap with updated colors and improved aesthetics. -#' -#' @param x Output matrix from calculateAveragePairwiseCorrelation function. -#' @param ... Additional arguments to be passed to the plotting function. -#' -#' @return A ggplot2 object representing the heatmap plot. -#' -#' @export -#' -#' @seealso \code{\link{calculateAveragePairwiseCorrelation}} -#' -#' @examples -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' library(SingleR) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Compute Pairwise Correlations -#' # Note: The selection of highly variable genes and desired cell types may vary -#' # based on user preference. -#' # The cell type annotation method used in this example is SingleR. -#' # User can use any other method for cell type annotation and provide -#' # the corresponding labels in the metadata. -#' -#' # Selecting highly variable genes -#' ref_var <- getTopHVGs(ref_data, n = 2000) -#' query_var <- getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' -#' # Select desired cell types -#' selected_cell_types <- c("CD4", "CD8", "B_and_plasma") -#' ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad %in% selected_cell_types] -#' query_data_subset <- query_data[common_genes, query_data$reclustered.broad %in% selected_cell_types] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Compute pairwise correlations -#' cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, -#' reference_data = ref_data_subset, -#' n_components = 10, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' cell_types = selected_cell_types, -#' correlation_method = "spearman") -#' -#' # Visualize the results -#' plot(cor_matrix_avg) -#' -#' -# Function to plot the output of the calculateAveragePairwiseCorrelation function -plot.calculateAveragePairwiseCorrelation <- function(x, ...){ - - # Convert matrix to dataframe - cor_df <- as.data.frame(as.table(cor_matrix_avg)) - cor_df$Var1 <- factor(cor_df$Var1, levels = rownames(cor_matrix_avg)) - cor_df$Var2 <- factor(cor_df$Var2, levels = rev(colnames(cor_matrix_avg))) - - # Create the heatmap with updated colors and improved aesthetics - heatmap_plot <- ggplot2::ggplot(cor_df, ggplot2::aes(x = Var2, y = Var1)) + - ggplot2::geom_tile(ggplot2::aes(fill = Freq), color = "white") + - ggplot2::geom_text(ggplot2::aes(label = round(Freq, 2)), color = "black", size = 3, family = "sans") + - ggplot2::scale_fill_gradient2(low = "blue", mid = "white", high = "red", - midpoint = 0, limits = c(min(cor_df$Freq), max(cor_df$Freq)), - name = "Correlation", - breaks = seq(-1, 1, by = 0.2)) + # Specify color scale breaks - ggplot2::labs(title = "Correlation Heatmap", x = "", y = "") + - ggplot2::theme_minimal() + - ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1), # Rotate x-axis labels - axis.text.y = ggplot2::element_text(family = "sans"), # Set font family for y-axis labels - plot.title = ggplot2::element_text(face = "bold"), # Make title bold - legend.position = "right", # Place legend on RHS - legend.title = ggplot2::element_text(face = "italic")) - - # Print the plot - print(heatmap_plot) -} diff --git a/R/plot.calculateSampleDistances.R b/R/plot.calculateSampleDistances.R deleted file mode 100644 index 28c468f..0000000 --- a/R/plot.calculateSampleDistances.R +++ /dev/null @@ -1,152 +0,0 @@ -#' @title Plot Distance Density Comparison for a Specific Cell Type and Selected Samples -#' -#' @description This function plots the density functions for the reference data and the distances from a specified query samples -#' to all reference samples within a specified cell type. -#' -#' @details The function first checks if the specified cell type and sample names are present in the \code{x}. If the -#' specified cell type or sample name is not found, an error is thrown. It then extracts the distances within the reference dataset -#' and the distances from the specified query sample to the reference samples. The function creates a density plot using \code{ggplot2} -#' to compare the distance distributions. The density plot will show two distributions: one for the pairwise distances within the -#' reference dataset and one for the distances from the specified query sample to each reference sample. These distributions are -#' plotted in different colors to visually assess how similar the query sample is to the reference samples of the specified cell type. -#' -#' @param x A list containing the distance data computed by \code{calculateSampleDistances}. -#' @param ref_cell_type A string specifying the reference cell type. -#' @param sample_names A string specifying the query sample name for which to plot the distances. -#' @param ... Additional arguments passed to the plotting function. -#' -#' @return A ggplot2 density plot comparing the reference distances and the distances from the specified sample to the reference samples. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{calculateSampleDistances}} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- scran::getTopHVGs(ref_data, n = 2000) -#' query_var <- scran::getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Plot the PC data -#' distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, -#' n_components = 10, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' pc_subset = c(1:10)) -#' -#' # Identify outliers for CD4 -#' cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' n_components = 10, -#' n_tree = 500, -#' anomaly_treshold = 0.5)$CD4 -#' cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6]) -#' -#' # Plot the densities of the distances -#' plot(distance_data, ref_cell_type = "CD4", sample_names = cd4_top5_anomalies) -#' plot(distance_data, ref_cell_type = "CD8", sample_names = cd4_top5_anomalies) -#' -#' -# Function to plot density functions for the reference data and the specified sample -plot.calculateSampleDistances <- function(x, ref_cell_type, sample_names, ...) { - - # Check if cell type is available - if(length(ref_cell_type) != 1 || !(ref_cell_type %in% names(x))) - stop("The specified \'ref_cell_type\' is not available.") - - # Filter distance data for the specified cell type - cell_distances <- x[[ref_cell_type]] - - # Check if samples are available in data for that cell type - if(!all(sample_names %in% rownames(cell_distances$query_to_ref_distances))) - stop("One or more specified 'sample_names' are not available for that cell type.") - - # Extract distances within the reference dataset - ref_distances <- cell_distances$ref_distances - - # Initialize an empty list to store data frames for each sample - plot_data_list <- list() - - # Loop through each sample to create the combined data frame - for(s in sample_names) { - # Extract distances for the current sample - sample_distances <- cell_distances$query_to_ref_distances[s, ] - - # Create a data frame for the current sample and reference distances - sample_data <- data.frame(Sample = s, Distance = sample_distances, Distance_Type = "Sample") - ref_data <- data.frame(Sample = s, Distance = ref_distances, Distance_Type = "Reference") - - # Combine the reference and sample data frames - combined_data <- rbind(ref_data, sample_data) - - # Append the combined data frame to the list - plot_data_list[[s]] <- combined_data - } - - # Combine all data frames into one data frame - plot_data <- do.call(rbind, plot_data_list) - - # Keep order of sample names - plot_data$Sample <- factor(plot_data$Sample, levels = sample_names) - - # Plot density comparison with facets for each sample - density_plot <- ggplot2::ggplot(plot_data, ggplot2::aes(x = Distance, fill = Distance_Type)) + - ggplot2::geom_density(alpha = 0.5) + - ggplot2::labs(title = paste("Distance Density Comparison for Cell Type:", ref_cell_type), - x = "Distance", y = "Density") + - ggplot2::scale_fill_manual(name = "Distance Type", values = c("Reference" = "blue", "Sample" = "red")) + - ggplot2::facet_wrap(~ Sample, scales = "free_y", labeller = ggplot2::labeller(Sample = label_parsed)) + - ggplot2::theme_minimal() + - ggplot2::theme( - strip.background = ggplot2::element_rect(fill = "lightgrey", color = "grey50"), - strip.text = ggplot2::element_text(color = "grey20", size = 10, face = "bold"), - panel.grid.major = ggplot2::element_line(color = "grey90", linetype = "dashed"), - panel.grid.minor = ggplot2::element_line(color = "grey95", linetype = "dashed") - ) - - # Print the density plot - print(density_plot) -} - - - - - - - diff --git a/R/plot.calculateSampleSimilarityPCA.R b/R/plot.calculateSampleSimilarityPCA.R deleted file mode 100644 index 26c8064..0000000 --- a/R/plot.calculateSampleSimilarityPCA.R +++ /dev/null @@ -1,118 +0,0 @@ -#' @title Plot Cosine Similarities Between Samples and PCs -#' -#' @description -#' This function creates a heatmap plot to visualize the cosine similarities between samples and principal components (PCs). -#' -#' @details -#' This function reshapes the input data frame to create a long format suitable for plotting as a heatmap. It then -#' creates a heatmap plot using ggplot2, where the x-axis represents the PCs, the y-axis represents the samples, and the -#' color intensity represents the cosine similarity values. -#' -#' @param x An object of class 'calculateSampleSimilarityPCA' containing a dataframe of cosine similarity values -#' between samples and PCs. -#' @param pc_subset A numeric vector specifying the subset of principal components to include in the plot (default: c(1:5)). -#' @param ... Additional arguments passed to the plotting function. -#' -#' @return A ggplot object representing the cosine similarity heatmap. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{calculateSampleSimilarityPCA}} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- scran::getTopHVGs(ref_data, n = 2000) -#' query_var <- scran::getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data (assumed to be prepared) -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Store PCA anomaly data and plots -#' anomaly_output <- detectAnomaly(reference_data = ref_data_subset, -#' ref_cell_type_col = "reclustered.broad", -#' n_components = 10, -#' n_tree = 500, -#' anomaly_treshold = 0.5) -#' top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, -#' decreasing = TRUE)[1:6]) -#' -#' # Compute cosine similarity between anomalies and top PCs -#' cosine_similarities <- calculateSampleSimilarityPCA(ref_data_subset, samples = top6_anomalies, -#' pc_subset = c(1:10), n_top_vars = 50) -#' cosine_similarities -#' -#' # Plot similarities -#' plot(cosine_similarities, pc_subset = c(1:5)) -#' -# Function to plot cosine similarities between samples and PCs -plot.calculateSampleSimilarityPCA <- function(x, pc_subset = c(1:5), ...){ - - # Subset data - x <- x[, paste0("PC", pc_subset)] - - # Initialize empty vectors for reshaped data - sample_names <- c() - pc_names <- c() - cosine_values <- c() - - # Loop through the data frame to manually reshape it - for (sample in rownames(x)) { - for (pc in colnames(x)) { - sample_names <- c(sample_names, sample) - pc_names <- c(pc_names, pc) - cosine_values <- c(cosine_values, x[sample, pc]) - } - } - - # Create a data frame with the reshaped data - cosine_long <- data.frame(Sample = factor(sample_names, levels = rev(rownames(x))), - PC = pc_names, CosineSimilarity = cosine_values) - - # Create the heatmap plot - plot <- ggplot(cosine_long, aes(x = PC, y = Sample, fill = CosineSimilarity)) + - geom_tile(color = "white") + - geom_text(aes(label = sprintf("%.2f", CosineSimilarity)), size = 3) + - scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0, - limits = c(-1, 1), space = "Lab", name = "Cosine Similarity") + - labs(title = "Cosine Similarity Heatmap", - x = "", - y = "") + - theme_minimal() + - theme(axis.text.x = element_text(angle = 45, hjust = 1), - plot.title = element_text(hjust = 0.5)) - return(plot) -} - diff --git a/R/plot.compareCCA.R b/R/plot.compareCCA.R deleted file mode 100644 index 55f5a30..0000000 --- a/R/plot.compareCCA.R +++ /dev/null @@ -1,95 +0,0 @@ -#' @title Plot Visualization of Output from compareCCA Function -#' -#' @description This function generates a visualization of the output from the `compareCCA` function. -#' The plot shows the cosine similarities of canonical correlation analysis (CCA) coefficients, -#' with point sizes representing the correlations. -#' -#' @details The function converts the input list into a data frame suitable for plotting with `ggplot2`. -#' Each point in the scatter plot represents the cosine similarity of CCA coefficients, with the size of the point -#' indicating the correlation. -#' -#' @param x A list containing the output from the `compareCCA` function. -#' This list should include `cosine_similarity` and `correlations`. -#' @param ... Additional arguments passed to the plotting function. -#' -#' @return A ggplot object representing the scatter plot of cosine similarities of CCA coefficients and correlations. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{compareCCA}} -#' -#' @examples -#' # Load necessary library -#' library(scRNAseq) -#' library(scuttle) -#' library(scran) -#' library(SingleR) -#' library(ggplot2) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # Log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 500) -#' query_var <- getTopHVGs(query_data, n = 500) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Subset reference and query data for a specific cell type -#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] -#' -#' # Run PCA on the reference and query datasets -#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -#' query_data_subset <- runPCA(query_data_subset, ncomponents = 50) -#' -#' # Compare CCA -#' cca_comparison <- compareCCA(query_data_subset, ref_data_subset, -#' pc_subset = c(1:5)) -#' -#' # Visualize output of CCA comparison -#' plot(cca_comparison) -#' -#' -# Plot visualization of output from compareCCA function -plot.compareCCA <- function(x, ...){ - - # Create a data frame for plotting - comparison_data <- data.frame(CCA = paste0("CC", 1:length(x$correlations)), - Cosine = x$cosine_similarity, - Correlation = x$correlations) - comparison_data$CC <- factor(comparison_data$CCA, levels = comparison_data$CCA) - - - cca_plot <- ggplot2::ggplot(comparison_data, aes(x = CCA, y = Cosine, size = Correlation)) + - ggplot2::geom_point() + - ggplot2::scale_size_continuous(range = c(3, 10)) + - ggplot2::labs(title = "Cosine Similarities of CCA Coefficients with Correlation", - x = "", - y = "Cosine of CC Coefficients", - size = "Correlation") + - ggplot2::theme_minimal() - print(cca_plot) -} \ No newline at end of file diff --git a/R/plot.comparePCA.R b/R/plot.comparePCA.R deleted file mode 100644 index acba156..0000000 --- a/R/plot.comparePCA.R +++ /dev/null @@ -1,100 +0,0 @@ -#' @title Plot Heatmap of Cosine Similarities Between Principal Components -#' -#' @description This function generates a heatmap to visualize the cosine similarities between -#' principal components from the output of the `comparePCA` function. -#' -#' @details The function converts the input matrix into a long-format data frame -#' suitable for plotting with `ggplot2`. The rows in the heatmap are ordered in -#' reverse to match the conventional display format. The heatmap uses a blue-white-red -#' color gradient to represent cosine similarity values, where blue indicates negative -#' similarity, white indicates zero similarity, and red indicates positive similarity. -#' -#' @param x A numeric matrix output from the `comparePCA` function, representing -#' cosine similarities between query and reference principal components. -#' @param ... Additional arguments passed to the plotting function. -#' -#' @return A ggplot object representing the heatmap of cosine similarities. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{comparePCA}} -#' -#' @examples -#' # Load necessary library -#' library(scRNAseq) -#' library(scuttle) -#' library(scran) -#' library(SingleR) -#' library(ComplexHeatmap) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # Log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 500) -#' query_var <- getTopHVGs(query_data, n = 500) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Subset reference and query data for a specific cell type -#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] -#' -#' # Run PCA on the reference and query datasets -#' ref_data_subset <- runPCA(ref_data_subset) -#' query_data_subset <- runPCA(query_data_subset) -#' -#' # Call the PCA comparison function -#' similarity_mat <- comparePCA(query_data_subset, ref_data_subset, -#' pc_subset = c(1:5), -#' metric = c("cosine", "correlation")[1], -#' correlation_method = c("spearman", "pearson")[1]) -#' -#' # Create the heatmap -#' plot(similarity_mat) -#' -#' -# Function to produce the heatmap from output of comparePCA function -plot.comparePCA <- function(x, ...){ - - # Convert the matrix to a data frame - similarity_df <- data.frame( - Ref = factor(rep(rownames(x), each = ncol(x)), levels = rev(rownames(x))), - Query = rep(colnames(x), times = nrow(x)), - value = as.vector(x)) - - # Create the heatmap - pc_plot <- ggplot2::ggplot(similarity_df, ggplot2::aes(x = Query, y = Ref, fill = value)) + - ggplot2::geom_tile(color = "white") + - ggplot2::scale_fill_gradient2(low = "blue", high = "red", mid = "white", - midpoint = 0, limit = c(min(x, -0.5), max(x, 0.5)), space = "Lab", - name = "Cosine Similarity") + - ggplot2::theme_minimal() + - ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, vjust = 1, - size = 12, hjust = 1)) + - ggplot2::labs(x = "", y = "", - title = "Heatmap of Cosine Similarities Between PCs") - print(pc_plot) -} diff --git a/R/plot.comparePCASubspace.R b/R/plot.comparePCASubspace.R deleted file mode 100644 index 9477a70..0000000 --- a/R/plot.comparePCASubspace.R +++ /dev/null @@ -1,96 +0,0 @@ -#' @title Plot Visualization of Output from comparePCASubspace Function -#' -#' @description This function generates a visualization of the output from the `comparePCASubspace` function. -#' The plot shows the cosine of principal angles between reference and query principal components, -#' with point sizes representing the variance explained. -#' -#' @details The function converts the input list into a data frame suitable for plotting with `ggplot2`. -#' Each point in the scatter plot represents the cosine of a principal angle, with the size of the point -#' indicating the average variance explained by the corresponding principal components. -#' -#' @param x A numeric matrix output from the `comparePCA` function, representing -#' cosine similarities between query and reference principal components. -#' @param ... Additional arguments passed to the plotting function. -#' -#' @return A ggplot object representing the heatmap of cosine similarities. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{comparePCASubspace}} -#' -#' @examples -#' # Load necessary library -#' library(scRNAseq) -#' library(scuttle) -#' library(scran) -#' library(SingleR) -#' library(ggplot2) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # Log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 500) -#' query_var <- getTopHVGs(query_data, n = 500) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Subset reference and query data for a specific cell type -#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] -#' -#' # Run PCA on the reference and query datasets -#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -#' query_data_subset <- runPCA(query_data_subset, ncomponents = 50) -#' -#' # Compare PCA subspaces -#' subspace_comparison <- comparePCASubspace(query_data_subset, ref_data_subset, -#' pc_subset = c(1:5)) -#' -#' # Create a data frame for plotting -#' plot(subspace_comparison) -#' -#' -# Function to produce the visualization of output from comparePCASubspace function -plot.comparePCASubspace <- function(x, ...){ - - # Create a data frame for plotting - x <- data.frame(PC = paste0("Ref PC", subspace_comparison$cosine_id[, 1], - " - Query PC", subspace_comparison$cosine_id[, 2]), - Cosine = subspace_comparison$cosine_similarity, - VarianceExplained = subspace_comparison$var_explained_avg) - x$PC <- factor(x$PC, levels = x$PC) - - # Create plot - pc_plot <- ggplot2::ggplot(x, aes(x = PC, y = Cosine, size = VarianceExplained)) + - ggplot2::geom_point() + - ggplot2::scale_size_continuous(range = c(3, 10)) + - ggplot2::labs(title = "Principal Angles Cosines with Variance Explained", - x = "", - y = "Cosine of Principal Angle", - size = "Variance Explained") + - ggplot2::theme_minimal() - print(pc_plot) -} \ No newline at end of file diff --git a/R/plot.detectAnomaly.R b/R/plot.detectAnomaly.R deleted file mode 100644 index 0480309..0000000 --- a/R/plot.detectAnomaly.R +++ /dev/null @@ -1,163 +0,0 @@ -#' @title Create Faceted Scatter Plots for Specified PC Combinations From \code{detectAnomaly} Object -#' -#' @description This function generates faceted scatter plots for specified principal component (PC) combinations -#' within an anomaly detection object. It allows visualization of the relationship between specified -#' PCs and highlights anomalies detected by the Isolation Forest algorithm. -#' -#' @details The function extracts the specified PCs from the given anomaly detection object and generates -#' scatter plots for each pair of PCs. It uses \code{ggplot2} to create a faceted plot where each facet represents -#' a pair of PCs. Anomalies are highlighted in red, while normal points are shown in black. -#' -#' @param x A list object containing the anomaly detection results from the \code{detectAnomaly} function. -#' Each element of the list should correspond to a cell type and contain \code{reference_mat_subset}, \code{query_mat_subset}, -#' \code{var_explained}, and \code{anomaly}. -#' @param cell_type A character string specifying the cell type for which the plots should be generated. This should -#' be a name present in \code{x}. If NULL, the "Combined" cell type will be plotted. Default is NULL. -#' @param pc_subset A numeric vector specifying the indices of the PCs to be included in the plots. If NULL, all PCs -#' in \code{reference_mat_subset} will be included. -#' @param data_type A character string specifying whether to plot the "query" data or the "reference" data. Default is "query". -#' @param ... Additional arguments. -#' -#' @return A ggplot2 object representing the PCA plots with anomalies highlighted. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{detectAnomaly}} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 2000) -#' query_var <- getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -#' -#' # Store PCA anomaly data and plots -#' anomaly_output <- detectAnomaly(ref_data_subset, query_data_subset, -#' ref_cell_type_col = "reclustered.broad", -#' query_cell_type_col = "labels", -#' n_components = 10, -#' n_tree = 500, -#' anomaly_treshold = 0.5) -#' -#' # Plot the output for a cell type -#' plot(anomaly_output, cell_type = "CD8", pc_subset = c(1:5), data_type = "query") -#' -# Function to create faceted scatter plots for specified PC combinations -plot.detectAnomaly <- function(x, cell_type = NULL, pc_subset = NULL, data_type = c("query", "reference"), ...) { - - # Check if PCA was used for computations - if(!("var_explained" %in% names(x[[names(x)[1]]]))) - stop("The plot function can only be used if \'n_components\' is not NULL.") - - # Check input for cell type - if(is.null(cell_type)){ - cell_type <- "Combined" - } else{ - if(!(cell_type %in% names(x))) - stop("\'cell_type\' is not available in \'x\'.") - } - - # Check input for pc_subset - if(!is.null(pc_subset)){ - if(!all(pc_subset %in% 1:ncol(x[[cell_type]]$reference_mat_subset))) - stop("\'pc_subset\' is out of range.") - } else{ - pc_subset <- 1:ncol(x[[cell_type]]$reference_mat_subset) - } - - # Check input for data_type - data_type <- match.arg(data_type) - - # Filter data to include only specified PCs - if(is.null(x[[cell_type]]$query_mat_subset) && data_type == "query"){ - stop("There is no query data available in the \'detectAnomaly\' object.") - } else{ - if(data_type == "query"){ - data_subset <- x[[cell_type]]$query_mat_subset[, pc_subset, drop = FALSE] - anomaly <- x[[cell_type]]$query_anomaly - - } else if(data_type == "reference"){ - data_subset <- x[[cell_type]]$reference_mat_subset[, pc_subset, drop = FALSE] - anomaly <- x[[cell_type]]$reference_anomaly - } - } - - # Modify column names to include percentage of variance explained - colnames(data_subset) <- paste0("PC", pc_subset, - " (", sprintf("%.1f%%", x[[cell_type]]$var_explained[pc_subset] * 100), ")") - - # Create all possible pairs of specified PCs - pc_names <- colnames(data_subset) - pairs <- expand.grid(x = pc_names, y = pc_names) - pairs <- pairs[pairs$x != pairs$y, ] - - # Create a new data frame with all possible pairs of specified PCs - data_pairs_list <- lapply(1:nrow(pairs), function(i) { - x_col <- pairs$x[i] - y_col <- pairs$y[i] - data_frame <- data.frame(data_subset[, c(x_col, y_col)]) - colnames(data_frame) <- c("x_value", "y_value") - data_frame$x <- x_col - data_frame$y <- y_col - data_frame - }) - data_pairs <- do.call(rbind, data_pairs_list) - - # Remove redundant data (to avoid duplicated plots) - data_pairs <- data_pairs[as.numeric(data_pairs$x) < as.numeric(data_pairs$y),] - - # Add anomalies vector to data_pairs dataframe - data_pairs$anomaly <- rep(anomaly, choose(length(pc_subset), 2)) - - # Create the ggplot object with facets - plot <- ggplot2::ggplot(data_pairs, ggplot2::aes(x = x_value, y = y_value, color = factor(anomaly))) + - ggplot2::geom_point(size = 2) + - ggplot2::scale_color_manual(values = c("black", "red"), labels = c("Normal", "Anomaly")) + - ggplot2::facet_grid(rows = ggplot2::vars(y), cols = ggplot2::vars(x), scales = "free") + - ggplot2::theme_minimal() + - ggplot2::theme(strip.background = ggplot2::element_rect(fill = "grey85", color = "grey70"), - strip.text = ggplot2::element_text(size = 10, face = "bold", color = "black"), - axis.title = ggplot2::element_blank(), - axis.text = ggplot2::element_text(size = 10), - panel.grid = ggplot2::element_blank(), - panel.background = ggplot2::element_rect(fill = "white", color = "black"), - legend.position = "right", - plot.title = ggplot2::element_text(size = 14, hjust = 0.5), - plot.background = ggplot2::element_rect(fill = "white")) + - ggplot2::labs(title = paste0("Isolation Forest Anomaly Plot: ", cell_type), color = "iForest Type") - print(plot) -} diff --git a/R/plot.nearestNeighborDiagnostics.R b/R/plot.nearestNeighborDiagnostics.R deleted file mode 100644 index c93bcdb..0000000 --- a/R/plot.nearestNeighborDiagnostics.R +++ /dev/null @@ -1,114 +0,0 @@ -#' @title Plot Density of Probabilities for Cell Type Classification -#' -#' @description This function generates a density plot showing the distribution of probabilities for each sample of belonging to -#' either the reference or query dataset for each cell type. -#' -#' @details This function creates a density plot to visualize the distribution of probabilities for each sample belonging to the -#' reference or query dataset for each cell type. It utilizes the ggplot2 package for plotting. -#' -#' @param x An object of class \code{nearestNeighbotDiagnostics} containing the probabilities calculated by the \code{\link{nearestNeighborDiagnostics}} function. -#' @param cell_types A character vector specifying the cell types to include in the plot. If NULL, all cell types in \code{x} will be plotted. Default is NULL. -#' @param prob_type A character string specifying the type of probability to plot. Must be either "query" or "reference". Default is "query". -#' @param ... Additional arguments to be passed to \code{\link[ggplot2]{geom_density}}. -#' -#' @return A ggplot2 density plot. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @seealso \code{\link{nearestNeighborDiagnostics}} -#' -#' @examples -#' # Load necessary library -#' library(scRNAseq) -#' library(scuttle) -#' library(scran) -#' library(SingleR) -#' library(scater) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- getTopHVGs(ref_data, n = 500) -#' query_var <- getTopHVGs(query_data, n = 500) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Project the query data onto PCA space of reference -#' nn_output <- nearestNeighborDiagnostics(query_data_subset, ref_data_subset, -#' n_neighbor = 15, -#' n_components = 10, -#' pc_subset = c(1:10), -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad") -#' -#' # Plot output -#' plot(nn_output, cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"), -#' prob_type = "query") -#' -#' -# Function to plot probabilities of each sample of belonging to reference or query dataset for each cell type -plot.nearestNeighborDiagnostics <- function(x, cell_types = NULL, - prob_type = c("query", "reference")[1], ...) { - - # Check input for probability type - if(!(prob_type %in% c("query", "reference"))) - stop("\'prob_type\' must be one of \'query\' or \'reference\'.") - - # Convert probabilities to data frame - probabilities_df <- do.call(rbind, lapply(names(x), function(ct) { - data.frame(cell_types = ct, - probability = x[[ct]][[ifelse(prob_type == "reference", "prob_ref", "prob_query")]]) - })) - - if(!is.null(cell_types)){ - - if(!all(cell_types %in% names(x))) - stop("One or more of the \'cell_types'\ is not available.") - - # Subset cell types - probabilities_df <- probabilities_df[probabilities_df$cell_types %in% cell_types,] - } - - # Create density plot - density_plot <- ggplot2::ggplot(probabilities_df, ggplot2::aes(x = probability, fill = cell_types)) + - ggplot2::geom_density(alpha = 0.7) + - ggplot2::labs(x = "Probability", y = "Density", title = "Density Plot of Probabilities") + - ggplot2::theme_minimal() + - ggplot2::theme( - legend.position = "none", - strip.background = ggplot2::element_rect(fill = "grey90", color = NA), - strip.text = ggplot2::element_text(face = "bold") - ) + - ggplot2::facet_wrap(~cell_types, scales = "free", labeller = ggplot2::labeller(cell_types = label_value)) - if(length(unique(probabilities_df$cell_types)) > 2) - density_plot <- density_plot + - ggplot2::scale_fill_manual(values = RColorBrewer::brewer.pal(n = nlevels(as.factor(probabilities_df$cell_types)), - name = "Set1")) - - return(density_plot) -} diff --git a/R/plotGeneExpressionDimred.R b/R/plotGeneExpressionDimred.R deleted file mode 100644 index 5cda893..0000000 --- a/R/plotGeneExpressionDimred.R +++ /dev/null @@ -1,84 +0,0 @@ -#' @title Visualize gene expression on a dimensional reduction plot -#' -#' @description -#' This function plots gene expression on a dimensional reduction plot using methods like t-SNE, UMAP, or PCA. Each single cell is color-coded based on the expression of a specific gene or feature. -#' -#' @param se_object An object of class "SingleCellExperiment" containing log-transformed expression matrix and other metadata. -#' It can be either a reference or query dataset. -#' @param method The reduction method to use for visualization. It should be one of the supported methods: "tSNE", "UMAP", or "PCA". -#' @param n_components A numeric vector of length 2 indicating the first two dimensions to be used for plotting. -#' @param feature A character string representing the name of the gene or feature to be visualized. -#' -#' @import ggplot2 -#' @importFrom ggplot2 ggplot -#' @importFrom SummarizedExperiment assay -#' @import SingleCellExperiment -#' -#' @return A ggplot object representing the dimensional reduction plot with gene expression. -#' @export -#' -#' @examples -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # Log transform datasets -#' query_data <- logNormCounts(query_data) -#' -#' # Run PCA -#' query_data <- runPCA(query_data) -#' -#' # Plot gene expression on PCA plot -#' plotGeneExpressionDimred(se_object = query_data, -#' method = "PCA", -#' n_components = c(1, 2), -#' feature = "VPREB3") -#' -#' -plotGeneExpressionDimred <- function(se_object, - method, - n_components = c(1, 2), - feature) { - - # Error handling and validation - supported_methods <- c("tSNE", "UMAP", "PCA") - if (!(method %in% supported_methods)) { - stop("Unsupported method. Please choose one of: ", paste(supported_methods, collapse = ", ")) - } - - if (length(n_components) != 2) { - stop("n_components should be a numeric vector of length 2.") - } - - if (!feature %in% rownames(assay(query_data, "logcounts"))) { - stop("Specified feature does not exist in the expression matrix.") - } - - # Extract dimension reduction coordinates from SingleCellExperiment object - reduction <- reducedDim(query_data, method)[, n_components] - - # Extract gene expression vector - expression <- assay(query_data, "logcounts")[feature, ] - - # Prepare data for plotting - df <- data.frame(Dim1 = reduction[, 1], Dim2 = reduction[, 2], Expression = expression) - - # Create the plot object - plot <- ggplot(df, aes(x = Dim1, y = Dim2)) + - geom_point(aes(color = Expression)) + - scale_color_gradient(low = "grey90", high = "blue") + - xlab("Dimension 1") + - ylab("Dimension 2") + - theme_bw() - - return(plot) -} diff --git a/R/plotGeneSetScores.R b/R/plotGeneSetScores.R deleted file mode 100644 index 39123f7..0000000 --- a/R/plotGeneSetScores.R +++ /dev/null @@ -1,149 +0,0 @@ -#' @title Visualization of gene sets or pathway scores on dimensional reduction plot -#' -#' @description -#' Plot gene sets or pathway scores on PCA, TSNE, or UMAP. Single cells are color-coded by scores of gene sets or pathways. -#' -#' @details -#' This function plots gene set scores on reduced dimensions such as PCA, t-SNE, or UMAP. -#' It extracts the reduced dimensions from the provided SingleCellExperiment object. -#' Gene set scores are visualized as a scatter plot with colors indicating the scores. -#' For PCA, the function automatically includes the percentage of variance explained -#' in the plot's legend. -#' -#' @param se_object An object of class "SingleCellExperiment" containing numeric expression matrix and other metadata. -#' It can be either a reference or query dataset. -#' @param method A character string indicating the method for visualization ("PCA", "TSNE", or "UMAP"). -#' @param feature A character string representing the name of the feature (score) in the colData(query_data) to plot. -#' @param pc_subset An optional vector specifying the principal components (PCs) to include in the plot if method = "PCA". -#' Default is c(1:5). -#' -#' @return A ggplot2 object representing the gene set scores plotted on the specified reduced dimensions. -#' @export -#' -#' @examples -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' library(AUCell) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' ## log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Run PCA on the query data -#' query_data <- runPCA(query_data) -#' -#' # Compute scores using AUCell -#' expression_matrix <- assay(query_data, "logcounts") -#' cells_rankings <- AUCell_buildRankings(expression_matrix, plotStats = FALSE) -#' # Generate gene sets -#' gene_set1 <- sample(rownames(expression_matrix), 10) -#' gene_set2 <- sample(rownames(expression_matrix), 20) -#' gene_sets <- list(geneSet1 = gene_set1, geneSet2 = gene_set2) -#' -#' # Calculate AUC scores for gene sets -#' cells_AUC <- AUCell_calcAUC(gene_sets, cells_rankings) -#' -#' # Assign scores to colData (users should ensure that the scores are present in the colData) -#' colData(query_data)$geneSetScores <- assay(cells_AUC)["geneSet1", ] -#' -#' # Plot gene set scores on PCA -#' plotGeneSetScores(se_object = query_data, -#' method = "PCA", -#' feature = "geneSetScores", -#' pc_subset = c(1:5)) -#' -#' # Note: Users can provide their own gene set scores in the colData of the 'se_object' object, -#' # using any method of their choice. -#' -plotGeneSetScores <- function(se_object, - method, - feature, - pc_subset = c(1:5)) { - - # Check if the specified method is valid - valid_methods <- c("PCA", "TSNE", "UMAP") - if (!(method %in% valid_methods)) { - stop("Invalid method. Please choose one of: ", paste(valid_methods, collapse = ", ")) - } - - # Create the plot object - if (method == "PCA") { - # Check if "PCA" is present in reference's reduced dimensions - if (!"PCA" %in% names(reducedDims(se_object))) { - stop("Reference data must have pre-computed PCA in \'reducedDims\'.") - } - - # Check input for pc_subset - if(!all(pc_subset %in% 1:ncol(reducedDim(se_object, "PCA")))) - stop("\'pc_subset\' is out of range.") - - # PCA data - plot_mat <- reducedDim(se_object, "PCA")[, pc_subset] - # Modify column names to include percentage of variance explained - colnames(plot_mat) <- paste0("PC", pc_subset, - " (", sprintf("%.1f%%", attributes(reducedDim(se_object, "PCA"))$varExplained[pc_subset] / - sum(attributes(reducedDim(se_object, "PCA"))$varExplained) * 100), ")") - } else if (method == "TSNE") { - # Check if "TSNE" is present in reference's reduced dimensions - if (!"TSNE" %in% names(reducedDims(se_object))) { - stop("Reference data must have pre-computed t-SNE in \'reducedDims\'.") - } - # TSNE data - plot_mat <- reducedDim(se_object, "TSNE") - } else if (method == "UMAP") { - # Check if "UMAP" is present in reference's reduced dimensions - if (!"UMAP" %in% names(reducedDims(se_object))) { - stop("Reference data must have pre-computed UMAP in \'reducedDims\'.") - } - # UMAP data - plot_mat <- reducedDim(se_object, "UMAP") - } - - # Create all possible pairs of specified PCs - plot_names <- colnames(plot_mat) - pairs <- expand.grid(x = plot_names, y = plot_names) - pairs <- pairs[pairs$x != pairs$y, ] - # Create a new data frame with all possible pairs of specified PCs - data_pairs_list <- lapply(1:nrow(pairs), function(i) { - x_col <- pairs$x[i] - y_col <- pairs$y[i] - data_frame <- data.frame(plot_mat[, c(x_col, y_col)]) - colnames(data_frame) <- c("x_value", "y_value") - data_frame$x <- x_col - data_frame$y <- y_col - data_frame - }) - # Plot data - data_pairs <- do.call(rbind, data_pairs_list) - # Remove redundant data (to avoid duplicated plots) - data_pairs <- data_pairs[as.numeric(data_pairs$x) < as.numeric(data_pairs$y),] - data_pairs$scores <- se_object[["geneSetScores"]] - # Create the ggplot object (with facets if PCA) - plot_obj <- ggplot2::ggplot(data_pairs, ggplot2::aes(x = x_value, y = y_value, color = scores)) + - ggplot2::geom_point(size = 1, alpha = 0.5) + - ggplot2::scale_color_gradientn(colors = c("#2171B5", "#8AABC1", "#FFEDA0", "#E6550D"), - values = seq(0, 1, by = 1/3), - limits = c(0, max(data_pairs$scores))) + - ggplot2::facet_grid(rows = ggplot2::vars(y), cols = ggplot2::vars(x), scales = "free") + - ggplot2::theme_minimal() + - ggplot2::theme(strip.background = ggplot2::element_rect(fill = "grey85", color = "grey70"), - strip.text = ggplot2::element_text(size = 10, face = "bold", color = "black"), - axis.title = ggplot2::element_blank(), - axis.text = ggplot2::element_text(size = 10), - panel.grid = ggplot2::element_blank(), - panel.background = ggplot2::element_rect(fill = "white", color = "black"), - legend.position = "right", - plot.title = ggplot2::element_text(size = 14, hjust = 0.5), - plot.background = ggplot2::element_rect(fill = "white")) - return(plot_obj) -} diff --git a/R/plotMarkerExpression.R b/R/plotMarkerExpression.R deleted file mode 100644 index 317c2d8..0000000 --- a/R/plotMarkerExpression.R +++ /dev/null @@ -1,156 +0,0 @@ -#' @title Plot gene expression distribution from overall and cell type-specific perspective -#' -#' @description -#' This function generates density plots to visualize the distribution of gene expression values -#' for a specific gene across the overall dataset and within a specified cell type. -#' -#' @details -#' This function generates density plots to compare the distribution of a specific marker -#' gene between reference and query datasets. The aim is to inspect the alignment of gene expression -#' levels as a surrogate for dataset similarity. Similar distributions suggest a good alignment, -#' while differences may indicate discrepancies or incompatibilities between the datasets. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} that identifies the cell types. -#' @param gene_name character. A string representing the gene name for which the distribution is to be visualized. -#' @param label character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")). -#' -#' @return A gtable object containing two arranged density plots as grobs. -#' The first plot shows the overall gene expression distribution, -#' and the second plot displays the cell type-specific expression -#' distribution. -#' -#' @examples -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' library(SingleR) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # Log transform datasets -#' ref_data <- logNormCounts(ref_data) -#' query_data <- logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR or any other method -#' pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- pred$labels -#' -#' # Note: Users can use SingleR or any other method to obtain the cell type annotations. -#' plotMarkerExpression(reference_data = ref_data, -#' query_data = query_data, -#' ref_cell_type_col = "reclustered.broad", -#' query_cell_type_col = "labels", -#' gene_name = "VPREB3", -#' label = "B_and_plasma") -#' -#' -#' @import ggplot2 -#' @importFrom ggplot2 ggplot -#' @importFrom gridExtra grid.arrange -#' @importFrom SummarizedExperiment assay -#' @import SingleCellExperiment -#' @export -plotMarkerExpression <- function(reference_data, - query_data, - ref_cell_type_col, - query_cell_type_col, - gene_name, - label) { - # Sanity checks - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if reference_data is a SingleCellExperiment object - if (!is(reference_data, "SingleCellExperiment")) { - stop("reference_data must be a SingleCellExperiment object.") - } - - # Check if gene_name is present in both query_data and reference_data - if (!(gene_name %in% rownames(assay(query_data)) && gene_name %in% - rownames(assay(reference_data)))) { - stop("gene_name: '", gene_name, "' is not present in the - row names of both query_data and reference_data.") - } - - # Check if all labels are present in query_data - if (!all(label %in% query_data[[query_cell_type_col]])) { - stop("One or more labels specified are not present in query_data.") - } - - # Check if all labels are present in reference_data - if (!all(label %in% reference_data[[ref_cell_type_col]])) { - stop("One or more labels specified are not present in reference_data.") - } - - # Get expression of the specified gene for reference and query datasets - reference_gene_expression <- assay(reference_data, "logcounts")[gene_name, ] - query_gene_expression <- assay(query_data, "logcounts")[gene_name, ] - - # Create a combined vector of gene expression values - combined_gene_expression <- c(reference_gene_expression, query_gene_expression) - - # Create a grouping vector for dataset labels - dataset_labels <- rep(c("Reference", "Query"), times = c(length(reference_gene_expression), - length(query_gene_expression))) - - # Combine the gene expression values and dataset labels - data <- data.frame( - GeneExpression = combined_gene_expression, - Dataset = dataset_labels - ) - - # Create a stacked density plot using ggplot2 for overall dataset - overall_plot <- ggplot(data, aes(x = GeneExpression, fill = Dataset)) + - geom_density(alpha = 0.5) + - labs(title = paste("Overall Distribution"), - x = paste("Log gene Expression", gene_name), - y = "Density") + - theme_minimal() - - # Create a subset of data for cell type-specific distribution - index1 <- which(reference_data[[ref_cell_type_col]] %in% label) - index2 <- which(query_data[[query_cell_type_col]] %in% label) - - reference_gene_expression_cell_type <- assay(reference_data, "logcounts")[gene_name, index1] - query_gene_expression_cell_type <- assay(query_data, "logcounts")[gene_name, index2] - - # Combine the gene expression values and dataset labels for cell type-specific - combined_gene_expression <- c(reference_gene_expression_cell_type, - query_gene_expression_cell_type) - - # Create a grouping vector for dataset labels - dataset_labels <- rep(c("Reference", "Query"), - times = c(length(reference_gene_expression_cell_type), - length(query_gene_expression_cell_type))) - - # Combine the gene expression values and dataset labels - cell_type_specific_data <- data.frame( - GeneExpression = combined_gene_expression, - Dataset = dataset_labels - ) - - # Create a stacked density plot using ggplot2 for cell type-specific dataset - cell_type_specific_plot <- ggplot(cell_type_specific_data, - aes(x = GeneExpression, fill = Dataset)) + - geom_density(alpha = 0.5) + - labs(title = paste("Cell Type-Specific Distribution"), - x = paste("Log gene Expression", gene_name), - y = "Density") + - theme_minimal() - - return(gridExtra::grid.arrange(overall_plot, cell_type_specific_plot, ncol = 2)) -} \ No newline at end of file diff --git a/R/plotQCvsAnnotation.R b/R/plotQCvsAnnotation.R deleted file mode 100644 index e56a7e6..0000000 --- a/R/plotQCvsAnnotation.R +++ /dev/null @@ -1,131 +0,0 @@ -#' Scatter plot: QC stats vs Cell Type Annotation Scores -#' -#' Creates a scatter plot to visualize the relationship between QC stats (e.g., library size) -#' and cell type annotation scores for one or more cell types. -#' -#' @details This function generates a scatter plot to explore the relationship between various quality -#' control (QC) statistics, such as library size and mitochondrial percentage, and cell type -#' annotation scores. By examining these relationships, users can assess whether specific QC -#' metrics, systematically influence the confidence in cell type annotations, -#' which is essential for ensuring reliable cell type annotation. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -#' expression data and metadata. -#' @param qc_col character. A column name in the \code{colData} of \code{query_data} that -#' contains the QC stats of interest. -#' @param label_col character. The column name in the \code{colData} of \code{query_data} -#' that contains the cell type labels. -#' @param score_col character. The column name in the \code{colData} of \code{query_data} that -#' contains the cell type annotation scores. -#' @param label character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")). -#' Defaults to \code{NULL}, which will include all the cells. -#' -#' @return A ggplot object displaying a scatter plot of QC stats vs annotation scores, -#' where each point represents a cell, color-coded by its cell type. -#' -#' @examples -#' \donttest{ -#' library(celldex) -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' library(SingleR) -#' -#' # load reference dataset -#' ref_data <- fetchReference("hpca", "2024-02-26") -#' -#' # Load query dataset (Bunis haematopoietic stem and progenitor cell data) from -#' # Bunis DG et al. (2021). Single-Cell Mapping of Progressive Fetal-to-Adult -#' # Transition in Human Naive T Cells Cell Rep. 34(1): 108573 -#' query_data <- BunisHSPCData() -#' rownames(query_data) <- rowData(query_data)$Symbol -#' -#' # Add QC metrics to query data -#' query_data <- addPerCellQCMetrics(query_data) -#' -#' # Log transform query dataset -#' query_data <- logNormCounts(query_data) -#' -#' # Run SingleR to predict cell types -#' -#' pred <- SingleR(query_data, ref_data, labels = ref_data$label.main) -#' -#' # Assign predicted labels to query data -#' colData(query_data)$pred.labels <- pred$labels -#' -#' # Get annotation scores -#' scores <- apply(pred$scores, 1, max) -#' -#' # Assign scores to query data -#' colData(query_data)$cell_scores <- scores -#' -#' # Create a scatter plot between library size and annotation scores -#' -#' p1 <- plotQCvsAnnotation( -#' query_data = query_data, -#' qc_col = "total", -#' label_col = "pred.labels", -#' score_col = "cell_scores", -#' label = NULL) -#' p1 + xlab("Library Size") -#' } -#' -#' -#' @import ggplot2 -#' @export -#' -plotQCvsAnnotation <- function(query_data, - qc_col, - label_col, - score_col, - label = NULL) { - - # Sanity checks - - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if qc_col is a valid column name in query_data - if (!qc_col %in% colnames(colData(query_data))) { - stop("qc_col: '", qc_col, "' is not a valid column name in query_data.") - } - - # Check if label_col is a valid column name in query_data - if (!label_col %in% colnames(colData(query_data))) { - stop("label_col: '", label_col, "' is not a valid column name in query_data.") - } - - # Check if score_col is a valid column name in query_data - if (!score_col %in% colnames(colData(query_data))) { - stop("score_col: '", score_col, "' is not a valid column name in query_data.") - } - - # Filter cells based on label if specified - if (!is.null(label)) { - index <- which(colData(query_data)[[label_col]] %in% label) - query_data <- query_data[, index] - } - - # Extract QC stats, scores, and labels - qc_stats <- colData(query_data)[, qc_col] - cell_type_scores <- colData(query_data)[, score_col] - cell_labels <- colData(query_data)[[label_col]] - - # Combine QC stats, scores, and labels into a data frame - data <- data.frame(QCStats = qc_stats, - Scores = cell_type_scores, - CellType = cell_labels) - - # Create a scatter plot with color-coded points based on cell types or labels - plot <- ggplot(data, aes(x = QCStats, - y = Scores, - color = CellType)) + - geom_point() + - xlab("QC stats") + - ylab("Annotation Scores") + - theme_bw() - - return(plot) -} \ No newline at end of file diff --git a/R/projectPCA.R b/R/projectPCA.R deleted file mode 100644 index 6a5eaab..0000000 --- a/R/projectPCA.R +++ /dev/null @@ -1,180 +0,0 @@ -#' @title Project Query Data Onto PCA Space of Reference Data -#' -#' @description -#' This function projects a query singleCellExperiment object onto the PCA space of a reference -#' singleCellExperiment object. The PCA analysis on the reference data is assumed to be pre-computed and stored within the object. -#' -#' @details -#' This function assumes that the "PCA" element exists within the \code{reducedDims} of the reference data -#' (obtained using \code{reducedDim(reference_data)}) and that the genes used for PCA are present in both the reference and query data. -#' It performs centering and scaling of the query data based on the reference data before projection. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. -#' Must be less than or equal to the number of components available in the reference PCA. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} -#' that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} -#' that identifies the cell types. -#' @param return_value A character string specifying the format of the returned data. Can be \code{data.frame} (combined reference -#' and query projections) or \code{list} (separate lists for reference and query projections) (default = \code{data.frame}). -#' -#' @return A \code{data.frame} containing the projected data in rows (reference and query data combined) or a \code{list} containing -#' separate matrices for reference and query projections, depending on the \code{return_value} argument. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' library(RColorBrewer) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- scran::getTopHVGs(ref_data, n = 2000) -#' query_var <- scran::getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data (assumed to be prepared) -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' # Project the query data onto PCA space of reference -#' pca_output <- projectPCA(query_data_subset, ref_data_subset, -#' n_components = 10, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' return_value = c("data.frame", "list")[1]) -#' -#' # Compute t-SNE and UMAP using first 10 PCs -#' tsne_data <- data.frame(calculateTSNE(t(pca_output[, paste0("PC", 1:10)]))) -#' umap_data <- data.frame(calculateUMAP(t(pca_output[, paste0("PC", 1:10)]))) -#' -#' # Combine the cell type labels from both datasets -#' tsne_data$Type <- paste(pca_output$dataset, pca_output$cell_type) -#' -#' # Define the cell types and legend order -#' legend_order <- c("Query CD8", -#' "Reference CD8", -#' "Query CD4", -#' "Reference CD4", -#' "Query B_and_plasma", -#' "Reference B_and_plasma") -#' -#' # Define the colors for cell types -#' color_palette <- brewer.pal(length(legend_order), "Paired") -#' color_mapping <- setNames(color_palette, legend_order) -#' cell_type_colors <- color_mapping[legend_order] -#' -#' # Visualize t-SNE output -#' tsne_plot <- ggplot(tsne_data[tsne_data$Type %in% legend_order,], -#' aes(x = TSNE1, y = TSNE2, color = factor(Type, levels = legend_order))) + -#' geom_point(alpha = 0.5, size = 1) + -#' scale_color_manual(values = cell_type_colors) + -#' theme_bw() + -#' guides(color = guide_legend(title = "Cell Types")) -#' -#' -# Function to project query data onto PCA space of reference data -projectPCA <- function(query_data, reference_data, - n_components = 10, - query_cell_type_col = NULL, - ref_cell_type_col = NULL, - return_value = c("data.frame", "list")[1]){ - - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if reference_data is a SingleCellExperiment object - if (!is(reference_data, "SingleCellExperiment")) { - stop("reference_data must be a SingleCellExperiment object.") - } - - # Check if "PCA" is present in reference's reduced dimensions - if (!"PCA" %in% names(reducedDims(reference_data))) { - stop("Reference data must have pre-computed PCA in \'reducedDims\'.") - } - - # Check if n_components is a positive integer - if (!inherits(n_components, "numeric")) { - stop("n_components should be numeric") - } else if (any(!n_components == floor(n_components), n_components < 1)) { - stop("n_components should be an integer, greater than zero.") - } - - # Check if requested number of components is within available components - if (ncol(reducedDim(reference_data, "PCA")) < n_components) { - stop("\'n_components\' is larger than number of available components in reference PCA.") - } - - # Returning output as single matrix or a list - if (!return_value %in% c("data.frame", "list")) { - stop("Invalid \'return_value\'. Must be 'data.frame' or \'list\'.") - } - - # Extract reference PCA components and rotation matrix - ref_mat <- reducedDim(reference_data, "PCA")[, 1:n_components] - rotation_mat <- attributes(reducedDim(reference_data, "PCA"))$rotation[, 1:n_components] - PCA_genes <- rownames(rotation_mat) - - # Check if genes used for PCA are available in query data - if (!all(PCA_genes %in% rownames(assay(query_data)))) { - stop("Genes in reference PCA are not found in query data.") - } - - # Center and scale query data based on reference for projection - centering_vec <- apply(t(as.matrix(assay(reference_data, "logcounts"))), 2, mean)[PCA_genes] - query_mat <- scale(t(as.matrix(assay(query_data, "logcounts")))[, PCA_genes], center = centering_vec, scale = FALSE) %*% - rotation_mat - - # Returning output as single matrix or a list - if (return_value == "data.frame") { - return(data.frame(rbind(ref_mat, query_mat), - dataset = c(rep("Reference", nrow(ref_mat)), rep("Query", nrow(query_mat))), - cell_type = c(ifelse(rep(is.null(ref_cell_type_col), nrow(ref_mat)), - rep(NA, nrow(ref_mat)), - colData(reference_data)[[ref_cell_type_col]]), - ifelse(rep(is.null(query_cell_type_col), nrow(query_mat)), - rep(NA, nrow(query_mat)), - colData(query_data)[[query_cell_type_col]])))) - } else if (return_value == "list") { - return(list(ref = data.frame(ref_mat, - cell_type = ifelse(rep(is.null(ref_cell_type_col), nrow(ref_mat)), - rep(NA, nrow(ref_mat)), - colData(reference_data)[[ref_cell_type_col]])), - query = data.frame(query_mat, - cell_type = ifelse(rep(is.null(query_cell_type_col), nrow(query_mat)), - rep(NA, nrow(query_mat)), - colData(query_data)[[query_cell_type_col]])))) - } -} \ No newline at end of file diff --git a/R/regressPC.R b/R/regressPC.R deleted file mode 100644 index 5a2c8f8..0000000 --- a/R/regressPC.R +++ /dev/null @@ -1,273 +0,0 @@ - -#' Principal component regression -#' -#' This function performs linear regression of a covariate of interest onto one -#' or more principal components, based on the data in a SingleCellExperiment -#' object. -#' -#' @details Principal component regression, derived from PCA, can be used to -#' quantify the variance explained by a covariate interest. Applications for -#' single-cell analysis include quantification of batch removal, assessing -#' clustering homogeneity, and evaluation of alignment of query and reference -#' datasets in cell type annotation settings. Briefly, the R^2 is calculated -#' from a linear regression of the covariate B of interest onto each principal -#' component. The variance contribution of the covariate effect per principal -#' component is then calculated as the product of the variance explained by -#' the ith principal component (PC) and the corresponding R2(PCi|B). The sum -#' across all variance contributions by the covariate effects in all principal -#' components gives the total variance explained by the covariate as follows: -#' -#' Var(C|B) = sum_{i=1}^G Var(C|PC_i) * R^2 (PC_i | B) -#' -#' where, Var(C|PCi) is the variance of the data matrix C explained by the ith -#' principal component. See references. -#' -#' If the input is large (>3e4 cells) and the independent variable is -#' categorical with >10 categories, this function will use a stripped down -#' linear model function that is faster but doesn't return all the same -#' components. Namely, the \code{regression.summaries} component of the result -#' will contain only the R^2 values, nothing else. -#' -#' @param sce An object of class \code{\linkS4class{SingleCellExperiment}} -#' containing the data for regression analysis. -#' -#' @param dep.vars character. Dependent variable(s). Determines which principal -#' component(s) (e.g., "PC1", "PC2", etc.) are used as explanatory variables. -#' Principal components are expected to be stored in a PC matrix named -#' \code{"PCA"} in the \code{reducedDims} of \code{sce}. Defaults to -#' \code{NULL} which will then regress on each principal component present in -#' the PC matrix. -#' -#' @param indep.var character. Independent variable. A column name in the -#' \code{colData} of \code{sce} specifying the response variable. -#' -#' @param regressPC_res a result from \code{\link{regressPC}} -#' -#' @param max_pc The maximum number of PCs to show on the plot. Set to 0 to show -#' all. -#' -#' @return A \code{list} containing \itemize{ \item summaries of the linear -#' regression models for each specified principal component, \item the -#' corresponding R-squared (R2) values, \item the variance contributions for -#' each principal component, and \item the total variance explained.} -#' -#' @references Luecken et al. Benchmarking atlas-level data integration in -#' single-cell genomics. Nature Methods, 19:41-50, 2022. -#' -#' @examples -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' library(SingleR) -#' -#' # Load data -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(sce), -#' size = floor(0.7 * ncol(sce)), -#' replace = FALSE -#' ) -#' ref <- sce[, indices] -#' query <- sce[, -indices] -#' -#' # log transform datasets -#' ref <- logNormCounts(ref) -#' query <- logNormCounts(query) -#' -#' # Run PCA -#' query <- runPCA(query) -#' -#' # Get cell type scores using SingleR -#' # Note: replace when using cell type annotation scores from other methods -#' scores <- SingleR(query, ref, labels = ref$reclustered.broad) -#' -#' # Add labels to query object -#' query$labels <- scores$labels -#' -#' # Specify the dependent variables (principal components) and -#' # independent variable (e.g., "labels") -#' dep.vars <- paste0("PC", 1:3) -#' indep.var <- "labels" -#' -#' # Perform linear regression on multiple principal components -#' res <- regressPC( -#' sce = query, -#' dep.vars = dep.vars, -#' indep.var = indep.var -#' ) -#' -#' # Obtain linear regression summaries and R-squared values -#' res$regression.summaries -#' res$rsquared -#' -#' -#' plotPCRegression(query, res, dep.vars, indep.var) -#' -#' @importFrom stats lm -#' @importFrom utils tail -#' @importFrom rlang .data -#' @import SingleCellExperiment -#' @export -regressPC <- - function( - sce, - dep.vars = NULL, - indep.var) { - ## sanity checks - stopifnot(is(sce, "SingleCellExperiment")) - stopifnot("PCA" %in% reducedDimNames(sce)) - - if (!is.null(dep.vars)) { - stopifnot(all(dep.vars %in% colnames(reducedDim(sce, "PCA")))) - } - - stopifnot(indep.var %in% colnames(colData(sce))) - - ## regress against all PCs if not instructed otherwise - if (is.null(dep.vars)) { - dep.vars <- colnames(reducedDim(sce, "PCA")) - } - - ## create a data frame with the dependent and independent variables - df <- data.frame( - Independent = sce[[indep.var]], - reducedDim(sce, "PCA")[, dep.vars] - ) - - ## perform linear regression for each principal component - .regress <- function(pc, df) { - f <- paste0(pc, " ~ Independent") - model <- lm(f, data = df) - s <- summary(model) - return(s) - } - - .regress_fast <- function(df) { - # This does the lms for large, categorical independent variables in - # one sweep. - ssts <- vapply( - df[, dep.vars], - \(x) sum((x - mean(x, na.rm = TRUE))^2, - na.rm = TRUE - ), - 1.0 - ) - - indp_list <- split( - df, - df$Independent - ) - - .get_sses <- function(x) { - vapply( - x[, dep.vars], - \(z) sum((z - mean(z, na.rm = TRUE))^2, - na.rm = TRUE - ), - 1.0 - ) - } - - sses <- rowSums(vapply( - indp_list, - .get_sses, - rep(1, length(dep.vars)) - )) - - s <- mapply( - \(err, tot) { - list( - "r.squared" = 1 - err / tot, - "regression.summaries" = NA - ) - }, - sses, ssts, - SIMPLIFY = FALSE - ) - - return(s) - } - - needs_fastlm <- (nrow(df) > 3e4) && - (is.character(df$Independent) || is.factor(df$Independent)) && - (length(unique(df$Independent)) > 10) - - if (needs_fastlm) { - summaries <- .regress_fast(df) - } else { - summaries <- lapply(dep.vars, .regress, df = df) - } - names(summaries) <- dep.vars - - ## calculate R-squared values - rsq <- vapply(summaries, `[[`, numeric(1), x = "r.squared") - - ## calculate variance contributions by principal component - ind <- match(dep.vars, colnames(reducedDim(sce, "PCA"))) - var.expl <- attr(reducedDim(sce, "PCA"), "percentVar")[ind] - var.contr <- var.expl * rsq - - ## calculate total variance explained by summing the variance contributions - total.var.expl <- sum(var.contr) - - ## return the summaries of the linear regression models, - ## R-squared values, and variance contributions - res <- list( - regression.summaries = summaries, - rsquared = rsq, - var.contributions = var.contr, - total.variance.explained = total.var.expl - ) - - res - } - -#' @rdname regressPC -#' @export -plotPCRegression <- function( - sce, - regressPC_res, - dep.vars = NULL, - indep.var, - max_pc = 20) { - - stopifnot(is(sce, "SingleCellExperiment")) - stopifnot("PCA" %in% reducedDimNames(sce)) - if (!is.null(dep.vars)) { - stopifnot(all(dep.vars %in% colnames(reducedDim(sce, "PCA")))) - } - stopifnot(indep.var %in% colnames(colData(sce))) - - if (is.null(dep.vars)) { - dep.vars <- colnames(reducedDim(sce, "PCA")) - } - - if (max_pc == 0) max_pc <- length(dep.vars) - - p2_input <- data.frame( - x = dep.vars[1:max_pc], - i = seq_along(dep.vars[1:max_pc]), - r2 = regressPC_res$rsquared[1:max_pc] - ) - - p2 <- ggplot2::ggplot(p2_input, aes(.data$i, .data$r2)) + - ggplot2::geom_point() + - ggplot2::geom_line() + - ggplot2::theme_bw() + - ggplot2::ylim(c(0, 1)) + - ggplot2::labs( - y = bquote(R^2 ~ of ~ "PC ~ " ~ .(indep.var)) - ) + - ggplot2::scale_x_continuous( - breaks = p2_input$i, - labels = p2_input$x - ) + - ggplot2::theme( - axis.title.x = ggplot2::element_blank(), - panel.grid.minor = ggplot2::element_blank() - ) - - return(p2) -} diff --git a/R/visualizeCellTypeMDS.R b/R/visualizeCellTypeMDS.R deleted file mode 100644 index 6e6bf33..0000000 --- a/R/visualizeCellTypeMDS.R +++ /dev/null @@ -1,138 +0,0 @@ -#' Visualizing Reference and Query Cell Types using MDS -#' -#' This function facilitates the assessment of similarity between reference and query datasets -#' through Multidimensional Scaling (MDS) scatter plots. It allows the visualization of cell types, -#' color-coded with user-defined custom colors, based on a dissimilarity matrix computed from a -#' user-selected gene set. -#' -#' @details To evaluate dataset similarity, the function selects specific subsets of cells from -#' both reference and query datasets. It then calculates Spearman correlations between gene expression profiles, -#' deriving a dissimilarity matrix. This matrix undergoes Classical Multidimensional Scaling (MDS) for -#' visualization, presenting cell types in a scatter plot, distinguished by colors defined by the user. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -#' expression data and metadata. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell -#' expression data and metadata. -#' @param cell_types A character vector specifying the cell types to include in the plot. If NULL, all cell types are included. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} -#' that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} -#' that identifies the cell types. -#' -#' @return A ggplot object representing the MDS scatter plot with cell type coloring. -#' -#' @examples -#' library(scater) -#' library(scran) -#' library(scRNAseq) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- scran::getTopHVGs(ref_data, n = 2000) -#' query_var <- scran::getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Generate the MDS scatter plot with cell type coloring -#' plot <- visualizeCellTypeMDS(query_data = query_data_subset, -#' reference_data = ref_data_subset, -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad") -#' print(plot) -#' -#' @importFrom stats cmdscale cor -#' @importFrom ggplot2 ggplot -#' @importFrom SummarizedExperiment assay -#' @export -#' -visualizeCellTypeMDS <- function(query_data, - reference_data, - cell_types = NULL, - query_cell_type_col, - ref_cell_type_col) { - - # Check if query_data is a SingleCellExperiment object - if (!is(query_data, "SingleCellExperiment")) { - stop("query_data must be a SingleCellExperiment object.") - } - - # Check if reference_data is a SingleCellExperiment object - if (!is(reference_data, "SingleCellExperiment")) { - stop("reference_data must be a SingleCellExperiment object.") - } - - # Check if query_cell_type_col is a valid column name in query_data - if (!query_cell_type_col %in% names(colData(query_data))) { - stop("query_cell_type_col: '", query_cell_type_col, "' is not a valid column name in query_data.") - } - - # Check if ref_cell_type_col is a valid column name in reference_data - if (!ref_cell_type_col %in% names(colData(reference_data))) { - stop("ref_cell_type_col: '", ref_cell_type_col, "' is not a valid column name in reference_data.") - } - - # Check if cell types available in both single-cell experiments - if(!all(cell_types %in% reference_data[[ref_cell_type_col]]) || - !all(cell_types %in% query_data[[query_cell_type_col]])) - stop("One or more of the specified cell types are not available in \'reference_data\' or \'query_data\'.") - - # Cell types - if(is.null(cell_types)){ - cell_types <- na.omit(intersect(unique(query_data[[query_cell_type_col]]), unique(reference_data[[ref_cell_type_col]]))) - } - - # Subset data - query_data <- query_data[, which(query_data[[query_cell_type_col]] %in% cell_types)] - reference_data <- reference_data[, which(reference_data[[ref_cell_type_col]] %in% cell_types)] - - # Extract logcounts - queryExp <- as.matrix(assay(query_data, "logcounts")) - refExp <- as.matrix(assay(reference_data, "logcounts")) - - # Compute correlation and dissimilarity matrix - df <- cbind(queryExp, refExp) - corMat <- cor(df, method = "spearman") - disMat <- (1 - corMat) - cmd <- data.frame(cmdscale(disMat), c(rep("Query", ncol(queryExp)), rep("Reference", ncol(refExp))), - c(query_data[[query_cell_type_col]], reference_data[[ref_cell_type_col]])) - colnames(cmd) <- c("Dim1", "Dim2", "dataset", "cellType") - cmd <- na.omit(cmd) - cmd$cell_type_dataset <- paste(cmd$dataset, cmd$cellType, sep = " ") - - # Define the order of cell type and dataset combinations - order_combinations <- paste(rep(c("Reference", "Query"), length(cell_types)), rep(sort(cell_types), each = 2)) - cmd$cell_type_dataset <- factor(cmd$cell_type_dataset, levels = order_combinations) - - # Define the colors for cell types - color_mapping <- setNames(RColorBrewer::brewer.pal(length(order_combinations), "Paired"), order_combinations) - cell_type_colors <- color_mapping[order_combinations] - - plot <- ggplot2::ggplot(cmd, aes(x = Dim1, y = Dim2, color = cell_type_dataset)) + - ggplot2::geom_point(alpha = 0.5, size = 1) + - ggplot2::scale_color_manual(values = cell_type_colors, name = "Cell Types") + - ggplot2::theme_bw() + - ggplot2::guides(color = ggplot2::guide_legend(title = "Cell Types")) - return(plot) -} diff --git a/R/visualizeCellTypePCA.R b/R/visualizeCellTypePCA.R deleted file mode 100644 index 2a88d52..0000000 --- a/R/visualizeCellTypePCA.R +++ /dev/null @@ -1,144 +0,0 @@ -#' @title Visualize Principal Components for Different Cell Types -#' -#' @description -#' This function plots the principal components for different cell types in the query and reference datasets. -#' -#' @details -#' This function projects the query dataset onto the principal component space of the reference dataset and then visualizes the -#' specified principal components for the specified cell types. -#' It uses the `projectPCA` function to perform the projection and `ggplot2` to create the plots. -#' -#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells. -#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. -#' Must be less than or equal to the number of components available in the reference PCA. -#' @param cell_types A character vector specifying the cell types to include in the plot. If NULL, all cell types are included. -#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} -#' that identifies the cell types. -#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} -#' that identifies the cell types. -#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5. -#' -#' @return A ggplot object representing the boxplots of specified principal components for the given cell types and datasets. -#' -#' @export -#' -#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -#' -#' @examples -#' # Load required libraries -#' library(scRNAseq) -#' library(scuttle) -#' library(SingleR) -#' library(scran) -#' library(scater) -#' -#' # Load data (replace with your data loading) -#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) -#' -#' # Divide the data into reference and query datasets -#' set.seed(100) -#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -#' ref_data <- sce[, indices] -#' query_data <- sce[, -indices] -#' -#' # log transform datasets -#' ref_data <- scuttle::logNormCounts(ref_data) -#' query_data <- scuttle::logNormCounts(query_data) -#' -#' # Get cell type scores using SingleR (or any other cell type annotation method) -#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -#' -#' # Add labels to query object -#' colData(query_data)$labels <- scores$labels -#' -#' # Selecting highly variable genes (can be customized by the user) -#' ref_var <- scran::getTopHVGs(ref_data, n = 2000) -#' query_var <- scran::getTopHVGs(query_data, n = 2000) -#' -#' # Intersect the gene symbols to obtain common genes -#' common_genes <- intersect(ref_var, query_var) -#' ref_data_subset <- ref_data[common_genes, ] -#' query_data_subset <- query_data[common_genes, ] -#' -#' # Run PCA on the reference data (assumed to be prepared) -#' ref_data_subset <- runPCA(ref_data_subset) -#' -#' pc_plot <- visualizeCellTypePCA(query_data_subset, ref_data_subset, -#' n_components = 10, -#' cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"), -#' query_cell_type_col = "labels", -#' ref_cell_type_col = "reclustered.broad", -#' pc_subset = c(1:5)) -#' pc_plot -#' -#' -#' @importFrom stats approxfun cancor density setNames -#' @importFrom utils combn -#' -# Function to plot PC for different cell types -visualizeCellTypePCA <- function(query_data, reference_data, - n_components = 10, - cell_types = NULL, - query_cell_type_col, - ref_cell_type_col, - pc_subset = c(1:5)){ - - # Cell types - if(is.null(cell_types)){ - cell_types <- na.omit(intersect(unique(query_data[[query_cell_type_col]]), unique(reference_data[[ref_cell_type_col]]))) - } - - # Get the projected PCA data - pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, - n_components = n_components, - query_cell_type_col = query_cell_type_col, - ref_cell_type_col = ref_cell_type_col) - pca_output <- na.omit(pca_output) - - # Create all possible pairs of specified PCs - plot_names <- paste0("PC", pc_subset) - pairs <- expand.grid(x = plot_names, y = plot_names) - pairs <- pairs[pairs$x != pairs$y, ] - # Create a new data frame with all possible pairs of specified PCs - data_pairs_list <- lapply(1:nrow(pairs), function(i) { - x_col <- pairs$x[i] - y_col <- pairs$y[i] - data_frame <- data.frame(pca_output[, c(x_col, y_col)], paste(pca_output$dataset, pca_output$cell_type, sep = " ")) - colnames(data_frame) <- c("x_value", "y_value", "cell_type_dataset") - data_frame$x <- x_col - data_frame$y <- y_col - data_frame - }) - # Plot data - data_pairs <- do.call(rbind, data_pairs_list) - # Remove redundant data (to avoid duplicated plots) - data_pairs <- data_pairs[as.numeric(data_pairs$x) < as.numeric(data_pairs$y),] - - # Define the order of cell type and dataset combinations - order_combinations <- paste(rep(c("Reference", "Query"), length(cell_types)), rep(sort(cell_types), each = 2)) - data_pairs$cell_type_dataset <- factor(data_pairs$cell_type_dataset, levels = order_combinations) - color_mapping <- setNames(RColorBrewer::brewer.pal(length(order_combinations), "Paired"), order_combinations) - cell_type_colors <- color_mapping[order_combinations] - - # Create the ggplot object (with facets if PCA) - plot_obj <- ggplot2::ggplot(data_pairs, ggplot2::aes(x = x_value, y = y_value, color = cell_type_dataset)) + - ggplot2::geom_point(alpha = 0.5, size = 1) + - ggplot2::scale_color_manual(values = cell_type_colors, name = "Cell Types") + - ggplot2::facet_grid(rows = ggplot2::vars(y), cols = ggplot2::vars(x), scales = "free") + - ggplot2::theme_bw() + - ggplot2::theme(strip.background = ggplot2::element_rect(fill = "grey85", color = "grey70"), - strip.text = ggplot2::element_text(size = 10, face = "bold", color = "black"), - axis.title = ggplot2::element_blank(), - axis.text = ggplot2::element_text(size = 10), - panel.grid = ggplot2::element_blank(), - panel.background = ggplot2::element_rect(fill = "white", color = "black"), - legend.position = "right", - plot.title = ggplot2::element_text(size = 14, hjust = 0.5), - plot.background = ggplot2::element_rect(fill = "white")) - - # Return the plot - return(plot_obj) -} - - diff --git a/man/boxplotPCA.Rd b/man/boxplotPCA.Rd deleted file mode 100644 index af526d2..0000000 --- a/man/boxplotPCA.Rd +++ /dev/null @@ -1,101 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/boxplotPCA.R -\name{boxplotPCA} -\alias{boxplotPCA} -\title{Plot Principal Components for Different Cell Types} -\usage{ -boxplotPCA( - query_data, - reference_data, - n_components = 10, - cell_types = NULL, - query_cell_type_col = NULL, - ref_cell_type_col = NULL, - pc_subset = c(1:5) -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10. -Must be less than or equal to the number of components available in the reference PCA.} - -\item{cell_types}{A character vector specifying the cell types to include in the plot. If NULL, all cell types are included.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} -that identifies the cell types.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} -that identifies the cell types.} - -\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.} -} -\value{ -A ggplot object representing the boxplots of specified principal components for the given cell types and datasets. -} -\description{ -This function generates a \code{ggplot2} boxplot visualization of principal components (PCs) for different -cell types across two datasets (query and reference). -} -\details{ -The function \code{boxplotPCA} is designed to provide a visualization of principal component analysis (PCA) results. It projects -the query dataset onto the principal components obtained from the reference dataset. The results are then visualized -as boxplots, grouped by cell types and datasets (query and reference). This allows for a comparative analysis of the -distributions of the principal components across different cell types and datasets. The function internally calls \code{projectPCA} -to perform the PCA projection. It then reshapes the output data into a long format suitable for ggplot2 plotting. -The color scheme is automatically determined using the \code{RColorBrewer} package, ensuring a visually distinct and appealing plot. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- scran::getTopHVGs(ref_data, n = 2000) -query_var <- scran::getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data (assumed to be prepared) -ref_data_subset <- runPCA(ref_data_subset) - -pc_plot <- boxplotPCA(query_data_subset, ref_data_subset, - n_components = 10, - cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"), - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - pc_subset = c(1:5)) -pc_plot - - -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/calculateAveragePairwiseCorrelation.Rd b/man/calculateAveragePairwiseCorrelation.Rd deleted file mode 100644 index 535f72c..0000000 --- a/man/calculateAveragePairwiseCorrelation.Rd +++ /dev/null @@ -1,117 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculateAveragePairwiseCorrelation.R -\name{calculateAveragePairwiseCorrelation} -\alias{calculateAveragePairwiseCorrelation} -\title{Compute Average Pairwise Correlation between Cell Types} -\usage{ -calculateAveragePairwiseCorrelation( - query_data, - reference_data, - n_components = 10, - query_cell_type_col, - ref_cell_type_col, - cell_types, - correlation_method -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -expression data and metadata.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell -expression data and metadata.} - -\item{n_components}{The number of principal components to use for the dimensionality reduction of the data using PCA. Defaults to 10. -If set to \code{NULL} then no dimensionality reduction is performed and the assay data is used directly for computations.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} -that identifies the cell types.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} -that identifies the cell types.} - -\item{cell_types}{A character vector specifying the cell types to be analysed consider.} - -\item{correlation_method}{The correlation method to use for calculating pairwise correlations.} -} -\value{ -A matrix containing the average pairwise correlation values. - Rows and columns are labeled with the cell types. Each element - in the matrix represents the average correlation between a pair - of cell types. -} -\description{ -Computes the average pairwise correlations between specified cell types -in single-cell gene expression data. -} -\details{ -This function operates on \code{\linkS4class{SingleCellExperiment}} objects, -ideal for single-cell analysis workflows. It calculates pairwise correlations between query and -reference cells using a specified correlation method, then averages these correlations for each -cell type pair. This function aids in assessing the similarity between cells in reference and query datasets, -providing insights into the reliability of cell type annotations in single-cell gene expression data. -} -\examples{ -library(scater) -library(scran) -library(scRNAseq) -library(SingleR) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Compute Pairwise Correlations -# Note: The selection of highly variable genes and desired cell types may vary -# based on user preference. -# The cell type annotation method used in this example is SingleR. -# User can use any other method for cell type annotation and provide -# the corresponding labels in the metadata. - -# Selecting highly variable genes -ref_var <- getTopHVGs(ref_data, n = 2000) -query_var <- getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) - -# Select desired cell types -selected_cell_types <- c("CD4", "CD8", "B_and_plasma") -ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad \%in\% selected_cell_types] -query_data_subset <- query_data[common_genes, query_data$reclustered.broad \%in\% selected_cell_types] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset) - -# Compute pairwise correlations -cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, - reference_data = ref_data_subset, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - cell_types = selected_cell_types, - correlation_method = "spearman") - -# Visualize the results -plot(cor_matrix_avg) - - -} -\seealso{ -\code{\link{plot.calculateAveragePairwiseCorrelation}} -} diff --git a/man/calculateCategorizationEntropy.Rd b/man/calculateCategorizationEntropy.Rd deleted file mode 100644 index eea6d26..0000000 --- a/man/calculateCategorizationEntropy.Rd +++ /dev/null @@ -1,58 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculateCategorizationEntropy.R -\name{calculateCategorizationEntropy} -\alias{calculateCategorizationEntropy} -\title{Calculate Categorization Entropy} -\usage{ -calculateCategorizationEntropy( - X, - inverse_normal_transform = FALSE, - plot = TRUE, - verbose = TRUE -) -} -\arguments{ -\item{X}{a matrix of category scores} - -\item{inverse_normal_transform}{if TRUE, apply} - -\item{plot}{if TRUE, plot a histogram of the entropies} - -\item{verbose}{if TRUE, display messages about the calculations} -} -\value{ -A vector of entropy values for each column in X. -} -\description{ -This function takes a matrix of category scores (cell type by - cells) and calculates the entropy of the category probabilities for each - cell. This gives a sense of how confident the cell type assignments are. - High entropy = lots of plausible category assignments = low confidence. Low - entropy = only one or two plausible categories = high confidence. This is - confidence in the vernacular sense, not in the "confidence interval" - statistical sense. Also note that the entropy tells you nothing about - whether or not the assignments are correct -- see the other functionality - in the package for that. This functionality can be used for assessing how - comparatively confident different sets of assignments are (given that the - number of categories is the same). -} -\details{ -The function checks if X is already on the probability scale. - Otherwise, it applies softmax columnwise. - - You can think about entropies on a scale from 0 to a maximum that depends - on the number of categories. This is the function for entropy (minus input - checking): \code{entropy(p) = -sum(p*log(p))} . If that input vector p is a - uniform distribution over the \code{length(p)} categories, the entropy will - be a high as possible. -} -\examples{ -# Simulate 500 cells with scores on 4 possible cell types -X <- rnorm(500 * 4) |> matrix(nrow = 4) -X[1, 1:250] <- X[1, 1:250] + 5 # Make the first category highly scored in the first 250 cells - - -# The function will issue a message about softmaxing the scores, and the entropy histogram will be -# bimodal since we made half of the cells clearly category 1 while the other half are roughly even. -# entropy_scores <- calculateCategorizationEntropy(X) -} diff --git a/man/calculateHVGOverlap.Rd b/man/calculateHVGOverlap.Rd deleted file mode 100644 index d0fd7a7..0000000 --- a/man/calculateHVGOverlap.Rd +++ /dev/null @@ -1,65 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculateHVGOverlap.R -\name{calculateHVGOverlap} -\alias{calculateHVGOverlap} -\title{Calculate the Overlap Coefficient for Highly Variable Genes} -\usage{ -calculateHVGOverlap(reference_genes, query_genes) -} -\arguments{ -\item{reference_genes}{character. A vector of highly variable genes from the reference dataset.} - -\item{query_genes}{character. A vector of highly variable genes from the query dataset.} -} -\value{ -Overlap coefficient, a value between 0 and 1, where 0 indicates no overlap - and 1 indicates complete overlap of highly variable genes between datasets. -} -\description{ -Calculates the overlap coefficient between the sets of highly variable genes -from a reference dataset and a query dataset. -} -\details{ -The overlap coefficient measures the similarity between two gene sets, indicating how well-aligned -reference and query datasets are in terms of their highly variable genes. This metric is -useful in single-cell genomics to understand the correspondence between different datasets. - -The coefficient is calculated using the formula: - -\deqn{Coefficient(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}} - -where X and Y are the sets of highly variable genes from the reference and query datasets, respectively, -|X ∩ Y| is the number of genes common to both X and Y, and min(|X|, |Y|) is the size of the smaller set among X and Y. -} -\examples{ -library(scater) -library(scran) -library(scRNAseq) -library(SingleR) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Selcting highly variable genes - -ref_var <- getTopHVGs(ref_data, n=2000) -query_var <- getTopHVGs(query_data, n=2000) - -overlap_coefficient <- calculateHVGOverlap(reference_genes = ref_var, - query_genes = query_var) - -} -\references{ -Luecken et al. Benchmarking atlas-level data integration in -single-cell genomics. Nature Methods, 19:41-50, 2022. -} diff --git a/man/calculateHotellingPValue.Rd b/man/calculateHotellingPValue.Rd deleted file mode 100644 index 39e5605..0000000 --- a/man/calculateHotellingPValue.Rd +++ /dev/null @@ -1,95 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculateHotellingPValue.R -\name{calculateHotellingPValue} -\alias{calculateHotellingPValue} -\title{Perform Hotelling's T-squared Test on PCA Scores for Single-cell RNA-seq Data} -\usage{ -calculateHotellingPValue( - query_data, - reference_data, - n_components = 10, - query_cell_type_col, - ref_cell_type_col, - pc_subset = c(1:5) -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} -that identifies the cell types.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} -that identifies the cell types.} - -\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.} -} -\value{ -A named numeric vector of p-values from Hotelling's T-squared test for each cell type. -} -\description{ -This function performs Hotelling's T-squared test to assess the similarity between reference and query datasets -for each cell type based on their PCA scores. -} -\details{ -This function first performs PCA on the reference dataset and then projects the query dataset onto the PCA space -of the reference data. For each cell type, it computes pseudo-bulk signatures in the PCA space by averaging the principal -component scores of cells belonging to that cell type. Hotelling's T-squared test is then performed to compare the mean -vectors of the pseudo-bulk signatures between the reference and query datasets. The resulting p-values indicate the similarity -between the reference and query datasets for each cell type. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- scran::getTopHVGs(ref_data, n = 2000) -query_var <- scran::getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) - -# Get the p-values from the test -p_values <- calculateHotellingPValue(query_data_subset, ref_data_subset, - n_components = 10, - query_cell_type_col = "reclustered.broad", - ref_cell_type_col = "reclustered.broad", - pc_subset = c(1:10)) -round(p_values, 5) - -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/calculatePairwiseDistancesAndPlotDensity.Rd b/man/calculatePairwiseDistancesAndPlotDensity.Rd deleted file mode 100644 index 28f70a9..0000000 --- a/man/calculatePairwiseDistancesAndPlotDensity.Rd +++ /dev/null @@ -1,108 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculatePairwiseDistancesAndPlotDensity.R -\name{calculatePairwiseDistancesAndPlotDensity} -\alias{calculatePairwiseDistancesAndPlotDensity} -\title{Pairwise Distance Analysis and Density Visualization} -\usage{ -calculatePairwiseDistancesAndPlotDensity( - query_data, - reference_data, - n_components = 10, - query_cell_type_col, - ref_cell_type_col, - cell_type_query, - cell_type_reference, - distance_metric, - correlation_method = "pearson" -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -expression data and metadata.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell -expression data and metadata.} - -\item{n_components}{The number of principal components to use for the dimensionality reduction of the data using PCA. Defaults to 10. -If set to \code{NULL} then no dimensionality reduction is performed and the assay data is used directly for computations.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} -that identifies the cell types.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} -that identifies the cell types.} - -\item{cell_type_query}{The query cell type for which distances or correlations are calculated.} - -\item{cell_type_reference}{The reference cell type for which distances or correlations are calculated.} - -\item{distance_metric}{The distance metric to use for calculating pairwise distances, such as euclidean, manhattan etc. -Set it to "correlation" for calculating correlation coefficients.} - -\item{correlation_method}{The correlation method to use when distance_metric is "correlation". -Possible values: "pearson", "spearman".} -} -\value{ -A plot generated by \code{ggplot2}, showing the density distribution of - calculated distances or correlations. -} -\description{ -Calculates pairwise distances or correlations between query and reference cells -of a specific cell type. -} -\details{ -The function works with \code{\linkS4class{SingleCellExperiment}} objects, ensuring -compatibility with common single-cell analysis workflows. It subsets the data for specified cell types, -computes pairwise distances or correlations, and visualizes these measurements using density plots. By comparing the distances and correlations, -one can evaluate the consistency and reliability of annotated cell types within single-cell datasets. -} -\examples{ -library(scran) -library(scRNAseq) -library(SingleR) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 2000) -query_var <- getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) - -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset) - -# Example usage of the function -calculatePairwiseDistancesAndPlotDensity(query_data = query_data_subset, - reference_data = ref_data_subset, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - cell_type_query = "CD8", - cell_type_reference = "CD8", - distance_metric = "euclidean") - - -} diff --git a/man/calculateSampleDistances.Rd b/man/calculateSampleDistances.Rd deleted file mode 100644 index 0adaa02..0000000 --- a/man/calculateSampleDistances.Rd +++ /dev/null @@ -1,111 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculateSampleDistances.R -\name{calculateSampleDistances} -\alias{calculateSampleDistances} -\title{Compute Sample Distances Between Reference and Query Data} -\usage{ -calculateSampleDistances( - query_data, - reference_data, - query_cell_type_col, - ref_cell_type_col, - n_components = 10, - pc_subset = c(1:5) -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} -that identifies the cell types.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} -that identifies the cell types.} - -\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10.} - -\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.} -} -\value{ -A list containing distance data for each cell type. Each entry in the list contains: -\describe{ - \item{ref_distances}{A vector of all pairwise distances within the reference subset for the cell type.} - \item{query_to_ref_distances}{A matrix of distances from each query sample to all reference samples for the cell type.} -} -} -\description{ -This function computes the distances within the reference dataset and the distances from each query sample to all -reference samples for each cell type. It uses PCA for dimensionality reduction and Euclidean distance for distance calculation. -} -\details{ -The function first performs PCA on the reference dataset and projects the query dataset onto the same PCA space. -It then computes pairwise Euclidean distances within the reference dataset for each cell type, as well as distances from each -query sample to all reference samples of a particular cell type. The results are stored in a list, with one entry per cell type. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 2000) -query_var <- getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset) - -# Plot the PC data -distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - pc_subset = c(1:10)) - -# Identify outliers for CD4 -cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - n_components = 10, - n_tree = 500, - anomaly_treshold = 0.5)$CD4 -cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6]) - -# Plot the densities of the distances -plot(distance_data, ref_cell_type = "CD4", sample_names = cd4_top5_anomalies) - -} -\seealso{ -\code{\link{plot.calculateSampleDistances}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/calculateSampleDistancesSimilarity.Rd b/man/calculateSampleDistancesSimilarity.Rd deleted file mode 100644 index e530677..0000000 --- a/man/calculateSampleDistancesSimilarity.Rd +++ /dev/null @@ -1,125 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculateSampleDistancesSimilarity.R -\name{calculateSampleDistancesSimilarity} -\alias{calculateSampleDistancesSimilarity} -\title{Function to compute Bhattacharyya coefficients and Hellinger distances} -\usage{ -calculateSampleDistancesSimilarity( - query_data, - reference_data, - query_cell_type_col, - ref_cell_type_col, - sample_names, - n_components = 10, - pc_subset = c(1:5) -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} -that identifies the cell types.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} -that identifies the cell types.} - -\item{sample_names}{A character vector specifying the names of the query samples for which to compute distance measures.} - -\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10.} - -\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.} -} -\value{ -A list containing distance data for each cell type. Each entry in the list contains: -\describe{ - \item{ref_distances}{A vector of all pairwise distances within the reference subset for the cell type.} - \item{query_to_ref_distances}{A matrix of distances from each query sample to all reference samples for the cell type.} -} -} -\description{ -This function computes Bhattacharyya coefficients and Hellinger distances to quantify the similarity of density -distributions between query samples and reference data for each cell type. -} -\details{ -This function first computes distance data using the \code{calculateSampleDistances} function, which calculates -pairwise distances between samples within the reference data and between query samples and reference samples in the PCA space. -Bhattacharyya coefficients and Hellinger distances are calculated to quantify the similarity of density distributions between query -samples and reference data for each cell type. Bhattacharyya coefficient measures the similarity of two probability distributions, -while Hellinger distance measures the distance between two probability distributions. - -Bhattacharyya coefficients range between 0 and 1. A value closer to 1 indicates higher similarity between distributions, while a value -closer to 0 indicates lower similarity - -Hellinger distances range between 0 and 1. A value closer to 0 indicates higher similarity between distributions, while a value -closer to 1 indicates lower similarity. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- scran::getTopHVGs(ref_data, n = 2000) -query_var <- scran::getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) - -# Plot the PC data -distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - pc_subset = c(1:10)) - -# Identify outliers for CD4 -cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - n_components = 10, - n_tree = 500, - anomaly_treshold = 0.5)$CD4 -cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6]) - -# Get overlap measures -overlap_measures <- calculateSampleDistancesSimilarity(query_data_subset,ref_data_subset, - sample_names = cd4_top5_anomalies, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - pc_subset = c(1:10)) - - -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/calculateSampleSimilarityPCA.Rd b/man/calculateSampleSimilarityPCA.Rd deleted file mode 100644 index 66e24fb..0000000 --- a/man/calculateSampleSimilarityPCA.Rd +++ /dev/null @@ -1,98 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculateSampleSimilarityPCA.R -\name{calculateSampleSimilarityPCA} -\alias{calculateSampleSimilarityPCA} -\title{Calculate Sample Similarity Using PCA Loadings} -\usage{ -calculateSampleSimilarityPCA( - se_object, - samples, - pc_subset = c(1:5), - n_top_vars = 50 -) -} -\arguments{ -\item{se_object}{A \code{\linkS4class{SingleCellExperiment}} object containing expression data.} - -\item{samples}{A character vector specifying the samples for which to compute the similarity.} - -\item{pc_subset}{A numeric vector specifying the subset of principal components to consider (default: c(1:5)).} - -\item{n_top_vars}{An integer indicating the number of top loading variables to consider for each PC (default: 50).} -} -\value{ -A data frame containing cosine similarity values between samples for each selected principal component. -} -\description{ -This function calculates the cosine similarity between samples based on the principal components (PCs) -obtained from PCA (Principal Component Analysis) loadings. -} -\details{ -This function calculates the cosine similarity between samples based on the loadings of the selected -principal components obtained from PCA. It extracts the rotation matrix from the PCA results of the -\code{\linkS4class{SingleCellExperiment}} object and identifies the high-loading variables for each selected PC. -Then, it computes the cosine similarity between samples using the high-loading variables for each PC. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- scran::getTopHVGs(ref_data, n = 2000) -query_var <- scran::getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data (assumed to be prepared) -ref_data_subset <- runPCA(ref_data_subset) - -# Store PCA anomaly data and plots -anomaly_output <- detectAnomaly(reference_data = ref_data_subset, - ref_cell_type_col = "reclustered.broad", - n_components = 10, - n_tree = 500, - anomaly_treshold = 0.5) -top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, - decreasing = TRUE)[1:6]) - -# Compute cosine similarity between anomalies and top PCs -cosine_similarities <- calculateSampleSimilarityPCA(ref_data_subset, samples = top6_anomalies, - pc_subset = c(1:10), n_top_vars = 50) -cosine_similarities - -# Plot similarities -plot(cosine_similarities, pc_subset = c(1:5)) - -} -\seealso{ -\code{\link{plot.calculateSampleSimilarityPCA}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/calculateVarImpOverlap.Rd b/man/calculateVarImpOverlap.Rd deleted file mode 100644 index 8571519..0000000 --- a/man/calculateVarImpOverlap.Rd +++ /dev/null @@ -1,93 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/calculateVarImpOverlap.R -\name{calculateVarImpOverlap} -\alias{calculateVarImpOverlap} -\title{Compare Gene Importance Across Datasets Using Random Forest} -\usage{ -calculateVarImpOverlap( - query_data, - reference_data, - query_cell_type_col, - ref_cell_type_col, - n_tree = 500, - n_top = 20 -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{query_cell_type_col}{A character string specifying the column name in the query dataset containing cell type annotations.} - -\item{ref_cell_type_col}{A character string specifying the column name in the reference dataset containing cell type annotations.} - -\item{n_tree}{An integer specifying the number of trees to grow in the Random Forest. Default is 500.} - -\item{n_top}{An integer specifying the number of top genes to consider when comparing variable importance scores. Default is 20.} -} -\value{ -A list containing three elements: -\item{var_imp_ref}{A list of data frames containing variable importance scores for each combination of cell types in the reference -dataset.} -\item{var_imp_query}{A list of data frames containing variable importance scores for each combination of cell types in the query -dataset.} -\item{var_imp_comparison}{A named vector indicating the proportion of top genes that overlap between the reference and query -datasets for each combination of cell types.} -} -\description{ -This function identifies and compares the most important genes for differentiating cell types between a query dataset -and a reference dataset using Random Forest. -} -\details{ -This function uses the Random Forest algorithm to calculate the importance of genes in differentiating between cell types -within both a reference dataset and a query dataset. The function then compares the top genes identified in both datasets to determine -the overlap in their importance scores. -} -\examples{ -# Load necessary library -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 500) -query_var <- getTopHVGs(query_data, n = 500) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Compare PCA subspaces -rf_output <- calculateVarImpOverlap(query_data_subset, ref_data_subset, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - n_tree = 500, - n_top = 20) - - -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/compareCCA.Rd b/man/compareCCA.Rd deleted file mode 100644 index 38a1d29..0000000 --- a/man/compareCCA.Rd +++ /dev/null @@ -1,100 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/compareCCA.R -\name{compareCCA} -\alias{compareCCA} -\title{Compare Subspaces Spanned by Top Principal Components Using Canonical Correlation Analysis} -\usage{ -compareCCA(reference_data, query_data, pc_subset = c(1:5), n_top_vars = 25) -} -\arguments{ -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{pc_subset}{A numeric vector specifying the subset of principal components (PCs) -to compare. Default is the first five PCs.} - -\item{n_top_vars}{An integer indicating the number of top loading variables to consider for each PC. Default is 25.} -} -\value{ -A list containing the following elements: -\describe{ - \item{coef_ref}{Canonical coefficients for the reference dataset.} - \item{coef_query}{Canonical coefficients for the query dataset.} - \item{cosine_similarity}{Cosine similarity values for the canonical variables.} - \item{correlations}{Canonical correlations between the reference and query datasets.} -} -} -\description{ -This function compares the subspaces spanned by the top principal components (PCs) of the reference -and query datasets using canonical correlation analysis (CCA). It calculates the canonical variables, -correlations, and a similarity measure for the subspaces. -} -\details{ -This function performs canonical correlation analysis (CCA) to compare the subspaces spanned by the -top principal components (PCs) of the reference and query datasets. The function extracts the rotation -matrices corresponding to the specified PCs and performs CCA on these matrices. It computes the canonical -variables and their corresponding correlations. Additionally, it calculates a similarity measure for the -canonical variables using cosine similarity. The output is a list containing the canonical coefficients -for both datasets, the cosine similarity values, and the canonical correlations. -} -\examples{ -# Load necessary library -library(scRNAseq) -library(scuttle) -library(scran) -library(SingleR) -library(ggplot2) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 500) -query_var <- getTopHVGs(query_data, n = 500) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Subset reference and query data for a specific cell type -ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] - -# Run PCA on the reference and query datasets -ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -query_data_subset <- runPCA(query_data_subset, ncomponents = 50) - -# Compare CCA -cca_comparison <- compareCCA(query_data_subset, ref_data_subset, - pc_subset = c(1:5), n_top_vars = 25) - -# Visualize output of CCA comparison -plot(cca_comparison) - - -} -\seealso{ -\code{\link{plot.compareCCA}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/comparePCA.Rd b/man/comparePCA.Rd deleted file mode 100644 index c801530..0000000 --- a/man/comparePCA.Rd +++ /dev/null @@ -1,108 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/comparePCA.R -\name{comparePCA} -\alias{comparePCA} -\title{Compare Principal Components Analysis (PCA) Results} -\usage{ -comparePCA( - reference_data, - query_data, - pc_subset = c(1:5), - n_top_vars = 50, - metric = c("cosine", "correlation")[1], - correlation_method = c("spearman", "pearson")[1] -) -} -\arguments{ -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{pc_subset}{A numeric vector specifying the subset of principal components (PCs) to compare. Default is the first five PCs.} - -\item{n_top_vars}{An integer indicating the number of top loading variables to consider for each PC. Default is 50.} - -\item{metric}{The similarity metric to use. It can be either "cosine" or "correlation". Default is "cosine".} - -\item{correlation_method}{The correlation method to use if metric is "correlation". It can be "spearman" -or "pearson". Default is "spearman".} -} -\value{ -A similarity matrix comparing the principal components of the reference and query datasets. -Each element (i, j) in the matrix represents the similarity between the i-th principal component -of the reference dataset and the j-th principal component of the query dataset. -} -\description{ -This function compares the principal components (PCs) obtained from separate PCA on reference and query -datasets for a single cell type using either cosine similarity or correlation. -} -\details{ -This function compares the PCA results between the reference and query datasets by computing cosine -similarities or correlations between the loadings of top variables for each pair of principal components. It first -extracts the PCA rotation matrices from both datasets and identifies the top variables with highest loadings for -each PC. Then, it computes the cosine similarities or correlations between the loadings of top variables for each -pair of PCs. The resulting matrix contains the similarity values, where rows represent reference PCs and columns -represent query PCs. -} -\examples{ -# Load necessary library -library(scRNAseq) -library(scuttle) -library(scran) -library(SingleR) -library(ComplexHeatmap) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 500) -query_var <- getTopHVGs(query_data, n = 500) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Subset reference and query data for a specific cell type -ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] - -# Run PCA on the reference and query datasets -ref_data_subset <- runPCA(ref_data_subset) -query_data_subset <- runPCA(query_data_subset) - -# Call the PCA comparison function -similarity_mat <- comparePCA(query_data_subset, ref_data_subset, - pc_subset = c(1:5), - n_top_vars = 50, - metric = c("cosine", "correlation")[1], - correlation_method = c("spearman", "pearson")[1]) - -# Create the heatmap -plot(similarity_mat) - -} -\seealso{ -\code{\link{plot.comparePCA}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/comparePCASubspace.Rd b/man/comparePCASubspace.Rd deleted file mode 100644 index f510670..0000000 --- a/man/comparePCASubspace.Rd +++ /dev/null @@ -1,100 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/comparePCASubspace.R -\name{comparePCASubspace} -\alias{comparePCASubspace} -\title{Compare Subspaces Spanned by Top Principal Components} -\usage{ -comparePCASubspace( - reference_data, - query_data, - pc_subset = c(1:5), - n_top_vars = 50 -) -} -\arguments{ -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{pc_subset}{A numeric vector specifying the subset of principal components (PCs) to compare. Default is the first five PCs.} - -\item{n_top_vars}{An integer indicating the number of top loading variables to consider for each PC. Default is 50.} -} -\value{ -A list containing the following components: - \item{principal_angles_cosines}{A numeric vector of cosine values of principal angles.} - \item{average_variance_explained}{A numeric vector of average variance explained by each PC.} - \item{weighted_cosine_similarity}{A numeric value representing the weighted cosine similarity.} -} -\description{ -This function compares the subspace spanned by the top principal components (PCs) in a reference dataset to that -in a query dataset. It computes the cosine similarity between the loadings of the top variables for each PC in -both datasets and provides a weighted cosine similarity score. -} -\details{ -This function compares the subspace spanned by the top principal components (PCs) in a reference dataset -to that in a query dataset. It first computes the cosine similarity between the loadings of the top variables -for each PC in both datasets. The top cosine similarity scores are then selected, and their corresponding PC -indices are stored. Additionally, the function calculates the average percentage of variance explained by the -selected top PCs. Finally, it computes a weighted cosine similarity score based on the top cosine similarities -and the average percentage of variance explained. -} -\examples{ -# Load necessary library -library(scRNAseq) -library(scuttle) -library(scran) -library(SingleR) -library(ggplot2) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 500) -query_var <- getTopHVGs(query_data, n = 500) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Subset reference and query data for a specific cell type -ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] - -# Run PCA on the reference and query datasets -ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -query_data_subset <- runPCA(query_data_subset, ncomponents = 50) - -# Compare PCA subspaces -subspace_comparison <- comparePCASubspace(query_data_subset, ref_data_subset, - pc_subset = c(1:5), n_top_vars = 50) - -# Create a data frame for plotting -plot(subspace_comparison) - -} -\seealso{ -\code{\link{plot.comparePCASubspace}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/detectAnomaly.Rd b/man/detectAnomaly.Rd deleted file mode 100644 index ba1a8c9..0000000 --- a/man/detectAnomaly.Rd +++ /dev/null @@ -1,110 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/detectAnomaly.R -\name{detectAnomaly} -\alias{detectAnomaly} -\title{PCA Anomaly Scores via Isolation Forests with Visualization} -\usage{ -detectAnomaly( - reference_data, - query_data = NULL, - ref_cell_type_col, - query_cell_type_col, - n_components = 10, - n_tree = 500, - anomaly_treshold = 0.5, - ... -) -} -\arguments{ -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{query_data}{An optional \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. -If NULL, then the isolation forest anomaly scores are computed for the reference data. Default is NULL.} - -\item{ref_cell_type_col}{A character string specifying the column name in the reference dataset containing cell type annotations.} - -\item{query_cell_type_col}{A character string specifying the column name in the query dataset containing cell type annotations.} - -\item{n_components}{An integer specifying the number of principal components to use. Default is 10.} - -\item{n_tree}{An integer specifying the number of trees for the isolation forest. Default is 500} - -\item{anomaly_treshold}{A numeric value specifying the threshold for identifying anomalies, Default is 0.5.} - -\item{...}{Additional arguments passed to the `isolation.forest` function.} -} -\value{ -A list containing the following components for each cell type and the combined data: -\item{anomaly_scores}{Anomaly scores for each cell in the query data.} -\item{anomaly}{Logical vector indicating whether each cell is classified as an anomaly.} -\item{reference_mat_subset}{PCA projections of the reference data.} -\item{query_mat_subset}{PCA projections of the query data (if provided).} -\item{var_explained}{Proportion of variance explained by the retained principal components.} -} -\description{ -This function detects anomalies in single-cell data by projecting the data onto a PCA space and using an isolation forest -algorithm to identify anomalies. -} -\details{ -This function projects the query data onto the PCA space of the reference data. An isolation forest is then built on the -reference data to identify anomalies in the query data based on their PCA projections. If no query dataset is provided by the user, -the anomaly scores are computed on the reference data itself. Anomaly scores for the data with all combined cell types are also -provided as part of the output. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 2000) -query_var <- getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset) - -# Store PCA anomaly data and plots -anomaly_output <- detectAnomaly(ref_data_subset, query_data_subset, - ref_cell_type_col = "reclustered.broad", - query_cell_type_col = "labels", - n_components = 10, - n_tree = 500, - anomaly_treshold = 0.5) - -# Plot the output for a cell type -plot(anomaly_output, cell_type = "CD8", pc_subset = c(1:5), data_type = "query") - -} -\seealso{ -\code{\link{plot.detectAnomaly}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/figures/CMD-Scatter-Plot-1.png b/man/figures/CMD-Scatter-Plot-1.png deleted file mode 100644 index b97d2c8..0000000 Binary files a/man/figures/CMD-Scatter-Plot-1.png and /dev/null differ diff --git a/man/figures/Cell-Type-Correlation-Analysis-Visualization-1.png b/man/figures/Cell-Type-Correlation-Analysis-Visualization-1.png deleted file mode 100644 index 794447d..0000000 Binary files a/man/figures/Cell-Type-Correlation-Analysis-Visualization-1.png and /dev/null differ diff --git a/man/figures/Gene-Expression-Histogram-1.png b/man/figures/Gene-Expression-Histogram-1.png deleted file mode 100644 index 5393ef3..0000000 Binary files a/man/figures/Gene-Expression-Histogram-1.png and /dev/null differ diff --git a/man/figures/Gene-Expression-Scatter-1.png b/man/figures/Gene-Expression-Scatter-1.png deleted file mode 100644 index 196a78a..0000000 Binary files a/man/figures/Gene-Expression-Scatter-1.png and /dev/null differ diff --git a/man/figures/Mito-Genes-Vs-Annotation-1.png b/man/figures/Mito-Genes-Vs-Annotation-1.png deleted file mode 100644 index f633ca9..0000000 Binary files a/man/figures/Mito-Genes-Vs-Annotation-1.png and /dev/null differ diff --git a/man/figures/Pairwise-Distance-Analysis-Density-Visualization-1.png b/man/figures/Pairwise-Distance-Analysis-Density-Visualization-1.png deleted file mode 100644 index 63af85f..0000000 Binary files a/man/figures/Pairwise-Distance-Analysis-Density-Visualization-1.png and /dev/null differ diff --git a/man/figures/Pairwise-Distance-Correlation-Based-Density-Visualization-1.png b/man/figures/Pairwise-Distance-Correlation-Based-Density-Visualization-1.png deleted file mode 100644 index a31491e..0000000 Binary files a/man/figures/Pairwise-Distance-Correlation-Based-Density-Visualization-1.png and /dev/null differ diff --git a/man/figures/Pathway-Scores-on-Dimensional-Reduction-Scatter-1.png b/man/figures/Pathway-Scores-on-Dimensional-Reduction-Scatter-1.png deleted file mode 100644 index d6cfb8c..0000000 Binary files a/man/figures/Pathway-Scores-on-Dimensional-Reduction-Scatter-1.png and /dev/null differ diff --git a/man/figures/QC-Annotation-Scatter-AllCellTypes-1.png b/man/figures/QC-Annotation-Scatter-AllCellTypes-1.png deleted file mode 100644 index 93e3650..0000000 Binary files a/man/figures/QC-Annotation-Scatter-AllCellTypes-1.png and /dev/null differ diff --git a/man/figures/QC-Annotation-Scatter-Mito-1.png b/man/figures/QC-Annotation-Scatter-Mito-1.png deleted file mode 100644 index dfba242..0000000 Binary files a/man/figures/QC-Annotation-Scatter-Mito-1.png and /dev/null differ diff --git a/man/figures/Scatter-Plot-LibrarySize-Vs-Annotation-Scores-1.png b/man/figures/Scatter-Plot-LibrarySize-Vs-Annotation-Scores-1.png deleted file mode 100644 index 0dc51b8..0000000 Binary files a/man/figures/Scatter-Plot-LibrarySize-Vs-Annotation-Scores-1.png and /dev/null differ diff --git a/man/figures/Scatter-Plot-QC-Stats-Vs-Annotation-Scores-1.png b/man/figures/Scatter-Plot-QC-Stats-Vs-Annotation-Scores-1.png deleted file mode 100644 index 1feb9ae..0000000 Binary files a/man/figures/Scatter-Plot-QC-Stats-Vs-Annotation-Scores-1.png and /dev/null differ diff --git a/man/histQCvsAnnotation.Rd b/man/histQCvsAnnotation.Rd deleted file mode 100644 index 3c3b2b8..0000000 --- a/man/histQCvsAnnotation.Rd +++ /dev/null @@ -1,92 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/histQCvsAnnotation.R -\name{histQCvsAnnotation} -\alias{histQCvsAnnotation} -\title{Histograms: QC Stats and Annotation Scores Visualization} -\usage{ -histQCvsAnnotation( - query_data, - qc_col = qc_col, - label_col, - score_col, - label = NULL -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -expression data and metadata.} - -\item{qc_col}{character. A column name in the \code{colData} of \code{query_data} that -contains the QC stats of interest.} - -\item{label_col}{character. The column name in the \code{colData} of \code{query_data} -that contains the cell type labels.} - -\item{score_col}{numeric. The column name in the \code{colData} of \code{query_data} that -contains the cell type scores.} - -\item{label}{character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")). -Defaults to \code{NULL}, which will include all the cells.} -} -\value{ -A object containing two histograms displayed side by side. -The first histogram represents the distribution of QC stats, -and the second histogram represents the distribution of annotation scores. -} -\description{ -This function generates histograms for visualizing the distribution of quality control (QC) statistics and -annotation scores associated with cell types in single-cell genomic data. -} -\details{ -The particularly useful in the analysis of data from single-cell experiments, -where understanding the distribution of these metrics is crucial for quality assessment and -interpretation of cell type annotations. -} -\examples{ -\donttest{ -library(scater) -library(scran) -library(scRNAseq) -library(SingleR) -library(gridExtra) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log-transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR -pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Assign labels to query data -colData(query_data)$labels <- pred$labels - -# Get annotation scores -scores <- apply(pred$scores, 1, max) - -# Assign scores to query data -colData(query_data)$cell_scores <- scores - -# Generate histograms -histQCvsAnnotation(query_data = query_data, - qc_col = "percent.mito", - label_col = "labels", - score_col = "cell_scores", - label = c("CD4", "CD8")) - -histQCvsAnnotation(query_data = query_data, - qc_col = "percent.mito", - label_col = "labels", - score_col = "cell_scores", - label = NULL) -} - -} diff --git a/man/nearestNeighborDiagnostics.Rd b/man/nearestNeighborDiagnostics.Rd deleted file mode 100644 index b45211a..0000000 --- a/man/nearestNeighborDiagnostics.Rd +++ /dev/null @@ -1,106 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/nearestNeighborDiagnostics.R -\name{nearestNeighborDiagnostics} -\alias{nearestNeighborDiagnostics} -\title{Calculate Nearest Neighbor Diagnostics for Cell Type Classification} -\usage{ -nearestNeighborDiagnostics( - query_data, - reference_data, - n_neighbor = 15, - n_components = 10, - pc_subset = c(1:10), - query_cell_type_col, - ref_cell_type_col -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{n_neighbor}{An integer specifying the number of nearest neighbors to consider. Default is 15.} - -\item{n_components}{An integer specifying the number of principal components to use for dimensionality reduction. Default is 10.} - -\item{pc_subset}{A vector specifying the subset of principal components to use in the analysis. Default is c(1:10).} - -\item{query_cell_type_col}{A character string specifying the column name in the query dataset containing cell type annotations.} - -\item{ref_cell_type_col}{A character string specifying the column name in the reference dataset containing cell type annotations.} -} -\value{ -A list where each element corresponds to a cell type and contains two vectors: -\item{prob_ref}{The probabilities of each query sample belonging to the reference dataset.} -\item{prob_query}{The probabilities of each query sample belonging to the query dataset.} -The list is assigned the class \code{"nearestNeighbotDiagnostics"}. -} -\description{ -This function computes the probabilities for each sample of belonging to either the reference or query dataset for -each cell type using nearest neighbor analysis. -} -\details{ -This function performs a nearest neighbor search to calculate the probability of each sample in the query dataset -belonging to the reference dataset for each cell type. It uses principal component analysis (PCA) to reduce the dimensionality -of the data before performing the nearest neighbor search. The function balances the sample sizes between the reference and query -datasets by data augmentation if necessary. -} -\examples{ -# Load necessary library -library(scRNAseq) -library(scuttle) -library(scran) -library(SingleR) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 500) -query_var <- getTopHVGs(query_data, n = 500) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset) - -# Project the query data onto PCA space of reference -nn_output <- nearestNeighborDiagnostics(query_data_subset, ref_data_subset, - n_neighbor = 15, - n_components = 10, - pc_subset = c(1:10), - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad") - -# Plot output -plot(nn_output, cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"), - prob_type = "query") - - -} -\seealso{ -\code{\link{plot.nearestNeighborDiagnostics}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/plot.calculateAveragePairwiseCorrelation.Rd b/man/plot.calculateAveragePairwiseCorrelation.Rd deleted file mode 100644 index 6caf9af..0000000 --- a/man/plot.calculateAveragePairwiseCorrelation.Rd +++ /dev/null @@ -1,88 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot.calculateAveragePairwiseCorrelation.R -\name{plot.calculateAveragePairwiseCorrelation} -\alias{plot.calculateAveragePairwiseCorrelation} -\title{Plot the output of the calculateAveragePairwiseCorrelation function} -\usage{ -\method{plot}{calculateAveragePairwiseCorrelation}(x, ...) -} -\arguments{ -\item{x}{Output matrix from calculateAveragePairwiseCorrelation function.} - -\item{...}{Additional arguments to be passed to the plotting function.} -} -\value{ -A ggplot2 object representing the heatmap plot. -} -\description{ -This function takes the output of the calculateAveragePairwiseCorrelation function, -which should be a matrix of pairwise correlations, and plots it as a heatmap. -} -\details{ -This function converts the correlation matrix into a dataframe, creates a heatmap using ggplot2, -and customizes the appearance of the heatmap with updated colors and improved aesthetics. -} -\examples{ -library(scater) -library(scran) -library(scRNAseq) -library(SingleR) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Compute Pairwise Correlations -# Note: The selection of highly variable genes and desired cell types may vary -# based on user preference. -# The cell type annotation method used in this example is SingleR. -# User can use any other method for cell type annotation and provide -# the corresponding labels in the metadata. - -# Selecting highly variable genes -ref_var <- getTopHVGs(ref_data, n = 2000) -query_var <- getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) - -# Select desired cell types -selected_cell_types <- c("CD4", "CD8", "B_and_plasma") -ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad \%in\% selected_cell_types] -query_data_subset <- query_data[common_genes, query_data$reclustered.broad \%in\% selected_cell_types] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset) - -# Compute pairwise correlations -cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, - reference_data = ref_data_subset, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - cell_types = selected_cell_types, - correlation_method = "spearman") - -# Visualize the results -plot(cor_matrix_avg) - - -} -\seealso{ -\code{\link{calculateAveragePairwiseCorrelation}} -} diff --git a/man/plot.calculateSampleDistances.Rd b/man/plot.calculateSampleDistances.Rd deleted file mode 100644 index 52e082f..0000000 --- a/man/plot.calculateSampleDistances.Rd +++ /dev/null @@ -1,99 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot.calculateSampleDistances.R -\name{plot.calculateSampleDistances} -\alias{plot.calculateSampleDistances} -\title{Plot Distance Density Comparison for a Specific Cell Type and Selected Samples} -\usage{ -\method{plot}{calculateSampleDistances}(x, ref_cell_type, sample_names, ...) -} -\arguments{ -\item{x}{A list containing the distance data computed by \code{calculateSampleDistances}.} - -\item{ref_cell_type}{A string specifying the reference cell type.} - -\item{sample_names}{A string specifying the query sample name for which to plot the distances.} - -\item{...}{Additional arguments passed to the plotting function.} -} -\value{ -A ggplot2 density plot comparing the reference distances and the distances from the specified sample to the reference samples. -} -\description{ -This function plots the density functions for the reference data and the distances from a specified query samples -to all reference samples within a specified cell type. -} -\details{ -The function first checks if the specified cell type and sample names are present in the \code{x}. If the -specified cell type or sample name is not found, an error is thrown. It then extracts the distances within the reference dataset -and the distances from the specified query sample to the reference samples. The function creates a density plot using \code{ggplot2} -to compare the distance distributions. The density plot will show two distributions: one for the pairwise distances within the -reference dataset and one for the distances from the specified query sample to each reference sample. These distributions are -plotted in different colors to visually assess how similar the query sample is to the reference samples of the specified cell type. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- scran::getTopHVGs(ref_data, n = 2000) -query_var <- scran::getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset) - -# Plot the PC data -distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - pc_subset = c(1:10)) - -# Identify outliers for CD4 -cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - n_components = 10, - n_tree = 500, - anomaly_treshold = 0.5)$CD4 -cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6]) - -# Plot the densities of the distances -plot(distance_data, ref_cell_type = "CD4", sample_names = cd4_top5_anomalies) -plot(distance_data, ref_cell_type = "CD8", sample_names = cd4_top5_anomalies) - - -} -\seealso{ -\code{\link{calculateSampleDistances}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/plot.calculateSampleSimilarityPCA.Rd b/man/plot.calculateSampleSimilarityPCA.Rd deleted file mode 100644 index b025237..0000000 --- a/man/plot.calculateSampleSimilarityPCA.Rd +++ /dev/null @@ -1,90 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot.calculateSampleSimilarityPCA.R -\name{plot.calculateSampleSimilarityPCA} -\alias{plot.calculateSampleSimilarityPCA} -\title{Plot Cosine Similarities Between Samples and PCs} -\usage{ -\method{plot}{calculateSampleSimilarityPCA}(x, pc_subset = c(1:5), ...) -} -\arguments{ -\item{x}{An object of class 'calculateSampleSimilarityPCA' containing a dataframe of cosine similarity values -between samples and PCs.} - -\item{pc_subset}{A numeric vector specifying the subset of principal components to include in the plot (default: c(1:5)).} - -\item{...}{Additional arguments passed to the plotting function.} -} -\value{ -A ggplot object representing the cosine similarity heatmap. -} -\description{ -This function creates a heatmap plot to visualize the cosine similarities between samples and principal components (PCs). -} -\details{ -This function reshapes the input data frame to create a long format suitable for plotting as a heatmap. It then -creates a heatmap plot using ggplot2, where the x-axis represents the PCs, the y-axis represents the samples, and the -color intensity represents the cosine similarity values. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- scran::getTopHVGs(ref_data, n = 2000) -query_var <- scran::getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data (assumed to be prepared) -ref_data_subset <- runPCA(ref_data_subset) - -# Store PCA anomaly data and plots -anomaly_output <- detectAnomaly(reference_data = ref_data_subset, - ref_cell_type_col = "reclustered.broad", - n_components = 10, - n_tree = 500, - anomaly_treshold = 0.5) -top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, - decreasing = TRUE)[1:6]) - -# Compute cosine similarity between anomalies and top PCs -cosine_similarities <- calculateSampleSimilarityPCA(ref_data_subset, samples = top6_anomalies, - pc_subset = c(1:10), n_top_vars = 50) -cosine_similarities - -# Plot similarities -plot(cosine_similarities, pc_subset = c(1:5)) - -} -\seealso{ -\code{\link{calculateSampleSimilarityPCA}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/plot.compareCCA.Rd b/man/plot.compareCCA.Rd deleted file mode 100644 index 40f0522..0000000 --- a/man/plot.compareCCA.Rd +++ /dev/null @@ -1,87 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot.compareCCA.R -\name{plot.compareCCA} -\alias{plot.compareCCA} -\title{Plot Visualization of Output from compareCCA Function} -\usage{ -\method{plot}{compareCCA}(x, ...) -} -\arguments{ -\item{x}{A list containing the output from the `compareCCA` function. -This list should include `cosine_similarity` and `correlations`.} - -\item{...}{Additional arguments passed to the plotting function.} -} -\value{ -A ggplot object representing the scatter plot of cosine similarities of CCA coefficients and correlations. -} -\description{ -This function generates a visualization of the output from the `compareCCA` function. -The plot shows the cosine similarities of canonical correlation analysis (CCA) coefficients, -with point sizes representing the correlations. -} -\details{ -The function converts the input list into a data frame suitable for plotting with `ggplot2`. -Each point in the scatter plot represents the cosine similarity of CCA coefficients, with the size of the point -indicating the correlation. -} -\examples{ -# Load necessary library -library(scRNAseq) -library(scuttle) -library(scran) -library(SingleR) -library(ggplot2) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 500) -query_var <- getTopHVGs(query_data, n = 500) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Subset reference and query data for a specific cell type -ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] - -# Run PCA on the reference and query datasets -ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -query_data_subset <- runPCA(query_data_subset, ncomponents = 50) - -# Compare CCA -cca_comparison <- compareCCA(query_data_subset, ref_data_subset, - pc_subset = c(1:5)) - -# Visualize output of CCA comparison -plot(cca_comparison) - - -} -\seealso{ -\code{\link{compareCCA}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/plot.comparePCA.Rd b/man/plot.comparePCA.Rd deleted file mode 100644 index 14662aa..0000000 --- a/man/plot.comparePCA.Rd +++ /dev/null @@ -1,90 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot.comparePCA.R -\name{plot.comparePCA} -\alias{plot.comparePCA} -\title{Plot Heatmap of Cosine Similarities Between Principal Components} -\usage{ -\method{plot}{comparePCA}(x, ...) -} -\arguments{ -\item{x}{A numeric matrix output from the `comparePCA` function, representing -cosine similarities between query and reference principal components.} - -\item{...}{Additional arguments passed to the plotting function.} -} -\value{ -A ggplot object representing the heatmap of cosine similarities. -} -\description{ -This function generates a heatmap to visualize the cosine similarities between -principal components from the output of the `comparePCA` function. -} -\details{ -The function converts the input matrix into a long-format data frame -suitable for plotting with `ggplot2`. The rows in the heatmap are ordered in -reverse to match the conventional display format. The heatmap uses a blue-white-red -color gradient to represent cosine similarity values, where blue indicates negative -similarity, white indicates zero similarity, and red indicates positive similarity. -} -\examples{ -# Load necessary library -library(scRNAseq) -library(scuttle) -library(scran) -library(SingleR) -library(ComplexHeatmap) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 500) -query_var <- getTopHVGs(query_data, n = 500) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Subset reference and query data for a specific cell type -ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] - -# Run PCA on the reference and query datasets -ref_data_subset <- runPCA(ref_data_subset) -query_data_subset <- runPCA(query_data_subset) - -# Call the PCA comparison function -similarity_mat <- comparePCA(query_data_subset, ref_data_subset, - pc_subset = c(1:5), - metric = c("cosine", "correlation")[1], - correlation_method = c("spearman", "pearson")[1]) - -# Create the heatmap -plot(similarity_mat) - - -} -\seealso{ -\code{\link{comparePCA}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/plot.comparePCASubspace.Rd b/man/plot.comparePCASubspace.Rd deleted file mode 100644 index 5889cfb..0000000 --- a/man/plot.comparePCASubspace.Rd +++ /dev/null @@ -1,87 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot.comparePCASubspace.R -\name{plot.comparePCASubspace} -\alias{plot.comparePCASubspace} -\title{Plot Visualization of Output from comparePCASubspace Function} -\usage{ -\method{plot}{comparePCASubspace}(x, ...) -} -\arguments{ -\item{x}{A numeric matrix output from the `comparePCA` function, representing -cosine similarities between query and reference principal components.} - -\item{...}{Additional arguments passed to the plotting function.} -} -\value{ -A ggplot object representing the heatmap of cosine similarities. -} -\description{ -This function generates a visualization of the output from the `comparePCASubspace` function. -The plot shows the cosine of principal angles between reference and query principal components, -with point sizes representing the variance explained. -} -\details{ -The function converts the input list into a data frame suitable for plotting with `ggplot2`. -Each point in the scatter plot represents the cosine of a principal angle, with the size of the point -indicating the average variance explained by the corresponding principal components. -} -\examples{ -# Load necessary library -library(scRNAseq) -library(scuttle) -library(scran) -library(SingleR) -library(ggplot2) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 500) -query_var <- getTopHVGs(query_data, n = 500) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Subset reference and query data for a specific cell type -ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")] -query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")] - -# Run PCA on the reference and query datasets -ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) -query_data_subset <- runPCA(query_data_subset, ncomponents = 50) - -# Compare PCA subspaces -subspace_comparison <- comparePCASubspace(query_data_subset, ref_data_subset, - pc_subset = c(1:5)) - -# Create a data frame for plotting -plot(subspace_comparison) - - -} -\seealso{ -\code{\link{comparePCASubspace}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/plot.detectAnomaly.Rd b/man/plot.detectAnomaly.Rd deleted file mode 100644 index a3ca284..0000000 --- a/man/plot.detectAnomaly.Rd +++ /dev/null @@ -1,99 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot.detectAnomaly.R -\name{plot.detectAnomaly} -\alias{plot.detectAnomaly} -\title{Create Faceted Scatter Plots for Specified PC Combinations From \code{detectAnomaly} Object} -\usage{ -\method{plot}{detectAnomaly}( - x, - cell_type = NULL, - pc_subset = NULL, - data_type = c("query", "reference"), - ... -) -} -\arguments{ -\item{x}{A list object containing the anomaly detection results from the \code{detectAnomaly} function. -Each element of the list should correspond to a cell type and contain \code{reference_mat_subset}, \code{query_mat_subset}, -\code{var_explained}, and \code{anomaly}.} - -\item{cell_type}{A character string specifying the cell type for which the plots should be generated. This should -be a name present in \code{x}. If NULL, the "Combined" cell type will be plotted. Default is NULL.} - -\item{pc_subset}{A numeric vector specifying the indices of the PCs to be included in the plots. If NULL, all PCs -in \code{reference_mat_subset} will be included.} - -\item{data_type}{A character string specifying whether to plot the "query" data or the "reference" data. Default is "query".} - -\item{...}{Additional arguments.} -} -\value{ -A ggplot2 object representing the PCA plots with anomalies highlighted. -} -\description{ -This function generates faceted scatter plots for specified principal component (PC) combinations -within an anomaly detection object. It allows visualization of the relationship between specified -PCs and highlights anomalies detected by the Isolation Forest algorithm. -} -\details{ -The function extracts the specified PCs from the given anomaly detection object and generates -scatter plots for each pair of PCs. It uses \code{ggplot2} to create a faceted plot where each facet represents -a pair of PCs. Anomalies are highlighted in red, while normal points are shown in black. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 2000) -query_var <- getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50) - -# Store PCA anomaly data and plots -anomaly_output <- detectAnomaly(ref_data_subset, query_data_subset, - ref_cell_type_col = "reclustered.broad", - query_cell_type_col = "labels", - n_components = 10, - n_tree = 500, - anomaly_treshold = 0.5) - -# Plot the output for a cell type -plot(anomaly_output, cell_type = "CD8", pc_subset = c(1:5), data_type = "query") - -} -\seealso{ -\code{\link{detectAnomaly}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/plot.nearestNeighborDiagnostics.Rd b/man/plot.nearestNeighborDiagnostics.Rd deleted file mode 100644 index 2464969..0000000 --- a/man/plot.nearestNeighborDiagnostics.Rd +++ /dev/null @@ -1,87 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plot.nearestNeighborDiagnostics.R -\name{plot.nearestNeighborDiagnostics} -\alias{plot.nearestNeighborDiagnostics} -\title{Plot Density of Probabilities for Cell Type Classification} -\usage{ -\method{plot}{nearestNeighborDiagnostics}(x, cell_types = NULL, prob_type = c("query", "reference")[1], ...) -} -\arguments{ -\item{x}{An object of class \code{nearestNeighbotDiagnostics} containing the probabilities calculated by the \code{\link{nearestNeighborDiagnostics}} function.} - -\item{cell_types}{A character vector specifying the cell types to include in the plot. If NULL, all cell types in \code{x} will be plotted. Default is NULL.} - -\item{prob_type}{A character string specifying the type of probability to plot. Must be either "query" or "reference". Default is "query".} - -\item{...}{Additional arguments to be passed to \code{\link[ggplot2]{geom_density}}.} -} -\value{ -A ggplot2 density plot. -} -\description{ -This function generates a density plot showing the distribution of probabilities for each sample of belonging to -either the reference or query dataset for each cell type. -} -\details{ -This function creates a density plot to visualize the distribution of probabilities for each sample belonging to the -reference or query dataset for each cell type. It utilizes the ggplot2 package for plotting. -} -\examples{ -# Load necessary library -library(scRNAseq) -library(scuttle) -library(scran) -library(SingleR) -library(scater) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- getTopHVGs(ref_data, n = 500) -query_var <- getTopHVGs(query_data, n = 500) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data -ref_data_subset <- runPCA(ref_data_subset) - -# Project the query data onto PCA space of reference -nn_output <- nearestNeighborDiagnostics(query_data_subset, ref_data_subset, - n_neighbor = 15, - n_components = 10, - pc_subset = c(1:10), - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad") - -# Plot output -plot(nn_output, cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"), - prob_type = "query") - - -} -\seealso{ -\code{\link{nearestNeighborDiagnostics}} -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/plotGeneExpressionDimred.Rd b/man/plotGeneExpressionDimred.Rd deleted file mode 100644 index 2d9021e..0000000 --- a/man/plotGeneExpressionDimred.Rd +++ /dev/null @@ -1,52 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plotGeneExpressionDimred.R -\name{plotGeneExpressionDimred} -\alias{plotGeneExpressionDimred} -\title{Visualize gene expression on a dimensional reduction plot} -\usage{ -plotGeneExpressionDimred(se_object, method, n_components = c(1, 2), feature) -} -\arguments{ -\item{se_object}{An object of class "SingleCellExperiment" containing log-transformed expression matrix and other metadata. -It can be either a reference or query dataset.} - -\item{method}{The reduction method to use for visualization. It should be one of the supported methods: "tSNE", "UMAP", or "PCA".} - -\item{n_components}{A numeric vector of length 2 indicating the first two dimensions to be used for plotting.} - -\item{feature}{A character string representing the name of the gene or feature to be visualized.} -} -\value{ -A ggplot object representing the dimensional reduction plot with gene expression. -} -\description{ -This function plots gene expression on a dimensional reduction plot using methods like t-SNE, UMAP, or PCA. Each single cell is color-coded based on the expression of a specific gene or feature. -} -\examples{ -library(scater) -library(scran) -library(scRNAseq) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log transform datasets -query_data <- logNormCounts(query_data) - -# Run PCA -query_data <- runPCA(query_data) - -# Plot gene expression on PCA plot -plotGeneExpressionDimred(se_object = query_data, - method = "PCA", - n_components = c(1, 2), - feature = "VPREB3") - - -} diff --git a/man/plotGeneSetScores.Rd b/man/plotGeneSetScores.Rd deleted file mode 100644 index dd18d86..0000000 --- a/man/plotGeneSetScores.Rd +++ /dev/null @@ -1,78 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plotGeneSetScores.R -\name{plotGeneSetScores} -\alias{plotGeneSetScores} -\title{Visualization of gene sets or pathway scores on dimensional reduction plot} -\usage{ -plotGeneSetScores(se_object, method, feature, pc_subset = c(1:5)) -} -\arguments{ -\item{se_object}{An object of class "SingleCellExperiment" containing numeric expression matrix and other metadata. -It can be either a reference or query dataset.} - -\item{method}{A character string indicating the method for visualization ("PCA", "TSNE", or "UMAP").} - -\item{feature}{A character string representing the name of the feature (score) in the colData(query_data) to plot.} - -\item{pc_subset}{An optional vector specifying the principal components (PCs) to include in the plot if method = "PCA". -Default is c(1:5).} -} -\value{ -A ggplot2 object representing the gene set scores plotted on the specified reduced dimensions. -} -\description{ -Plot gene sets or pathway scores on PCA, TSNE, or UMAP. Single cells are color-coded by scores of gene sets or pathways. -} -\details{ -This function plots gene set scores on reduced dimensions such as PCA, t-SNE, or UMAP. -It extracts the reduced dimensions from the provided SingleCellExperiment object. -Gene set scores are visualized as a scatter plot with colors indicating the scores. -For PCA, the function automatically includes the percentage of variance explained -in the plot's legend. -} -\examples{ -library(scater) -library(scran) -library(scRNAseq) -library(AUCell) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -## log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Run PCA on the query data -query_data <- runPCA(query_data) - -# Compute scores using AUCell -expression_matrix <- assay(query_data, "logcounts") -cells_rankings <- AUCell_buildRankings(expression_matrix, plotStats = FALSE) -# Generate gene sets -gene_set1 <- sample(rownames(expression_matrix), 10) -gene_set2 <- sample(rownames(expression_matrix), 20) -gene_sets <- list(geneSet1 = gene_set1, geneSet2 = gene_set2) - -# Calculate AUC scores for gene sets -cells_AUC <- AUCell_calcAUC(gene_sets, cells_rankings) - -# Assign scores to colData (users should ensure that the scores are present in the colData) -colData(query_data)$geneSetScores <- assay(cells_AUC)["geneSet1", ] - -# Plot gene set scores on PCA -plotGeneSetScores(se_object = query_data, - method = "PCA", - feature = "geneSetScores", - pc_subset = c(1:5)) - -# Note: Users can provide their own gene set scores in the colData of the 'se_object' object, -# using any method of their choice. - -} diff --git a/man/plotMarkerExpression.Rd b/man/plotMarkerExpression.Rd deleted file mode 100644 index 780c894..0000000 --- a/man/plotMarkerExpression.Rd +++ /dev/null @@ -1,79 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plotMarkerExpression.R -\name{plotMarkerExpression} -\alias{plotMarkerExpression} -\title{Plot gene expression distribution from overall and cell type-specific perspective} -\usage{ -plotMarkerExpression( - reference_data, - query_data, - ref_cell_type_col, - query_cell_type_col, - gene_name, - label -) -} -\arguments{ -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} that identifies the cell types.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} that identifies the cell types.} - -\item{gene_name}{character. A string representing the gene name for which the distribution is to be visualized.} - -\item{label}{character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")).} -} -\value{ -A gtable object containing two arranged density plots as grobs. - The first plot shows the overall gene expression distribution, - and the second plot displays the cell type-specific expression - distribution. -} -\description{ -This function generates density plots to visualize the distribution of gene expression values -for a specific gene across the overall dataset and within a specified cell type. -} -\details{ -This function generates density plots to compare the distribution of a specific marker -gene between reference and query datasets. The aim is to inspect the alignment of gene expression -levels as a surrogate for dataset similarity. Similar distributions suggest a good alignment, -while differences may indicate discrepancies or incompatibilities between the datasets. -} -\examples{ -library(scater) -library(scran) -library(scRNAseq) -library(SingleR) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Get cell type scores using SingleR or any other method -pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- pred$labels - -# Note: Users can use SingleR or any other method to obtain the cell type annotations. -plotMarkerExpression(reference_data = ref_data, - query_data = query_data, - ref_cell_type_col = "reclustered.broad", - query_cell_type_col = "labels", - gene_name = "VPREB3", - label = "B_and_plasma") - - -} diff --git a/man/plotQCvsAnnotation.Rd b/man/plotQCvsAnnotation.Rd deleted file mode 100644 index ac2b417..0000000 --- a/man/plotQCvsAnnotation.Rd +++ /dev/null @@ -1,88 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/plotQCvsAnnotation.R -\name{plotQCvsAnnotation} -\alias{plotQCvsAnnotation} -\title{Scatter plot: QC stats vs Cell Type Annotation Scores} -\usage{ -plotQCvsAnnotation(query_data, qc_col, label_col, score_col, label = NULL) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -expression data and metadata.} - -\item{qc_col}{character. A column name in the \code{colData} of \code{query_data} that -contains the QC stats of interest.} - -\item{label_col}{character. The column name in the \code{colData} of \code{query_data} -that contains the cell type labels.} - -\item{score_col}{character. The column name in the \code{colData} of \code{query_data} that -contains the cell type annotation scores.} - -\item{label}{character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")). -Defaults to \code{NULL}, which will include all the cells.} -} -\value{ -A ggplot object displaying a scatter plot of QC stats vs annotation scores, - where each point represents a cell, color-coded by its cell type. -} -\description{ -Creates a scatter plot to visualize the relationship between QC stats (e.g., library size) -and cell type annotation scores for one or more cell types. -} -\details{ -This function generates a scatter plot to explore the relationship between various quality -control (QC) statistics, such as library size and mitochondrial percentage, and cell type -annotation scores. By examining these relationships, users can assess whether specific QC -metrics, systematically influence the confidence in cell type annotations, -which is essential for ensuring reliable cell type annotation. -} -\examples{ -\donttest{ -library(celldex) -library(scater) -library(scran) -library(scRNAseq) -library(SingleR) - -# load reference dataset -ref_data <- fetchReference("hpca", "2024-02-26") - -# Load query dataset (Bunis haematopoietic stem and progenitor cell data) from -# Bunis DG et al. (2021). Single-Cell Mapping of Progressive Fetal-to-Adult -# Transition in Human Naive T Cells Cell Rep. 34(1): 108573 -query_data <- BunisHSPCData() -rownames(query_data) <- rowData(query_data)$Symbol - -# Add QC metrics to query data -query_data <- addPerCellQCMetrics(query_data) - -# Log transform query dataset -query_data <- logNormCounts(query_data) - -# Run SingleR to predict cell types - -pred <- SingleR(query_data, ref_data, labels = ref_data$label.main) - -# Assign predicted labels to query data -colData(query_data)$pred.labels <- pred$labels - -# Get annotation scores -scores <- apply(pred$scores, 1, max) - -# Assign scores to query data -colData(query_data)$cell_scores <- scores - -# Create a scatter plot between library size and annotation scores - -p1 <- plotQCvsAnnotation( - query_data = query_data, - qc_col = "total", - label_col = "pred.labels", - score_col = "cell_scores", - label = NULL) -p1 + xlab("Library Size") -} - - -} diff --git a/man/projectPCA.Rd b/man/projectPCA.Rd deleted file mode 100644 index f03b9a8..0000000 --- a/man/projectPCA.Rd +++ /dev/null @@ -1,125 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/projectPCA.R -\name{projectPCA} -\alias{projectPCA} -\title{Project Query Data Onto PCA Space of Reference Data} -\usage{ -projectPCA( - query_data, - reference_data, - n_components = 10, - query_cell_type_col = NULL, - ref_cell_type_col = NULL, - return_value = c("data.frame", "list")[1] -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10. -Must be less than or equal to the number of components available in the reference PCA.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} -that identifies the cell types.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} -that identifies the cell types.} - -\item{return_value}{A character string specifying the format of the returned data. Can be \code{data.frame} (combined reference -and query projections) or \code{list} (separate lists for reference and query projections) (default = \code{data.frame}).} -} -\value{ -A \code{data.frame} containing the projected data in rows (reference and query data combined) or a \code{list} containing -separate matrices for reference and query projections, depending on the \code{return_value} argument. -} -\description{ -This function projects a query singleCellExperiment object onto the PCA space of a reference -singleCellExperiment object. The PCA analysis on the reference data is assumed to be pre-computed and stored within the object. -} -\details{ -This function assumes that the "PCA" element exists within the \code{reducedDims} of the reference data -(obtained using \code{reducedDim(reference_data)}) and that the genes used for PCA are present in both the reference and query data. -It performs centering and scaling of the query data based on the reference data before projection. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) -library(RColorBrewer) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- scran::getTopHVGs(ref_data, n = 2000) -query_var <- scran::getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data (assumed to be prepared) -ref_data_subset <- runPCA(ref_data_subset) - -# Project the query data onto PCA space of reference -pca_output <- projectPCA(query_data_subset, ref_data_subset, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - return_value = c("data.frame", "list")[1]) - -# Compute t-SNE and UMAP using first 10 PCs -tsne_data <- data.frame(calculateTSNE(t(pca_output[, paste0("PC", 1:10)]))) -umap_data <- data.frame(calculateUMAP(t(pca_output[, paste0("PC", 1:10)]))) - -# Combine the cell type labels from both datasets -tsne_data$Type <- paste(pca_output$dataset, pca_output$cell_type) - -# Define the cell types and legend order -legend_order <- c("Query CD8", - "Reference CD8", - "Query CD4", - "Reference CD4", - "Query B_and_plasma", - "Reference B_and_plasma") - -# Define the colors for cell types -color_palette <- brewer.pal(length(legend_order), "Paired") -color_mapping <- setNames(color_palette, legend_order) -cell_type_colors <- color_mapping[legend_order] - -# Visualize t-SNE output -tsne_plot <- ggplot(tsne_data[tsne_data$Type \%in\% legend_order,], - aes(x = TSNE1, y = TSNE2, color = factor(Type, levels = legend_order))) + - geom_point(alpha = 0.5, size = 1) + - scale_color_manual(values = cell_type_colors) + - theme_bw() + - guides(color = guide_legend(title = "Cell Types")) - - -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/man/regressPC.Rd b/man/regressPC.Rd deleted file mode 100644 index 8afbbbd..0000000 --- a/man/regressPC.Rd +++ /dev/null @@ -1,121 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/regressPC.R -\name{regressPC} -\alias{regressPC} -\alias{plotPCRegression} -\title{Principal component regression} -\usage{ -regressPC(sce, dep.vars = NULL, indep.var) - -plotPCRegression(sce, regressPC_res, dep.vars = NULL, indep.var, max_pc = 20) -} -\arguments{ -\item{sce}{An object of class \code{\linkS4class{SingleCellExperiment}} -containing the data for regression analysis.} - -\item{dep.vars}{character. Dependent variable(s). Determines which principal -component(s) (e.g., "PC1", "PC2", etc.) are used as explanatory variables. -Principal components are expected to be stored in a PC matrix named -\code{"PCA"} in the \code{reducedDims} of \code{sce}. Defaults to -\code{NULL} which will then regress on each principal component present in -the PC matrix.} - -\item{indep.var}{character. Independent variable. A column name in the -\code{colData} of \code{sce} specifying the response variable.} - -\item{regressPC_res}{a result from \code{\link{regressPC}}} - -\item{max_pc}{The maximum number of PCs to show on the plot. Set to 0 to show -all.} -} -\value{ -A \code{list} containing \itemize{ \item summaries of the linear - regression models for each specified principal component, \item the - corresponding R-squared (R2) values, \item the variance contributions for - each principal component, and \item the total variance explained.} -} -\description{ -This function performs linear regression of a covariate of interest onto one -or more principal components, based on the data in a SingleCellExperiment -object. -} -\details{ -Principal component regression, derived from PCA, can be used to - quantify the variance explained by a covariate interest. Applications for - single-cell analysis include quantification of batch removal, assessing - clustering homogeneity, and evaluation of alignment of query and reference - datasets in cell type annotation settings. Briefly, the R^2 is calculated - from a linear regression of the covariate B of interest onto each principal - component. The variance contribution of the covariate effect per principal - component is then calculated as the product of the variance explained by - the ith principal component (PC) and the corresponding R2(PCi|B). The sum - across all variance contributions by the covariate effects in all principal - components gives the total variance explained by the covariate as follows: - - Var(C|B) = sum_{i=1}^G Var(C|PC_i) * R^2 (PC_i | B) - - where, Var(C|PCi) is the variance of the data matrix C explained by the ith - principal component. See references. - - If the input is large (>3e4 cells) and the independent variable is - categorical with >10 categories, this function will use a stripped down - linear model function that is faster but doesn't return all the same - components. Namely, the \code{regression.summaries} component of the result - will contain only the R^2 values, nothing else. -} -\examples{ -library(scater) -library(scran) -library(scRNAseq) -library(SingleR) - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(sce), - size = floor(0.7 * ncol(sce)), - replace = FALSE -) -ref <- sce[, indices] -query <- sce[, -indices] - -# log transform datasets -ref <- logNormCounts(ref) -query <- logNormCounts(query) - -# Run PCA -query <- runPCA(query) - -# Get cell type scores using SingleR -# Note: replace when using cell type annotation scores from other methods -scores <- SingleR(query, ref, labels = ref$reclustered.broad) - -# Add labels to query object -query$labels <- scores$labels - -# Specify the dependent variables (principal components) and -# independent variable (e.g., "labels") -dep.vars <- paste0("PC", 1:3) -indep.var <- "labels" - -# Perform linear regression on multiple principal components -res <- regressPC( - sce = query, - dep.vars = dep.vars, - indep.var = indep.var -) - -# Obtain linear regression summaries and R-squared values -res$regression.summaries -res$rsquared - - -plotPCRegression(query, res, dep.vars, indep.var) - -} -\references{ -Luecken et al. Benchmarking atlas-level data integration in - single-cell genomics. Nature Methods, 19:41-50, 2022. -} diff --git a/man/visualizeCellTypeMDS.Rd b/man/visualizeCellTypeMDS.Rd deleted file mode 100644 index 84dad90..0000000 --- a/man/visualizeCellTypeMDS.Rd +++ /dev/null @@ -1,85 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/visualizeCellTypeMDS.R -\name{visualizeCellTypeMDS} -\alias{visualizeCellTypeMDS} -\title{Visualizing Reference and Query Cell Types using MDS} -\usage{ -visualizeCellTypeMDS( - query_data, - reference_data, - cell_types = NULL, - query_cell_type_col, - ref_cell_type_col -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell -expression data and metadata.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell -expression data and metadata.} - -\item{cell_types}{A character vector specifying the cell types to include in the plot. If NULL, all cell types are included.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} -that identifies the cell types.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} -that identifies the cell types.} -} -\value{ -A ggplot object representing the MDS scatter plot with cell type coloring. -} -\description{ -This function facilitates the assessment of similarity between reference and query datasets -through Multidimensional Scaling (MDS) scatter plots. It allows the visualization of cell types, -color-coded with user-defined custom colors, based on a dissimilarity matrix computed from a -user-selected gene set. -} -\details{ -To evaluate dataset similarity, the function selects specific subsets of cells from -both reference and query datasets. It then calculates Spearman correlations between gene expression profiles, -deriving a dissimilarity matrix. This matrix undergoes Classical Multidimensional Scaling (MDS) for -visualization, presenting cell types in a scatter plot, distinguished by colors defined by the user. -} -\examples{ -library(scater) -library(scran) -library(scRNAseq) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- scran::getTopHVGs(ref_data, n = 2000) -query_var <- scran::getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Generate the MDS scatter plot with cell type coloring -plot <- visualizeCellTypeMDS(query_data = query_data_subset, - reference_data = ref_data_subset, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad") -print(plot) - -} diff --git a/man/visualizeCellTypePCA.Rd b/man/visualizeCellTypePCA.Rd deleted file mode 100644 index 3d62fb0..0000000 --- a/man/visualizeCellTypePCA.Rd +++ /dev/null @@ -1,97 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/visualizeCellTypePCA.R -\name{visualizeCellTypePCA} -\alias{visualizeCellTypePCA} -\title{Visualize Principal Components for Different Cell Types} -\usage{ -visualizeCellTypePCA( - query_data, - reference_data, - n_components = 10, - cell_types = NULL, - query_cell_type_col, - ref_cell_type_col, - pc_subset = c(1:5) -) -} -\arguments{ -\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.} - -\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.} - -\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10. -Must be less than or equal to the number of components available in the reference PCA.} - -\item{cell_types}{A character vector specifying the cell types to include in the plot. If NULL, all cell types are included.} - -\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} -that identifies the cell types.} - -\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} -that identifies the cell types.} - -\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.} -} -\value{ -A ggplot object representing the boxplots of specified principal components for the given cell types and datasets. -} -\description{ -This function plots the principal components for different cell types in the query and reference datasets. -} -\details{ -This function projects the query dataset onto the principal component space of the reference dataset and then visualizes the -specified principal components for the specified cell types. -It uses the `projectPCA` function to perform the projection and `ggplot2` to create the plots. -} -\examples{ -# Load required libraries -library(scRNAseq) -library(scuttle) -library(SingleR) -library(scran) -library(scater) - -# Load data (replace with your data loading) -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# log transform datasets -ref_data <- scuttle::logNormCounts(ref_data) -query_data <- scuttle::logNormCounts(query_data) - -# Get cell type scores using SingleR (or any other cell type annotation method) -scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) - -# Add labels to query object -colData(query_data)$labels <- scores$labels - -# Selecting highly variable genes (can be customized by the user) -ref_var <- scran::getTopHVGs(ref_data, n = 2000) -query_var <- scran::getTopHVGs(query_data, n = 2000) - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) -ref_data_subset <- ref_data[common_genes, ] -query_data_subset <- query_data[common_genes, ] - -# Run PCA on the reference data (assumed to be prepared) -ref_data_subset <- runPCA(ref_data_subset) - -pc_plot <- visualizeCellTypePCA(query_data_subset, ref_data_subset, - n_components = 10, - cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"), - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - pc_subset = c(1:5)) -pc_plot - - -} -\author{ -Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu} -} diff --git a/pkgdown/extra.css b/pkgdown/extra.css deleted file mode 100644 index 19caa82..0000000 --- a/pkgdown/extra.css +++ /dev/null @@ -1,99 +0,0 @@ -/* -Developed and maintained by Kevin Rue-Albrecht (@kevinrue) -Borrowed by Leo from https://github.com/iSEE/iSEEhub/blob/main/pkgdown/extra.css -See https://github.com/lcolladotor/biocthis/issues/34 for more details. -*/ - -/* -#0092ac blue -#00758a darker blue (active menu) -#c4d931 green (on blue) -#87b13f green (on white) -*/ - -.headroom { - background-color: #0092ac; -} - -.navbar-default .navbar-link { - color: #ffffff; -} - -.navbar-default .navbar-link:hover { - color: #c4d931; -} - -.navbar-default .navbar-nav>.active>a, -.navbar-default .navbar-nav>.active>a:hover, -.navbar-default .navbar-nav>.active>a:focus { - color: #c4d931; - background-color: #00758a; -} - -.navbar-default .navbar-nav>.open>a, -.navbar-default .navbar-nav>.open>a:hover, -.navbar-default .navbar-nav>.open>a:focus { - color: #c4d931; - background-color: #00758a; -} - -.dropdown-menu>.active>a, -.dropdown-menu>.active>a:hover, -.dropdown-menu>.active>a:focus { - color: #c4d931; - background-color: #00758a; -} - -.navbar-default .navbar-nav>li>a:hover, -.navbar-default .navbar-nav>li>a:focus { - color: #c4d931; -} - -.dropdown-menu>li>a:hover { - color: #87b13f; - background-color: #ffffff; -} - -.navbar-default .navbar-nav>li>a { - color: #ffffff; -} - -h1 { - color: #87b13f; -} - -h2 { - color: #1a81c2; -} - -h3 { - color: #1a81c2; - font-weight: bold; -} - -.btn-copy-ex { - color: #ffffff; - background-color: #0092ac; - border-color: #0092ac; -} - -.btn-copy-ex:hover { - color: #ffffff; - background-color: #00758a; - border-color: #00758a; -} - -.btn-copy-ex:active:focus { - color: #c4d931; - background-color: #00758a; - border-color: #0092ac; -} - -p>.fa, -p>.fas { - color: #0092ac; -} - -img { - width: auto; -} diff --git a/scDiagnostics.Rproj b/scDiagnostics.Rproj deleted file mode 100644 index a4dce49..0000000 --- a/scDiagnostics.Rproj +++ /dev/null @@ -1,17 +0,0 @@ -Version: 1.0 - -RestoreWorkspace: Default -SaveWorkspace: Default -AlwaysSaveHistory: Default - -EnableCodeIndexing: Yes -UseSpacesForTab: Yes -NumSpacesForTab: 4 -Encoding: UTF-8 - -RnwWeave: Sweave -LaTeX: pdfLaTeX - -BuildType: Package -PackageUseDevtools: Yes -PackageInstallArgs: --no-multiarch --with-keep.source diff --git a/tests/testthat.R b/tests/testthat.R deleted file mode 100644 index 952fd33..0000000 --- a/tests/testthat.R +++ /dev/null @@ -1,12 +0,0 @@ -# This file is part of the standard setup for testthat. -# It is recommended that you do not modify it. -# -# Where should you do additional test configuration? -# Learn more about the roles of various files in: -# * https://r-pkgs.org/testing-design.html#sec-tests-files-overview -# * https://testthat.r-lib.org/articles/special-files.html - -library(testthat) -library(scDiagnostics) - -test_check("scDiagnostics") diff --git a/tests/testthat/test-calculateCategorizationEntropy.R b/tests/testthat/test-calculateCategorizationEntropy.R deleted file mode 100644 index 8849056..0000000 --- a/tests/testthat/test-calculateCategorizationEntropy.R +++ /dev/null @@ -1,3 +0,0 @@ -test_that("multiplication works", { - expect_equal(2 * 2, 4) -}) diff --git a/vignettes/scDiagnostics.Rmd b/vignettes/scDiagnostics.Rmd deleted file mode 100644 index e428f47..0000000 --- a/vignettes/scDiagnostics.Rmd +++ /dev/null @@ -1,760 +0,0 @@ ---- -title: "scDiagnostics: diagnostic functions to assess the quality of cell type annotations in single-cell RNA-seq data" -author: - - name: Anthony Christidis - affiliation: Center for Computational Biomedicine, Harvard Medical School - email: anthony-alexander_christidis@hms.harvard.edu - - name: Andrew Ghazi - affiliation: Center for Computational Biomedicine, Harvard Medical School - - name: Smriti Chawla - affiliation: Center for Computational Biomedicine, Harvard Medical School - - name: Nitesh Turaga - affiliation: Center for Computational Biomedicine, Harvard Medical School - - name: Ludwig Geistlinger - affiliation: Center for Computational Biomedicine, Harvard Medical School - - name: Robert Gentleman - affiliation: Center for Computational Biomedicine, Harvard Medical School -package: scDiagnostics -output: - BiocStyle::html_document: - toc: true - toc_float: true -vignette: > - %\VignetteIndexEntry{scDiagnostics} - %\VignetteEncoding{UTF-8} - %\VignetteEngine{knitr::rmarkdown} -editor_options: - markdown: - wrap: 72 ---- - -```{r setup, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) -``` - -# Purpose - -Annotation transfer from a reference dataset for the cell type -annotation of a new query single-cell RNA-sequencing (scRNA-seq) -experiment is an integral component of the typical analysis workflow. -The approach provides a fast, automated, and reproducible alternative to -the manual annotation of cell clusters based on marker gene expression. -However, dataset imbalance and undiagnosed incompatibilities between -query and reference dataset can lead to erroneous annotation and distort -downstream applications. - -The `scDiagnostics` package provides functionality for the systematic -evaluation of cell type assignments in scRNA-seq data. `scDiagnostics` -offers a suite of diagnostic functions to assess whether both (query and -reference) datasets are aligned, ensuring that annotations can be -transferred reliably. `scDiagnostics` also provides functionality to -assess annotation ambiguity, cluster heterogeneity, and marker gene -alignment. The implemented functionality helps researchers to determine -how accurately cells from a new scRNA-seq experiment can be assigned to -known cell types. - -# Installation - -To install the development version of the package from Github, use the -following command: - -```{r dev_version_install, eval = FALSE} -BiocManager::install("ccb-hms/scDiagnostics") -``` - -NOTE: you will need the -[remotes](https://cran.r-project.org/web/packages/remotes/index.html) -package to install from GitHub. - -To build the package vignettes upon installation use: - -```{r build_vignettes, eval=FALSE} -BiocManager::install("ccb-hms/scDiagnostics", - build_vignettes = TRUE, - dependencies = TRUE) -``` - -# Usage - -To explore the capabilities of the scDiagnostics package, you can load -your own data or utilize publicly available datasets obtained from the -scRNAseq R package. In this guide, we will demonstrate how to use -scDiagnostics with such datasets, which serve as valuable resources for -exploring the package and assessing the appropriateness of cell type -assignments. - -```{r libraries, message = FALSE} -library(scDiagnostics) -library(celldex) -library(corrplot) -library(scater) -library(scran) -library(scRNAseq) -library(AUCell) -library(RColorBrewer) -library(SingleR) -library(ComplexHeatmap) -``` - -## Scatter Plot: QC stats vs. Annotation Scores - -Here, we will consider the Human Primary Cell Atlas (Mabbott et al. -2013) as a reference dataset and our query dataset consists of -Haematopoietic stem and progenitor cells from (Bunis DG et al. 2021). - -In scRNA-seq studies, assessing the quality of cells is important for -accurate downstream analyses. At the same time, assigning accurate cell -type labels based on gene expression profiles is an integral aspect of -scRNA-seq data interpretation. Generally, these two are performed -independently of each other. The rationale behind this function is to -inspect whether certain QC (Quality Control) criteria impact the -confidence level of cell type annotations. - -For instance, it is reasonable to hypothesize that higher library sizes -could contribute to increased annotation confidence due to enhanced -statistical power for identifying cell type-specific gene expression -patterns, as evident in the scatter plot below. - -```{r Scatter-Plot-LibrarySize-Vs-Annotation-Scores, message=FALSE, warning=FALSE, eval=FALSE} - -# load reference dataset -ref_data <- celldex::fetchReference("hpca", "2024-02-26") - -# Load query dataset (Bunis haematopoietic stem and progenitor cell -# data) from Bunis DG et al. (2021). Single-Cell Mapping of -# Progressive Fetal-to-Adult Transition in Human Naive T Cells Cell -# Rep. 34(1): 108573 - -query_data <- BunisHSPCData() -rownames(query_data) <- rowData(query_data)$Symbol - -# Add QC metrics to query data -query_data <- addPerCellQCMetrics(query_data) - -# Log transform query dataset -query_data <- logNormCounts(query_data) - -# Run SingleR to predict cell types -pred <- SingleR(query_data, ref_data, labels = ref_data$label.main) - -# Assign predicted labels to query data -colData(query_data)$pred.labels <- pred$labels - -# Get annotation scores -scores <- apply(pred$scores, 1, max) - -# Assign scores to query data -colData(query_data)$cell_scores <- scores - -# Create a scatter plot between library size and annotation scores -p1 <- plotQCvsAnnotation( - query_data = query_data, - qc_col = "total", - label_col = "pred.labels", - score_col = "cell_scores", - label = NULL -) -p1 + xlab("Library Size") -``` - -However, certain QC metrics, such as the proportion of mitochondrial -genes, may require careful consideration as they can sometimes be -associated with cellular states or functions rather than noise. The -interpretation of mitochondrial content should be context-specific and -informed by biological knowledge. - -In next analysis, we investigated the relationship between mitochondrial -percentage and cell type annotation scores using liver tissue data from -He S et al. 2020. Notably, we observed high annotation scores for -macrophages and monocytes. These findings align with the established -biological characteristic of high mitochondrial activity in macrophages -and monocytes, adding biological context to our results. - -```{r QC-Annotation-Scatter-Mito, warning=FALSE, message=FALSE, eval=FALSE} -# load query dataset -query_data <- HeOrganAtlasData( - tissue = c("Liver"), - ensembl = FALSE, - location = TRUE -) - -# Add QC metrics to query data - -mito_genes <- rownames(query_data)[grep("^MT-", rownames(query_data))] -query_data <- unfiltered <- addPerCellQC(query_data,subsets = list(mt = mito_genes)) -qc <- quickPerCellQC(colData(query_data), sub.fields = "subsets_mt_percent") -query_data <- query_data[,!qc$discard] - -# Log transform query dataset -query_data <- logNormCounts(query_data) - -# Run SingleR to predict cell types -pred <- SingleR(query_data, ref_data, labels = ref_data$label.main) - -# Assign predicted labels to query data -colData(query_data)$pred.labels <- pred$labels - -# Get annotation scores -scores <- apply(pred$scores, 1, max) - -# Assign scores to query data -colData(query_data)$cell_scores <- scores - -# Create a new column for the labels so it is easy to distinguish -# between Macrophoges, Monocytes and other cells -query_data$label_category <- - ifelse(query_data$pred.labels %in% c("Macrophage", "Monocyte"), - query_data$pred.labels, - "Other cells") - - -# Define custom colors for cell type labels -cols <- c("Other cells" = "grey", "Macrophage" = "green", "Monocyte" = "red") - -# Generate scatter plot for all cell types -p1 <- plotQCvsAnnotation( - query_data = query_data, - qc_col = "subsets_mt_percent", - label_col = "label_category", - score_col = "cell_scores", - label = NULL) + - scale_color_manual(values = cols) + - xlab("subsets_mt_percent") -p1 -``` - -## Examining Distribution of QC stats and Annotation Scores - -In addition to the scatter plot, we can gain further insights into the -gene expression profiles by visualizing the distribution of user defined -QC stats and annotation scores for all the cell types or specific cell -types. This allows us to examine the variation and patterns in -expression levels and scores across cells assigned to the cell type of -interest. - -To accomplish this, we create two separate histograms. The first -histogram displays the distribution of the annotation scores. - -The second histogram visualizes the distribution of QC stats. This -provides insights into the overall gene expression levels for the -specific cell type. Here in this particular example we are investigating -percentage of mitochondrial genes. - -By examining the histograms, we can observe the range, shape, and -potential outliers in the distribution of both annotation scores and QC -stats. This allows us to assess the appropriateness of the cell type -assignments and identify any potential discrepancies or patterns in the -gene expression profiles for the specific cell type. - -```{r Mito-Genes-Vs-Annotation, warning=FALSE, message=FALSE, eval=FALSE} -# Generate histogram -histQCvsAnnotation(query_data = query_data, qc_col = "subsets_mt_percent", - label_col = "pred.labels", - score_col = "cell_scores", - label = NULL) -``` - -The right-skewed distribution for mitochondrial percentages and a -left-skewed distribution for annotation scores in above histograms -suggest that most cells have lower mitochondrial contamination and -higher confidence in their assigned cell types. - -## Exploring Gene Expression Distribution - -This function helps user to explore the distribution of gene expression -values for a specific gene of interest across all the cells in both -reference and query datasets and within specific cell types. This helps -to evaluate whether the distributions are similar or aligned between the -datasets. Discrepancies in distribution patterns may indicate potential -incompatibilities or differences between the datasets. - -The function also allows users to narrow down their analysis to specific -cell types of interest. This enables investigation of whether alignment -between the query and reference datasets is consistent not only at a -global level but also within specific cell types. - -```{r Gene-Expression-Histogram, warning=FALSE, message=FALSE} - -# Load data -sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE) - -# Divide the data into reference and query datasets -set.seed(100) -indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE) -ref_data <- sce[, indices] -query_data <- sce[, -indices] - -# Log-transform datasets -ref_data <- logNormCounts(ref_data) -query_data <- logNormCounts(query_data) - -# Run PCA -ref_data <- runPCA(ref_data) -query_data <- runPCA(query_data) - -# Get cell type scores using SingleR -pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad) -pred <- as.data.frame(pred) - -# Assign labels to query data -colData(query_data)$labels <- pred$labels - -# Generate density plots -plotMarkerExpression(reference_data = ref_data, - query_data = query_data, - ref_cell_type_col = "reclustered.broad", - query_cell_type_col = "labels", - gene_name = "MS4A1", - label = "B_and_plasma") -``` - -In the provided example, we examined the distribution of expression -values for the gene MS4A1, a marker for naive B cells, in both the query -and reference datasets. Additionally, we also looked at the distribution -of MS4A1 expression in the B_and_plasma cell type. We observed -overlapping distributions in both cases, suggesting alignment between -the reference and query datasets. - -## Evaluating Alignment Between Reference and Query Datasets in Terms of Highly Variable Genes - -We are assessing the similarity or alignment between two datasets, the -reference dataset, and the query dataset, in terms of highly variable -genes (HVGs). We calculate the overlap coefficient between the sets of -highly variable genes in the reference and query datasets. The overlap -coefficient quantifies the degree of overlap or similarity between these -two sets of genes. A value closer to 1 indicates a higher degree of -overlap, while a value closer to 0 suggests less overlap. The computed -overlap coefficient is printed, providing a numerical measure of how -well the highly variable genes in the reference and query datasets -align. In this case, the overlap coefficient is 0.62, indicating a -moderate level of overlap. - -```{r HVG overlap, warning=FALSE, message=FALSE} - -# Selecting highly variable genes -ref_var <- getTopHVGs(ref_data, n=2000) -query_var <- getTopHVGs(query_data, n=2000) - -# Compute the overlap coefficient -overlap_coefficient <- calculateHVGOverlap(reference_genes = ref_var, - query_genes = query_var) -print(overlap_coefficient) -``` - -In the provided example, we examined the distribution of expression -values for the gene MS4A1, a marker for naive B cells, in both the query -and reference datasets. Additionally, we also looked at the distribution -of MS4A1 expression in the B_and_plasma cell type. We observed -overlapping distributions in both cases, suggesting alignment between -the reference and query datasets. - -## Visualize Gene Expression on Dimensional Reduction Plot - -To gain insights into the gene expression patterns and their -representation in a dimensional reduction space, we can utilize the -plotGeneExpressionDimred function. This function allows us to plot the -gene expression values of a specific gene on a dimensional reduction -plot generated using methods like t-SNE, UMAP, or PCA. Each single cell -is color-coded based on its expression level of the gene of interest. - -In the provided example, we are visualizing the gene expression values -of the gene "VPREB3" on a PCA plot. The PCA plot represents the cells in -a lower-dimensional space, where the x-axis corresponds to the first -principal component (Dimension 1) and the y-axis corresponds to the -second principal component (Dimension 2). Each cell is represented as a -point on the plot, and its color reflects the expression level of the -gene "VPREB3," ranging from low (lighter color) to high (darker color). - -```{r Gene-Expression-Scatter, warning=FALSE, message=FALSE} -# Generate dimension reduction plot color code by gene expression -plotGeneExpressionDimred(se_object = query_data, - method = "PCA", - n_components = c(1, 2), - feature = "VPREB3") -``` - -The dimensional reduction plot allows us to observe how the gene -expression of VPREB3 is distributed across the cells and whether any -clusters or patterns emerge in the data. - -## Visualize Gene Sets or Pathway Scores on Dimensional Reduction Plot - -In addition to examining individual gene expression patterns, it is -often useful to assess the collective activity of gene sets or pathways -within single cells. This can provide insights into the functional -states or biological processes associated with specific cell types or -conditions. To facilitate this analysis, the scDiagnostics package -includes a function called plotGeneSetScores that enables the -visualization of gene set or pathway scores on a dimensional reduction -plot. - -The plotGeneSetScores function allows you to plot gene set or pathway -scores on a dimensional reduction plot generated using methods such as -PCA, t-SNE, or UMAP. Each single cell is color-coded based on its scores -for specific gene sets or pathways. This visualization helps identify -the heterogeneity and patterns of gene set or pathway activity within -the dataset, potentially revealing subpopulations with distinct -functional characteristics. - -```{r Pathway-Scores-on-Dimensional-Reduction-Scatter, warning=FALSE, message=FALSE} - -# Compute scores using AUCell -expression_matrix <- assay(query_data, "logcounts") -cells_rankings <- AUCell_buildRankings(expression_matrix, plotStats = FALSE) - -# Generate gene sets -gene_set1 <- sample(rownames(expression_matrix), 10) -gene_set2 <- sample(rownames(expression_matrix), 20) - -gene_sets <- list(geneSet1 = gene_set1, - geneSet2 = gene_set2) - -# Calculate AUC scores for gene sets -cells_AUC <- AUCell_calcAUC(gene_sets, cells_rankings) - -# Assign scores to colData -colData(query_data)$geneSetScores <- assay(cells_AUC)["geneSet1", ] - -# Plot gene set scores on PCA -plotGeneSetScores(se_object = query_data, - method = "PCA", - feature = "geneSetScores", - pc_subset = c(1:5)) -``` - -In the provided example, we demonstrate the usage of the -plotGeneSetScores function using the AUCell package to compute gene set -or pathway scores. Custom gene sets are generated for demonstration -purposes, but users can provide their own gene set scores using any -method of their choice. It is important to ensure that the scores are -assigned to the colData of the reference or query object and specify the -correct feature name for visualization. - -By visualizing gene set or pathway scores on a dimensional reduction -plot, you can gain a comprehensive understanding of the functional -landscape within your single-cell gene expression dataset and explore -the relationships between gene set activities and cellular phenotypes. - -## Visualizing Reference and Query Cell Types using Multidimensional Scaling (MDS) - -This function performs Multidimensional Scaling (MDS) analysis on the -query and reference datasets to examine their similarity. The -dissimilarity matrix is calculated based on the correlation between the -datasets, representing the distances between cells in terms of gene -expression patterns. MDS is then applied to derive low-dimensional -coordinates for each cell. Subsequently, a scatter plot is generated, -where each data point represents a cell, and cell types are color-coded -using custom colors provided by the user. This visualization enables the -comparison of cell type distributions between the query and reference -datasets in a reduced-dimensional space. - -The rationale behind this function is to visually assess the alignment -and relationships between cell types in the query and reference -datasets. - - -```{r CMD-Scatter-Plot, warning=FALSE, message=FALSE} - -# Intersect the gene symbols to obtain common genes -common_genes <- intersect(ref_var, query_var) - -# Select desired cell types -selected_cell_types <- c("CD4", "CD8", "B_and_plasma") -ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad %in% selected_cell_types] -query_data_subset <- query_data[common_genes, query_data$labels %in% selected_cell_types] - -# Extract cell types for visualization -ref_labels <- ref_data_subset$reclustered.broad -query_labels <- query_data_subset$labels - -# Generate the MDS scatter plot with cell type coloring -visualizeCellTypeMDS(query_data = query_data_subset, - reference_data = ref_data_subset, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad") -``` - -Upon examining the MDS scatter plot, we observe that the CD4 and CD8 -cell types overlap to some extent.By observing the proximity or overlap -of different cell types, one can gain insights into their potential -relationships or shared characteristics. - -The selection of custom genes and desired cell types depends on the -user's research interests and goals. It allows for flexibility in -focusing on specific genes and examining particular cell types of -interest in the visualization. - -## Cell Type-specific Pairwise Correlation Analysis and Visualization - -This analysis aims to explore the correlation patterns between different -cell types in a single-cell gene expression dataset. The goal is to -compare the gene expression profiles of cells from a reference dataset -and a query dataset to understand the relationships and similarities -between various cell types. - -To perform the analysis, we start by computing the pairwise correlations -between the query and reference cells for selected cell types ("CD4", -"CD8", "B_and_plasma"). The Spearman correlation method is used, user -can also use Pearsons correlation coeefficient. - -This will return average correlation matrix which can be visulaized by -user's method of choice. Here, the results are visualized as a -correlation plot using the corrplot package. - -```{r Cell-Type-Correlation-Analysis-Visualization, warning=FALSE, message=FALSE} -selected_cell_types <- c("CD4", "CD8", "B_and_plasma") -ref_data_subset <- runPCA(ref_data_subset) -cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, - reference_data = ref_data_subset, - n_components = 5, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - cell_types = selected_cell_types, - correlation_method = "spearman") - -# Visualize the output -plot(cor_matrix_avg) -``` - -In this case, users have the flexibility to extract the gene expression -profiles of specific cell types from the reference and query datasets -and provide these profiles as input to the function. Additionally, they -can select their own set of genes that they consider relevant for -computing the pairwise correlations. For demonstartion we have used -common highly variable genes from reference and query dataset. - -By providing their own gene expression profiles and choosing specific -genes, users can focus the analysis on the cell types and genes of -interest to their research question. - -## Pairwise Distance Analysis and Density Visualization - -This function serves to conduct a analysis of pairwise distances or -correlations between cells of specific cell types within a single-cell -gene expression dataset. By calculating these distances or correlations, -users can gain insights into the relationships and differences in gene -expression profiles between different cell types. The function -facilitates this analysis by generating density plots, allowing users to -visualize the distribution of distances or correlations for various -pairwise comparisons. - -The analysis offers the flexibility to select a particular cell type for -examination, and users can choose between different distance metrics, -such as "euclidean" or "manhattan," to calculate pairwise distances. - -To illustrate, the function is applied to the cell type CD8 using the -euclidean distance metric in the example below. - -```{r Pairwise-Distance-Analysis-Density-Visualization, fig.width=8, message=FALSE, warning=FALSE} -calculatePairwiseDistancesAndPlotDensity(query_data = query_data_subset, - reference_data = ref_data_subset, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - cell_type_query = "CD8", - cell_type_reference = "CD8", - distance_metric = "euclidean") -``` - -Alternatively, users can opt for the "correlation" distance metric, -which measures the similarity in gene expression profiles between cells. - -To illustrate, the function is applied to the cell type CD8 using the -correlation distance metric in the example below. By selecting either -the "pearson" or "spearman" correlation method, users can emphasize -either linear or rank-based associations, respectively. - -```{r Pairwise-Distance-Correlation-Based-Density-Visualization, warning=FALSE, message=FALSE, fig.width=8} -calculatePairwiseDistancesAndPlotDensity(query_data = query_data_subset, - reference_data = ref_data_subset, - n_components = 10, - query_cell_type_col = "labels", - ref_cell_type_col = "reclustered.broad", - cell_type_query = "CD8", - cell_type_reference = "CD8", - distance_metric = "correlation", - correlation_method = "spearman") -``` - -By utilizing this function, users can explore the pairwise distances -between query and reference cells of a specific cell type and gain -insights into the distribution of distances through density plots. This -analysis aids in understanding the similarities and differences in gene -expression profiles for the selected cell type within the query and -reference datasets. - - - - -## PC regression analysis - -Performing PC regression analysis on a SingleCellExperiment object -enables users to examine the relationship between a principal component -(PC) from the dimension reduction slot and an independent variable of -interest. By specifying the desired dependent variable as one of the -principal components (e.g., "PC1", "PC2", etc.) and providing the -corresponding independent variable from the colData slot (e.g. -"cell_type"), users can explore the associations between linear -structure in the single-cell gene expression dataset (reference and -query) and an independent variable of interest (e.g. cell type or -batch). - -The function prints two diagnostic plots by default: - -- a plot of the two PCs with the highest R^2^ with the specified - independent variable -- a dot plot showing the R^2^ of each consecutive PC \~ indep.var - regression - - Generally you should expect this plot to die off to near 0 - before \~PC10 - - Interpretation example: If the R^2^ values are high (\>=50%) - anywhere in PCs 1-5 and your independent variable is "batch", - you have batch effects! - -```{r Regression, warning=FALSE, message=FALSE} - -# Specify the dependent variables (principal components) and -# independent variable (e.g., "labels") -dep.vars <- paste0("PC", 1:12) -indep.var <- "labels" - -# Perform linear regression on multiple principal components -result <- regressPC(sce = query_data, - dep.vars = dep.vars, - indep.var = indep.var) - -# Print the summaries of the linear regression models and R-squared -# values - -# Summaries of the linear regression models -result$regression.summaries[[1]] - -# R-squared values -result$rsquared - -# Variance contributions for each principal component -result$var.contributions - -# Total variance explained -result$total.variance.explained -``` - -This analysis helps uncover whether there is a systematic variation in -PC values across different cell types. In the example above, we can see -that the four cell types are spread out across both PC1 and PC2. Digging -into the genes with high loadins on these PCs can help explain the -biological or technical factors driving cellular heterogeneity. It can -help identify PC dimensions that capture variation specific to certain -cell types or distinguish different cellular states. - -Let's look at the genes driving PC1 by ordering the rotation matrix by -the absolute gene loadings for PC1: - -```{r} -pc_df <- attr(reducedDims(query_data)$PCA, "rotation")[,1:5] |> - as.data.frame() - -pc_df[order(abs(pc_df$PC1)),] |> - tail() -``` - -PC1 is mostly driven by NKG7 - Natural Killer Cell Granule Protein 7. -This gene is important in CD8+ T cells, so that makes sense that it's -distinguishing the cell types shown. - -> Exercise: What genes are driving PC2? Do they make sense? - -```{r echo = FALSE, eval = FALSE} -pc_df[order(abs(pc_df$PC2)),] |> - tail() - -# It's IL32 mostly. -``` - -> Exercise: Try to use the command below to examine the spike on PC5. -> What's going on there? - -`plotPCA(query_data, ncomponents = c(1,5), color_by = "labels")` - -```{r eval=FALSE, echo=FALSE} -plotPCA(query_data, ncomponents = c(1,5), color_by = "labels") -# The myeloid cells are shifted off from the other types. - -pc_df[order(abs(pc_df$PC5)),] |> - tail() -# It's mostly driven by low GNLY expression in the myeloid cells. -``` - - -## Annotation entropy - -In order to assess the confidence of cell type predictions, we can use -the function `calculateCategorizationEntropy()`. This function -calculates the information entropy of assignment probabilities across a -set of cell types for each cell. If a set of class probabilities are -confident, the entropies will be low. - -This can be used to compare two sets of cell type assignments (e.g. from -different type assignment methods) to compare their relative confidence. -**Please note that this has nothing to do with their accuracy!** -Computational methods can sometimes be confidently incorrect. - -The cell type probabilities should be passed as a matrix with cell types -as rows and cells as columns. If the columns of the matrix are not valid -probability distributions (i.e. don't sum to 1 as in the below example), -the function will perform a column-wise softmax to convert them to a -probability scale. This may or may not work well depending on the -distribution of the inputs, so if at all possible try to pass -probabilities instead of arbitrary scores. - -In this example, we create 500 random cells with random normal cell type -"scores" across 4 cell types. For demonstration we make the score of the -first class much higher in the first 250 cells. After the softmax, this -will equate to a very high probability of cell type 1. The remaining 250 -will have assignments that are roughly even across the four cell types -(i.e. high entropy). - -```{r} -X <- rnorm(500 * 4) |> matrix(nrow = 4) -X[1, 1:250] <- X[1, 1:250] + 5 - -entropy_scores <- calculateCategorizationEntropy(X) -``` - -From the plot we can see that half of the cells (the first half we -shifted to class 1) have low entropy, and half have high entropy. - -# Conclusion - -In this analysis, we have demonstrated the capabilities of the -scDiagnostics package for assessing the appropriateness of cell -assignments in single-cell gene expression profiles. By utilizing -various diagnostic functions and visualization techniques, we have -explored different aspects of the data, including total UMI counts, -annotation scores, gene expression distributions, dimensional reduction -plots, gene set scores, pairwise correlations, pairwise distances, and -linear regression analysis. - -Through the scatter plots, histograms, and dimensional reduction plots, -we were able to gain insights into the relationships between gene -expression patterns, cell types, and the distribution of cells in a -reduced-dimensional space. The examination of gene expression -distributions, gene sets, and pathways allowed us to explore the -functional landscape and identify subpopulations with distinct -characteristics within the dataset. Additionally, the pairwise -correlation and distance analyses provided a deeper understanding of the -similarities and differences between cell types, highlighting potential -relationships and patterns. - ------------------------------------------------------------------------- - -## R.session Info - -```{r SessionInfo, echo=FALSE, message=FALSE, warning=FALSE, comment=NA} -options(width = 80) #reset to 'default' width - -sessionInfo() #record the R and package versions used -```