diff --git a/DESCRIPTION b/DESCRIPTION
deleted file mode 100644
index b49d221..0000000
--- a/DESCRIPTION
+++ /dev/null
@@ -1,71 +0,0 @@
-Type: Package
-Package: scDiagnostics
-Title: Cell type annotation diagnostics
-Version: 0.99.0
-Authors@R: c(
-    person("Anthony", "Christidis", role = c("aut", "cre"), 
-            email = "anthony-alexander_christidis@hms.harvard.edu"),
-    person("Andrew", "Ghazi", role = "aut"),
-    person("Smriti", "Chawla", role = "aut"),
-    person("Nitesh", "Turaga", role = "ctb"),
-    person("Ludwig", "Geistlinger", role = "aut"),
-    person("Robert", "Gentleman", role = "aut")
-  )
-Description: The scDiagnostics package provides diagnostic plots to
-    assess the quality of cell type assignments from single cell gene
-    expression profiles. The implemented functionality allows to
-    assess the reliability of cell type annotations, investigate gene
-    expression patterns, and explore relationships between different
-    cell types in query and reference datasets allowing users to
-    detect potential misalignments between reference and query
-    datasets. The package also provides visualization capabilities for
-    diagnositics purposes.    
-License: Artistic-2.0
-URL: https://github.com/ccb-hms/scDiagnostics
-BugReports: https://github.com/ccb-hms/scDiagnostics/issues
-Depends:
-    R (>= 4.4.0)
-Imports:
-    SingleCellExperiment,
-    isotree,
-    methods,
-    ggplot2,
-    RColorBrewer,
-    gridExtra,
-    SummarizedExperiment,
-    stats,
-    utils,
-    ranger,
-    BiocNeighbors,
-    Hotelling,
-    rlang
-Suggests:
-    AUCell,
-    BiocStyle,
-    corrplot,
-    knitr,
-    Matrix,
-    rmarkdown,
-    scran,
-    scRNAseq,
-    SingleR,
-    celldex,
-    ComplexHeatmap,
-    scuttle,
-    scater,
-    testthat (>= 3.0.0)
-VignetteBuilder: 
-    knitr
-biocViews:
-    Annotation,
-    Classification,
-    Clustering,
-    GeneExpression,
-    RNASeq,
-    SingleCell,
-    Software,
-    Transcriptomics
-Encoding: UTF-8
-LazyData: true
-RoxygenNote: 7.3.1
-Config/testthat/edition: 3
diff --git a/NAMESPACE b/NAMESPACE
deleted file mode 100644
index a8e8b8d..0000000
--- a/NAMESPACE
+++ /dev/null
@@ -1,55 +0,0 @@
-# Generated by roxygen2: do not edit by hand
-
-S3method(plot,calculateAveragePairwiseCorrelation)
-S3method(plot,calculateSampleDistances)
-S3method(plot,calculateSampleSimilarityPCA)
-S3method(plot,compareCCA)
-S3method(plot,comparePCA)
-S3method(plot,comparePCASubspace)
-S3method(plot,detectAnomaly)
-S3method(plot,nearestNeighborDiagnostics)
-export(boxplotPCA)
-export(calculateAveragePairwiseCorrelation)
-export(calculateCategorizationEntropy)
-export(calculateHVGOverlap)
-export(calculateHotellingPValue)
-export(calculatePairwiseDistancesAndPlotDensity)
-export(calculateSampleDistances)
-export(calculateSampleDistancesSimilarity)
-export(calculateSampleSimilarityPCA)
-export(calculateVarImpOverlap)
-export(compareCCA)
-export(comparePCA)
-export(comparePCASubspace)
-export(detectAnomaly)
-export(histQCvsAnnotation)
-export(nearestNeighborDiagnostics)
-export(plotGeneExpressionDimred)
-export(plotGeneSetScores)
-export(plotMarkerExpression)
-export(plotPCRegression)
-export(plotQCvsAnnotation)
-export(projectPCA)
-export(regressPC)
-export(visualizeCellTypeMDS)
-export(visualizeCellTypePCA)
-import(SingleCellExperiment)
-import(ggplot2)
-importFrom(SummarizedExperiment,assay)
-importFrom(ggplot2,ggplot)
-importFrom(gridExtra,grid.arrange)
-importFrom(methods,is)
-importFrom(rlang,.data)
-importFrom(stats,approxfun)
-importFrom(stats,cancor)
-importFrom(stats,cmdscale)
-importFrom(stats,cor)
-importFrom(stats,density)
-importFrom(stats,dist)
-importFrom(stats,lm)
-importFrom(stats,na.omit)
-importFrom(stats,predict)
-importFrom(stats,qnorm)
-importFrom(stats,setNames)
-importFrom(utils,combn)
-importFrom(utils,tail)
diff --git a/NEWS.md b/NEWS.md
deleted file mode 100644
index 2389c17..0000000
--- a/NEWS.md
+++ /dev/null
@@ -1,4 +0,0 @@
-# scDiagnostics 0.99.0
-
-* Initial CRAN submission.
-* New package scDiagnostics, for cell type annotation diagnostics.
diff --git a/R/boxplotPCA.R b/R/boxplotPCA.R
deleted file mode 100644
index 22a7ace..0000000
--- a/R/boxplotPCA.R
+++ /dev/null
@@ -1,146 +0,0 @@
-#' @title Plot Principal Components for Different Cell Types
-#'
-#' @description This function generates a \code{ggplot2} boxplot visualization of principal components (PCs) for different 
-#' cell types across two datasets (query and reference).
-#'
-#' @details
-#' The function \code{boxplotPCA} is designed to provide a visualization of principal component analysis (PCA) results. It projects 
-#' the query dataset onto the principal components obtained from the reference dataset. The results are then visualized 
-#' as boxplots, grouped by cell types and datasets (query and reference). This allows for a comparative analysis of the 
-#' distributions of the principal components across different cell types and datasets. The function internally calls \code{projectPCA} 
-#' to perform the PCA projection. It then reshapes the output data into a long format suitable for ggplot2 plotting. 
-#' The color scheme is automatically determined using the \code{RColorBrewer} package, ensuring a visually distinct and appealing plot.
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. 
-#' Must be less than or equal to the number of components available in the reference PCA.
-#' @param cell_types A character vector specifying the cell types to include in the plot. If NULL, all cell types are included.
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} 
-#' that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} 
-#' that identifies the cell types.
-#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.
-#'
-#' @return A ggplot object representing the boxplots of specified principal components for the given cell types and datasets.
-#'
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#'
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-#' query_var <- scran::getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Run PCA on the reference data (assumed to be prepared)
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' pc_plot <- boxplotPCA(query_data_subset, ref_data_subset,
-#'                       n_components = 10,
-#'                       cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
-#'                       query_cell_type_col = "labels", 
-#'                       ref_cell_type_col = "reclustered.broad", 
-#'                       pc_subset = c(1:5))
-#' pc_plot
-#' 
-#' 
-#' @importFrom stats approxfun cancor density setNames
-#' @importFrom utils combn
-#'                          
-# Function to plot PC for different cell types
-boxplotPCA <- function(query_data, reference_data, 
-                       n_components = 10, 
-                       cell_types = NULL,
-                       query_cell_type_col = NULL, 
-                       ref_cell_type_col = NULL, 
-                       pc_subset = c(1:5)){
-    
-    # Get the projected PCA data
-    pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, 
-                             n_components = n_components, 
-                             query_cell_type_col = query_cell_type_col, 
-                             ref_cell_type_col = ref_cell_type_col)
-    
-    # Create the long format data frame manually
-    pca_output <- pca_output[!is.na(pca_output$cell_type),]
-    if(!is.null(cell_types)){
-        if(all(cell_types %in% pca_output$cell_type)){
-            pca_output <- pca_output[which(pca_output$cell_type %in% cell_types),]
-        } else{
-            stop("One or more of the specified \'cell_types\' are not available.")
-        }
-    }
-    pca_long <- data.frame(PC = rep(paste0("pc", pc_subset), each = nrow(pca_output)),
-                           Value = unlist(c(pca_output[, pc_subset])),
-                           dataset = rep(pca_output$dataset, length(pc_subset)),
-                           cell_type = rep(pca_output$cell_type, length(pc_subset)))
-    pca_long$PC <- toupper(pca_long$PC)
-    
-    # Create a new variable representing the combination of cell type and dataset
-    pca_long$cell_type_dataset <- paste(pca_long$dataset, pca_long$cell_type, sep = " ")
-    
-    # Define the order of cell type and dataset combinations
-    order_combinations <- paste(rep(c("Reference", "Query"), length(unique(pca_long$cell_type))),
-                                rep(sort(unique(pca_long$cell_type)), each = 2))
-    
-    # Reorder the levels of cell type and dataset factor
-    pca_long$cell_type_dataset <- factor(pca_long$cell_type_dataset, levels = order_combinations)
-    
-    # Define the colors for cell types
-    color_mapping <- setNames(RColorBrewer::brewer.pal(length(order_combinations), "Paired"), order_combinations)
-    cell_type_colors <- color_mapping[order_combinations]
-    
-    # Create the ggplot
-    plot <- ggplot2::ggplot(pca_long, aes(x = cell_type, y = Value, fill = cell_type_dataset)) +
-        ggplot2::geom_boxplot(alpha = 0.7, outlier.shape = NA, width = 0.7) + 
-        ggplot2::facet_wrap(~ PC, scales = "free") +
-        ggplot2::scale_fill_manual(values = cell_type_colors, name = "Cell Types") + 
-        ggplot2::labs(x = "", y = "Value") +  
-        ggplot2::theme_minimal() +
-        ggplot2::theme(legend.position = "right",  
-                       axis.text.x = ggplot2::element_text(angle = 45, hjust = 1, size = 10),  
-                       axis.title = ggplot2::element_text(size = 14), 
-                       strip.text = ggplot2::element_text(size = 12, face = "bold"), 
-                       panel.grid.major = ggplot2::element_line(color = "grey", linetype = "dotted", linewidth = 0.7),  
-                       panel.grid.minor = ggplot2::element_blank(),  
-                       panel.border = ggplot2::element_blank(),  
-                       strip.background = ggplot2::element_rect(fill = "lightgrey", color = "grey", linewidth = 0.5),  
-                       plot.title = ggplot2::element_text(size = 16, face = "bold", hjust = 0.5))
-    
-    # Return the plot
-    return(plot)
-}
-
-
diff --git a/R/calculateAveragePairwiseCorrelation.R b/R/calculateAveragePairwiseCorrelation.R
deleted file mode 100644
index 1c99f51..0000000
--- a/R/calculateAveragePairwiseCorrelation.R
+++ /dev/null
@@ -1,169 +0,0 @@
-#' Compute Average Pairwise Correlation between Cell Types
-#'
-#' Computes the average pairwise correlations between specified cell types 
-#' in single-cell gene expression data.
-#' 
-#' @details This function operates on \code{\linkS4class{SingleCellExperiment}} objects, 
-#' ideal for single-cell analysis workflows. It calculates pairwise correlations between query and 
-#' reference cells using a specified correlation method, then averages these correlations for each 
-#' cell type pair. This function aids in assessing the similarity between cells in reference and query datasets, 
-#' providing insights into the reliability of cell type annotations in single-cell gene expression data.
-#'
-#' @param query_data  A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-#' expression data and metadata.
-#' @param n_components The number of principal components to use for the dimensionality reduction of the data using PCA. Defaults to 10.
-#' If set to \code{NULL} then no dimensionality reduction is performed and the assay data is used directly for computations.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell 
-#' expression data and metadata.
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} 
-#' that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} 
-#' that identifies the cell types.
-#' @param cell_types A character vector specifying the cell types to be analysed consider.
-#' @param correlation_method The correlation method to use for calculating pairwise correlations.
-#'
-#' @return A matrix containing the average pairwise correlation values. 
-#'         Rows and columns are labeled with the cell types. Each element 
-#'         in the matrix represents the average correlation between a pair 
-#'         of cell types.
-#'         
-#' @seealso \code{\link{plot.calculateAveragePairwiseCorrelation}}
-#' 
-#' @examples
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#' library(SingleR)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#'
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#'
-#' # Compute Pairwise Correlations
-#' # Note: The selection of highly variable genes and desired cell types may vary 
-#' # based on user preference. 
-#' # The cell type annotation method used in this example is SingleR. 
-#' # User can use any other method for cell type annotation and provide 
-#' # the corresponding labels in the metadata.
-#'
-#' # Selecting highly variable genes
-#' ref_var <- getTopHVGs(ref_data, n = 2000)
-#' query_var <- getTopHVGs(query_data, n = 2000)
-#'
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#'
-#' # Select desired cell types
-#' selected_cell_types <- c("CD4", "CD8", "B_and_plasma")
-#' ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad %in% selected_cell_types]
-#' query_data_subset <- query_data[common_genes, query_data$reclustered.broad %in% selected_cell_types]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' # Compute pairwise correlations
-#' cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, 
-#'                                                       reference_data = ref_data_subset, 
-#'                                                       n_components = 10,
-#'                                                       query_cell_type_col = "labels", 
-#'                                                       ref_cell_type_col = "reclustered.broad", 
-#'                                                       cell_types = selected_cell_types, 
-#'                                                       correlation_method = "spearman")
-#'
-#' # Visualize the results
-#' plot(cor_matrix_avg)
-#' 
-#'
-#' @import SingleCellExperiment
-#' @importFrom SummarizedExperiment assay
-#' @importFrom stats cor
-#' @export
-calculateAveragePairwiseCorrelation <- function(query_data, 
-                                                reference_data, 
-                                                n_components = 10,
-                                                query_cell_type_col, 
-                                                ref_cell_type_col, 
-                                                cell_types, 
-                                                correlation_method) {
-  # Sanity checks
-  
-  # Check if query_data is a SingleCellExperiment object
-  if (!is(query_data, "SingleCellExperiment")) {
-    stop("query_data must be a SingleCellExperiment object.")
-  }
-  
-  # Check if reference_data is a SingleCellExperiment object
-  if (!is(reference_data, "SingleCellExperiment")) {
-    stop("reference_data must be a SingleCellExperiment object.")
-  }
-  
-  # Check if query_cell_type_col is a valid column name in query_data
-  if (!query_cell_type_col %in% names(colData(query_data))) {
-    stop("query_cell_type_col: '", query_cell_type_col, "' is not a valid column name in query_data.")
-  }
-  
-  # Check if ref_cell_type_col is a valid column name in reference_data
-  if (!ref_cell_type_col %in% names(colData(reference_data))) {
-    stop("ref_cell_type_col: '", ref_cell_type_col, "' is not a valid column name in reference_data.")
-  }
-  
-  # Check if all cell_types are present in query_data
-  if (!all(cell_types %in% unique(query_data[[query_cell_type_col]]))) {
-    stop("One or more cell_types specified are not present in query_data.")
-  }
-  
-  # Check if all cell_types are present in reference_data
-  if (!all(cell_types %in% unique(reference_data[[ref_cell_type_col]]))) {
-    stop("One or more cell_types specified are not present in reference_data.")
-  }
-    
-  # Function to compute correlation between two cell types
-  .computeCorrelation <- function(type1, type2) {
-      
-      if(!is.null(n_components)){
-          # Project query data onto PCA space of reference data
-          pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, 
-                                   n_components = n_components, return_value = "list")
-          ref_mat <- pca_output$ref[which(reference_data[[ref_cell_type_col]] == type2), paste0("PC", 1:n_components)]
-          query_mat <- pca_output$query[which(query_data[[query_cell_type_col]] == type1), paste0("PC", 1:n_components)]
-      } else{
-          
-          # Subset query data to the specified cell type
-          query_subset <- query_data[ , query_data[[query_cell_type_col]] == type1, drop = FALSE]
-          ref_subset <- reference_data[ , reference_data[[ref_cell_type_col]] == type2, drop = FALSE]
-          
-          query_mat <- t(as.matrix(assay(query_subset, "logcounts")))
-          ref_mat <- t(as.matrix(assay(ref_subset, "logcounts")))
-      }
-
-    cor_matrix <- cor(t(query_mat), t(ref_mat), method = correlation_method)
-    mean(cor_matrix)
-  }
-  
-  # Use outer to compute pairwise correlations
-  cor_matrix_avg <- outer(cell_types, cell_types, Vectorize(.computeCorrelation))
-  
-  # Assign cell type names to rows and columns
-  rownames(cor_matrix_avg) <- paste0("Query-", cell_types)
-  colnames(cor_matrix_avg) <- paste0("Ref-", cell_types)
-  
-  # Update class of output
-  class(cor_matrix_avg) <- c(class(cor_matrix_avg), "calculateAveragePairwiseCorrelation")
-  
-  return(cor_matrix_avg)
-}
diff --git a/R/calculateCategorizationEntropy.R b/R/calculateCategorizationEntropy.R
deleted file mode 100644
index 0d84716..0000000
--- a/R/calculateCategorizationEntropy.R
+++ /dev/null
@@ -1,123 +0,0 @@
-#' Calculate Categorization Entropy
-#' @description This function takes a matrix of category scores (cell type by
-#'   cells) and calculates the entropy of the category probabilities for each
-#'   cell. This gives a sense of how confident the cell type assignments are.
-#'   High entropy = lots of plausible category assignments = low confidence. Low
-#'   entropy = only one or two plausible categories = high confidence. This is
-#'   confidence in the vernacular sense, not in the "confidence interval"
-#'   statistical sense. Also note that the entropy tells you nothing about
-#'   whether or not the assignments are correct -- see the other functionality
-#'   in the package for that. This functionality can be used for assessing how
-#'   comparatively confident different sets of assignments are (given that the
-#'   number of categories is the same).
-#' @param X a matrix of category scores
-#' @param inverse_normal_transform if TRUE, apply
-#' @param verbose if TRUE, display messages about the calculations
-#' @param plot if TRUE, plot a histogram of the entropies
-#' @returns A vector of entropy values for each column in X.
-#' @details The function checks if X is already on the probability scale.
-#'   Otherwise, it applies softmax columnwise.
-#'
-#'   You can think about entropies on a scale from 0 to a maximum that depends
-#'   on the number of categories. This is the function for entropy (minus input
-#'   checking): \code{entropy(p) = -sum(p*log(p))} . If that input vector p is a
-#'   uniform distribution over the \code{length(p)} categories, the entropy will
-#'   be a high as possible.
-#' @export
-#' @examples
-#' # Simulate 500 cells with scores on 4 possible cell types
-#' X <- rnorm(500 * 4) |> matrix(nrow = 4)
-#' X[1, 1:250] <- X[1, 1:250] + 5 # Make the first category highly scored in the first 250 cells
-#' 
-#'
-#' # The function will issue a message about softmaxing the scores, and the entropy histogram will be
-#' # bimodal since we made half of the cells clearly category 1 while the other half are roughly even.
-#' # entropy_scores <- calculateCategorizationEntropy(X)
-calculateCategorizationEntropy <- function(X,
-    inverse_normal_transform = FALSE,
-    plot = TRUE,
-    verbose = TRUE) {
-    if (inverse_normal_transform) {
-        # https://cran.r-project.org/web/packages/RNOmni/vignettes/RNOmni.html#inverse-normal-transformation
-        if (verbose) message("Applying global inverse normal transformation.")
-        # You can't do the INT column-wise (by cell) because it will set a
-        # constant "range" to the probabilities, eliminating the differences in
-        # confidence across methods we're trying to quantify.
-
-        # You can't do the INT row-wise (by cell-type) because even though
-        # different cell types exhibit different marginal distributions of
-        # scores (in SingleR at least), doing the transformation row-wise would
-        # eliminate any differences in which cell types are "hard to predict".
-        # You don't want a score of .5 for cytotoxic T cells (hard to predict
-        # type) to overwhelm a score of .62 from erythroid type 2 (easy to
-        # predict), even though the first would be extraordinary within its cell
-        # type and the latter unexceptional within its cell type.
-
-        X <- inverse_normal_trans(X)
-    }
-
-    colSumsX <- colSums(X)
-
-    X_is_probabilities <- all(X >= 0 & X <= 1) &
-        all((colSumsX - 1) <= 1e-8)
-
-    if (!X_is_probabilities) {
-        if (verbose) message("X doesn't seem to be on the probability scale, applying column-wise softmax.")
-        expX <- exp(X)
-
-        X <- sweep(expX, MARGIN = 2, STATS = colSums(expX), FUN = "/")
-    }
-
-    ncat <- nrow(X)
-
-    max_ent <- calculate_entropy(rep(1 / ncat, ncat))
-
-    if (verbose) {
-        message(
-            "Max possible entropy given ", ncat, " categories: ",
-            round(max_ent,
-                digits = 2
-            )
-        )
-    }
-
-    entropies <- apply(X, 2, calculate_entropy)
-
-    if (plot) {
-        p <- data.frame(entropies = entropies) |>
-            ggplot(aes(entropies)) +
-            geom_histogram(
-                color = "black", fill = "white",
-                bins = 30,
-                boundary = 0
-            ) +
-            theme_bw()
-        print(p)
-    }
-
-    return(entropies)
-}
-
-calculate_entropy <- function(p) {
-    # p is one column of X, a vector of probabilities summing to 1.
-
-    nonzeros <- p != 0
-
-    -sum(p[nonzeros] * log(p[nonzeros]))
-}
-
-n_elements <- function(X) ifelse(is.matrix(X), prod(dim(X)), length(X))
-
-inverse_normal_trans <- function(X, constant = 3 / 8) {
-    n <- n_elements(X)
-
-    rankX <- rank(X)
-
-    intX <- qnorm((rankX - constant) / (n - 2 * constant + 1))
-
-    if (is.matrix(X)) {
-        intX <- matrix(intX, nrow = nrow(X))
-    }
-
-    return(intX)
-}
diff --git a/R/calculateHVGOverlap.R b/R/calculateHVGOverlap.R
deleted file mode 100644
index 9083a1d..0000000
--- a/R/calculateHVGOverlap.R
+++ /dev/null
@@ -1,82 +0,0 @@
-#' @title Calculate the Overlap Coefficient for Highly Variable Genes
-#' 
-#' @description Calculates the overlap coefficient between the sets of highly variable genes 
-#' from a reference dataset and a query dataset.
-#'
-#' @details The overlap coefficient measures the similarity between two gene sets, indicating how well-aligned 
-#' reference and query datasets are in terms of their highly variable genes. This metric is 
-#' useful in single-cell genomics to understand the correspondence between different datasets.
-#'
-#' The coefficient is calculated using the formula:
-#'
-#' \deqn{Coefficient(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}}
-#'
-#' where X and Y are the sets of highly variable genes from the reference and query datasets, respectively,
-#' |X ∩ Y| is the number of genes common to both X and Y, and min(|X|, |Y|) is the size of the smaller set among X and Y.
-#'
-#' @param reference_genes character. A vector of highly variable genes from the reference dataset.
-#' @param query_genes character. A vector of highly variable genes from the query dataset.
-#'
-#' @return Overlap coefficient, a value between 0 and 1, where 0 indicates no overlap 
-#'         and 1 indicates complete overlap of highly variable genes between datasets.
-#' 
-#' @references Luecken et al. Benchmarking atlas-level data integration in
-#' single-cell genomics. Nature Methods, 19:41-50, 2022.
-#' 
-#' @examples
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#' library(SingleR)
-#' 
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#' 
-#' # Selcting highly variable genes
-#' 
-#' ref_var <- getTopHVGs(ref_data, n=2000)
-#' query_var <- getTopHVGs(query_data, n=2000)
-#' 
-#' overlap_coefficient <- calculateHVGOverlap(reference_genes = ref_var, 
-#'                                           query_genes = query_var)
-#' 
-#' @export                                       
-calculateHVGOverlap <- function(reference_genes, query_genes) {
-  
-  # Sanity checks
-  if (!is.vector(reference_genes) || !is.character(reference_genes)) {
-    stop("reference_genes must be a character vector.")
-  }
-  if (!is.vector(query_genes) || !is.character(query_genes)) {
-    stop("query_genes must be a character vector.")
-  }
-  if (length(reference_genes) == 0 || length(query_genes) == 0) {
-    stop("Input vectors must not be empty.")
-  }
-  
-  # Calculate the intersection of highly variable genes
-  common_genes <- intersect(reference_genes, query_genes)
-  
-  # Calculate the size of the intersection
-  intersection_size <- length(common_genes)
-  
-  # Calculate the size of the smaller set
-  min_size <- min(length(reference_genes), length(query_genes))
-  
-  # Compute the overlap coefficient
-  overlap_coefficient <- intersection_size / min_size
-  overlap_coefficient <- round(overlap_coefficient, 2)
-  
-  # Return the overlap coefficient
-  return(overlap_coefficient)
-}
\ No newline at end of file
diff --git a/R/calculateHotellingPValue.R b/R/calculateHotellingPValue.R
deleted file mode 100644
index ca2d8a6..0000000
--- a/R/calculateHotellingPValue.R
+++ /dev/null
@@ -1,113 +0,0 @@
-#' @title Perform Hotelling's T-squared Test on PCA Scores for Single-cell RNA-seq Data
-#'
-#' @description This function performs Hotelling's T-squared test to assess the similarity between reference and query datasets 
-#' for each cell type based on their PCA scores.
-#'
-#' @details This function first performs PCA on the reference dataset and then projects the query dataset onto the PCA space 
-#' of the reference data. For each cell type, it computes pseudo-bulk signatures in the PCA space by averaging the principal 
-#' component scores of cells belonging to that cell type. Hotelling's T-squared test is then performed to compare the mean 
-#' vectors of the pseudo-bulk signatures between the reference and query datasets. The resulting p-values indicate the similarity 
-#' between the reference and query datasets for each cell type.
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. 
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} 
-#' that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} 
-#' that identifies the cell types.
-#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.
-#'
-#' @return A named numeric vector of p-values from Hotelling's T-squared test for each cell type.
-#'
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#'
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-#' query_var <- scran::getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-#'
-#' # Get the p-values from the test
-#' p_values <- calculateHotellingPValue(query_data_subset, ref_data_subset, 
-#'                                      n_components = 10, 
-#'                                      query_cell_type_col = "reclustered.broad", 
-#'                                      ref_cell_type_col = "reclustered.broad",
-#'                                      pc_subset = c(1:10)) 
-#' round(p_values, 5)
-#'                          
-# Function to perform Hotelling T^2 test for each cell type
-# The test is performed on the PCA space of the reference data 
-# The query data projected onto PCA space of reference
-calculateHotellingPValue <- function(query_data, reference_data, 
-                                     n_components = 10, 
-                                     query_cell_type_col, 
-                                     ref_cell_type_col,
-                                     pc_subset = c(1:5)) {
-    
-    # Get the projected PCA data
-    pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, 
-                             n_components = n_components, 
-                             query_cell_type_col = query_cell_type_col, 
-                             ref_cell_type_col = ref_cell_type_col, 
-                             return_value = "list")
-    
-    # Get unique cell types
-    unique_cell_types <- na.omit(unique(c(colData(reference_data)[[ref_cell_type_col]],
-                                          colData(query_data)[[query_cell_type_col]])))
-    
-    # Create a list to store p-values for each cell type
-    p_values <- rep(NA, length(unique_cell_types))
-    names(p_values) <- unique_cell_types
-    
-    for (cell_type in unique_cell_types) {
-        
-        # Subset principal component scores for current cell type
-        ref_subset_scores <- pca_output$ref[which(cell_type == reference_data[[ref_cell_type_col]]), pc_subset]
-        query_subset_scores <- pca_output$query[which(cell_type == query_data[[query_cell_type_col]]), pc_subset]
-        
-        # Calculate the p-value
-        hotelling_output <- Hotelling::hotelling.test(x = ref_subset_scores, y = query_subset_scores)
-        
-        # Store the result
-        p_values[cell_type] <- hotelling_output$pval
-    }
-    
-    # Return p-values
-    return(p_values)
-}
diff --git a/R/calculatePairwiseDistancesAndPlotDensity.R b/R/calculatePairwiseDistancesAndPlotDensity.R
deleted file mode 100644
index 6176165..0000000
--- a/R/calculatePairwiseDistancesAndPlotDensity.R
+++ /dev/null
@@ -1,183 +0,0 @@
-#' @title Pairwise Distance Analysis and Density Visualization
-#'
-#' @description
-#' Calculates pairwise distances or correlations between query and reference cells 
-#' of a specific cell type.
-#' 
-#' @details  
-#' The function works with \code{\linkS4class{SingleCellExperiment}} objects, ensuring 
-#' compatibility with common single-cell analysis workflows. It subsets the data for specified cell types, 
-#' computes pairwise distances or correlations, and visualizes these measurements using density plots. By comparing the distances and correlations, 
-#' one can evaluate the consistency and reliability of annotated cell types within single-cell datasets.
-#' 
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-#' expression data and metadata.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell 
-#' expression data and metadata.
-#' @param n_components The number of principal components to use for the dimensionality reduction of the data using PCA. Defaults to 10.
-#' If set to \code{NULL} then no dimensionality reduction is performed and the assay data is used directly for computations.
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} 
-#' that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} 
-#' that identifies the cell types.
-#' @param cell_type_query The query cell type for which distances or correlations are calculated.
-#' @param cell_type_reference The reference cell type for which distances or correlations are calculated.
-#' @param distance_metric The distance metric to use for calculating pairwise distances, such as euclidean, manhattan etc.
-#'                        Set it to "correlation" for calculating correlation coefficients.
-#' @param correlation_method The correlation method to use when distance_metric is "correlation".
-#'                           Possible values: "pearson", "spearman".
-#'
-#' @return A plot generated by \code{ggplot2}, showing the density distribution of 
-#'         calculated distances or correlations.
-#'
-#' @examples
-#' library(scran)
-#' library(scRNAseq)
-#' library(SingleR)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#'
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#'
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 2000)
-#' query_var <- getTopHVGs(query_data, n = 2000)
-#'
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#'
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' # Example usage of the function
-#' calculatePairwiseDistancesAndPlotDensity(query_data = query_data_subset, 
-#'                                          reference_data = ref_data_subset, 
-#'                                          n_components = 10,
-#'                                          query_cell_type_col = "labels", 
-#'                                          ref_cell_type_col = "reclustered.broad", 
-#'                                          cell_type_query = "CD8", 
-#'                                          cell_type_reference = "CD8", 
-#'                                          distance_metric = "euclidean")
-#' 
-#' 
-#' @importFrom stats cor dist
-#' @import SingleCellExperiment
-#' @importFrom SummarizedExperiment assay                                       
-#' @export
-#' 
-calculatePairwiseDistancesAndPlotDensity <- function(query_data, 
-                                                     reference_data, 
-                                                     n_components = 10,
-                                                     query_cell_type_col, 
-                                                     ref_cell_type_col, 
-                                                     cell_type_query, 
-                                                     cell_type_reference, 
-                                                     distance_metric, 
-                                                     correlation_method = "pearson") {
-  
-  # Sanity checks
-  
-  # Check if query_data is a SingleCellExperiment object
-  if (!is(query_data, "SingleCellExperiment")) {
-    stop("query_data must be a SingleCellExperiment object.")
-  }
-
-  # Check if reference_data is a SingleCellExperiment object
-  if (!is(reference_data, "SingleCellExperiment")) {
-    stop("reference_data must be a SingleCellExperiment object.")
-  }
-
-  # Convert to matrix and potentially applied PCA dimensionality reduction
-  if(!is.null(n_components)){
-      # Project query data onto PCA space of reference data
-      pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, 
-                               n_components = n_components, return_value = "list")
-      ref_mat <- pca_output$ref[which(reference_data[[ref_cell_type_col]] == cell_type_reference), paste0("PC", 1:n_components)]
-      query_mat <- pca_output$query[which(query_data[[query_cell_type_col]] == cell_type_query), paste0("PC", 1:n_components)]
-  } else{
-      
-      # Subset query data to the specified cell type
-      query_data_subset <- query_data[, !is.na(query_data[[query_cell_type_col]]) & query_data[[query_cell_type_col]] == cell_type_query]
-      query_mat <- t(as.matrix(assay(query_data_subset, "logcounts")))
-      ref_mat <- t(as.matrix(assay(ref_data_subset, "logcounts")))
-  }
-
-  # Combine query and reference matrices
-  combined_mat <- rbind(query_mat, ref_mat)
-
-  # Calculate pairwise distances or correlations for all comparisons
-  if (distance_metric == "correlation") {
-    if (correlation_method == "pearson") {
-      dist_matrix <- cor(t(combined_mat), method = "pearson")
-    } else if (correlation_method == "spearman") {
-      dist_matrix <- cor(t(combined_mat), method = "spearman")
-    } else {
-      stop("Invalid correlation method. Available options: 'pearson', 'spearman'")
-    }
-  } else {
-    dist_matrix <- dist(combined_mat, method = distance_metric)
-  }
-
-  # Convert dist_matrix to a square matrix
-  dist_matrix <- as.matrix(dist_matrix)
-
-  # Extract the distances or correlations for the different pairwise comparisons
-  num_query_cells <- nrow(query_mat)
-  num_ref_cells <- nrow(ref_mat)
-  dist_query_query <- dist_matrix[1:num_query_cells, 1:num_query_cells]
-  dist_ref_ref <- dist_matrix[(num_query_cells+1):(num_query_cells+num_ref_cells), 
-                              (num_query_cells+1):(num_query_cells+num_ref_cells)]
-  dist_query_ref <- dist_matrix[1:num_query_cells, (num_query_cells+1):(num_query_cells+num_ref_cells)]
-
-  # Create data frame for plotting
-  dist_df <- data.frame(
-    Comparison = c(rep("Query vs Query", length(dist_query_query)),
-                   rep("Reference vs Reference", length(dist_ref_ref)),
-                   rep("Query vs Reference", length(dist_query_ref))),
-    Distance = c(as.vector(dist_query_query),
-                 as.vector(dist_ref_ref),
-                 as.vector(dist_query_ref))
-  )
-
-  # Plot density plots with improved aesthetics
-  ggplot2::ggplot(dist_df, aes(x = Distance, color = Comparison, fill = Comparison)) +
-      ggplot2::geom_density(alpha = 0.5, linewidth = 1) +  # Updated: linewidth instead of size
-      ggplot2::scale_color_manual(values = c("#1f78b4", "#33a02c", "#e31a1c")) +
-      ggplot2::scale_fill_manual(values = c("#1f78b4", "#33a02c", "#e31a1c")) +
-      ggplot2::labs(x = ifelse(distance_metric == "correlation", 
-                               ifelse(correlation_method == "spearman", "Spearman Correlation", "Pearson Correlation"), 
-                               "Distance"), y = "Density", 
-                    title = "Pairwise Distance Analysis and Density Visualization") +
-      ggplot2::theme_minimal() +
-      ggplot2::theme(
-          plot.title = ggplot2::element_text(size = 16, hjust = 0.5, face = "bold"),
-          axis.title = ggplot2::element_text(size = 14),
-          axis.text = ggplot2::element_text(size = 12),
-          legend.title = ggplot2::element_blank(),
-          legend.text = ggplot2::element_text(size = 12),
-          panel.grid.major = ggplot2::element_line(color = "gray", linetype = "dashed"),
-          panel.grid.minor = ggplot2::element_blank(),
-          panel.background = ggplot2::element_rect(fill = "white"),
-          panel.border = ggplot2::element_blank(),
-          legend.position = "top"
-      )
-}
diff --git a/R/calculateSampleDistances.R b/R/calculateSampleDistances.R
deleted file mode 100644
index 829c9e7..0000000
--- a/R/calculateSampleDistances.R
+++ /dev/null
@@ -1,147 +0,0 @@
-#' @title Compute Sample Distances Between Reference and Query Data
-#'
-#' @description This function computes the distances within the reference dataset and the distances from each query sample to all 
-#' reference samples for each cell type. It uses PCA for dimensionality reduction and Euclidean distance for distance calculation.
-#'
-#' @details The function first performs PCA on the reference dataset and projects the query dataset onto the same PCA space. 
-#' It then computes pairwise Euclidean distances within the reference dataset for each cell type, as well as distances from each 
-#' query sample to all reference samples of a particular cell type. The results are stored in a list, with one entry per cell type.
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. 
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} 
-#' that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} 
-#' that identifies the cell types.
-#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.
-#'
-#' @return A list containing distance data for each cell type. Each entry in the list contains:
-#' \describe{
-#'   \item{ref_distances}{A vector of all pairwise distances within the reference subset for the cell type.}
-#'   \item{query_to_ref_distances}{A matrix of distances from each query sample to all reference samples for the cell type.}
-#' }
-#'
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{plot.calculateSampleDistances}}
-#'
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 2000)
-#' query_var <- getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset)
-#' 
-#' # Plot the PC data
-#' distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, 
-#'                                           n_components = 10, 
-#'                                           query_cell_type_col = "labels", 
-#'                                           ref_cell_type_col = "reclustered.broad",
-#'                                           pc_subset = c(1:10)) 
-#' 
-#' # Identify outliers for CD4
-#' cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, 
-#'                                query_cell_type_col = "labels", 
-#'                                ref_cell_type_col = "reclustered.broad",
-#'                                n_components = 10,
-#'                                n_tree = 500,
-#'                                anomaly_treshold = 0.5)$CD4
-#' cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6])
-#' 
-#' # Plot the densities of the distances
-#' plot(distance_data, ref_cell_type = "CD4", sample_names = cd4_top5_anomalies)
-#' 
-# Function to compute distances within reference data and between query data and reference samples
-calculateSampleDistances <- function(query_data, reference_data, 
-                                     query_cell_type_col, 
-                                     ref_cell_type_col,
-                                     n_components = 10, 
-                                     pc_subset = c(1:5)) {
-    
-    # Get the projected PCA data
-    pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, 
-                             n_components = n_components, 
-                             query_cell_type_col = query_cell_type_col, 
-                             ref_cell_type_col = ref_cell_type_col, 
-                             return_value = "list")
-    
-    # Get unique cell types
-    unique_cell_types <- na.omit(unique(c(colData(reference_data)[[ref_cell_type_col]],
-                                          colData(query_data)[[query_cell_type_col]])))
-    
-    # Create a list to store distance data for each cell type
-    distance_data <- list()
-    
-    # Function to compute Euclidean distance between a vector and each row of a matrix
-    .compute_distances <- function(matrix, vector) {
-
-        # Apply the distance function to each row of the matrix
-        distances <- apply(matrix, 1, function(row) {
-            sqrt(sum((row - vector) ^ 2))
-        })
-        
-        return(distances)
-    }
-    
-    for (cell_type in unique_cell_types) {
-        
-        # Subset principal component scores for current cell type
-        ref_subset_scores <- pca_output$ref[which(cell_type == reference_data[[ref_cell_type_col]]), pc_subset]
-        query_subset_scores <- pca_output$query[, pc_subset]
-        
-        # Compute all pairwise distances within the reference subset
-        ref_distances <- as.vector(dist(ref_subset_scores))
-        
-        # Compute distances from each query sample to all reference samples
-        query_to_ref_distances <- apply(query_subset_scores, 1, function(query_sample, ref_subset_scores) {
-            .compute_distances(ref_subset_scores, query_sample)
-        }, ref_subset_scores = ref_subset_scores)
-        
-        # Store the distances
-        distance_data[[cell_type]] <- list(
-            ref_distances = ref_distances,
-            query_to_ref_distances = t(query_to_ref_distances)
-        )
-    }
-    
-    # Add class of object
-    class(distance_data) <- c(class(distance_data), "calculateSampleDistances")
-    
-    # Return the distance data
-    return(distance_data)
-}
\ No newline at end of file
diff --git a/R/calculateSampleDistancesSimilarity.R b/R/calculateSampleDistancesSimilarity.R
deleted file mode 100644
index e0c0eae..0000000
--- a/R/calculateSampleDistancesSimilarity.R
+++ /dev/null
@@ -1,178 +0,0 @@
-#' @title Function to compute Bhattacharyya coefficients and Hellinger distances
-#'
-#' @description 
-#' This function computes Bhattacharyya coefficients and Hellinger distances to quantify the similarity of density 
-#' distributions between query samples and reference data for each cell type.
-
-#'
-#' @details 
-#' This function first computes distance data using the \code{calculateSampleDistances} function, which calculates 
-#' pairwise distances between samples within the reference data and between query samples and reference samples in the PCA space.
-#' Bhattacharyya coefficients and Hellinger distances are calculated to quantify the similarity of density distributions between query 
-#' samples and reference data for each cell type. Bhattacharyya coefficient measures the similarity of two probability distributions, 
-#' while Hellinger distance measures the distance between two probability distributions.
-#'
-#' Bhattacharyya coefficients range between 0 and 1. A value closer to 1 indicates higher similarity between distributions, while a value 
-#' closer to 0 indicates lower similarity
-#'
-#' Hellinger distances range between 0 and 1. A value closer to 0 indicates higher similarity between distributions, while a value 
-#' closer to 1 indicates lower similarity.
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} 
-#' that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} 
-#' that identifies the cell types.
-#' @param sample_names A character vector specifying the names of the query samples for which to compute distance measures.
-#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. 
-#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.
-#'
-#' @return A list containing distance data for each cell type. Each entry in the list contains:
-#' \describe{
-#'   \item{ref_distances}{A vector of all pairwise distances within the reference subset for the cell type.}
-#'   \item{query_to_ref_distances}{A matrix of distances from each query sample to all reference samples for the cell type.}
-#' }
-#'
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#'
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-#' query_var <- scran::getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-#' 
-#' # Plot the PC data
-#' distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, 
-#'                                           n_components = 10, 
-#'                                           query_cell_type_col = "labels", 
-#'                                           ref_cell_type_col = "reclustered.broad",
-#'                                           pc_subset = c(1:10)) 
-#' 
-#' # Identify outliers for CD4
-#' cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, 
-#'                                query_cell_type_col = "labels", 
-#'                                ref_cell_type_col = "reclustered.broad",
-#'                                n_components = 10,
-#'                                n_tree = 500,
-#'                                anomaly_treshold = 0.5)$CD4
-#' cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6])
-#' 
-#' # Get overlap measures
-#' overlap_measures <- calculateSampleDistancesSimilarity(query_data_subset,ref_data_subset, 
-#'                                                        sample_names = cd4_top5_anomalies,
-#'                                                        n_components = 10, 
-#'                                                        query_cell_type_col = "labels", 
-#'                                                        ref_cell_type_col = "reclustered.broad",
-#'                                                        pc_subset = c(1:10))
-#' 
-#' 
-# Function to compute Bhattacharyya coefficients and Hellinger distances
-calculateSampleDistancesSimilarity <- function(query_data, reference_data, 
-                                               query_cell_type_col, 
-                                               ref_cell_type_col,
-                                               sample_names,
-                                               n_components = 10, 
-                                               pc_subset = c(1:5)) {
-
-    # Check if samples are available in data for that cell type
-    if(!all(sample_names %in% colnames(query_data)))
-        stop("One or more specified 'sample_names' are not available for that cell type.")
-    
-    # Compute distance data
-    query_data_subset <- query_data[, sample_names]
-    distance_data <- calculateSampleDistances(query_data = query_data_subset, reference_data = reference_data, 
-                                              query_cell_type_col = query_cell_type_col, 
-                                              ref_cell_type_col = ref_cell_type_col,
-                                              n_components = n_components, 
-                                              pc_subset = pc_subset)
-    
-    # Initialize empty lists to store results
-    bhattacharyya_list <- list()
-    hellinger_list <- list()
-    
-    # Iterate over each cell type
-    for (cell_type in names(distance_data)) {
-        
-        # Extract distances within the reference dataset for the current cell type
-        ref_distances <- distance_data[[cell_type]]$ref_distances
-        
-        # Compute density of reference distances
-        ref_density <- density(ref_distances)
-        
-        # Initialize an empty vector to store overlap measures for the current cell type
-        bhattacharyya_coef <- numeric(length(sample_names))
-        hellinger_dist <- numeric(length(sample_names))
-        
-        # Iterate over each sample
-        for (i in 1:length(sample_names)) {
-            
-            # Extract distances from the current sample to reference samples
-            sample_distances <- distance_data[[cell_type]]$query_to_ref_distances[sample_names[i], ]
-            
-            # Compute density of sample distances
-            sample_density <- density(sample_distances)
-            
-            # Create a common grid for evaluating densities
-            common_grid <- seq(min(min(ref_density$x), min(sample_density$x), 0), 
-                               max(max(ref_density$x), max(sample_density$x)), length.out = 1000)
-            
-            # Interpolate densities onto the common grid
-            ref_density_interp <- approxfun(ref_density$x, ref_density$y)(common_grid)
-            ref_density_interp[is.na(ref_density_interp)] <- 0
-            sample_density_interp <- approxfun(sample_density$x, sample_density$y)(common_grid)
-            sample_density_interp[is.na(sample_density_interp)] <- 0
-            
-            # Compute and store Bhattacharyya coefficient/Hellinger distance
-            bhattacharyya_coef[i] <- sum(sqrt(ref_density_interp * sample_density_interp) * mean(diff(common_grid)))
-            hellinger_dist[i] <- sqrt(1 - sum(sqrt(ref_density_interp * sample_density_interp)) * mean(diff(common_grid)))
-        }
-        
-        # Store overlap measures for the current cell type
-        bhattacharyya_list[[cell_type]] <- bhattacharyya_coef
-        hellinger_list[[cell_type]] <- hellinger_dist
-    }
-    
-    # Return list with overlap measures
-    bhattacharyya_coef <- data.frame(Sample = sample_names, bhattacharyya_list)
-    hellinger_dist <- data.frame(Sample = sample_names, hellinger_list)
-    return(list(bhattacharyya_coef = bhattacharyya_coef, 
-                hellinger_dist = hellinger_dist))
-}
-
-
diff --git a/R/calculateSampleSimilarityPCA.R b/R/calculateSampleSimilarityPCA.R
deleted file mode 100644
index f6c31ac..0000000
--- a/R/calculateSampleSimilarityPCA.R
+++ /dev/null
@@ -1,128 +0,0 @@
-#' @title Calculate Sample Similarity Using PCA Loadings
-#'
-#' @description 
-#' This function calculates the cosine similarity between samples based on the principal components (PCs)
-#' obtained from PCA (Principal Component Analysis) loadings.
-#'
-#' @details 
-#' This function calculates the cosine similarity between samples based on the loadings of the selected
-#' principal components obtained from PCA. It extracts the rotation matrix from the PCA results of the 
-#' \code{\linkS4class{SingleCellExperiment}} object and identifies the high-loading variables for each selected PC. 
-#' Then, it computes the cosine similarity between samples using the high-loading variables for each PC.
-#'
-#' @param se_object A \code{\linkS4class{SingleCellExperiment}} object containing expression data.
-#' @param samples A character vector specifying the samples for which to compute the similarity.
-#' @param pc_subset A numeric vector specifying the subset of principal components to consider (default: c(1:5)).
-#' @param n_top_vars An integer indicating the number of top loading variables to consider for each PC (default: 50).
-#'
-#' @return A data frame containing cosine similarity values between samples for each selected principal component.
-#'
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{plot.calculateSampleSimilarityPCA}}
-#'
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-#' query_var <- scran::getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Run PCA on the reference data (assumed to be prepared)
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' # Store PCA anomaly data and plots
-#' anomaly_output <- detectAnomaly(reference_data = ref_data_subset, 
-#'                                 ref_cell_type_col = "reclustered.broad", 
-#'                                 n_components = 10,
-#'                                 n_tree = 500,
-#'                                 anomaly_treshold = 0.5) 
-#' top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, 
-#'                              decreasing = TRUE)[1:6])
-#' 
-#' # Compute cosine similarity between anomalies and top PCs
-#' cosine_similarities <- calculateSampleSimilarityPCA(ref_data_subset, samples = top6_anomalies, 
-#'                                                     pc_subset = c(1:10), n_top_vars = 50)
-#' cosine_similarities
-#' 
-#' # Plot similarities
-#' plot(cosine_similarities, pc_subset = c(1:5))
-#' 
-# Function to calculate cosine similarities between samples and PCs
-calculateSampleSimilarityPCA <- function(se_object, samples, pc_subset = c(1:5), n_top_vars = 50){
-    
-    # Extract rotation matrix for SingleCellExperiment object
-    rotation_mat <- attributes(reducedDim(se_object, "PCA"))$rotation[, pc_subset]
-    
-    # Function to identify high-loading variables for each PC
-    .getHighLoadingVars <- function(rotation_mat, n_top_vars) {
-        high_loading_vars <- lapply(1:ncol(rotation_mat), function(pc) {
-            abs_loadings <- abs(rotation_mat[, pc])
-            top_vars <- names(sort(abs_loadings, decreasing = TRUE))[1:n_top_vars]
-            return(top_vars)
-        })
-        return(high_loading_vars)
-    }
-    
-    # Extract high-loading variables
-    high_loading_vars <- .getHighLoadingVars(rotation_mat, n_top_vars)
-    
-    # Function to compute cosine similarity
-    .cosine_similarity <- function(vector1, vector2) {
-        sum(vector1 * vector2) / (sqrt(sum(vector1^2)) * sqrt(sum(vector2^2)))
-    }
-    
-    # Function to compute cosine similarity for each PC using high-loading variables
-    .computeCosineSimilarity <- function(samples, rotation_mat, high_loading_vars) {
-        similarities <- lapply(1:length(high_loading_vars), function(pc) {
-            vars <- high_loading_vars[[pc]]
-            sample_subset <- samples[, vars, drop = FALSE]
-            pc_vector <- rotation_mat[vars, pc]
-            apply(sample_subset, 1, .cosine_similarity, vector2 = pc_vector)
-        })
-        return(similarities)
-    }
-    
-    # Calculate similarities
-    assay_mat <- t(as.matrix(assay(se_object[, samples], "logcounts")))
-    similarities <- .computeCosineSimilarity(assay_mat, rotation_mat, high_loading_vars)
-    
-    # Format the result into a data frame for easy interpretation
-    similarity_df <- do.call(cbind, similarities)
-    colnames(similarity_df) <- paste0("PC", 1:ncol(rotation_mat))
-    
-    # Update class of output
-    class(similarity_df) <- c(class(similarity_df), "calculateSampleSimilarityPCA")
-    return(similarity_df)
-}
\ No newline at end of file
diff --git a/R/calculateVarImpOverlap.R b/R/calculateVarImpOverlap.R
deleted file mode 100644
index 4060103..0000000
--- a/R/calculateVarImpOverlap.R
+++ /dev/null
@@ -1,140 +0,0 @@
-#' @title Compare Gene Importance Across Datasets Using Random Forest
-#'
-#' @description This function identifies and compares the most important genes for differentiating cell types between a query dataset 
-#' and a reference dataset using Random Forest.
-#'
-#' @details This function uses the Random Forest algorithm to calculate the importance of genes in differentiating between cell types 
-#' within both a reference dataset and a query dataset. The function then compares the top genes identified in both datasets to determine 
-#' the overlap in their importance scores.
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param query_cell_type_col A character string specifying the column name in the query dataset containing cell type annotations.
-#' @param ref_cell_type_col A character string specifying the column name in the reference dataset containing cell type annotations.
-#' @param n_tree An integer specifying the number of trees to grow in the Random Forest. Default is 500.
-#' @param n_top An integer specifying the number of top genes to consider when comparing variable importance scores. Default is 20.
-#'
-#' @return A list containing three elements:
-#' \item{var_imp_ref}{A list of data frames containing variable importance scores for each combination of cell types in the reference 
-#' dataset.}
-#' \item{var_imp_query}{A list of data frames containing variable importance scores for each combination of cell types in the query 
-#' dataset.}
-#' \item{var_imp_comparison}{A named vector indicating the proportion of top genes that overlap between the reference and query 
-#' datasets for each combination of cell types.}
-#' 
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @examples
-#' # Load necessary library
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 500)
-#' query_var <- getTopHVGs(query_data, n = 500)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Compare PCA subspaces
-#' rf_output <- calculateVarImpOverlap(query_data_subset, ref_data_subset, 
-#'                                     query_cell_type_col = "labels", 
-#'                                     ref_cell_type_col = "reclustered.broad", 
-#'                                     n_tree = 500,
-#'                                     n_top = 20)
-#' 
-#' 
-# RF function to compare (between datasets) which genes are best at differentiating cell types from each 
-calculateVarImpOverlap <- function(query_data, 
-                                   reference_data, 
-                                   query_cell_type_col, 
-                                   ref_cell_type_col,
-                                   n_tree = 500,
-                                   n_top = 20){
-    
-    # Extract assay data for reference and query datasets
-    ref_x <- t(as.matrix(assay(reference_data, "logcounts")))
-    query_x <- t(as.matrix(assay(query_data, "logcounts")))
-        
-    # Extract labels from reference and query datasets
-    ref_y <- reference_data[[ref_cell_type_col]]
-    query_y <- query_data[[query_cell_type_col]]
-    
-    # Remove NA from reference
-    ref_x <- ref_x[-which(is.na(ref_y)),]
-    ref_y <- ref_y[-which(is.na(ref_y))]
-    
-    # Finding importance scores for each cell type in reference dataset
-    var_imp_ref <- list()
-    cell_types <- unique(intersect(ref_y, query_y))
-    cell_types_combn <- combn(length(cell_types), 2)
-    for(combn_id in 1:ncol(cell_types_combn)){
-        
-        ref_x_subset <- ref_x[which(ref_y %in% c(cell_types[cell_types_combn[1, combn_id]], cell_types[cell_types_combn[2, combn_id]])),]
-        ref_y_subset <- ref_y[which(ref_y %in% c(cell_types[cell_types_combn[1, combn_id]], cell_types[cell_types_combn[2, combn_id]]))]
-        training_data <- data.frame(ref_x_subset, cell_type = factor(ref_y_subset))
-        rf_binary <- ranger::ranger(cell_type ~ ., data = training_data, num.trees = n_tree, importance = "impurity")
-        var_importance_name <- paste0(cell_types[cell_types_combn[1, combn_id]], "-", cell_types[cell_types_combn[2, combn_id]])
-        var_imp_ref[[var_importance_name]] <- rf_binary$variable.importance
-        var_imp_ref[[var_importance_name]] <- 
-            data.frame(Gene = names(var_imp_ref[[var_importance_name]])[order(var_imp_ref[[var_importance_name]], 
-                                                                                    decreasing = TRUE)], 
-                       RF_Importance = var_imp_ref[[var_importance_name]][order(var_imp_ref[[var_importance_name]], 
-                                                                                   decreasing = TRUE)])
-    }
-    
-    # Finding importance scores for each cell type in query dataset
-    var_imp_query <- list()
-    for(combn_id in 1:ncol(cell_types_combn)){
-        
-        ref_x_subset <- ref_x[which(ref_y %in% c(cell_types[cell_types_combn[1, combn_id]], cell_types[cell_types_combn[2, combn_id]])),]
-        ref_y_subset <- ref_y[which(ref_y %in% c(cell_types[cell_types_combn[1, combn_id]], cell_types[cell_types_combn[2, combn_id]]))]
-        training_data <- data.frame(ref_x_subset, cell_type = factor(ref_y_subset))
-        rf_binary <- ranger::ranger(cell_type ~ ., data = training_data, num.trees = n_tree, importance = "impurity")
-        var_importance_name <- paste0(cell_types[cell_types_combn[1, combn_id]], "-", cell_types[cell_types_combn[2, combn_id]])
-        var_imp_query[[var_importance_name]] <- rf_binary$variable.importance
-        var_imp_query[[var_importance_name]] <- 
-            data.frame(Gene = names(var_imp_query[[var_importance_name]])[order(var_imp_query[[var_importance_name]], 
-                                                                                 decreasing = TRUE)], 
-                       RF_Importance = var_imp_query[[var_importance_name]][order(var_imp_query[[var_importance_name]], 
-                                                                                   decreasing = TRUE)])
-    }
-    
-    # Comparison vector
-    var_imp_comparison <- rep(NA, length(var_imp_ref))
-    names(var_imp_comparison) <- names(var_imp_ref)
-    for(cells in names(var_imp_comparison)){
-        var_imp_comparison[cells] <- length(intersect(var_imp_ref[[cells]]$Gene[1:n_top], 
-                                                      var_imp_query[[cells]]$Gene[1:n_top])) / n_top
-    }
-    
-    # Return variable importance scores for each combination of cell types in each dataset and the comparison 
-    return(list(var_imp_ref = var_imp_ref, 
-                var_imp_query = var_imp_query,
-                var_imp_comparison = var_imp_comparison))
-}
\ No newline at end of file
diff --git a/R/compareCCA.R b/R/compareCCA.R
deleted file mode 100644
index 214f37f..0000000
--- a/R/compareCCA.R
+++ /dev/null
@@ -1,163 +0,0 @@
-#' @title Compare Subspaces Spanned by Top Principal Components Using Canonical Correlation Analysis
-#' 
-#' @description 
-#' This function compares the subspaces spanned by the top principal components (PCs) of the reference 
-#' and query datasets using canonical correlation analysis (CCA). It calculates the canonical variables, 
-#' correlations, and a similarity measure for the subspaces.
-#'
-#' @details
-#' This function performs canonical correlation analysis (CCA) to compare the subspaces spanned by the 
-#' top principal components (PCs) of the reference and query datasets. The function extracts the rotation 
-#' matrices corresponding to the specified PCs and performs CCA on these matrices. It computes the canonical 
-#' variables and their corresponding correlations. Additionally, it calculates a similarity measure for the 
-#' canonical variables using cosine similarity. The output is a list containing the canonical coefficients 
-#' for both datasets, the cosine similarity values, and the canonical correlations.
-
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param pc_subset A numeric vector specifying the subset of principal components (PCs) 
-#' to compare. Default is the first five PCs.
-#' @param n_top_vars An integer indicating the number of top loading variables to consider for each PC. Default is 25.
-#'
-#' @return A list containing the following elements:
-#' \describe{
-#'   \item{coef_ref}{Canonical coefficients for the reference dataset.}
-#'   \item{coef_query}{Canonical coefficients for the query dataset.}
-#'   \item{cosine_similarity}{Cosine similarity values for the canonical variables.}
-#'   \item{correlations}{Canonical correlations between the reference and query datasets.}
-#' }
-#'
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{plot.compareCCA}}
-#' 
-#' @examples
-#' # Load necessary library
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(scran)
-#' library(SingleR)
-#' library(ggplot2)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # Log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#'
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 500)
-#' query_var <- getTopHVGs(query_data, n = 500)
-#'
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Subset reference and query data for a specific cell type
-#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-#'
-#' # Run PCA on the reference and query datasets
-#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-#' query_data_subset <- runPCA(query_data_subset, ncomponents = 50)
-#' 
-#' # Compare CCA
-#' cca_comparison <- compareCCA(query_data_subset, ref_data_subset, 
-#'                              pc_subset = c(1:5), n_top_vars = 25)
-#' 
-#' # Visualize output of CCA comparison
-#' plot(cca_comparison)
-#' 
-#' 
-# Function to compare subspace spanned by top PCs in reference and query datasets
-compareCCA <- function(reference_data, query_data, 
-                       pc_subset = c(1:5),
-                       n_top_vars = 25){
-    
-    # Check if query_data is a SingleCellExperiment object
-    if (!is(query_data, "SingleCellExperiment")) {
-        stop("query_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check if reference_data is a SingleCellExperiment object
-    if (!is(reference_data, "SingleCellExperiment")) {
-        stop("reference_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check of genes in both datasets are the same
-    if(!all(rownames(attributes(reducedDim(query_data, "PCA"))$rotation) %in%
-            rownames(attributes(reducedDim(reference_data, "PCA"))$rotation)))
-        stop("The genes in the rotation matrices differ. Consider decreasing the number of genes using for PCA.")
-    
-    # Check input if PC subset is valid
-    if(!all(c(pc_subset %in% 1:ncol(reducedDim(reference_data, "PCA")), 
-              pc_subset %in% 1:ncol(reducedDim(query_data, "PCA")))))
-        stop("\'pc_subset\' is out of range.")
-    
-    # Extract the rotation matrices
-    ref_rotation <- attributes(reducedDim(reference_data, "PCA"))$rotation[, pc_subset]
-    query_rotation <- attributes(reducedDim(query_data, "PCA"))$rotation[, pc_subset]
-    
-    # Function to identify high-loading variables for each PC
-    .getHighLoadingVars <- function(rotation_mat, n_top_vars) {
-        high_loading_vars <- lapply(1:ncol(rotation_mat), function(pc) {
-            abs_loadings <- abs(rotation_mat[, pc])
-            top_vars <- names(sort(abs_loadings, decreasing = TRUE))[1:n_top_vars]
-            return(top_vars)
-        })
-        return(high_loading_vars)
-    }
-    
-    # Get union of variables with highest loadings
-    top_ref <- .getHighLoadingVars(ref_rotation, n_top_vars)
-    top_query <- .getHighLoadingVars(query_rotation, n_top_vars)
-    top_union <- unlist(lapply(1:length(pc_subset), function(i) return(union(top_ref[[i]], top_query[[i]]))))
-    
-    # Perform CCA
-    cca_result <- cancor(ref_rotation, query_rotation)
-    
-    # Extract canonical variables and correlations
-    canonical_ref <- cca_result$xcoef
-    canonical_query <- cca_result$ycoef
-    correlations <- cca_result$cor
-    
-    # Function to compute similarity measure (e.g., cosine similarity)
-    .cosine_similarity <- function(u, v) {
-        return(abs(sum(u * v)) / (sqrt(sum(u^2)) * sqrt(sum(v^2))))
-    }
-    
-    # Compute similarities and account for correlations
-    similarities <- rep(0, length(pc_subset))
-    for (i in 1:length(pc_subset)) {
-        similarities[i] <- .cosine_similarity(canonical_ref[, i], canonical_query[, i])
-    }
-    
-    # Update class of return output
-    output <- list(coef_ref = canonical_ref,
-                   coef_query = canonical_query,
-                   cosine_similarity = similarities,
-                   correlations = correlations)
-    class(output) <- c(class(output), "compareCCA")
-
-    # Return cosine similarity output
-    return(output)
-}
-
diff --git a/R/comparePCA.R b/R/comparePCA.R
deleted file mode 100644
index 53ad60f..0000000
--- a/R/comparePCA.R
+++ /dev/null
@@ -1,185 +0,0 @@
-#' @title Compare Principal Components Analysis (PCA) Results
-#' 
-#' @description This function compares the principal components (PCs) obtained from separate PCA on reference and query 
-#' datasets for a single cell type using either cosine similarity or correlation.
-#' 
-#' @details
-#' This function compares the PCA results between the reference and query datasets by computing cosine 
-#' similarities or correlations between the loadings of top variables for each pair of principal components. It first 
-#' extracts the PCA rotation matrices from both datasets and identifies the top variables with highest loadings for 
-#' each PC. Then, it computes the cosine similarities or correlations between the loadings of top variables for each 
-#' pair of PCs. The resulting matrix contains the similarity values, where rows represent reference PCs and columns 
-#' represent query PCs.
-#' 
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param pc_subset A numeric vector specifying the subset of principal components (PCs) to compare. Default is the first five PCs.
-#' @param n_top_vars An integer indicating the number of top loading variables to consider for each PC. Default is 50.
-#' @param metric The similarity metric to use. It can be either "cosine" or "correlation". Default is "cosine".
-#' @param correlation_method The correlation method to use if metric is "correlation". It can be "spearman" 
-#' or "pearson". Default is "spearman".
-#'
-#' @return A similarity matrix comparing the principal components of the reference and query datasets.
-#' Each element (i, j) in the matrix represents the similarity between the i-th principal component 
-#' of the reference dataset and the j-th principal component of the query dataset.
-#' 
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{plot.comparePCA}}
-#' 
-#' @examples
-#' # Load necessary library
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(scran)
-#' library(SingleR)
-#' library(ComplexHeatmap)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # Log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#'
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 500)
-#' query_var <- getTopHVGs(query_data, n = 500)
-#'
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Subset reference and query data for a specific cell type
-#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-#'
-#' # Run PCA on the reference and query datasets
-#' ref_data_subset <- runPCA(ref_data_subset)
-#' query_data_subset <- runPCA(query_data_subset)
-#'
-#' # Call the PCA comparison function
-#' similarity_mat <- comparePCA(query_data_subset, ref_data_subset, 
-#'                              pc_subset = c(1:5), 
-#'                              n_top_vars = 50,
-#'                              metric = c("cosine", "correlation")[1], 
-#'                              correlation_method = c("spearman", "pearson")[1])
-#'
-#' # Create the heatmap
-#' plot(similarity_mat)
-#' 
-# Compare PCA vectors of reference and query datasets for specific cell type.
-comparePCA <- function(reference_data, query_data, 
-                       pc_subset = c(1:5),
-                       n_top_vars = 50,
-                       metric = c("cosine", "correlation")[1], 
-                       correlation_method = c("spearman", "pearson")[1]){
-    
-    # Check if query_data is a SingleCellExperiment object
-    if (!is(query_data, "SingleCellExperiment")) {
-        stop("query_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check if reference_data is a SingleCellExperiment object
-    if (!is(reference_data, "SingleCellExperiment")) {
-        stop("reference_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check of genes in both datasets are the same
-    if(!all(rownames(attributes(reducedDim(query_data, "PCA"))$rotation) %in%
-            rownames(attributes(reducedDim(reference_data, "PCA"))$rotation)))
-        stop("The genes in the rotation matrices differ. Consider decreasing the number of genes used for PCA.")
-    
-    # Check input if PC subset is valid
-    if(!all(c(pc_subset %in% 1:ncol(reducedDim(reference_data, "PCA")), 
-              pc_subset %in% 1:ncol(reducedDim(query_data, "PCA")))))
-        stop("\'pc_subset\' is out of range.")
-    
-    # Check input for metric
-    if(!(metric %in% c("cosine", "correlation")))
-        stop("\'metric\' should be one of \'cosine\' or \'correlation\'.")
-    
-    # Check input for correlation method
-    if(!(correlation_method %in% c("spearman", "pearson")))
-        stop("\'correlation_method\' should be one of \'spearman\' or \'pearson\'.")
-    
-    # Extract PCA data from reference and query data
-    ref_rotation <- attributes(reducedDim(reference_data, "PCA"))$rotation[, pc_subset]
-    query_rotation <- attributes(reducedDim(query_data, "PCA"))$rotation[, pc_subset]
-    
-    # Function to identify high-loading variables for each PC
-    .getHighLoadingVars <- function(rotation_mat, n_top_vars) {
-        high_loading_vars <- lapply(1:ncol(rotation_mat), function(pc) {
-            abs_loadings <- abs(rotation_mat[, pc])
-            top_vars <- names(sort(abs_loadings, decreasing = TRUE))[1:n_top_vars]
-            return(top_vars)
-        })
-        return(high_loading_vars)
-    }
-    
-    # Get union of variables with highest loadings
-    top_ref <- .getHighLoadingVars(ref_rotation, n_top_vars)
-    top_query <- .getHighLoadingVars(query_rotation, n_top_vars)
-    top_union <- lapply(1:length(pc_subset), function(i) return(union(top_ref[[i]], top_query[[i]])))
-
-    # Initialize a matrix to store cosine similarities
-    similarity_matrix <- matrix(NA, nrow = length(pc_subset), ncol = length(pc_subset))
-    
-    if(metric == "cosine"){
-        # Function to compute cosine similarity
-        .cosine_similarity <- function(x, y) {
-            sum(x * y) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
-        }
-        
-        # Loop over each pair of columns and compute cosine similarity
-        for (i in 1:length(pc_subset)) {
-            for (j in 1:length(pc_subset)) {
-                combination_union <- union(top_union[[i]], top_union[[j]])
-                similarity_matrix[i, j] <- .cosine_similarity(ref_rotation[combination_union, i], query_rotation[combination_union, j])
-            }
-        }
-    } else if(metric == "correlation"){
-        # Loop over each pair of columns and compute cosine similarity
-        for (i in 1:length(pc_subset)) {
-            for (j in 1:length(pc_subset)) {
-                combination_union <- union(top_union[[i]], top_union[[j]])
-                similarity_matrix[i, j] <- cor(ref_rotation[combination_union, i], query_rotation[combination_union, j], 
-                                               method = correlation_method)
-            }
-        }
-    }
-    
-    # Add rownames and colnames with % of variance explained for each PC of each dataset 
-    rownames(similarity_matrix) <- paste0("Ref PC", pc_subset, " (", 
-                                          round(attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset] / 
-                                                    sum(attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset]) *
-                                                            100, 1), "%)")
-    colnames(similarity_matrix) <- paste0("Query PC", pc_subset, " (", 
-                                          round(attributes(reducedDim(query_data, "PCA"))$varExplained[pc_subset] / 
-                                                    sum(attributes(reducedDim(query_data, "PCA"))$varExplained[pc_subset]) *
-                                                    100, 1), "%)")
-    
-    # Update class of return output
-    class(similarity_matrix) <- c(class(similarity_matrix), "comparePCA")
-    
-    # Return similarity matrix
-    return(similarity_matrix)
-}
-
-
diff --git a/R/comparePCASubspace.R b/R/comparePCASubspace.R
deleted file mode 100644
index 5b292db..0000000
--- a/R/comparePCASubspace.R
+++ /dev/null
@@ -1,153 +0,0 @@
-#' @title Compare Subspaces Spanned by Top Principal Components
-#' 
-#' @description
-#' This function compares the subspace spanned by the top principal components (PCs) in a reference dataset to that 
-#' in a query dataset. It computes the cosine similarity between the loadings of the top variables for each PC in 
-#' both datasets and provides a weighted cosine similarity score.
-#'
-#' @details
-#' This function compares the subspace spanned by the top principal components (PCs) in a reference dataset 
-#' to that in a query dataset. It first computes the cosine similarity between the loadings of the top variables 
-#' for each PC in both datasets. The top cosine similarity scores are then selected, and their corresponding PC 
-#' indices are stored. Additionally, the function calculates the average percentage of variance explained by the 
-#' selected top PCs. Finally, it computes a weighted cosine similarity score based on the top cosine similarities 
-#' and the average percentage of variance explained.
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param pc_subset A numeric vector specifying the subset of principal components (PCs) to compare. Default is the first five PCs.
-#' @param n_top_vars An integer indicating the number of top loading variables to consider for each PC. Default is 50.
-#'
-#' @return A list containing the following components:
-#'   \item{principal_angles_cosines}{A numeric vector of cosine values of principal angles.}
-#'   \item{average_variance_explained}{A numeric vector of average variance explained by each PC.}
-#'   \item{weighted_cosine_similarity}{A numeric value representing the weighted cosine similarity.}
-#'
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{plot.comparePCASubspace}}
-#' 
-#' @examples
-#' # Load necessary library
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(scran)
-#' library(SingleR)
-#' library(ggplot2)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # Log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#'
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 500)
-#' query_var <- getTopHVGs(query_data, n = 500)
-#'
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Subset reference and query data for a specific cell type
-#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-#'
-#' # Run PCA on the reference and query datasets
-#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-#' query_data_subset <- runPCA(query_data_subset, ncomponents = 50)
-#' 
-#' # Compare PCA subspaces
-#' subspace_comparison <- comparePCASubspace(query_data_subset, ref_data_subset, 
-#'                                           pc_subset = c(1:5), n_top_vars = 50)
-#' 
-#' # Create a data frame for plotting
-#' plot(subspace_comparison)
-#' 
-# Function to compare subspace spanned by top PCs in reference and query datasets
-comparePCASubspace <- function(reference_data, query_data, 
-                               pc_subset = c(1:5),
-                               n_top_vars = 50){
-    
-    # Check if query_data is a SingleCellExperiment object
-    if (!is(query_data, "SingleCellExperiment")) {
-        stop("query_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check if reference_data is a SingleCellExperiment object
-    if (!is(reference_data, "SingleCellExperiment")) {
-        stop("reference_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check of genes in both datasets are the same
-    if(!all(rownames(attributes(reducedDim(query_data, "PCA"))$rotation) %in%
-            rownames(attributes(reducedDim(reference_data, "PCA"))$rotation)))
-        stop("The genes in the rotation matrices differ. Consider decreasing the number of genes using for PCA.")
-    
-    # Check input if PC subset is valid
-    if(!all(c(pc_subset %in% 1:ncol(reducedDim(reference_data, "PCA")), 
-         pc_subset %in% 1:ncol(reducedDim(query_data, "PCA")))))
-        stop("\'pc_subset\' is out of range.")
-    
-    # Compute the cosine similarity (cosine of principal angle)
-    cosine_similarity <- comparePCA(query_data = query_data, reference_data = reference_data,
-                                    pc_subset = pc_subset, n_top_vars = n_top_vars, metric = "cosine")
-    
-    # Vector to store top cosine similarities
-    top_cosine <- numeric(length(pc_subset))
-    # Matrix to store PC IDs for each top cosine similarity
-    cosine_id <- matrix(NA, nrow = length(pc_subset), ncol = 2)
-    colnames(cosine_id) <- c("Ref", "Query")
-    
-    # Looping to store top cosine similarities and PC IDs
-    for(id in 1:length(pc_subset)){
-        
-        # Store data for top cosine
-        top_ref <- which.max(apply(cosine_similarity, 1, max))
-        top_query <- which.max(cosine_similarity[top_ref,])
-        top_cosine[id] <- cosine_similarity[top_ref, top_query]
-        cosine_id[id,] <- c(top_ref, top_query)
-        
-        # Remove as candidate
-        cosine_similarity[top_ref,] <- -Inf 
-        cosine_similarity[, top_query] <- -Inf
-    }
-    
-    # Vector of variance explained
-    var_explained_ref <- attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset]/
-        sum(attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset])
-    var_explained_query <- attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset]/
-        sum(attributes(reducedDim(reference_data, "PCA"))$varExplained[pc_subset])
-    var_explained_avg <- (var_explained_ref[cosine_id[, 1]] + var_explained_query[cosine_id[, 2]]) / 2
-    
-    # Weighted cosine similarity score
-    weighted_cosine_similarity <- sum(top_cosine * var_explained_avg)
-    
-    # Update class of return output
-    output <- list(cosine_similarity = top_cosine,
-                   cosine_id = cosine_id,
-                   var_explained_avg = var_explained_avg,
-                   weighted_cosine_similarity = weighted_cosine_similarity)
-    class(output) <- c(class(output), "comparePCASubspace")
-    
-    # Return cosine similarity output
-    return(output)
-}
-
diff --git a/R/detectAnomaly.R b/R/detectAnomaly.R
deleted file mode 100644
index 4ee1c62..0000000
--- a/R/detectAnomaly.R
+++ /dev/null
@@ -1,175 +0,0 @@
-#' 
-#' @importFrom methods is
-#' @importFrom stats na.omit predict qnorm
-#' @importFrom utils tail
-#' 
-#' @title PCA Anomaly Scores via Isolation Forests with Visualization
-#'
-#' @description 
-#' This function detects anomalies in single-cell data by projecting the data onto a PCA space and using an isolation forest 
-#' algorithm to identify anomalies.
-#'
-#' @details This function projects the query data onto the PCA space of the reference data. An isolation forest is then built on the 
-#' reference data to identify anomalies in the query data based on their PCA projections. If no query dataset is provided by the user,
-#' the anomaly scores are computed on the reference data itself. Anomaly scores for the data with all combined cell types are also
-#' provided as part of the output.
-#' 
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param query_data An optional \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. 
-#' If NULL, then the isolation forest anomaly scores are computed for the reference data. Default is NULL.
-#' @param ref_cell_type_col A character string specifying the column name in the reference dataset containing cell type annotations.
-#' @param query_cell_type_col A character string specifying the column name in the query dataset containing cell type annotations.
-#' @param n_components An integer specifying the number of principal components to use. Default is 10.
-#' @param n_tree An integer specifying the number of trees for the isolation forest. Default is 500
-#' @param anomaly_treshold A numeric value specifying the threshold for identifying anomalies, Default is 0.5.
-#' @param ... Additional arguments passed to the `isolation.forest` function.
-#' 
-#' @return A list containing the following components for each cell type and the combined data:
-#' \item{anomaly_scores}{Anomaly scores for each cell in the query data.}
-#' \item{anomaly}{Logical vector indicating whether each cell is classified as an anomaly.}
-#' \item{reference_mat_subset}{PCA projections of the reference data.}
-#' \item{query_mat_subset}{PCA projections of the query data (if provided).}
-#' \item{var_explained}{Proportion of variance explained by the retained principal components.}
-#' 
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{plot.detectAnomaly}}
-#' 
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 2000)
-#' query_var <- getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset)
-#' 
-#' # Store PCA anomaly data and plots
-#' anomaly_output <- detectAnomaly(ref_data_subset, query_data_subset,
-#'                                 ref_cell_type_col = "reclustered.broad", 
-#'                                 query_cell_type_col = "labels",
-#'                                 n_components = 10,
-#'                                 n_tree = 500,
-#'                                 anomaly_treshold = 0.5) 
-#' 
-#' # Plot the output for a cell type
-#' plot(anomaly_output, cell_type = "CD8", pc_subset = c(1:5), data_type = "query")
-#' 
-# Function to perform diagnostics using isolation forest with PCA and visualization
-detectAnomaly <- function(reference_data, 
-                          query_data = NULL, 
-                          ref_cell_type_col,
-                          query_cell_type_col, 
-                          n_components = 10,
-                          n_tree = 500,
-                          anomaly_treshold = 0.5,
-                          ...) {
-    
-    # Check whether the anlaysis is done only for one dataset
-  if (is.null(query_data)) {
-      include_query_in_output <- FALSE
-  } else{
-      if(is.null(query_cell_type_col))
-          stop("If \'query_data\' is not NULL, a value for \'query_cell_type_col\' must be provided.")
-      include_query_in_output <- TRUE
-  }
-    
-  if(!is.null(n_components)){
-      reference_mat <- reducedDim(reference_data, "PCA")[, 1:n_components]
-      if(include_query_in_output){
-          # Get PCA data from reference and query datasets (query data projected onto PCA space of reference dataset)
-          pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, 
-                                   query_cell_type_col = query_cell_type_col, ref_cell_type_col = ref_cell_type_col,
-                                   n_components = n_components, return_value = "list")
-          query_mat <- pca_output$query[, paste0("PC", 1:n_components)]
-      }
-  } else{
-      reference_mat <- t(as.matrix(assay(reference_data, "logcounts")))
-      if(include_query_in_output){
-          query_mat <- t(as.matrix(assay(query_data, "logcounts")))
-      }
-  }
-  
-  # List to store output
-  output <- list()
-  
-  # Extract reference and query annotations
-  reference_labels <- reference_data[[ref_cell_type_col]]
-  if(!include_query_in_output){
-      cell_types <- c(as.list(na.omit(unique(reference_labels))),
-                      list(na.omit(unique(reference_labels))))
-  } else{
-      query_labels <- query_data[[query_cell_type_col]]
-      cell_types <- c(as.list(na.omit(intersect(unique(reference_labels), unique(query_labels)))),
-                      list(na.omit(intersect(unique(reference_labels), unique(query_labels)))))
-  }
-
-  for (cell_type in cell_types) {
-    
-    # Filter reference and query PCA data for the current cell type
-    reference_mat_subset <- na.omit(reference_mat[reference_labels %in% cell_type,])
-    
-    # Build isolation forest on reference PCA data for this cell type
-    isolation_forest <- isotree::isolation.forest(reference_mat_subset, ntree = n_tree, ...)
-      
-    # Calculate anomaly scores for query data (scaled by reference path length)
-    reference_anomaly_scores <- predict(isolation_forest, newdata = reference_mat_subset, type = "score")
-    if(include_query_in_output){
-        query_mat_subset <- na.omit(query_mat[query_labels %in% cell_type,])
-        query_anomaly_scores <- predict(isolation_forest, newdata = query_mat_subset, type = "score")
-    }
-
-    # Store cell type anomaly scores and PCA data
-    list_name <- ifelse(length(cell_type) == 1, cell_type, "Combined")
-    output[[list_name]] <- list()
-    output[[list_name]]$reference_anomaly_scores <- reference_anomaly_scores
-    output[[list_name]]$reference_anomaly <- reference_anomaly_scores > anomaly_treshold
-    output[[list_name]]$reference_mat_subset <- reference_mat_subset
-    if(include_query_in_output){
-        output[[list_name]]$query_mat_subset <- query_mat_subset
-        output[[list_name]]$query_anomaly_scores <- query_anomaly_scores
-        output[[list_name]]$query_anomaly <- query_anomaly_scores > anomaly_treshold
-    }
-    if(!is.null(n_components))
-        output[[list_name]]$var_explained <- (attributes(reducedDim(reference_data, "PCA"))$varExplained[1:n_components]) /
-        sum(attributes(reducedDim(reference_data, "PCA"))$varExplained) 
-  }
-  
-  # Set the class of the output
-  class(output) <- c(class(output), "detectAnomaly")
-  
-  # Return anomaly, PCA data and optional PCA anomaly plots for each cell type
-  return(output)
-}
diff --git a/R/histQCvsAnnotation.R b/R/histQCvsAnnotation.R
deleted file mode 100644
index 4d126aa..0000000
--- a/R/histQCvsAnnotation.R
+++ /dev/null
@@ -1,130 +0,0 @@
-#' @title Histograms: QC Stats and Annotation Scores Visualization
-#'
-#' @description
-#' This function generates histograms for visualizing the distribution of quality control (QC) statistics and 
-#' annotation scores associated with cell types in single-cell genomic data. 
-#' 
-#' @details The particularly useful in the analysis of data from single-cell experiments, 
-#' where understanding the distribution of these metrics is crucial for quality assessment and 
-#' interpretation of cell type annotations.
-#'
-#' @param query_data  A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-#' expression data and metadata.
-#' @param qc_col character. A column name in the \code{colData} of \code{query_data} that 
-#' contains the QC stats of interest.
-#' @param label_col character. The column name in the \code{colData} of \code{query_data} 
-#' that contains the cell type labels.
-#' @param score_col numeric. The column name in the \code{colData} of \code{query_data} that 
-#' contains the cell type scores.
-#' @param label character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")).  
-#' Defaults to \code{NULL}, which will include all the cells.
-#'
-#' @return A object containing two histograms displayed side by side. 
-#' The first histogram represents the distribution of QC stats, 
-#' and the second histogram represents the distribution of annotation scores.
-#' 
-#' @examples
-#' \donttest{
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#' library(SingleR)
-#' library(gridExtra)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#'
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # Log-transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR
-#' pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Assign labels to query data
-#' colData(query_data)$labels <- pred$labels
-#' 
-#' # Get annotation scores
-#' scores <- apply(pred$scores, 1, max)
-#'
-#' # Assign scores to query data
-#' colData(query_data)$cell_scores <- scores
-#'
-#' # Generate histograms
-#' histQCvsAnnotation(query_data = query_data, 
-#'                   qc_col = "percent.mito", 
-#'                   label_col = "labels", 
-#'                   score_col = "cell_scores", 
-#'                   label = c("CD4", "CD8"))
-#'                   
-#' histQCvsAnnotation(query_data = query_data, 
-#'                    qc_col = "percent.mito", 
-#'                    label_col = "labels", 
-#'                    score_col = "cell_scores", 
-#'                    label = NULL)
-#' }
-#'
-#' @export
-histQCvsAnnotation <- function(query_data, 
-                               qc_col = qc_col, 
-                               label_col, 
-                               score_col, 
-                               label = NULL) {
-  # Sanity checks
-  
-  # Check if query_data is a SingleCellExperiment object
-  if (!is(query_data, "SingleCellExperiment")) {
-    stop("query_data must be a SingleCellExperiment object.")
-  }
-  
-  # Check if qc_col is a valid column name in query_data
-  if (!qc_col %in% colnames(colData(query_data))) {
-    stop("qc_col: '", qc_col, "' is not a valid column name in query_data.")
-  }
-  
-  # Check if label_col is a valid column name in query_data
-  if (!label_col %in% colnames(colData(query_data))) {
-    stop("label_col: '", label_col, "' is not a valid column name in query_data.")
-  }
-  
-  # Check if score_col is a valid column name in query_data
-  if (!score_col %in% colnames(colData(query_data))) {
-    stop("score_col: '", score_col, "' is not a valid column name in query_data.")
-  }
-  
-  # Filter cells based on label if specified
-  if (!is.null(label)) {
-    index <- which(colData(query_data)[[label_col]] %in% label)
-    query_data <- query_data[, index]
-  }
-  
-  # Extract QC stats, scores, and labels
-  qc_stats <- colData(query_data)[, qc_col]
-  cell_type_scores <- colData(query_data)[, score_col]
-
-  # Combine QC stats, scores, and labels into a data frame
-  data <- data.frame(QCStats = qc_stats, Scores = cell_type_scores)
-  
-  # Create histogram for QC stats
-  qc_histogram <- ggplot2::ggplot(data, aes(x = QCStats)) +
-      ggplot2::geom_histogram(color = "black", fill = "white") +
-      ggplot2::xlab(paste(qc_col)) +
-      ggplot2::ylab("Frequency") +
-      ggplot2::theme_bw()
-  
-  # Create histogram for scores
-  scores_histogram <- ggplot2::ggplot(data, aes(x = Scores)) +
-      ggplot2::geom_histogram(color = "black", fill = "white") +
-      ggplot2::xlab("Annotation Scores") +
-      ggplot2::ylab("Frequency") +
-      ggplot2::theme_bw()
-  
-  # Return the list of plots
-  return(gridExtra::grid.arrange(qc_histogram, scores_histogram, ncol = 2))
-}
diff --git a/R/nearestNeighborDiagnostics.R b/R/nearestNeighborDiagnostics.R
deleted file mode 100644
index 54d827f..0000000
--- a/R/nearestNeighborDiagnostics.R
+++ /dev/null
@@ -1,159 +0,0 @@
-#' @title Calculate Nearest Neighbor Diagnostics for Cell Type Classification
-#'
-#' @description 
-#' This function computes the probabilities for each sample of belonging to either the reference or query dataset for 
-#' each cell type using nearest neighbor analysis.
-
-#'
-#' @details 
-#' This function performs a nearest neighbor search to calculate the probability of each sample in the query dataset 
-#' belonging to the reference dataset for each cell type. It uses principal component analysis (PCA) to reduce the dimensionality 
-#' of the data before performing the nearest neighbor search. The function balances the sample sizes between the reference and query 
-#' datasets by data augmentation if necessary.
-
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param n_neighbor An integer specifying the number of nearest neighbors to consider. Default is 15.
-#' @param n_components An integer specifying the number of principal components to use for dimensionality reduction. Default is 10.
-#' @param pc_subset A vector specifying the subset of principal components to use in the analysis. Default is c(1:10).
-#' @param query_cell_type_col A character string specifying the column name in the query dataset containing cell type annotations.
-#' @param ref_cell_type_col A character string specifying the column name in the reference dataset containing cell type annotations.
-#'
-#' @return A list where each element corresponds to a cell type and contains two vectors:
-#' \item{prob_ref}{The probabilities of each query sample belonging to the reference dataset.}
-#' \item{prob_query}{The probabilities of each query sample belonging to the query dataset.}
-#' The list is assigned the class \code{"nearestNeighbotDiagnostics"}.
-#' 
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{plot.nearestNeighborDiagnostics}}
-#' 
-#' @examples
-#' # Load necessary library
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(scran)
-#' library(SingleR)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 500)
-#' query_var <- getTopHVGs(query_data, n = 500)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' # Project the query data onto PCA space of reference
-#' nn_output <- nearestNeighborDiagnostics(query_data_subset, ref_data_subset,
-#'                                         n_neighbor = 15, 
-#'                                         n_components = 10,
-#'                                         pc_subset = c(1:10),
-#'                                         query_cell_type_col = "labels", 
-#'                                         ref_cell_type_col = "reclustered.broad")
-#' 
-#' # Plot output
-#' plot(nn_output, cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
-#'      prob_type = "query")
-#' 
-#' 
-# Function to get probabilities for each sample of belonging to reference or query dataset for each cell type
-nearestNeighborDiagnostics <- function(query_data, reference_data,
-                                       n_neighbor = 15,
-                                       n_components = 10,
-                                       pc_subset = c(1:10),
-                                       query_cell_type_col, 
-                                       ref_cell_type_col){
-    
-    # Check if n_components is a positive integer
-    if (!inherits(n_components, "numeric")) {
-        stop("n_components should be numeric")
-    } else if (any(!n_components == floor(n_components), n_components < 1)) {
-        stop("n_components should be an integer, greater than zero.")
-    }
-    
-    # Get PCA data
-    pca_output <- projectPCA(query_data = query_data, reference_data = reference_data,
-                             n_components = n_components,
-                             query_cell_type_col = query_cell_type_col, 
-                             ref_cell_type_col = ref_cell_type_col, 
-                             return_value = c("data.frame", "list")[2])
-    
-    # Initialize list to store probabilities
-    probabilities <- list()
-    
-    # Get unique cell types
-    cell_types <- na.omit(intersect(unique(query_data[[query_cell_type_col]]), 
-                                    unique(reference_data[[ref_cell_type_col]])))
-
-    # Loop through each cell type
-    for (cell_type in cell_types) {
-        
-        # Extract PCA-reduced data for the current cell type
-        ref_pca_cell_type <- pca_output$ref[which(reference_data[[ref_cell_type_col]] == cell_type), paste0("PC", pc_subset)]
-        query_pca_cell_type <- pca_output$query[which(query_data[[query_cell_type_col]] == cell_type), paste0("PC", pc_subset)]
-        
-        # Combine reference and query data for the current cell type
-        combined_data_cell_type <- rbind(ref_pca_cell_type, query_pca_cell_type)
-        
-        # Number of samples for reference and query datasets
-        n_ref <- nrow(ref_pca_cell_type)
-        n_query <- nrow(query_pca_cell_type)
-        
-        # Data augmentation to balance sample size of datasets
-        if(n_ref > n_query){
-            
-            combined_data_cell_type <- rbind(combined_data_cell_type,
-                                             query_pca_cell_type[sample(1:n_query, n_ref - n_query, replace = TRUE),])
-        } else if (n_query > n_ref){
-            
-            combined_data_cell_type <- rbind(combined_data_cell_type,
-                                             ref_pca_cell_type[sample(1:n_ref, n_query - n_ref, replace = TRUE),])
-        }
-        
-        # Perform nearest neighbors search
-        knn_result <- BiocNeighbors::findKNN(combined_data_cell_type, k = n_neighbor, warn.ties = FALSE)
-        
-        prob_ref <- apply(knn_result$index[(n_ref + 1):nrow(knn_result$index),], 1, function(x, n_ref) {
-            mean(x <= n_ref)},
-            n_ref = n_ref)
-        
-        # Store the probabilities
-        probabilities[[cell_type]] <- list()
-        probabilities[[cell_type]]$prob_ref <- prob_ref
-        probabilities[[cell_type]]$prob_query <- 1 - prob_ref
-    }
-    
-    # Creating class for output
-    class(probabilities) <- c(class(probabilities), "nearestNeighborDiagnostics")
-    
-    # Return the list of probabilities
-    return(probabilities)
-}
-
diff --git a/R/plot.calculateAveragePairwiseCorrelation.R b/R/plot.calculateAveragePairwiseCorrelation.R
deleted file mode 100644
index a89d4d5..0000000
--- a/R/plot.calculateAveragePairwiseCorrelation.R
+++ /dev/null
@@ -1,107 +0,0 @@
-#' @title 
-#' Plot the output of the calculateAveragePairwiseCorrelation function
-#'
-#' @description 
-#' This function takes the output of the calculateAveragePairwiseCorrelation function,
-#' which should be a matrix of pairwise correlations, and plots it as a heatmap.
-#' 
-#' @details 
-#' This function converts the correlation matrix into a dataframe, creates a heatmap using ggplot2,
-#' and customizes the appearance of the heatmap with updated colors and improved aesthetics.
-#'
-#' @param x Output matrix from calculateAveragePairwiseCorrelation function.
-#' @param ... Additional arguments to be passed to the plotting function.
-#'
-#' @return A ggplot2 object representing the heatmap plot.
-#' 
-#' @export
-#'         
-#' @seealso \code{\link{calculateAveragePairwiseCorrelation}}
-#' 
-#' @examples
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#' library(SingleR)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#'
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#'
-#' # Compute Pairwise Correlations
-#' # Note: The selection of highly variable genes and desired cell types may vary 
-#' # based on user preference. 
-#' # The cell type annotation method used in this example is SingleR. 
-#' # User can use any other method for cell type annotation and provide 
-#' # the corresponding labels in the metadata.
-#'
-#' # Selecting highly variable genes
-#' ref_var <- getTopHVGs(ref_data, n = 2000)
-#' query_var <- getTopHVGs(query_data, n = 2000)
-#'
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#'
-#' # Select desired cell types
-#' selected_cell_types <- c("CD4", "CD8", "B_and_plasma")
-#' ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad %in% selected_cell_types]
-#' query_data_subset <- query_data[common_genes, query_data$reclustered.broad %in% selected_cell_types]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' # Compute pairwise correlations
-#' cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, 
-#'                                                       reference_data = ref_data_subset, 
-#'                                                       n_components = 10,
-#'                                                       query_cell_type_col = "labels", 
-#'                                                       ref_cell_type_col = "reclustered.broad", 
-#'                                                       cell_types = selected_cell_types, 
-#'                                                       correlation_method = "spearman")
-#'
-#' # Visualize the results
-#' plot(cor_matrix_avg)
-#' 
-#'
-# Function to plot the output of the calculateAveragePairwiseCorrelation function
-plot.calculateAveragePairwiseCorrelation <- function(x, ...){
-    
-    # Convert matrix to dataframe
-    cor_df <- as.data.frame(as.table(cor_matrix_avg))
-    cor_df$Var1 <- factor(cor_df$Var1, levels = rownames(cor_matrix_avg))
-    cor_df$Var2 <- factor(cor_df$Var2, levels = rev(colnames(cor_matrix_avg)))
-    
-    # Create the heatmap with updated colors and improved aesthetics
-    heatmap_plot <- ggplot2::ggplot(cor_df, ggplot2::aes(x = Var2, y = Var1)) +
-        ggplot2::geom_tile(ggplot2::aes(fill = Freq), color = "white") +
-        ggplot2::geom_text(ggplot2::aes(label = round(Freq, 2)), color = "black", size = 3, family = "sans") +
-        ggplot2::scale_fill_gradient2(low = "blue", mid = "white", high = "red", 
-                                      midpoint = 0, limits = c(min(cor_df$Freq), max(cor_df$Freq)),
-                                      name = "Correlation",
-                                      breaks = seq(-1, 1, by = 0.2)) +  # Specify color scale breaks
-        ggplot2::labs(title = "Correlation Heatmap", x = "", y = "") +
-        ggplot2::theme_minimal() +
-        ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
-                       axis.text.y = ggplot2::element_text(family = "sans"),  # Set font family for y-axis labels
-                       plot.title = ggplot2::element_text(face = "bold"),  # Make title bold
-                       legend.position = "right",  # Place legend on RHS
-                       legend.title = ggplot2::element_text(face = "italic"))
-    
-    # Print the plot
-    print(heatmap_plot)
-}
diff --git a/R/plot.calculateSampleDistances.R b/R/plot.calculateSampleDistances.R
deleted file mode 100644
index 28c468f..0000000
--- a/R/plot.calculateSampleDistances.R
+++ /dev/null
@@ -1,152 +0,0 @@
-#' @title Plot Distance Density Comparison for a Specific Cell Type and Selected Samples
-#'
-#' @description This function plots the density functions for the reference data and the distances from a specified query samples 
-#' to all reference samples within a specified cell type.
-#'
-#' @details The function first checks if the specified cell type and sample names are present in the \code{x}. If the 
-#' specified cell type or sample name is not found, an error is thrown. It then extracts the distances within the reference dataset 
-#' and the distances from the specified query sample to the reference samples. The function creates a density plot using \code{ggplot2} 
-#' to compare the distance distributions. The density plot will show two distributions: one for the pairwise distances within the 
-#' reference dataset and one for the distances from the specified query sample to each reference sample. These distributions are 
-#' plotted in different colors to visually assess how similar the query sample is to the reference samples of the specified cell type.
-#'
-#' @param x A list containing the distance data computed by \code{calculateSampleDistances}.
-#' @param ref_cell_type A string specifying the reference cell type.
-#' @param sample_names A string specifying the query sample name for which to plot the distances.
-#' @param ... Additional arguments passed to the plotting function.
-#'
-#' @return A ggplot2 density plot comparing the reference distances and the distances from the specified sample to the reference samples.
-#'
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{calculateSampleDistances}}
-#'
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-#' query_var <- scran::getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' # Plot the PC data
-#' distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, 
-#'                                           n_components = 10, 
-#'                                           query_cell_type_col = "labels", 
-#'                                           ref_cell_type_col = "reclustered.broad",
-#'                                           pc_subset = c(1:10)) 
-#' 
-#' # Identify outliers for CD4
-#' cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, 
-#'                                query_cell_type_col = "labels", 
-#'                                ref_cell_type_col = "reclustered.broad",
-#'                                n_components = 10,
-#'                                n_tree = 500,
-#'                                anomaly_treshold = 0.5)$CD4
-#' cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6])
-#' 
-#' # Plot the densities of the distances
-#' plot(distance_data, ref_cell_type = "CD4", sample_names = cd4_top5_anomalies)
-#' plot(distance_data, ref_cell_type = "CD8", sample_names = cd4_top5_anomalies)
-#' 
-#'  
-# Function to plot density functions for the reference data and the specified sample
-plot.calculateSampleDistances <- function(x, ref_cell_type, sample_names, ...) {
-    
-    # Check if cell type is available
-    if(length(ref_cell_type) != 1 || !(ref_cell_type %in% names(x)))
-        stop("The specified \'ref_cell_type\' is not available.")
-    
-    # Filter distance data for the specified cell type
-    cell_distances <- x[[ref_cell_type]]
-    
-    # Check if samples are available in data for that cell type
-    if(!all(sample_names %in% rownames(cell_distances$query_to_ref_distances)))
-        stop("One or more specified 'sample_names' are not available for that cell type.")
-    
-    # Extract distances within the reference dataset
-    ref_distances <- cell_distances$ref_distances
-    
-    # Initialize an empty list to store data frames for each sample
-    plot_data_list <- list()
-    
-    # Loop through each sample to create the combined data frame
-    for(s in sample_names) {
-        # Extract distances for the current sample
-        sample_distances <- cell_distances$query_to_ref_distances[s, ]
-        
-        # Create a data frame for the current sample and reference distances
-        sample_data <- data.frame(Sample = s, Distance = sample_distances, Distance_Type = "Sample")
-        ref_data <- data.frame(Sample = s, Distance = ref_distances, Distance_Type = "Reference")
-        
-        # Combine the reference and sample data frames
-        combined_data <- rbind(ref_data, sample_data)
-        
-        # Append the combined data frame to the list
-        plot_data_list[[s]] <- combined_data
-    }
-    
-    # Combine all data frames into one data frame
-    plot_data <- do.call(rbind, plot_data_list)
-    
-    # Keep order of sample names
-    plot_data$Sample <- factor(plot_data$Sample, levels = sample_names)
-    
-    # Plot density comparison with facets for each sample
-    density_plot <- ggplot2::ggplot(plot_data, ggplot2::aes(x = Distance, fill = Distance_Type)) +
-        ggplot2::geom_density(alpha = 0.5) +
-        ggplot2::labs(title = paste("Distance Density Comparison for Cell Type:", ref_cell_type),
-                      x = "Distance", y = "Density") +
-        ggplot2::scale_fill_manual(name = "Distance Type", values = c("Reference" = "blue", "Sample" = "red")) +
-        ggplot2::facet_wrap(~ Sample, scales = "free_y", labeller = ggplot2::labeller(Sample = label_parsed)) +
-        ggplot2::theme_minimal() +
-        ggplot2::theme(
-            strip.background = ggplot2::element_rect(fill = "lightgrey", color = "grey50"),
-            strip.text = ggplot2::element_text(color = "grey20", size = 10, face = "bold"),
-            panel.grid.major = ggplot2::element_line(color = "grey90", linetype = "dashed"),
-            panel.grid.minor = ggplot2::element_line(color = "grey95", linetype = "dashed")
-        )
-    
-    # Print the density plot
-    print(density_plot)
-}
-
-
-
-
-
-
-
diff --git a/R/plot.calculateSampleSimilarityPCA.R b/R/plot.calculateSampleSimilarityPCA.R
deleted file mode 100644
index 26c8064..0000000
--- a/R/plot.calculateSampleSimilarityPCA.R
+++ /dev/null
@@ -1,118 +0,0 @@
-#' @title Plot Cosine Similarities Between Samples and PCs
-#'
-#' @description 
-#' This function creates a heatmap plot to visualize the cosine similarities between samples and principal components (PCs).
-#'
-#' @details 
-#' This function reshapes the input data frame to create a long format suitable for plotting as a heatmap. It then
-#' creates a heatmap plot using ggplot2, where the x-axis represents the PCs, the y-axis represents the samples, and the
-#' color intensity represents the cosine similarity values.
-#'
-#' @param x An object of class 'calculateSampleSimilarityPCA' containing a dataframe of cosine similarity values 
-#' between samples and PCs.
-#' @param pc_subset A numeric vector specifying the subset of principal components to include in the plot (default: c(1:5)).
-#' @param ... Additional arguments passed to the plotting function.
-#'
-#' @return A ggplot object representing the cosine similarity heatmap.
-#'
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{calculateSampleSimilarityPCA}}
-#'
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-#' query_var <- scran::getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Run PCA on the reference data (assumed to be prepared)
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' # Store PCA anomaly data and plots
-#' anomaly_output <- detectAnomaly(reference_data = ref_data_subset, 
-#'                                 ref_cell_type_col = "reclustered.broad", 
-#'                                 n_components = 10,
-#'                                 n_tree = 500,
-#'                                 anomaly_treshold = 0.5) 
-#' top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, 
-#'                              decreasing = TRUE)[1:6])
-#' 
-#' # Compute cosine similarity between anomalies and top PCs
-#' cosine_similarities <- calculateSampleSimilarityPCA(ref_data_subset, samples = top6_anomalies, 
-#'                                                     pc_subset = c(1:10), n_top_vars = 50)
-#' cosine_similarities
-#' 
-#' # Plot similarities
-#' plot(cosine_similarities, pc_subset = c(1:5))
-#' 
-# Function to plot cosine similarities between samples and PCs
-plot.calculateSampleSimilarityPCA <- function(x, pc_subset = c(1:5), ...){
-    
-    # Subset data
-    x <- x[, paste0("PC", pc_subset)]
-    
-    # Initialize empty vectors for reshaped data
-    sample_names <- c()
-    pc_names <- c()
-    cosine_values <- c()
-    
-    # Loop through the data frame to manually reshape it
-    for (sample in rownames(x)) {
-        for (pc in colnames(x)) {
-            sample_names <- c(sample_names, sample)
-            pc_names <- c(pc_names, pc)
-            cosine_values <- c(cosine_values, x[sample, pc])
-        }
-    }
-    
-    # Create a data frame with the reshaped data
-    cosine_long <- data.frame(Sample = factor(sample_names, levels = rev(rownames(x))), 
-                              PC = pc_names, CosineSimilarity = cosine_values)
-    
-    # Create the heatmap plot
-    plot <- ggplot(cosine_long, aes(x = PC, y = Sample, fill = CosineSimilarity)) +
-        geom_tile(color = "white") +
-        geom_text(aes(label = sprintf("%.2f", CosineSimilarity)), size = 3) +
-        scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0,
-                             limits = c(-1, 1), space = "Lab", name = "Cosine Similarity") +
-        labs(title = "Cosine Similarity Heatmap",
-             x = "",
-             y = "") +
-        theme_minimal() +
-        theme(axis.text.x = element_text(angle = 45, hjust = 1),
-              plot.title = element_text(hjust = 0.5))
-    return(plot)
-}
-
diff --git a/R/plot.compareCCA.R b/R/plot.compareCCA.R
deleted file mode 100644
index 55f5a30..0000000
--- a/R/plot.compareCCA.R
+++ /dev/null
@@ -1,95 +0,0 @@
-#' @title Plot Visualization of Output from compareCCA Function
-#' 
-#' @description This function generates a visualization of the output from the `compareCCA` function.
-#' The plot shows the cosine similarities of canonical correlation analysis (CCA) coefficients,
-#' with point sizes representing the correlations.
-#'
-#' @details The function converts the input list into a data frame suitable for plotting with `ggplot2`.
-#' Each point in the scatter plot represents the cosine similarity of CCA coefficients, with the size of the point
-#' indicating the correlation.
-#'
-#' @param x A list containing the output from the `compareCCA` function. 
-#' This list should include `cosine_similarity` and `correlations`.
-#' @param ... Additional arguments passed to the plotting function.
-#'
-#' @return A ggplot object representing the scatter plot of cosine similarities of CCA coefficients and correlations.
-#'
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{compareCCA}}
-#' 
-#' @examples
-#' # Load necessary library
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(scran)
-#' library(SingleR)
-#' library(ggplot2)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # Log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#'
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 500)
-#' query_var <- getTopHVGs(query_data, n = 500)
-#'
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Subset reference and query data for a specific cell type
-#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-#'
-#' # Run PCA on the reference and query datasets
-#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-#' query_data_subset <- runPCA(query_data_subset, ncomponents = 50)
-#' 
-#' # Compare CCA
-#' cca_comparison <- compareCCA(query_data_subset, ref_data_subset, 
-#'                              pc_subset = c(1:5))
-#' 
-#' # Visualize output of CCA comparison
-#' plot(cca_comparison)
-#' 
-#' 
-# Plot visualization of output from compareCCA function
-plot.compareCCA <- function(x, ...){
-    
-    # Create a data frame for plotting
-    comparison_data <- data.frame(CCA = paste0("CC", 1:length(x$correlations)),
-                                  Cosine = x$cosine_similarity,
-                                  Correlation = x$correlations)
-    comparison_data$CC <- factor(comparison_data$CCA, levels = comparison_data$CCA)
-    
-    
-    cca_plot <- ggplot2::ggplot(comparison_data, aes(x = CCA, y = Cosine, size = Correlation)) +
-        ggplot2::geom_point() +
-        ggplot2::scale_size_continuous(range = c(3, 10)) +
-        ggplot2::labs(title = "Cosine Similarities of CCA Coefficients with Correlation",
-                      x = "",
-                      y = "Cosine of CC Coefficients",
-                      size = "Correlation") +
-        ggplot2::theme_minimal()
-    print(cca_plot)
-}
\ No newline at end of file
diff --git a/R/plot.comparePCA.R b/R/plot.comparePCA.R
deleted file mode 100644
index acba156..0000000
--- a/R/plot.comparePCA.R
+++ /dev/null
@@ -1,100 +0,0 @@
-#' @title Plot Heatmap of Cosine Similarities Between Principal Components
-#' 
-#' @description This function generates a heatmap to visualize the cosine similarities between 
-#' principal components from the output of the `comparePCA` function.
-#' 
-#' @details The function converts the input matrix into a long-format data frame 
-#' suitable for plotting with `ggplot2`. The rows in the heatmap are ordered in 
-#' reverse to match the conventional display format. The heatmap uses a blue-white-red 
-#' color gradient to represent cosine similarity values, where blue indicates negative 
-#' similarity, white indicates zero similarity, and red indicates positive similarity.
-#' 
-#' @param x A numeric matrix output from the `comparePCA` function, representing 
-#' cosine similarities between query and reference principal components.
-#' @param ... Additional arguments passed to the plotting function.
-#'
-#' @return A ggplot object representing the heatmap of cosine similarities.
-#' 
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{comparePCA}}
-#' 
-#' @examples
-#' # Load necessary library
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(scran)
-#' library(SingleR)
-#' library(ComplexHeatmap)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # Log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#'
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 500)
-#' query_var <- getTopHVGs(query_data, n = 500)
-#'
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Subset reference and query data for a specific cell type
-#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-#'
-#' # Run PCA on the reference and query datasets
-#' ref_data_subset <- runPCA(ref_data_subset)
-#' query_data_subset <- runPCA(query_data_subset)
-#'
-#' # Call the PCA comparison function
-#' similarity_mat <- comparePCA(query_data_subset, ref_data_subset, 
-#'                              pc_subset = c(1:5), 
-#'                              metric = c("cosine", "correlation")[1], 
-#'                              correlation_method = c("spearman", "pearson")[1])
-#'
-#' # Create the heatmap
-#' plot(similarity_mat)
-#' 
-#' 
-# Function to produce the heatmap from output of comparePCA function
-plot.comparePCA <- function(x, ...){
-    
-    # Convert the matrix to a data frame
-    similarity_df <- data.frame(
-        Ref = factor(rep(rownames(x), each = ncol(x)), levels = rev(rownames(x))),
-        Query = rep(colnames(x), times = nrow(x)),
-        value = as.vector(x))
-    
-    # Create the heatmap
-    pc_plot <- ggplot2::ggplot(similarity_df, ggplot2::aes(x = Query, y = Ref, fill = value)) +
-        ggplot2::geom_tile(color = "white") +
-        ggplot2::scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
-                                      midpoint = 0, limit = c(min(x, -0.5), max(x, 0.5)), space = "Lab", 
-                                      name = "Cosine Similarity") +
-        ggplot2::theme_minimal() + 
-        ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, vjust = 1, 
-                                                           size = 12, hjust = 1)) +
-        ggplot2::labs(x = "", y = "", 
-                      title = "Heatmap of Cosine Similarities Between PCs")
-    print(pc_plot)
-}
diff --git a/R/plot.comparePCASubspace.R b/R/plot.comparePCASubspace.R
deleted file mode 100644
index 9477a70..0000000
--- a/R/plot.comparePCASubspace.R
+++ /dev/null
@@ -1,96 +0,0 @@
-#' @title Plot Visualization of Output from comparePCASubspace Function
-#' 
-#' @description This function generates a visualization of the output from the `comparePCASubspace` function.
-#' The plot shows the cosine of principal angles between reference and query principal components,
-#' with point sizes representing the variance explained.
-#' 
-#' @details The function converts the input list into a data frame suitable for plotting with `ggplot2`.
-#' Each point in the scatter plot represents the cosine of a principal angle, with the size of the point
-#' indicating the average variance explained by the corresponding principal components.
-#' 
-#' @param x A numeric matrix output from the `comparePCA` function, representing 
-#' cosine similarities between query and reference principal components.
-#' @param ... Additional arguments passed to the plotting function.
-#'
-#' @return A ggplot object representing the heatmap of cosine similarities.
-#' 
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{comparePCASubspace}}
-#'
-#' @examples
-#' # Load necessary library
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(scran)
-#' library(SingleR)
-#' library(ggplot2)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # Log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#'
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 500)
-#' query_var <- getTopHVGs(query_data, n = 500)
-#'
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Subset reference and query data for a specific cell type
-#' ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-#' query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-#'
-#' # Run PCA on the reference and query datasets
-#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-#' query_data_subset <- runPCA(query_data_subset, ncomponents = 50)
-#' 
-#' # Compare PCA subspaces
-#' subspace_comparison <- comparePCASubspace(query_data_subset, ref_data_subset, 
-#'                                           pc_subset = c(1:5))
-#' 
-#' # Create a data frame for plotting
-#' plot(subspace_comparison)
-#' 
-#' 
-# Function to produce the visualization of output from comparePCASubspace function
-plot.comparePCASubspace <- function(x, ...){
-    
-    # Create a data frame for plotting
-    x <- data.frame(PC = paste0("Ref PC", subspace_comparison$cosine_id[, 1],
-                                              " - Query PC", subspace_comparison$cosine_id[, 2]),
-                                  Cosine = subspace_comparison$cosine_similarity,
-                                  VarianceExplained = subspace_comparison$var_explained_avg)
-    x$PC <- factor(x$PC, levels = x$PC)
-    
-    # Create plot
-    pc_plot <- ggplot2::ggplot(x, aes(x = PC, y = Cosine, size = VarianceExplained)) +
-        ggplot2::geom_point() +
-        ggplot2::scale_size_continuous(range = c(3, 10)) +
-        ggplot2::labs(title = "Principal Angles Cosines with Variance Explained",
-                      x = "",
-                      y = "Cosine of Principal Angle",
-                      size = "Variance Explained") +
-        ggplot2::theme_minimal()
-    print(pc_plot)
-}
\ No newline at end of file
diff --git a/R/plot.detectAnomaly.R b/R/plot.detectAnomaly.R
deleted file mode 100644
index 0480309..0000000
--- a/R/plot.detectAnomaly.R
+++ /dev/null
@@ -1,163 +0,0 @@
-#' @title Create Faceted Scatter Plots for Specified PC Combinations From \code{detectAnomaly} Object
-#'
-#' @description This function generates faceted scatter plots for specified principal component (PC) combinations
-#' within an anomaly detection object. It allows visualization of the relationship between specified
-#' PCs and highlights anomalies detected by the Isolation Forest algorithm.
-#'
-#' @details The function extracts the specified PCs from the given anomaly detection object and generates
-#' scatter plots for each pair of PCs. It uses \code{ggplot2} to create a faceted plot where each facet represents
-#' a pair of PCs. Anomalies are highlighted in red, while normal points are shown in black.
-#'
-#' @param x A list object containing the anomaly detection results from the \code{detectAnomaly} function. 
-#' Each element of the list should correspond to a cell type and contain \code{reference_mat_subset}, \code{query_mat_subset}, 
-#' \code{var_explained}, and \code{anomaly}.
-#' @param cell_type A character string specifying the cell type for which the plots should be generated. This should
-#' be a name present in \code{x}. If NULL, the "Combined" cell type will be plotted. Default is NULL.
-#' @param pc_subset A numeric vector specifying the indices of the PCs to be included in the plots. If NULL, all PCs
-#' in \code{reference_mat_subset} will be included.
-#' @param data_type A character string specifying whether to plot the "query" data or the "reference" data. Default is "query".
-#' @param ... Additional arguments.
-#' 
-#' @return A ggplot2 object representing the PCA plots with anomalies highlighted.
-#' 
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{detectAnomaly}}
-#' 
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 2000)
-#' query_var <- getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-#' 
-#' # Store PCA anomaly data and plots
-#' anomaly_output <- detectAnomaly(ref_data_subset, query_data_subset, 
-#'                                 ref_cell_type_col = "reclustered.broad", 
-#'                                 query_cell_type_col = "labels",
-#'                                 n_components = 10,
-#'                                 n_tree = 500,
-#'                                 anomaly_treshold = 0.5) 
-#' 
-#' # Plot the output for a cell type
-#' plot(anomaly_output, cell_type = "CD8", pc_subset = c(1:5), data_type = "query")
-#' 
-# Function to create faceted scatter plots for specified PC combinations
-plot.detectAnomaly <- function(x, cell_type = NULL, pc_subset = NULL, data_type = c("query", "reference"), ...) {
-    
-    # Check if PCA was used for computations
-    if(!("var_explained" %in% names(x[[names(x)[1]]])))
-        stop("The plot function can only be used if \'n_components\' is not NULL.")
-    
-    # Check input for cell type
-    if(is.null(cell_type)){
-        cell_type <- "Combined"
-    } else{
-        if(!(cell_type %in% names(x)))
-            stop("\'cell_type\' is not available in \'x\'.")
-    }
-    
-    # Check input for pc_subset
-    if(!is.null(pc_subset)){
-        if(!all(pc_subset %in% 1:ncol(x[[cell_type]]$reference_mat_subset)))
-            stop("\'pc_subset\' is out of range.")
-    } else{
-        pc_subset <- 1:ncol(x[[cell_type]]$reference_mat_subset)
-    }
-    
-    # Check input for data_type
-    data_type <- match.arg(data_type)
-    
-    # Filter data to include only specified PCs
-    if(is.null(x[[cell_type]]$query_mat_subset) && data_type == "query"){
-        stop("There is no query data available in the \'detectAnomaly\' object.")
-    } else{
-        if(data_type == "query"){
-            data_subset <- x[[cell_type]]$query_mat_subset[, pc_subset, drop = FALSE]
-            anomaly <- x[[cell_type]]$query_anomaly
-            
-        } else if(data_type == "reference"){
-            data_subset <- x[[cell_type]]$reference_mat_subset[, pc_subset, drop = FALSE]
-            anomaly <- x[[cell_type]]$reference_anomaly
-        }
-    }
-    
-    # Modify column names to include percentage of variance explained
-    colnames(data_subset) <- paste0("PC", pc_subset, 
-                                    " (", sprintf("%.1f%%", x[[cell_type]]$var_explained[pc_subset] * 100), ")")
-    
-    # Create all possible pairs of specified PCs
-    pc_names <- colnames(data_subset)
-    pairs <- expand.grid(x = pc_names, y = pc_names)
-    pairs <- pairs[pairs$x != pairs$y, ]
-    
-    # Create a new data frame with all possible pairs of specified PCs
-    data_pairs_list <- lapply(1:nrow(pairs), function(i) {
-        x_col <- pairs$x[i]
-        y_col <- pairs$y[i]
-        data_frame <- data.frame(data_subset[, c(x_col, y_col)])
-        colnames(data_frame) <- c("x_value", "y_value")
-        data_frame$x <- x_col
-        data_frame$y <- y_col
-        data_frame
-    })
-    data_pairs <- do.call(rbind, data_pairs_list)
-    
-    # Remove redundant data (to avoid duplicated plots)
-    data_pairs <- data_pairs[as.numeric(data_pairs$x) < as.numeric(data_pairs$y),]
-    
-    # Add anomalies vector to data_pairs dataframe
-    data_pairs$anomaly <- rep(anomaly, choose(length(pc_subset), 2))
-    
-    # Create the ggplot object with facets
-    plot <- ggplot2::ggplot(data_pairs, ggplot2::aes(x = x_value, y = y_value, color = factor(anomaly))) +
-        ggplot2::geom_point(size = 2) + 
-        ggplot2::scale_color_manual(values = c("black", "red"), labels = c("Normal", "Anomaly")) + 
-        ggplot2::facet_grid(rows = ggplot2::vars(y), cols = ggplot2::vars(x), scales = "free") +
-        ggplot2::theme_minimal() +
-        ggplot2::theme(strip.background = ggplot2::element_rect(fill = "grey85", color = "grey70"),   
-                       strip.text = ggplot2::element_text(size = 10, face = "bold", color = "black"), 
-                       axis.title = ggplot2::element_blank(),        
-                       axis.text = ggplot2::element_text(size = 10), 
-                       panel.grid = ggplot2::element_blank(),        
-                       panel.background = ggplot2::element_rect(fill = "white", color = "black"), 
-                       legend.position = "right",          
-                       plot.title = ggplot2::element_text(size = 14, hjust = 0.5), 
-                       plot.background = ggplot2::element_rect(fill = "white")) + 
-        ggplot2::labs(title = paste0("Isolation Forest Anomaly Plot: ", cell_type), color = "iForest Type")
-    print(plot)
-}
diff --git a/R/plot.nearestNeighborDiagnostics.R b/R/plot.nearestNeighborDiagnostics.R
deleted file mode 100644
index c93bcdb..0000000
--- a/R/plot.nearestNeighborDiagnostics.R
+++ /dev/null
@@ -1,114 +0,0 @@
-#' @title Plot Density of Probabilities for Cell Type Classification
-#'
-#' @description This function generates a density plot showing the distribution of probabilities for each sample of belonging to 
-#' either the reference or query dataset for each cell type.
-#'
-#' @details This function creates a density plot to visualize the distribution of probabilities for each sample belonging to the 
-#' reference or query dataset for each cell type. It utilizes the ggplot2 package for plotting.
-#'
-#' @param x An object of class \code{nearestNeighbotDiagnostics} containing the probabilities calculated by the \code{\link{nearestNeighborDiagnostics}} function.
-#' @param cell_types A character vector specifying the cell types to include in the plot. If NULL, all cell types in \code{x} will be plotted. Default is NULL.
-#' @param prob_type A character string specifying the type of probability to plot. Must be either "query" or "reference". Default is "query".
-#' @param ... Additional arguments to be passed to \code{\link[ggplot2]{geom_density}}.
-#'
-#' @return A ggplot2 density plot.
-#' 
-#' @export
-#' 
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#' 
-#' @seealso \code{\link{nearestNeighborDiagnostics}}
-#' 
-#' @examples
-#' # Load necessary library
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(scran)
-#' library(SingleR)
-#' library(scater)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- getTopHVGs(ref_data, n = 500)
-#' query_var <- getTopHVGs(query_data, n = 500)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Run PCA on the reference data
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' # Project the query data onto PCA space of reference
-#' nn_output <- nearestNeighborDiagnostics(query_data_subset, ref_data_subset,
-#'                                         n_neighbor = 15, 
-#'                                         n_components = 10,
-#'                                         pc_subset = c(1:10),
-#'                                         query_cell_type_col = "labels", 
-#'                                         ref_cell_type_col = "reclustered.broad")
-#' 
-#' # Plot output
-#' plot(nn_output, cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
-#'      prob_type = "query")
-#' 
-#' 
-# Function to plot probabilities of each sample of belonging to reference or query dataset for each cell type
-plot.nearestNeighborDiagnostics <- function(x, cell_types = NULL,
-                                            prob_type = c("query", "reference")[1], ...) {
-    
-    # Check input for probability type
-    if(!(prob_type %in% c("query", "reference")))
-        stop("\'prob_type\' must be one of \'query\' or \'reference\'.")
-    
-    # Convert probabilities to data frame
-    probabilities_df <- do.call(rbind, lapply(names(x), function(ct) {
-        data.frame(cell_types = ct, 
-                   probability = x[[ct]][[ifelse(prob_type == "reference", "prob_ref", "prob_query")]])
-    }))
-    
-    if(!is.null(cell_types)){
-        
-        if(!all(cell_types %in% names(x)))
-            stop("One or more of the \'cell_types'\ is not available.")
-        
-        # Subset cell types
-        probabilities_df <- probabilities_df[probabilities_df$cell_types %in% cell_types,]
-    }
-    
-    # Create density plot
-    density_plot <- ggplot2::ggplot(probabilities_df, ggplot2::aes(x = probability, fill = cell_types)) +
-        ggplot2::geom_density(alpha = 0.7) +
-        ggplot2::labs(x = "Probability", y = "Density", title = "Density Plot of Probabilities") +
-        ggplot2::theme_minimal() +
-        ggplot2::theme(
-            legend.position = "none",
-            strip.background = ggplot2::element_rect(fill = "grey90", color = NA),
-            strip.text = ggplot2::element_text(face = "bold")
-        ) +
-        ggplot2::facet_wrap(~cell_types, scales = "free", labeller = ggplot2::labeller(cell_types = label_value))
-    if(length(unique(probabilities_df$cell_types)) > 2)
-        density_plot <- density_plot + 
-        ggplot2::scale_fill_manual(values = RColorBrewer::brewer.pal(n = nlevels(as.factor(probabilities_df$cell_types)), 
-                                                                     name = "Set1")) 
-    
-    return(density_plot)
-}
diff --git a/R/plotGeneExpressionDimred.R b/R/plotGeneExpressionDimred.R
deleted file mode 100644
index 5cda893..0000000
--- a/R/plotGeneExpressionDimred.R
+++ /dev/null
@@ -1,84 +0,0 @@
-#' @title Visualize gene expression on a dimensional reduction plot
-#'
-#' @description
-#' This function plots gene expression on a dimensional reduction plot using methods like t-SNE, UMAP, or PCA. Each single cell is color-coded based on the expression of a specific gene or feature.
-#'
-#' @param se_object An object of class "SingleCellExperiment" containing log-transformed expression matrix and other metadata.
-#'        It can be either a reference or query dataset.
-#' @param method The reduction method to use for visualization. It should be one of the supported methods: "tSNE", "UMAP", or "PCA".
-#' @param n_components A numeric vector of length 2 indicating the first two dimensions to be used for plotting.
-#' @param feature A character string representing the name of the gene or feature to be visualized.
-#'
-#' @import ggplot2
-#' @importFrom ggplot2 ggplot
-#' @importFrom SummarizedExperiment assay
-#' @import SingleCellExperiment
-#'
-#' @return A ggplot object representing the dimensional reduction plot with gene expression.
-#' @export
-#'
-#' @examples
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#'
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # Log transform datasets
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Run PCA
-#' query_data <- runPCA(query_data)
-#'
-#' # Plot gene expression on PCA plot
-#' plotGeneExpressionDimred(se_object = query_data, 
-#'                          method = "PCA", 
-#'                          n_components = c(1, 2), 
-#'                          feature = "VPREB3")
-#' 
-#'
-plotGeneExpressionDimred <- function(se_object, 
-                                     method, 
-                                     n_components = c(1, 2), 
-                                     feature) {
-
-  # Error handling and validation
-  supported_methods <- c("tSNE", "UMAP", "PCA")
-  if (!(method %in% supported_methods)) {
-    stop("Unsupported method. Please choose one of: ", paste(supported_methods, collapse = ", "))
-  }
-
-  if (length(n_components) != 2) {
-    stop("n_components should be a numeric vector of length 2.")
-  }
-
-  if (!feature %in% rownames(assay(query_data, "logcounts"))) {
-    stop("Specified feature does not exist in the expression matrix.")
-  }
-
-  # Extract dimension reduction coordinates from SingleCellExperiment object
-  reduction <- reducedDim(query_data, method)[, n_components]
-
-  # Extract gene expression vector
-  expression <- assay(query_data, "logcounts")[feature, ]
-
-  # Prepare data for plotting
-  df <- data.frame(Dim1 = reduction[, 1], Dim2 = reduction[, 2], Expression = expression)
-
-  # Create the plot object
-  plot <- ggplot(df, aes(x = Dim1, y = Dim2)) +
-    geom_point(aes(color = Expression)) +
-    scale_color_gradient(low = "grey90", high = "blue") +
-    xlab("Dimension 1") +
-    ylab("Dimension 2") +
-    theme_bw()
-
-  return(plot)
-}
diff --git a/R/plotGeneSetScores.R b/R/plotGeneSetScores.R
deleted file mode 100644
index 39123f7..0000000
--- a/R/plotGeneSetScores.R
+++ /dev/null
@@ -1,149 +0,0 @@
-#' @title Visualization of gene sets or pathway scores on dimensional reduction plot
-#' 
-#' @description
-#' Plot gene sets or pathway scores on PCA, TSNE, or UMAP. Single cells are color-coded by scores of gene sets or pathways.
-#' 
-#' @details 
-#' This function plots gene set scores on reduced dimensions such as PCA, t-SNE, or UMAP. 
-#' It extracts the reduced dimensions from the provided SingleCellExperiment object.
-#' Gene set scores are visualized as a scatter plot with colors indicating the scores.
-#' For PCA, the function automatically includes the percentage of variance explained 
-#' in the plot's legend.
-#'          
-#' @param se_object An object of class "SingleCellExperiment" containing numeric expression matrix and other metadata.
-#'        It can be either a reference or query dataset.
-#' @param method A character string indicating the method for visualization ("PCA", "TSNE", or "UMAP").
-#' @param feature A character string representing the name of the feature (score) in the colData(query_data) to plot.
-#' @param pc_subset An optional vector specifying the principal components (PCs) to include in the plot if method = "PCA". 
-#'        Default is c(1:5).
-#'
-#' @return A ggplot2 object representing the gene set scores plotted on the specified reduced dimensions.
-#' @export
-#'
-#' @examples
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#' library(AUCell)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#'
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' ## log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Run PCA on the query data
-#' query_data <- runPCA(query_data)
-#' 
-#' # Compute scores using AUCell
-#' expression_matrix <- assay(query_data, "logcounts")
-#' cells_rankings <- AUCell_buildRankings(expression_matrix, plotStats = FALSE)
-#' # Generate gene sets
-#' gene_set1 <- sample(rownames(expression_matrix), 10)
-#' gene_set2 <- sample(rownames(expression_matrix), 20)
-#' gene_sets <- list(geneSet1 = gene_set1, geneSet2 = gene_set2)
-#' 
-#' # Calculate AUC scores for gene sets
-#' cells_AUC <- AUCell_calcAUC(gene_sets, cells_rankings)
-#' 
-#' # Assign scores to colData (users should ensure that the scores are present in the colData)
-#' colData(query_data)$geneSetScores <- assay(cells_AUC)["geneSet1", ] 
-#'
-#' # Plot gene set scores on PCA
-#' plotGeneSetScores(se_object = query_data, 
-#'                   method = "PCA", 
-#'                   feature = "geneSetScores",
-#'                   pc_subset = c(1:5))
-#'
-#' # Note: Users can provide their own gene set scores in the colData of the 'se_object' object, 
-#' # using any method of their choice.
-#'
-plotGeneSetScores <- function(se_object, 
-                              method, 
-                              feature,
-                              pc_subset = c(1:5)) {
-
-  # Check if the specified method is valid
-  valid_methods <- c("PCA", "TSNE", "UMAP")
-  if (!(method %in% valid_methods)) {
-    stop("Invalid method. Please choose one of: ", paste(valid_methods, collapse = ", "))
-  }
-
-  # Create the plot object
-  if (method == "PCA") {
-      # Check if "PCA" is present in reference's reduced dimensions
-      if (!"PCA" %in% names(reducedDims(se_object))) {
-          stop("Reference data must have pre-computed PCA in \'reducedDims\'.")
-      }
-      
-      # Check input for pc_subset
-      if(!all(pc_subset %in% 1:ncol(reducedDim(se_object, "PCA"))))
-          stop("\'pc_subset\' is out of range.")
-      
-      # PCA data
-      plot_mat <- reducedDim(se_object, "PCA")[, pc_subset]
-      # Modify column names to include percentage of variance explained
-      colnames(plot_mat) <- paste0("PC", pc_subset, 
-                                      " (", sprintf("%.1f%%", attributes(reducedDim(se_object, "PCA"))$varExplained[pc_subset] /
-                                                        sum(attributes(reducedDim(se_object, "PCA"))$varExplained) * 100), ")")
-  } else if (method == "TSNE") {
-      # Check if "TSNE" is present in reference's reduced dimensions
-      if (!"TSNE" %in% names(reducedDims(se_object))) {
-          stop("Reference data must have pre-computed t-SNE in \'reducedDims\'.")
-      }
-      # TSNE data
-      plot_mat <- reducedDim(se_object, "TSNE")
-  } else if (method == "UMAP") {
-      # Check if "UMAP" is present in reference's reduced dimensions
-      if (!"UMAP" %in% names(reducedDims(se_object))) {
-          stop("Reference data must have pre-computed UMAP in \'reducedDims\'.")
-      }
-      # UMAP data
-      plot_mat <- reducedDim(se_object, "UMAP")
-  }
-  
-  # Create all possible pairs of specified PCs
-  plot_names <- colnames(plot_mat)
-  pairs <- expand.grid(x = plot_names, y = plot_names)
-  pairs <- pairs[pairs$x != pairs$y, ]
-  # Create a new data frame with all possible pairs of specified PCs
-  data_pairs_list <- lapply(1:nrow(pairs), function(i) {
-      x_col <- pairs$x[i]
-      y_col <- pairs$y[i]
-      data_frame <- data.frame(plot_mat[, c(x_col, y_col)])
-      colnames(data_frame) <- c("x_value", "y_value")
-      data_frame$x <- x_col
-      data_frame$y <- y_col
-      data_frame
-  })
-  # Plot data
-  data_pairs <- do.call(rbind, data_pairs_list)
-  # Remove redundant data (to avoid duplicated plots)
-  data_pairs <- data_pairs[as.numeric(data_pairs$x) < as.numeric(data_pairs$y),]
-  data_pairs$scores <- se_object[["geneSetScores"]]
-  # Create the ggplot object (with facets if PCA)
-  plot_obj <- ggplot2::ggplot(data_pairs, ggplot2::aes(x = x_value, y = y_value, color = scores)) +
-      ggplot2::geom_point(size = 1, alpha = 0.5) + 
-      ggplot2::scale_color_gradientn(colors = c("#2171B5", "#8AABC1", "#FFEDA0", "#E6550D"), 
-                                     values = seq(0, 1, by = 1/3), 
-                                     limits = c(0, max(data_pairs$scores))) +
-      ggplot2::facet_grid(rows = ggplot2::vars(y), cols = ggplot2::vars(x), scales = "free") +
-      ggplot2::theme_minimal() +
-      ggplot2::theme(strip.background = ggplot2::element_rect(fill = "grey85", color = "grey70"),   
-                     strip.text = ggplot2::element_text(size = 10, face = "bold", color = "black"), 
-                     axis.title = ggplot2::element_blank(),        
-                     axis.text = ggplot2::element_text(size = 10), 
-                     panel.grid = ggplot2::element_blank(),        
-                     panel.background = ggplot2::element_rect(fill = "white", color = "black"), 
-                     legend.position = "right",          
-                     plot.title = ggplot2::element_text(size = 14, hjust = 0.5), 
-                     plot.background = ggplot2::element_rect(fill = "white")) 
-  return(plot_obj)
-}
diff --git a/R/plotMarkerExpression.R b/R/plotMarkerExpression.R
deleted file mode 100644
index 317c2d8..0000000
--- a/R/plotMarkerExpression.R
+++ /dev/null
@@ -1,156 +0,0 @@
-#' @title Plot gene expression distribution from overall and cell type-specific perspective
-#' 
-#' @description
-#' This function generates density plots to visualize the distribution of gene expression values 
-#' for a specific gene across the overall dataset and within a specified cell type.
-#'
-#' @details 
-#' This function generates density plots to compare the distribution of a specific marker 
-#' gene between reference and query datasets. The aim is to inspect the alignment of gene expression 
-#' levels as a surrogate for dataset similarity. Similar distributions suggest a good alignment, 
-#' while differences may indicate discrepancies or incompatibilities between the datasets.
-#' 
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} that identifies the cell types.
-#' @param gene_name character. A string representing the gene name for which the distribution is to be visualized.
-#' @param label character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")).
-#'
-#' @return A gtable object containing two arranged density plots as grobs. 
-#'         The first plot shows the overall gene expression distribution, 
-#'         and the second plot displays the cell type-specific expression 
-#'         distribution.
-#'         
-#' @examples
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#' library(SingleR)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#'
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#'
-#' # Log transform datasets
-#' ref_data <- logNormCounts(ref_data)
-#' query_data <- logNormCounts(query_data)
-#'
-#' # Get cell type scores using SingleR or any other method
-#' pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#'
-#' # Add labels to query object
-#' colData(query_data)$labels <- pred$labels
-#'
-#' # Note: Users can use SingleR or any other method to obtain the cell type annotations.
-#' plotMarkerExpression(reference_data = ref_data, 
-#'                      query_data = query_data, 
-#'                      ref_cell_type_col = "reclustered.broad", 
-#'                      query_cell_type_col = "labels", 
-#'                      gene_name = "VPREB3", 
-#'                      label = "B_and_plasma")
-#' 
-#'
-#' @import ggplot2
-#' @importFrom ggplot2 ggplot
-#' @importFrom gridExtra grid.arrange
-#' @importFrom SummarizedExperiment assay
-#' @import SingleCellExperiment
-#' @export
-plotMarkerExpression <- function(reference_data, 
-                                 query_data, 
-                                 ref_cell_type_col, 
-                                 query_cell_type_col, 
-                                 gene_name, 
-                                 label) {
-  # Sanity checks
-  # Check if query_data is a SingleCellExperiment object
-  if (!is(query_data, "SingleCellExperiment")) {
-    stop("query_data must be a SingleCellExperiment object.")
-  }
-  
-  # Check if reference_data is a SingleCellExperiment object
-  if (!is(reference_data, "SingleCellExperiment")) {
-    stop("reference_data must be a SingleCellExperiment object.")
-  }
-  
-  # Check if gene_name is present in both query_data and reference_data
-  if (!(gene_name %in% rownames(assay(query_data)) && gene_name %in% 
-        rownames(assay(reference_data)))) {
-    stop("gene_name: '", gene_name, "' is not present in the 
-         row names of both query_data and reference_data.")
-  }
-    
-  # Check if all labels are present in query_data
-  if (!all(label %in% query_data[[query_cell_type_col]])) {
-    stop("One or more labels specified are not present in query_data.")
-  }
-  
-  # Check if all labels are present in reference_data
-  if (!all(label %in% reference_data[[ref_cell_type_col]])) {
-    stop("One or more labels specified are not present in reference_data.")
-  }
-  
-  # Get expression of the specified gene for reference and query datasets
-  reference_gene_expression <- assay(reference_data, "logcounts")[gene_name, ]
-  query_gene_expression <- assay(query_data, "logcounts")[gene_name, ]
-  
-  # Create a combined vector of gene expression values
-  combined_gene_expression <- c(reference_gene_expression, query_gene_expression)
-  
-  # Create a grouping vector for dataset labels
-  dataset_labels <- rep(c("Reference", "Query"), times = c(length(reference_gene_expression), 
-                                                           length(query_gene_expression)))
-  
-  # Combine the gene expression values and dataset labels
-  data <- data.frame(
-    GeneExpression = combined_gene_expression,
-    Dataset = dataset_labels
-  )
-  
-  # Create a stacked density plot using ggplot2 for overall dataset
-  overall_plot <- ggplot(data, aes(x = GeneExpression, fill = Dataset)) +
-    geom_density(alpha = 0.5) +
-    labs(title = paste("Overall Distribution"), 
-         x = paste("Log gene Expression", gene_name), 
-         y = "Density") +
-    theme_minimal()
-  
-  # Create a subset of data for cell type-specific distribution
-  index1 <- which(reference_data[[ref_cell_type_col]] %in%  label)
-  index2 <- which(query_data[[query_cell_type_col]] %in%  label)
-  
-  reference_gene_expression_cell_type <- assay(reference_data, "logcounts")[gene_name, index1]
-  query_gene_expression_cell_type <- assay(query_data, "logcounts")[gene_name, index2]
-  
-  # Combine the gene expression values and dataset labels for cell type-specific
-  combined_gene_expression <- c(reference_gene_expression_cell_type, 
-                                query_gene_expression_cell_type)
-  
-  # Create a grouping vector for dataset labels
-  dataset_labels <- rep(c("Reference", "Query"), 
-                        times = c(length(reference_gene_expression_cell_type), 
-                                  length(query_gene_expression_cell_type)))
-  
-  # Combine the gene expression values and dataset labels
-  cell_type_specific_data <- data.frame(
-    GeneExpression = combined_gene_expression,
-    Dataset = dataset_labels
-  )
-  
-  # Create a stacked density plot using ggplot2 for cell type-specific dataset
-  cell_type_specific_plot <- ggplot(cell_type_specific_data, 
-                                    aes(x = GeneExpression, fill = Dataset)) +
-    geom_density(alpha = 0.5) +
-    labs(title = paste("Cell Type-Specific Distribution"), 
-         x = paste("Log gene Expression", gene_name), 
-         y = "Density") +
-    theme_minimal()
-  
-  return(gridExtra::grid.arrange(overall_plot, cell_type_specific_plot, ncol = 2))
-}
\ No newline at end of file
diff --git a/R/plotQCvsAnnotation.R b/R/plotQCvsAnnotation.R
deleted file mode 100644
index e56a7e6..0000000
--- a/R/plotQCvsAnnotation.R
+++ /dev/null
@@ -1,131 +0,0 @@
-#' Scatter plot: QC stats vs Cell Type Annotation Scores
-#'
-#' Creates a scatter plot to visualize the relationship between QC stats (e.g., library size) 
-#' and cell type annotation scores for one or more cell types.
-#'
-#' @details This function generates a scatter plot to explore the relationship between various quality 
-#' control (QC) statistics, such as library size and mitochondrial percentage, and cell type 
-#' annotation scores. By examining these relationships, users can assess whether specific QC 
-#' metrics, systematically influence the confidence in cell type annotations, 
-#' which is essential for ensuring reliable cell type annotation.
-#' 
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-#' expression data and metadata.
-#' @param qc_col character. A column name in the \code{colData} of \code{query_data} that 
-#' contains the QC stats of interest.
-#' @param label_col character. The column name in the \code{colData} of \code{query_data} 
-#' that contains the cell type labels.
-#' @param score_col character. The column name in the \code{colData} of \code{query_data} that 
-#' contains the cell type annotation scores.
-#' @param label character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")).  
-#' Defaults to \code{NULL}, which will include all the cells.
-#'
-#' @return A ggplot object displaying a scatter plot of QC stats vs annotation scores, 
-#'         where each point represents a cell, color-coded by its cell type.
-#'
-#' @examples
-#' \donttest{
-#' library(celldex)
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#' library(SingleR)
-#'
-#' # load reference dataset
-#' ref_data <- fetchReference("hpca", "2024-02-26")
-#' 
-#' # Load query dataset (Bunis haematopoietic stem and progenitor cell data) from 
-#' # Bunis DG et al. (2021). Single-Cell Mapping of Progressive Fetal-to-Adult 
-#' # Transition in Human Naive T Cells Cell Rep. 34(1): 108573
-#' query_data <- BunisHSPCData()
-#' rownames(query_data) <- rowData(query_data)$Symbol
-#' 
-#' # Add QC metrics to query data
-#' query_data <- addPerCellQCMetrics(query_data)
-#' 
-#' # Log transform query dataset
-#' query_data <- logNormCounts(query_data)
-#' 
-#' # Run SingleR to predict cell types
-#' 
-#' pred <- SingleR(query_data, ref_data, labels = ref_data$label.main)
-#' 
-#' # Assign predicted labels to query data
-#' colData(query_data)$pred.labels <- pred$labels
-#' 
-#' # Get annotation scores
-#' scores <- apply(pred$scores, 1, max)
-#' 
-#' # Assign scores to query data
-#' colData(query_data)$cell_scores <- scores
-#' 
-#' # Create a scatter plot between library size and annotation scores
-#' 
-#' p1 <- plotQCvsAnnotation(
-#'       query_data = query_data,
-#'       qc_col = "total",
-#'       label_col = "pred.labels",
-#'       score_col = "cell_scores",
-#'       label = NULL)
-#' p1 + xlab("Library Size")
-#' }
-#' 
-#'                    
-#' @import ggplot2
-#' @export
-#'
-plotQCvsAnnotation <- function(query_data, 
-                               qc_col, 
-                               label_col, 
-                               score_col, 
-                               label = NULL) {
-  
-  # Sanity checks
-  
-  # Check if query_data is a SingleCellExperiment object
-  if (!is(query_data, "SingleCellExperiment")) {
-    stop("query_data must be a SingleCellExperiment object.")
-  }
-  
-  # Check if qc_col is a valid column name in query_data
-  if (!qc_col %in% colnames(colData(query_data))) {
-    stop("qc_col: '", qc_col, "' is not a valid column name in query_data.")
-  }
-  
-  # Check if label_col is a valid column name in query_data
-  if (!label_col %in% colnames(colData(query_data))) {
-    stop("label_col: '", label_col, "' is not a valid column name in query_data.")
-  }
-  
-  # Check if score_col is a valid column name in query_data
-  if (!score_col %in% colnames(colData(query_data))) {
-    stop("score_col: '", score_col, "' is not a valid column name in query_data.")
-  }
-  
-  # Filter cells based on label if specified
-  if (!is.null(label)) {
-    index <- which(colData(query_data)[[label_col]] %in% label)
-    query_data <- query_data[, index]
-  }
-  
-  # Extract QC stats, scores, and labels
-  qc_stats <- colData(query_data)[, qc_col]
-  cell_type_scores <- colData(query_data)[, score_col]
-  cell_labels <- colData(query_data)[[label_col]]
-  
-  # Combine QC stats, scores, and labels into a data frame
-  data <- data.frame(QCStats = qc_stats, 
-                     Scores = cell_type_scores, 
-                     CellType = cell_labels)
-  
-  # Create a scatter plot with color-coded points based on cell types or labels
-  plot <- ggplot(data, aes(x = QCStats, 
-                           y = Scores, 
-                           color = CellType)) +
-    geom_point() +
-    xlab("QC stats") +
-    ylab("Annotation Scores") +
-    theme_bw()
-  
-  return(plot)
-}
\ No newline at end of file
diff --git a/R/projectPCA.R b/R/projectPCA.R
deleted file mode 100644
index 6a5eaab..0000000
--- a/R/projectPCA.R
+++ /dev/null
@@ -1,180 +0,0 @@
-#' @title Project Query Data Onto PCA Space of Reference Data
-#'
-#' @description 
-#' This function projects a query singleCellExperiment object onto the PCA space of a reference 
-#' singleCellExperiment object. The PCA analysis on the reference data is assumed to be pre-computed and stored within the object.
-#'
-#' @details 
-#' This function assumes that the "PCA" element exists within the \code{reducedDims} of the reference data 
-#' (obtained using \code{reducedDim(reference_data)}) and that the genes used for PCA are present in both the reference and query data. 
-#' It performs centering and scaling of the query data based on the reference data before projection.
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. 
-#' Must be less than or equal to the number of components available in the reference PCA.
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} 
-#' that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} 
-#' that identifies the cell types.
-#' @param return_value A character string specifying the format of the returned data. Can be \code{data.frame} (combined reference 
-#' and query projections) or \code{list} (separate lists for reference and query projections) (default = \code{data.frame}).
-#'
-#' @return A \code{data.frame} containing the projected data in rows (reference and query data combined) or a \code{list} containing 
-#' separate matrices for reference and query projections, depending on the \code{return_value} argument.
-#'
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#'
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#' library(RColorBrewer)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-#' query_var <- scran::getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Run PCA on the reference data (assumed to be prepared)
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' # Project the query data onto PCA space of reference
-#' pca_output <- projectPCA(query_data_subset, ref_data_subset,
-#'                          n_components = 10,
-#'                          query_cell_type_col = "labels",
-#'                          ref_cell_type_col = "reclustered.broad",
-#'                          return_value = c("data.frame", "list")[1])
-#'
-#' # Compute t-SNE and UMAP using first 10 PCs
-#' tsne_data <- data.frame(calculateTSNE(t(pca_output[, paste0("PC", 1:10)])))
-#' umap_data <- data.frame(calculateUMAP(t(pca_output[, paste0("PC", 1:10)])))
-#'
-#' # Combine the cell type labels from both datasets
-#' tsne_data$Type <- paste(pca_output$dataset, pca_output$cell_type)
-#'
-#' # Define the cell types and legend order
-#' legend_order <- c("Query CD8",
-#'                   "Reference CD8",
-#'                   "Query CD4",
-#'                   "Reference CD4",
-#'                   "Query B_and_plasma",
-#'                   "Reference B_and_plasma")
-#'
-#' # Define the colors for cell types
-#' color_palette <- brewer.pal(length(legend_order), "Paired")
-#' color_mapping <- setNames(color_palette, legend_order)
-#' cell_type_colors <- color_mapping[legend_order]
-#'
-#' # Visualize t-SNE output
-#' tsne_plot <- ggplot(tsne_data[tsne_data$Type %in% legend_order,],
-#'                     aes(x = TSNE1, y = TSNE2, color = factor(Type, levels = legend_order))) +
-#'     geom_point(alpha = 0.5, size = 1) +
-#'     scale_color_manual(values = cell_type_colors) +
-#'     theme_bw() +
-#'     guides(color = guide_legend(title = "Cell Types"))
-#' 
-#'
-# Function to project query data onto PCA space of reference data
-projectPCA <- function(query_data, reference_data, 
-                       n_components = 10, 
-                       query_cell_type_col = NULL, 
-                       ref_cell_type_col = NULL, 
-                       return_value = c("data.frame", "list")[1]){
-    
-    # Check if query_data is a SingleCellExperiment object
-    if (!is(query_data, "SingleCellExperiment")) {
-        stop("query_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check if reference_data is a SingleCellExperiment object
-    if (!is(reference_data, "SingleCellExperiment")) {
-        stop("reference_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check if "PCA" is present in reference's reduced dimensions
-    if (!"PCA" %in% names(reducedDims(reference_data))) {
-        stop("Reference data must have pre-computed PCA in \'reducedDims\'.")
-    }
-    
-    # Check if n_components is a positive integer
-    if (!inherits(n_components, "numeric")) {
-        stop("n_components should be numeric")
-    } else if (any(!n_components == floor(n_components), n_components < 1)) {
-        stop("n_components should be an integer, greater than zero.")
-    }
-    
-    # Check if requested number of components is within available components
-    if (ncol(reducedDim(reference_data, "PCA")) < n_components) {
-        stop("\'n_components\' is larger than number of available components in reference PCA.")
-    }
-    
-    # Returning output as single matrix or a list
-    if (!return_value %in% c("data.frame", "list")) {
-        stop("Invalid \'return_value\'. Must be 'data.frame' or \'list\'.")
-    }
-    
-    # Extract reference PCA components and rotation matrix
-    ref_mat <- reducedDim(reference_data, "PCA")[, 1:n_components]
-    rotation_mat <- attributes(reducedDim(reference_data, "PCA"))$rotation[, 1:n_components]
-    PCA_genes <- rownames(rotation_mat)
-    
-    # Check if genes used for PCA are available in query data
-    if (!all(PCA_genes %in% rownames(assay(query_data)))) {
-        stop("Genes in reference PCA are not found in query data.")
-    }
-    
-    # Center and scale query data based on reference for projection
-    centering_vec <- apply(t(as.matrix(assay(reference_data, "logcounts"))), 2, mean)[PCA_genes]
-    query_mat <- scale(t(as.matrix(assay(query_data, "logcounts")))[, PCA_genes], center = centering_vec, scale = FALSE) %*% 
-        rotation_mat
-    
-    # Returning output as single matrix or a list
-    if (return_value == "data.frame") {
-        return(data.frame(rbind(ref_mat, query_mat), 
-                          dataset = c(rep("Reference", nrow(ref_mat)), rep("Query", nrow(query_mat))),
-                          cell_type = c(ifelse(rep(is.null(ref_cell_type_col), nrow(ref_mat)), 
-                                               rep(NA, nrow(ref_mat)), 
-                                               colData(reference_data)[[ref_cell_type_col]]),
-                                        ifelse(rep(is.null(query_cell_type_col), nrow(query_mat)), 
-                                               rep(NA, nrow(query_mat)), 
-                                               colData(query_data)[[query_cell_type_col]]))))
-    } else if (return_value == "list") {
-        return(list(ref = data.frame(ref_mat, 
-                                     cell_type = ifelse(rep(is.null(ref_cell_type_col), nrow(ref_mat)), 
-                                                        rep(NA, nrow(ref_mat)), 
-                                                        colData(reference_data)[[ref_cell_type_col]])), 
-                    query = data.frame(query_mat,
-                                       cell_type = ifelse(rep(is.null(query_cell_type_col), nrow(query_mat)), 
-                                                          rep(NA, nrow(query_mat)), 
-                                                          colData(query_data)[[query_cell_type_col]]))))
-    }
-}
\ No newline at end of file
diff --git a/R/regressPC.R b/R/regressPC.R
deleted file mode 100644
index 5a2c8f8..0000000
--- a/R/regressPC.R
+++ /dev/null
@@ -1,273 +0,0 @@
-
-#' Principal component regression
-#'
-#' This function performs linear regression of a covariate of interest onto one
-#' or more principal components, based on the data in a SingleCellExperiment
-#' object.
-#'
-#' @details Principal component regression, derived from PCA, can be used to
-#'   quantify the variance explained by a covariate interest. Applications for
-#'   single-cell analysis include quantification of batch removal, assessing
-#'   clustering homogeneity, and evaluation of alignment of query and reference
-#'   datasets in cell type annotation settings.  Briefly, the R^2 is calculated
-#'   from a linear regression of the covariate B of interest onto each principal
-#'   component. The variance contribution of the covariate effect per principal
-#'   component is then calculated as the product of the variance explained by
-#'   the ith principal component (PC) and the corresponding R2(PCi|B). The sum
-#'   across all variance contributions by the covariate effects in all principal
-#'   components gives the total variance explained by the covariate as follows:
-#'
-#'   Var(C|B) = sum_{i=1}^G Var(C|PC_i) * R^2 (PC_i | B)
-#'
-#'   where, Var(C|PCi) is the variance of the data matrix C explained by the ith
-#'   principal component. See references.
-#'
-#'   If the input is large (>3e4 cells) and the independent variable is
-#'   categorical with >10 categories, this function will use a stripped down
-#'   linear model function that is faster but doesn't return all the same
-#'   components. Namely, the \code{regression.summaries} component of the result
-#'   will contain only the R^2 values, nothing else.
-#'
-#' @param sce An object of class \code{\linkS4class{SingleCellExperiment}}
-#'   containing the data for regression analysis.
-#'
-#' @param dep.vars character. Dependent variable(s). Determines which principal
-#'   component(s) (e.g., "PC1", "PC2", etc.) are used as explanatory variables.
-#'   Principal components are expected to be stored in a PC matrix named
-#'   \code{"PCA"} in the \code{reducedDims} of \code{sce}. Defaults to
-#'   \code{NULL} which will then regress on each principal component present in
-#'   the PC matrix.
-#'
-#' @param indep.var character. Independent variable. A column name in the
-#'   \code{colData} of \code{sce} specifying the response variable.
-#'
-#' @param regressPC_res a result from \code{\link{regressPC}}
-#'
-#' @param max_pc The maximum number of PCs to show on the plot. Set to 0 to show
-#'   all.
-#'
-#' @return A \code{list} containing \itemize{ \item summaries of the linear
-#'   regression models for each specified principal component, \item the
-#'   corresponding R-squared (R2) values, \item the variance contributions for
-#'   each principal component, and \item the total variance explained.}
-#'
-#' @references Luecken et al. Benchmarking atlas-level data integration in
-#'   single-cell genomics. Nature Methods, 19:41-50, 2022.
-#'
-#' @examples
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#' library(SingleR)
-#'
-#' # Load data
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#'
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(sce),
-#'     size = floor(0.7 * ncol(sce)),
-#'     replace = FALSE
-#' )
-#' ref <- sce[, indices]
-#' query <- sce[, -indices]
-#'
-#' # log transform datasets
-#' ref <- logNormCounts(ref)
-#' query <- logNormCounts(query)
-#'
-#' # Run PCA
-#' query <- runPCA(query)
-#'
-#' # Get cell type scores using SingleR
-#' # Note: replace when using cell type annotation scores from other methods
-#' scores <- SingleR(query, ref, labels = ref$reclustered.broad)
-#'
-#' # Add labels to query object
-#' query$labels <- scores$labels
-#'
-#' # Specify the dependent variables (principal components) and
-#' # independent variable (e.g., "labels")
-#' dep.vars <- paste0("PC", 1:3)
-#' indep.var <- "labels"
-#'
-#' # Perform linear regression on multiple principal components
-#' res <- regressPC(
-#'     sce = query,
-#'     dep.vars = dep.vars,
-#'     indep.var = indep.var
-#' )
-#'
-#' # Obtain linear regression summaries and R-squared values
-#' res$regression.summaries
-#' res$rsquared
-#' 
-#'
-#' plotPCRegression(query, res, dep.vars, indep.var)
-#'
-#' @importFrom stats lm
-#' @importFrom utils tail
-#' @importFrom rlang .data
-#' @import SingleCellExperiment
-#' @export
-regressPC <-
-    function(
-        sce,
-        dep.vars = NULL,
-        indep.var) {
-        ## sanity checks
-        stopifnot(is(sce, "SingleCellExperiment"))
-        stopifnot("PCA" %in% reducedDimNames(sce))
-
-        if (!is.null(dep.vars)) {
-            stopifnot(all(dep.vars %in% colnames(reducedDim(sce, "PCA"))))
-        }
-
-        stopifnot(indep.var %in% colnames(colData(sce)))
-
-        ## regress against all PCs if not instructed otherwise
-        if (is.null(dep.vars)) {
-            dep.vars <- colnames(reducedDim(sce, "PCA"))
-        }
-
-        ## create a data frame with the dependent and independent variables
-        df <- data.frame(
-            Independent = sce[[indep.var]],
-            reducedDim(sce, "PCA")[, dep.vars]
-        )
-
-        ## perform linear regression for each principal component
-        .regress <- function(pc, df) {
-            f <- paste0(pc, " ~ Independent")
-            model <- lm(f, data = df)
-            s <- summary(model)
-            return(s)
-        }
-
-        .regress_fast <- function(df) {
-            # This does the lms for large, categorical independent variables in
-            # one sweep.
-            ssts <- vapply(
-                df[, dep.vars],
-                \(x) sum((x - mean(x, na.rm = TRUE))^2,
-                    na.rm = TRUE
-                ),
-                1.0
-            )
-
-            indp_list <- split(
-                df,
-                df$Independent
-            )
-
-            .get_sses <- function(x) {
-                vapply(
-                    x[, dep.vars],
-                    \(z) sum((z - mean(z, na.rm = TRUE))^2,
-                        na.rm = TRUE
-                    ),
-                    1.0
-                )
-            }
-
-            sses <- rowSums(vapply(
-                indp_list,
-                .get_sses,
-                rep(1, length(dep.vars))
-            ))
-
-            s <- mapply(
-                \(err, tot) {
-                    list(
-                        "r.squared" = 1 - err / tot,
-                        "regression.summaries" = NA
-                    )
-                },
-                sses, ssts,
-                SIMPLIFY = FALSE
-            )
-
-            return(s)
-        }
-
-        needs_fastlm <- (nrow(df) > 3e4) &&
-            (is.character(df$Independent) || is.factor(df$Independent)) &&
-            (length(unique(df$Independent)) > 10)
-
-        if (needs_fastlm) {
-            summaries <- .regress_fast(df)
-        } else {
-            summaries <- lapply(dep.vars, .regress, df = df)
-        }
-        names(summaries) <- dep.vars
-
-        ## calculate R-squared values
-        rsq <- vapply(summaries, `[[`, numeric(1), x = "r.squared")
-
-        ## calculate variance contributions by principal component
-        ind <- match(dep.vars, colnames(reducedDim(sce, "PCA")))
-        var.expl <- attr(reducedDim(sce, "PCA"), "percentVar")[ind]
-        var.contr <- var.expl * rsq
-
-        ## calculate total variance explained by summing the variance contributions
-        total.var.expl <- sum(var.contr)
-
-        ## return the summaries of the linear regression models,
-        ## R-squared values, and variance contributions
-        res <- list(
-            regression.summaries = summaries,
-            rsquared = rsq,
-            var.contributions = var.contr,
-            total.variance.explained = total.var.expl
-        )
-
-        res
-    }
-
-#' @rdname regressPC
-#' @export
-plotPCRegression <- function(
-        sce,
-        regressPC_res,
-        dep.vars = NULL,
-        indep.var,
-        max_pc = 20) {
-
-    stopifnot(is(sce, "SingleCellExperiment"))
-    stopifnot("PCA" %in% reducedDimNames(sce))
-    if (!is.null(dep.vars)) {
-        stopifnot(all(dep.vars %in% colnames(reducedDim(sce, "PCA"))))
-    }
-    stopifnot(indep.var %in% colnames(colData(sce)))
-
-    if (is.null(dep.vars)) {
-        dep.vars <- colnames(reducedDim(sce, "PCA"))
-    }
-
-    if (max_pc == 0) max_pc <- length(dep.vars)
-
-    p2_input <- data.frame(
-        x = dep.vars[1:max_pc],
-        i = seq_along(dep.vars[1:max_pc]),
-        r2 = regressPC_res$rsquared[1:max_pc]
-    )
-
-    p2 <- ggplot2::ggplot(p2_input, aes(.data$i, .data$r2)) +
-        ggplot2::geom_point() +
-        ggplot2::geom_line() +
-        ggplot2::theme_bw() +
-        ggplot2::ylim(c(0, 1)) +
-        ggplot2::labs(
-            y = bquote(R^2 ~ of ~ "PC ~ " ~ .(indep.var))
-        ) +
-        ggplot2::scale_x_continuous(
-            breaks = p2_input$i,
-            labels = p2_input$x
-        ) +
-        ggplot2::theme(
-            axis.title.x = ggplot2::element_blank(),
-            panel.grid.minor = ggplot2::element_blank()
-        )
-
-    return(p2)
-}
diff --git a/R/visualizeCellTypeMDS.R b/R/visualizeCellTypeMDS.R
deleted file mode 100644
index 6e6bf33..0000000
--- a/R/visualizeCellTypeMDS.R
+++ /dev/null
@@ -1,138 +0,0 @@
-#' Visualizing Reference and Query Cell Types using MDS
-#'
-#' This function facilitates the assessment of similarity between reference and query datasets 
-#' through Multidimensional Scaling (MDS) scatter plots. It allows the visualization of cell types, 
-#' color-coded with user-defined custom colors, based on a dissimilarity matrix computed from a 
-#' user-selected gene set.
-#' 
-#' @details To evaluate dataset similarity, the function selects specific subsets of cells from 
-#' both reference and query datasets. It then calculates Spearman correlations between gene expression profiles, 
-#' deriving a dissimilarity matrix. This matrix undergoes Classical Multidimensional Scaling (MDS) for 
-#' visualization, presenting cell types in a scatter plot, distinguished by colors defined by the user.
-#' 
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-#' expression data and metadata.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell 
-#' expression data and metadata.
-#' @param cell_types A character vector specifying the cell types to include in the plot. If NULL, all cell types are included.
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} 
-#' that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} 
-#' that identifies the cell types.
-#' 
-#' @return A ggplot object representing the MDS scatter plot with cell type coloring.
-#'
-#' @examples
-#' library(scater)
-#' library(scran)
-#' library(scRNAseq)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-#' query_var <- scran::getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#' 
-#' # Generate the MDS scatter plot with cell type coloring
-#' plot <- visualizeCellTypeMDS(query_data = query_data_subset, 
-#'                              reference_data = ref_data_subset, 
-#'                              query_cell_type_col = "labels", 
-#'                              ref_cell_type_col = "reclustered.broad")
-#' print(plot)
-#'
-#' @importFrom stats cmdscale cor
-#' @importFrom ggplot2 ggplot
-#' @importFrom SummarizedExperiment assay
-#' @export
-#' 
-visualizeCellTypeMDS <- function(query_data, 
-                                 reference_data, 
-                                 cell_types = NULL,
-                                 query_cell_type_col, 
-                                 ref_cell_type_col) {
-
-    # Check if query_data is a SingleCellExperiment object
-    if (!is(query_data, "SingleCellExperiment")) {
-    stop("query_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check if reference_data is a SingleCellExperiment object
-    if (!is(reference_data, "SingleCellExperiment")) {
-    stop("reference_data must be a SingleCellExperiment object.")
-    }
-    
-    # Check if query_cell_type_col is a valid column name in query_data
-    if (!query_cell_type_col %in% names(colData(query_data))) {
-      stop("query_cell_type_col: '", query_cell_type_col, "' is not a valid column name in query_data.")
-    }
-    
-    # Check if ref_cell_type_col is a valid column name in reference_data
-    if (!ref_cell_type_col %in% names(colData(reference_data))) {
-      stop("ref_cell_type_col: '", ref_cell_type_col, "' is not a valid column name in reference_data.")
-    }
-    
-    # Check if cell types available in both single-cell experiments
-    if(!all(cell_types %in% reference_data[[ref_cell_type_col]]) || 
-       !all(cell_types %in% query_data[[query_cell_type_col]]))
-        stop("One or more of the specified cell types are not available in \'reference_data\' or \'query_data\'.")
-    
-    # Cell types
-    if(is.null(cell_types)){
-        cell_types <- na.omit(intersect(unique(query_data[[query_cell_type_col]]), unique(reference_data[[ref_cell_type_col]])))
-    }
-
-    # Subset data
-    query_data <- query_data[, which(query_data[[query_cell_type_col]] %in% cell_types)]
-    reference_data <- reference_data[, which(reference_data[[ref_cell_type_col]] %in% cell_types)]
-    
-    # Extract logcounts
-    queryExp <- as.matrix(assay(query_data, "logcounts"))
-    refExp <- as.matrix(assay(reference_data, "logcounts"))
-    
-    # Compute correlation and dissimilarity matrix
-    df <- cbind(queryExp, refExp)
-    corMat <- cor(df, method = "spearman")
-    disMat <- (1 - corMat)
-    cmd <- data.frame(cmdscale(disMat), c(rep("Query", ncol(queryExp)), rep("Reference", ncol(refExp))),
-                      c(query_data[[query_cell_type_col]], reference_data[[ref_cell_type_col]]))
-    colnames(cmd) <- c("Dim1", "Dim2", "dataset", "cellType")
-    cmd <- na.omit(cmd)
-    cmd$cell_type_dataset <- paste(cmd$dataset, cmd$cellType, sep = " ")
-
-    # Define the order of cell type and dataset combinations
-    order_combinations <- paste(rep(c("Reference", "Query"), length(cell_types)), rep(sort(cell_types), each = 2))
-    cmd$cell_type_dataset <- factor(cmd$cell_type_dataset, levels = order_combinations)
-    
-    # Define the colors for cell types
-    color_mapping <- setNames(RColorBrewer::brewer.pal(length(order_combinations), "Paired"), order_combinations)
-    cell_type_colors <- color_mapping[order_combinations]
-    
-    plot <- ggplot2::ggplot(cmd, aes(x = Dim1, y = Dim2, color = cell_type_dataset)) +
-      ggplot2::geom_point(alpha = 0.5, size = 1) +
-      ggplot2::scale_color_manual(values = cell_type_colors, name = "Cell Types") + 
-      ggplot2::theme_bw() +
-      ggplot2::guides(color = ggplot2::guide_legend(title = "Cell Types"))
-    return(plot)
-}
diff --git a/R/visualizeCellTypePCA.R b/R/visualizeCellTypePCA.R
deleted file mode 100644
index 2a88d52..0000000
--- a/R/visualizeCellTypePCA.R
+++ /dev/null
@@ -1,144 +0,0 @@
-#' @title Visualize Principal Components for Different Cell Types
-#'
-#' @description 
-#' This function plots the principal components for different cell types in the query and reference datasets.
-#'
-#' @details
-#' This function projects the query dataset onto the principal component space of the reference dataset and then visualizes the 
-#' specified principal components for the specified cell types.
-#' It uses the `projectPCA` function to perform the projection and `ggplot2` to create the plots.
-#'
-#' @param query_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.
-#' @param reference_data A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.
-#' @param n_components An integer specifying the number of principal components to use for projection. Defaults to 10. 
-#' Must be less than or equal to the number of components available in the reference PCA.
-#' @param cell_types A character vector specifying the cell types to include in the plot. If NULL, all cell types are included.
-#' @param query_cell_type_col character. The column name in the \code{colData} of \code{query_data} 
-#' that identifies the cell types.
-#' @param ref_cell_type_col character. The column name in the \code{colData} of \code{reference_data} 
-#' that identifies the cell types.
-#' @param pc_subset A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.
-#'
-#' @return A ggplot object representing the boxplots of specified principal components for the given cell types and datasets.
-#'
-#' @export
-#'
-#' @author Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-#'
-#' @examples
-#' # Load required libraries
-#' library(scRNAseq)
-#' library(scuttle)
-#' library(SingleR)
-#' library(scran)
-#' library(scater)
-#'
-#' # Load data (replace with your data loading)
-#' sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-#' 
-#' # Divide the data into reference and query datasets
-#' set.seed(100)
-#' indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-#' ref_data <- sce[, indices]
-#' query_data <- sce[, -indices]
-#' 
-#' # log transform datasets
-#' ref_data <- scuttle::logNormCounts(ref_data)
-#' query_data <- scuttle::logNormCounts(query_data)
-#' 
-#' # Get cell type scores using SingleR (or any other cell type annotation method)
-#' scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-#' 
-#' # Add labels to query object
-#' colData(query_data)$labels <- scores$labels
-#' 
-#' # Selecting highly variable genes (can be customized by the user)
-#' ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-#' query_var <- scran::getTopHVGs(query_data, n = 2000)
-#' 
-#' # Intersect the gene symbols to obtain common genes
-#' common_genes <- intersect(ref_var, query_var)
-#' ref_data_subset <- ref_data[common_genes, ]
-#' query_data_subset <- query_data[common_genes, ]
-#'
-#' # Run PCA on the reference data (assumed to be prepared)
-#' ref_data_subset <- runPCA(ref_data_subset)
-#'
-#' pc_plot <- visualizeCellTypePCA(query_data_subset, ref_data_subset,
-#'                                 n_components = 10,
-#'                                 cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
-#'                                 query_cell_type_col = "labels", 
-#'                                 ref_cell_type_col = "reclustered.broad", 
-#'                                 pc_subset = c(1:5))
-#' pc_plot
-#' 
-#' 
-#' @importFrom stats approxfun cancor density setNames
-#' @importFrom utils combn
-#'                          
-# Function to plot PC for different cell types
-visualizeCellTypePCA <- function(query_data, reference_data, 
-                                 n_components = 10, 
-                                 cell_types = NULL,
-                                 query_cell_type_col, 
-                                 ref_cell_type_col, 
-                                 pc_subset = c(1:5)){
-    
-    # Cell types
-    if(is.null(cell_types)){
-        cell_types <- na.omit(intersect(unique(query_data[[query_cell_type_col]]), unique(reference_data[[ref_cell_type_col]])))
-    }
-    
-    # Get the projected PCA data
-    pca_output <- projectPCA(query_data = query_data, reference_data = reference_data, 
-                             n_components = n_components, 
-                             query_cell_type_col = query_cell_type_col, 
-                             ref_cell_type_col = ref_cell_type_col)
-    pca_output <- na.omit(pca_output)
-
-    # Create all possible pairs of specified PCs
-    plot_names <- paste0("PC", pc_subset)
-    pairs <- expand.grid(x = plot_names, y = plot_names)
-    pairs <- pairs[pairs$x != pairs$y, ]
-    # Create a new data frame with all possible pairs of specified PCs
-    data_pairs_list <- lapply(1:nrow(pairs), function(i) {
-        x_col <- pairs$x[i]
-        y_col <- pairs$y[i]
-        data_frame <- data.frame(pca_output[, c(x_col, y_col)], paste(pca_output$dataset, pca_output$cell_type, sep = " "))
-        colnames(data_frame) <- c("x_value", "y_value", "cell_type_dataset")
-        data_frame$x <- x_col
-        data_frame$y <- y_col
-        data_frame
-    })
-    # Plot data
-    data_pairs <- do.call(rbind, data_pairs_list)
-    # Remove redundant data (to avoid duplicated plots)
-    data_pairs <- data_pairs[as.numeric(data_pairs$x) < as.numeric(data_pairs$y),]
-    
-    # Define the order of cell type and dataset combinations
-    order_combinations <- paste(rep(c("Reference", "Query"), length(cell_types)), rep(sort(cell_types), each = 2))
-    data_pairs$cell_type_dataset <- factor(data_pairs$cell_type_dataset, levels = order_combinations)
-    color_mapping <- setNames(RColorBrewer::brewer.pal(length(order_combinations), "Paired"), order_combinations)
-    cell_type_colors <- color_mapping[order_combinations]
-
-    # Create the ggplot object (with facets if PCA)
-    plot_obj <- ggplot2::ggplot(data_pairs, ggplot2::aes(x = x_value, y = y_value, color = cell_type_dataset)) +
-        ggplot2::geom_point(alpha = 0.5, size = 1) +
-        ggplot2::scale_color_manual(values = cell_type_colors, name = "Cell Types") + 
-        ggplot2::facet_grid(rows = ggplot2::vars(y), cols = ggplot2::vars(x), scales = "free") +
-        ggplot2::theme_bw() +
-        ggplot2::theme(strip.background = ggplot2::element_rect(fill = "grey85", color = "grey70"),   
-                       strip.text = ggplot2::element_text(size = 10, face = "bold", color = "black"), 
-                       axis.title = ggplot2::element_blank(),        
-                       axis.text = ggplot2::element_text(size = 10), 
-                       panel.grid = ggplot2::element_blank(),        
-                       panel.background = ggplot2::element_rect(fill = "white", color = "black"), 
-                       legend.position = "right",          
-                       plot.title = ggplot2::element_text(size = 14, hjust = 0.5), 
-                       plot.background = ggplot2::element_rect(fill = "white")) 
-    
-    # Return the plot
-    return(plot_obj)
-}
-
-
diff --git a/man/boxplotPCA.Rd b/man/boxplotPCA.Rd
deleted file mode 100644
index af526d2..0000000
--- a/man/boxplotPCA.Rd
+++ /dev/null
@@ -1,101 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/boxplotPCA.R
-\name{boxplotPCA}
-\alias{boxplotPCA}
-\title{Plot Principal Components for Different Cell Types}
-\usage{
-boxplotPCA(
-  query_data,
-  reference_data,
-  n_components = 10,
-  cell_types = NULL,
-  query_cell_type_col = NULL,
-  ref_cell_type_col = NULL,
-  pc_subset = c(1:5)
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10. 
-Must be less than or equal to the number of components available in the reference PCA.}
-
-\item{cell_types}{A character vector specifying the cell types to include in the plot. If NULL, all cell types are included.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} 
-that identifies the cell types.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} 
-that identifies the cell types.}
-
-\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.}
-}
-\value{
-A ggplot object representing the boxplots of specified principal components for the given cell types and datasets.
-}
-\description{
-This function generates a \code{ggplot2} boxplot visualization of principal components (PCs) for different 
-cell types across two datasets (query and reference).
-}
-\details{
-The function \code{boxplotPCA} is designed to provide a visualization of principal component analysis (PCA) results. It projects 
-the query dataset onto the principal components obtained from the reference dataset. The results are then visualized 
-as boxplots, grouped by cell types and datasets (query and reference). This allows for a comparative analysis of the 
-distributions of the principal components across different cell types and datasets. The function internally calls \code{projectPCA} 
-to perform the PCA projection. It then reshapes the output data into a long format suitable for ggplot2 plotting. 
-The color scheme is automatically determined using the \code{RColorBrewer} package, ensuring a visually distinct and appealing plot.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-query_var <- scran::getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data (assumed to be prepared)
-ref_data_subset <- runPCA(ref_data_subset)
-
-pc_plot <- boxplotPCA(query_data_subset, ref_data_subset,
-                      n_components = 10,
-                      cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
-                      query_cell_type_col = "labels", 
-                      ref_cell_type_col = "reclustered.broad", 
-                      pc_subset = c(1:5))
-pc_plot
-
-
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/calculateAveragePairwiseCorrelation.Rd b/man/calculateAveragePairwiseCorrelation.Rd
deleted file mode 100644
index 535f72c..0000000
--- a/man/calculateAveragePairwiseCorrelation.Rd
+++ /dev/null
@@ -1,117 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/calculateAveragePairwiseCorrelation.R
-\name{calculateAveragePairwiseCorrelation}
-\alias{calculateAveragePairwiseCorrelation}
-\title{Compute Average Pairwise Correlation between Cell Types}
-\usage{
-calculateAveragePairwiseCorrelation(
-  query_data,
-  reference_data,
-  n_components = 10,
-  query_cell_type_col,
-  ref_cell_type_col,
-  cell_types,
-  correlation_method
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-expression data and metadata.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell 
-expression data and metadata.}
-
-\item{n_components}{The number of principal components to use for the dimensionality reduction of the data using PCA. Defaults to 10.
-If set to \code{NULL} then no dimensionality reduction is performed and the assay data is used directly for computations.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} 
-that identifies the cell types.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} 
-that identifies the cell types.}
-
-\item{cell_types}{A character vector specifying the cell types to be analysed consider.}
-
-\item{correlation_method}{The correlation method to use for calculating pairwise correlations.}
-}
-\value{
-A matrix containing the average pairwise correlation values. 
-        Rows and columns are labeled with the cell types. Each element 
-        in the matrix represents the average correlation between a pair 
-        of cell types.
-}
-\description{
-Computes the average pairwise correlations between specified cell types 
-in single-cell gene expression data.
-}
-\details{
-This function operates on \code{\linkS4class{SingleCellExperiment}} objects, 
-ideal for single-cell analysis workflows. It calculates pairwise correlations between query and 
-reference cells using a specified correlation method, then averages these correlations for each 
-cell type pair. This function aids in assessing the similarity between cells in reference and query datasets, 
-providing insights into the reliability of cell type annotations in single-cell gene expression data.
-}
-\examples{
-library(scater)
-library(scran)
-library(scRNAseq)
-library(SingleR)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Compute Pairwise Correlations
-# Note: The selection of highly variable genes and desired cell types may vary 
-# based on user preference. 
-# The cell type annotation method used in this example is SingleR. 
-# User can use any other method for cell type annotation and provide 
-# the corresponding labels in the metadata.
-
-# Selecting highly variable genes
-ref_var <- getTopHVGs(ref_data, n = 2000)
-query_var <- getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-
-# Select desired cell types
-selected_cell_types <- c("CD4", "CD8", "B_and_plasma")
-ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad \%in\% selected_cell_types]
-query_data_subset <- query_data[common_genes, query_data$reclustered.broad \%in\% selected_cell_types]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Compute pairwise correlations
-cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, 
-                                                      reference_data = ref_data_subset, 
-                                                      n_components = 10,
-                                                      query_cell_type_col = "labels", 
-                                                      ref_cell_type_col = "reclustered.broad", 
-                                                      cell_types = selected_cell_types, 
-                                                      correlation_method = "spearman")
-
-# Visualize the results
-plot(cor_matrix_avg)
-
-
-}
-\seealso{
-\code{\link{plot.calculateAveragePairwiseCorrelation}}
-}
diff --git a/man/calculateCategorizationEntropy.Rd b/man/calculateCategorizationEntropy.Rd
deleted file mode 100644
index eea6d26..0000000
--- a/man/calculateCategorizationEntropy.Rd
+++ /dev/null
@@ -1,58 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/calculateCategorizationEntropy.R
-\name{calculateCategorizationEntropy}
-\alias{calculateCategorizationEntropy}
-\title{Calculate Categorization Entropy}
-\usage{
-calculateCategorizationEntropy(
-  X,
-  inverse_normal_transform = FALSE,
-  plot = TRUE,
-  verbose = TRUE
-)
-}
-\arguments{
-\item{X}{a matrix of category scores}
-
-\item{inverse_normal_transform}{if TRUE, apply}
-
-\item{plot}{if TRUE, plot a histogram of the entropies}
-
-\item{verbose}{if TRUE, display messages about the calculations}
-}
-\value{
-A vector of entropy values for each column in X.
-}
-\description{
-This function takes a matrix of category scores (cell type by
-  cells) and calculates the entropy of the category probabilities for each
-  cell. This gives a sense of how confident the cell type assignments are.
-  High entropy = lots of plausible category assignments = low confidence. Low
-  entropy = only one or two plausible categories = high confidence. This is
-  confidence in the vernacular sense, not in the "confidence interval"
-  statistical sense. Also note that the entropy tells you nothing about
-  whether or not the assignments are correct -- see the other functionality
-  in the package for that. This functionality can be used for assessing how
-  comparatively confident different sets of assignments are (given that the
-  number of categories is the same).
-}
-\details{
-The function checks if X is already on the probability scale.
-  Otherwise, it applies softmax columnwise.
-
-  You can think about entropies on a scale from 0 to a maximum that depends
-  on the number of categories. This is the function for entropy (minus input
-  checking): \code{entropy(p) = -sum(p*log(p))} . If that input vector p is a
-  uniform distribution over the \code{length(p)} categories, the entropy will
-  be a high as possible.
-}
-\examples{
-# Simulate 500 cells with scores on 4 possible cell types
-X <- rnorm(500 * 4) |> matrix(nrow = 4)
-X[1, 1:250] <- X[1, 1:250] + 5 # Make the first category highly scored in the first 250 cells
-
-
-# The function will issue a message about softmaxing the scores, and the entropy histogram will be
-# bimodal since we made half of the cells clearly category 1 while the other half are roughly even.
-# entropy_scores <- calculateCategorizationEntropy(X)
-}
diff --git a/man/calculateHVGOverlap.Rd b/man/calculateHVGOverlap.Rd
deleted file mode 100644
index d0fd7a7..0000000
--- a/man/calculateHVGOverlap.Rd
+++ /dev/null
@@ -1,65 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/calculateHVGOverlap.R
-\name{calculateHVGOverlap}
-\alias{calculateHVGOverlap}
-\title{Calculate the Overlap Coefficient for Highly Variable Genes}
-\usage{
-calculateHVGOverlap(reference_genes, query_genes)
-}
-\arguments{
-\item{reference_genes}{character. A vector of highly variable genes from the reference dataset.}
-
-\item{query_genes}{character. A vector of highly variable genes from the query dataset.}
-}
-\value{
-Overlap coefficient, a value between 0 and 1, where 0 indicates no overlap 
-        and 1 indicates complete overlap of highly variable genes between datasets.
-}
-\description{
-Calculates the overlap coefficient between the sets of highly variable genes 
-from a reference dataset and a query dataset.
-}
-\details{
-The overlap coefficient measures the similarity between two gene sets, indicating how well-aligned 
-reference and query datasets are in terms of their highly variable genes. This metric is 
-useful in single-cell genomics to understand the correspondence between different datasets.
-
-The coefficient is calculated using the formula:
-
-\deqn{Coefficient(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}}
-
-where X and Y are the sets of highly variable genes from the reference and query datasets, respectively,
-|X ∩ Y| is the number of genes common to both X and Y, and min(|X|, |Y|) is the size of the smaller set among X and Y.
-}
-\examples{
-library(scater)
-library(scran)
-library(scRNAseq)
-library(SingleR)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Selcting highly variable genes
-
-ref_var <- getTopHVGs(ref_data, n=2000)
-query_var <- getTopHVGs(query_data, n=2000)
-
-overlap_coefficient <- calculateHVGOverlap(reference_genes = ref_var, 
-                                          query_genes = query_var)
-
-}
-\references{
-Luecken et al. Benchmarking atlas-level data integration in
-single-cell genomics. Nature Methods, 19:41-50, 2022.
-}
diff --git a/man/calculateHotellingPValue.Rd b/man/calculateHotellingPValue.Rd
deleted file mode 100644
index 39e5605..0000000
--- a/man/calculateHotellingPValue.Rd
+++ /dev/null
@@ -1,95 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/calculateHotellingPValue.R
-\name{calculateHotellingPValue}
-\alias{calculateHotellingPValue}
-\title{Perform Hotelling's T-squared Test on PCA Scores for Single-cell RNA-seq Data}
-\usage{
-calculateHotellingPValue(
-  query_data,
-  reference_data,
-  n_components = 10,
-  query_cell_type_col,
-  ref_cell_type_col,
-  pc_subset = c(1:5)
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} 
-that identifies the cell types.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} 
-that identifies the cell types.}
-
-\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.}
-}
-\value{
-A named numeric vector of p-values from Hotelling's T-squared test for each cell type.
-}
-\description{
-This function performs Hotelling's T-squared test to assess the similarity between reference and query datasets 
-for each cell type based on their PCA scores.
-}
-\details{
-This function first performs PCA on the reference dataset and then projects the query dataset onto the PCA space 
-of the reference data. For each cell type, it computes pseudo-bulk signatures in the PCA space by averaging the principal 
-component scores of cells belonging to that cell type. Hotelling's T-squared test is then performed to compare the mean 
-vectors of the pseudo-bulk signatures between the reference and query datasets. The resulting p-values indicate the similarity 
-between the reference and query datasets for each cell type.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-query_var <- scran::getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-
-# Get the p-values from the test
-p_values <- calculateHotellingPValue(query_data_subset, ref_data_subset, 
-                                     n_components = 10, 
-                                     query_cell_type_col = "reclustered.broad", 
-                                     ref_cell_type_col = "reclustered.broad",
-                                     pc_subset = c(1:10)) 
-round(p_values, 5)
-                         
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/calculatePairwiseDistancesAndPlotDensity.Rd b/man/calculatePairwiseDistancesAndPlotDensity.Rd
deleted file mode 100644
index 28f70a9..0000000
--- a/man/calculatePairwiseDistancesAndPlotDensity.Rd
+++ /dev/null
@@ -1,108 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/calculatePairwiseDistancesAndPlotDensity.R
-\name{calculatePairwiseDistancesAndPlotDensity}
-\alias{calculatePairwiseDistancesAndPlotDensity}
-\title{Pairwise Distance Analysis and Density Visualization}
-\usage{
-calculatePairwiseDistancesAndPlotDensity(
-  query_data,
-  reference_data,
-  n_components = 10,
-  query_cell_type_col,
-  ref_cell_type_col,
-  cell_type_query,
-  cell_type_reference,
-  distance_metric,
-  correlation_method = "pearson"
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-expression data and metadata.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell 
-expression data and metadata.}
-
-\item{n_components}{The number of principal components to use for the dimensionality reduction of the data using PCA. Defaults to 10.
-If set to \code{NULL} then no dimensionality reduction is performed and the assay data is used directly for computations.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} 
-that identifies the cell types.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} 
-that identifies the cell types.}
-
-\item{cell_type_query}{The query cell type for which distances or correlations are calculated.}
-
-\item{cell_type_reference}{The reference cell type for which distances or correlations are calculated.}
-
-\item{distance_metric}{The distance metric to use for calculating pairwise distances, such as euclidean, manhattan etc.
-Set it to "correlation" for calculating correlation coefficients.}
-
-\item{correlation_method}{The correlation method to use when distance_metric is "correlation".
-Possible values: "pearson", "spearman".}
-}
-\value{
-A plot generated by \code{ggplot2}, showing the density distribution of 
-        calculated distances or correlations.
-}
-\description{
-Calculates pairwise distances or correlations between query and reference cells 
-of a specific cell type.
-}
-\details{
-The function works with \code{\linkS4class{SingleCellExperiment}} objects, ensuring 
-compatibility with common single-cell analysis workflows. It subsets the data for specified cell types, 
-computes pairwise distances or correlations, and visualizes these measurements using density plots. By comparing the distances and correlations, 
-one can evaluate the consistency and reliability of annotated cell types within single-cell datasets.
-}
-\examples{
-library(scran)
-library(scRNAseq)
-library(SingleR)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 2000)
-query_var <- getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Example usage of the function
-calculatePairwiseDistancesAndPlotDensity(query_data = query_data_subset, 
-                                         reference_data = ref_data_subset, 
-                                         n_components = 10,
-                                         query_cell_type_col = "labels", 
-                                         ref_cell_type_col = "reclustered.broad", 
-                                         cell_type_query = "CD8", 
-                                         cell_type_reference = "CD8", 
-                                         distance_metric = "euclidean")
-
-
-}
diff --git a/man/calculateSampleDistances.Rd b/man/calculateSampleDistances.Rd
deleted file mode 100644
index 0adaa02..0000000
--- a/man/calculateSampleDistances.Rd
+++ /dev/null
@@ -1,111 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/calculateSampleDistances.R
-\name{calculateSampleDistances}
-\alias{calculateSampleDistances}
-\title{Compute Sample Distances Between Reference and Query Data}
-\usage{
-calculateSampleDistances(
-  query_data,
-  reference_data,
-  query_cell_type_col,
-  ref_cell_type_col,
-  n_components = 10,
-  pc_subset = c(1:5)
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} 
-that identifies the cell types.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} 
-that identifies the cell types.}
-
-\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10.}
-
-\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.}
-}
-\value{
-A list containing distance data for each cell type. Each entry in the list contains:
-\describe{
-  \item{ref_distances}{A vector of all pairwise distances within the reference subset for the cell type.}
-  \item{query_to_ref_distances}{A matrix of distances from each query sample to all reference samples for the cell type.}
-}
-}
-\description{
-This function computes the distances within the reference dataset and the distances from each query sample to all 
-reference samples for each cell type. It uses PCA for dimensionality reduction and Euclidean distance for distance calculation.
-}
-\details{
-The function first performs PCA on the reference dataset and projects the query dataset onto the same PCA space. 
-It then computes pairwise Euclidean distances within the reference dataset for each cell type, as well as distances from each 
-query sample to all reference samples of a particular cell type. The results are stored in a list, with one entry per cell type.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 2000)
-query_var <- getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Plot the PC data
-distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, 
-                                          n_components = 10, 
-                                          query_cell_type_col = "labels", 
-                                          ref_cell_type_col = "reclustered.broad",
-                                          pc_subset = c(1:10)) 
-
-# Identify outliers for CD4
-cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, 
-                               query_cell_type_col = "labels", 
-                               ref_cell_type_col = "reclustered.broad",
-                               n_components = 10,
-                               n_tree = 500,
-                               anomaly_treshold = 0.5)$CD4
-cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6])
-
-# Plot the densities of the distances
-plot(distance_data, ref_cell_type = "CD4", sample_names = cd4_top5_anomalies)
-
-}
-\seealso{
-\code{\link{plot.calculateSampleDistances}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/calculateSampleDistancesSimilarity.Rd b/man/calculateSampleDistancesSimilarity.Rd
deleted file mode 100644
index e530677..0000000
--- a/man/calculateSampleDistancesSimilarity.Rd
+++ /dev/null
@@ -1,125 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/calculateSampleDistancesSimilarity.R
-\name{calculateSampleDistancesSimilarity}
-\alias{calculateSampleDistancesSimilarity}
-\title{Function to compute Bhattacharyya coefficients and Hellinger distances}
-\usage{
-calculateSampleDistancesSimilarity(
-  query_data,
-  reference_data,
-  query_cell_type_col,
-  ref_cell_type_col,
-  sample_names,
-  n_components = 10,
-  pc_subset = c(1:5)
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} 
-that identifies the cell types.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} 
-that identifies the cell types.}
-
-\item{sample_names}{A character vector specifying the names of the query samples for which to compute distance measures.}
-
-\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10.}
-
-\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.}
-}
-\value{
-A list containing distance data for each cell type. Each entry in the list contains:
-\describe{
-  \item{ref_distances}{A vector of all pairwise distances within the reference subset for the cell type.}
-  \item{query_to_ref_distances}{A matrix of distances from each query sample to all reference samples for the cell type.}
-}
-}
-\description{
-This function computes Bhattacharyya coefficients and Hellinger distances to quantify the similarity of density 
-distributions between query samples and reference data for each cell type.
-}
-\details{
-This function first computes distance data using the \code{calculateSampleDistances} function, which calculates 
-pairwise distances between samples within the reference data and between query samples and reference samples in the PCA space.
-Bhattacharyya coefficients and Hellinger distances are calculated to quantify the similarity of density distributions between query 
-samples and reference data for each cell type. Bhattacharyya coefficient measures the similarity of two probability distributions, 
-while Hellinger distance measures the distance between two probability distributions.
-
-Bhattacharyya coefficients range between 0 and 1. A value closer to 1 indicates higher similarity between distributions, while a value 
-closer to 0 indicates lower similarity
-
-Hellinger distances range between 0 and 1. A value closer to 0 indicates higher similarity between distributions, while a value 
-closer to 1 indicates lower similarity.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-query_var <- scran::getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-
-# Plot the PC data
-distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, 
-                                          n_components = 10, 
-                                          query_cell_type_col = "labels", 
-                                          ref_cell_type_col = "reclustered.broad",
-                                          pc_subset = c(1:10)) 
-
-# Identify outliers for CD4
-cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, 
-                               query_cell_type_col = "labels", 
-                               ref_cell_type_col = "reclustered.broad",
-                               n_components = 10,
-                               n_tree = 500,
-                               anomaly_treshold = 0.5)$CD4
-cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6])
-
-# Get overlap measures
-overlap_measures <- calculateSampleDistancesSimilarity(query_data_subset,ref_data_subset, 
-                                                       sample_names = cd4_top5_anomalies,
-                                                       n_components = 10, 
-                                                       query_cell_type_col = "labels", 
-                                                       ref_cell_type_col = "reclustered.broad",
-                                                       pc_subset = c(1:10))
-
-
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/calculateSampleSimilarityPCA.Rd b/man/calculateSampleSimilarityPCA.Rd
deleted file mode 100644
index 66e24fb..0000000
--- a/man/calculateSampleSimilarityPCA.Rd
+++ /dev/null
@@ -1,98 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/calculateSampleSimilarityPCA.R
-\name{calculateSampleSimilarityPCA}
-\alias{calculateSampleSimilarityPCA}
-\title{Calculate Sample Similarity Using PCA Loadings}
-\usage{
-calculateSampleSimilarityPCA(
-  se_object,
-  samples,
-  pc_subset = c(1:5),
-  n_top_vars = 50
-)
-}
-\arguments{
-\item{se_object}{A \code{\linkS4class{SingleCellExperiment}} object containing expression data.}
-
-\item{samples}{A character vector specifying the samples for which to compute the similarity.}
-
-\item{pc_subset}{A numeric vector specifying the subset of principal components to consider (default: c(1:5)).}
-
-\item{n_top_vars}{An integer indicating the number of top loading variables to consider for each PC (default: 50).}
-}
-\value{
-A data frame containing cosine similarity values between samples for each selected principal component.
-}
-\description{
-This function calculates the cosine similarity between samples based on the principal components (PCs)
-obtained from PCA (Principal Component Analysis) loadings.
-}
-\details{
-This function calculates the cosine similarity between samples based on the loadings of the selected
-principal components obtained from PCA. It extracts the rotation matrix from the PCA results of the 
-\code{\linkS4class{SingleCellExperiment}} object and identifies the high-loading variables for each selected PC. 
-Then, it computes the cosine similarity between samples using the high-loading variables for each PC.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-query_var <- scran::getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data (assumed to be prepared)
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Store PCA anomaly data and plots
-anomaly_output <- detectAnomaly(reference_data = ref_data_subset, 
-                                ref_cell_type_col = "reclustered.broad", 
-                                n_components = 10,
-                                n_tree = 500,
-                                anomaly_treshold = 0.5) 
-top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, 
-                             decreasing = TRUE)[1:6])
-
-# Compute cosine similarity between anomalies and top PCs
-cosine_similarities <- calculateSampleSimilarityPCA(ref_data_subset, samples = top6_anomalies, 
-                                                    pc_subset = c(1:10), n_top_vars = 50)
-cosine_similarities
-
-# Plot similarities
-plot(cosine_similarities, pc_subset = c(1:5))
-
-}
-\seealso{
-\code{\link{plot.calculateSampleSimilarityPCA}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/calculateVarImpOverlap.Rd b/man/calculateVarImpOverlap.Rd
deleted file mode 100644
index 8571519..0000000
--- a/man/calculateVarImpOverlap.Rd
+++ /dev/null
@@ -1,93 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/calculateVarImpOverlap.R
-\name{calculateVarImpOverlap}
-\alias{calculateVarImpOverlap}
-\title{Compare Gene Importance Across Datasets Using Random Forest}
-\usage{
-calculateVarImpOverlap(
-  query_data,
-  reference_data,
-  query_cell_type_col,
-  ref_cell_type_col,
-  n_tree = 500,
-  n_top = 20
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{query_cell_type_col}{A character string specifying the column name in the query dataset containing cell type annotations.}
-
-\item{ref_cell_type_col}{A character string specifying the column name in the reference dataset containing cell type annotations.}
-
-\item{n_tree}{An integer specifying the number of trees to grow in the Random Forest. Default is 500.}
-
-\item{n_top}{An integer specifying the number of top genes to consider when comparing variable importance scores. Default is 20.}
-}
-\value{
-A list containing three elements:
-\item{var_imp_ref}{A list of data frames containing variable importance scores for each combination of cell types in the reference 
-dataset.}
-\item{var_imp_query}{A list of data frames containing variable importance scores for each combination of cell types in the query 
-dataset.}
-\item{var_imp_comparison}{A named vector indicating the proportion of top genes that overlap between the reference and query 
-datasets for each combination of cell types.}
-}
-\description{
-This function identifies and compares the most important genes for differentiating cell types between a query dataset 
-and a reference dataset using Random Forest.
-}
-\details{
-This function uses the Random Forest algorithm to calculate the importance of genes in differentiating between cell types 
-within both a reference dataset and a query dataset. The function then compares the top genes identified in both datasets to determine 
-the overlap in their importance scores.
-}
-\examples{
-# Load necessary library
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 500)
-query_var <- getTopHVGs(query_data, n = 500)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Compare PCA subspaces
-rf_output <- calculateVarImpOverlap(query_data_subset, ref_data_subset, 
-                                    query_cell_type_col = "labels", 
-                                    ref_cell_type_col = "reclustered.broad", 
-                                    n_tree = 500,
-                                    n_top = 20)
-
-
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/compareCCA.Rd b/man/compareCCA.Rd
deleted file mode 100644
index 38a1d29..0000000
--- a/man/compareCCA.Rd
+++ /dev/null
@@ -1,100 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/compareCCA.R
-\name{compareCCA}
-\alias{compareCCA}
-\title{Compare Subspaces Spanned by Top Principal Components Using Canonical Correlation Analysis}
-\usage{
-compareCCA(reference_data, query_data, pc_subset = c(1:5), n_top_vars = 25)
-}
-\arguments{
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{pc_subset}{A numeric vector specifying the subset of principal components (PCs) 
-to compare. Default is the first five PCs.}
-
-\item{n_top_vars}{An integer indicating the number of top loading variables to consider for each PC. Default is 25.}
-}
-\value{
-A list containing the following elements:
-\describe{
-  \item{coef_ref}{Canonical coefficients for the reference dataset.}
-  \item{coef_query}{Canonical coefficients for the query dataset.}
-  \item{cosine_similarity}{Cosine similarity values for the canonical variables.}
-  \item{correlations}{Canonical correlations between the reference and query datasets.}
-}
-}
-\description{
-This function compares the subspaces spanned by the top principal components (PCs) of the reference 
-and query datasets using canonical correlation analysis (CCA). It calculates the canonical variables, 
-correlations, and a similarity measure for the subspaces.
-}
-\details{
-This function performs canonical correlation analysis (CCA) to compare the subspaces spanned by the 
-top principal components (PCs) of the reference and query datasets. The function extracts the rotation 
-matrices corresponding to the specified PCs and performs CCA on these matrices. It computes the canonical 
-variables and their corresponding correlations. Additionally, it calculates a similarity measure for the 
-canonical variables using cosine similarity. The output is a list containing the canonical coefficients 
-for both datasets, the cosine similarity values, and the canonical correlations.
-}
-\examples{
-# Load necessary library
-library(scRNAseq)
-library(scuttle)
-library(scran)
-library(SingleR)
-library(ggplot2)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 500)
-query_var <- getTopHVGs(query_data, n = 500)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Subset reference and query data for a specific cell type
-ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-
-# Run PCA on the reference and query datasets
-ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-query_data_subset <- runPCA(query_data_subset, ncomponents = 50)
-
-# Compare CCA
-cca_comparison <- compareCCA(query_data_subset, ref_data_subset, 
-                             pc_subset = c(1:5), n_top_vars = 25)
-
-# Visualize output of CCA comparison
-plot(cca_comparison)
-
-
-}
-\seealso{
-\code{\link{plot.compareCCA}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/comparePCA.Rd b/man/comparePCA.Rd
deleted file mode 100644
index c801530..0000000
--- a/man/comparePCA.Rd
+++ /dev/null
@@ -1,108 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/comparePCA.R
-\name{comparePCA}
-\alias{comparePCA}
-\title{Compare Principal Components Analysis (PCA) Results}
-\usage{
-comparePCA(
-  reference_data,
-  query_data,
-  pc_subset = c(1:5),
-  n_top_vars = 50,
-  metric = c("cosine", "correlation")[1],
-  correlation_method = c("spearman", "pearson")[1]
-)
-}
-\arguments{
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{pc_subset}{A numeric vector specifying the subset of principal components (PCs) to compare. Default is the first five PCs.}
-
-\item{n_top_vars}{An integer indicating the number of top loading variables to consider for each PC. Default is 50.}
-
-\item{metric}{The similarity metric to use. It can be either "cosine" or "correlation". Default is "cosine".}
-
-\item{correlation_method}{The correlation method to use if metric is "correlation". It can be "spearman" 
-or "pearson". Default is "spearman".}
-}
-\value{
-A similarity matrix comparing the principal components of the reference and query datasets.
-Each element (i, j) in the matrix represents the similarity between the i-th principal component 
-of the reference dataset and the j-th principal component of the query dataset.
-}
-\description{
-This function compares the principal components (PCs) obtained from separate PCA on reference and query 
-datasets for a single cell type using either cosine similarity or correlation.
-}
-\details{
-This function compares the PCA results between the reference and query datasets by computing cosine 
-similarities or correlations between the loadings of top variables for each pair of principal components. It first 
-extracts the PCA rotation matrices from both datasets and identifies the top variables with highest loadings for 
-each PC. Then, it computes the cosine similarities or correlations between the loadings of top variables for each 
-pair of PCs. The resulting matrix contains the similarity values, where rows represent reference PCs and columns 
-represent query PCs.
-}
-\examples{
-# Load necessary library
-library(scRNAseq)
-library(scuttle)
-library(scran)
-library(SingleR)
-library(ComplexHeatmap)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 500)
-query_var <- getTopHVGs(query_data, n = 500)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Subset reference and query data for a specific cell type
-ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-
-# Run PCA on the reference and query datasets
-ref_data_subset <- runPCA(ref_data_subset)
-query_data_subset <- runPCA(query_data_subset)
-
-# Call the PCA comparison function
-similarity_mat <- comparePCA(query_data_subset, ref_data_subset, 
-                             pc_subset = c(1:5), 
-                             n_top_vars = 50,
-                             metric = c("cosine", "correlation")[1], 
-                             correlation_method = c("spearman", "pearson")[1])
-
-# Create the heatmap
-plot(similarity_mat)
-
-}
-\seealso{
-\code{\link{plot.comparePCA}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/comparePCASubspace.Rd b/man/comparePCASubspace.Rd
deleted file mode 100644
index f510670..0000000
--- a/man/comparePCASubspace.Rd
+++ /dev/null
@@ -1,100 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/comparePCASubspace.R
-\name{comparePCASubspace}
-\alias{comparePCASubspace}
-\title{Compare Subspaces Spanned by Top Principal Components}
-\usage{
-comparePCASubspace(
-  reference_data,
-  query_data,
-  pc_subset = c(1:5),
-  n_top_vars = 50
-)
-}
-\arguments{
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{pc_subset}{A numeric vector specifying the subset of principal components (PCs) to compare. Default is the first five PCs.}
-
-\item{n_top_vars}{An integer indicating the number of top loading variables to consider for each PC. Default is 50.}
-}
-\value{
-A list containing the following components:
-  \item{principal_angles_cosines}{A numeric vector of cosine values of principal angles.}
-  \item{average_variance_explained}{A numeric vector of average variance explained by each PC.}
-  \item{weighted_cosine_similarity}{A numeric value representing the weighted cosine similarity.}
-}
-\description{
-This function compares the subspace spanned by the top principal components (PCs) in a reference dataset to that 
-in a query dataset. It computes the cosine similarity between the loadings of the top variables for each PC in 
-both datasets and provides a weighted cosine similarity score.
-}
-\details{
-This function compares the subspace spanned by the top principal components (PCs) in a reference dataset 
-to that in a query dataset. It first computes the cosine similarity between the loadings of the top variables 
-for each PC in both datasets. The top cosine similarity scores are then selected, and their corresponding PC 
-indices are stored. Additionally, the function calculates the average percentage of variance explained by the 
-selected top PCs. Finally, it computes a weighted cosine similarity score based on the top cosine similarities 
-and the average percentage of variance explained.
-}
-\examples{
-# Load necessary library
-library(scRNAseq)
-library(scuttle)
-library(scran)
-library(SingleR)
-library(ggplot2)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 500)
-query_var <- getTopHVGs(query_data, n = 500)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Subset reference and query data for a specific cell type
-ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-
-# Run PCA on the reference and query datasets
-ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-query_data_subset <- runPCA(query_data_subset, ncomponents = 50)
-
-# Compare PCA subspaces
-subspace_comparison <- comparePCASubspace(query_data_subset, ref_data_subset, 
-                                          pc_subset = c(1:5), n_top_vars = 50)
-
-# Create a data frame for plotting
-plot(subspace_comparison)
-
-}
-\seealso{
-\code{\link{plot.comparePCASubspace}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/detectAnomaly.Rd b/man/detectAnomaly.Rd
deleted file mode 100644
index ba1a8c9..0000000
--- a/man/detectAnomaly.Rd
+++ /dev/null
@@ -1,110 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/detectAnomaly.R
-\name{detectAnomaly}
-\alias{detectAnomaly}
-\title{PCA Anomaly Scores via Isolation Forests with Visualization}
-\usage{
-detectAnomaly(
-  reference_data,
-  query_data = NULL,
-  ref_cell_type_col,
-  query_cell_type_col,
-  n_components = 10,
-  n_tree = 500,
-  anomaly_treshold = 0.5,
-  ...
-)
-}
-\arguments{
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{query_data}{An optional \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells. 
-If NULL, then the isolation forest anomaly scores are computed for the reference data. Default is NULL.}
-
-\item{ref_cell_type_col}{A character string specifying the column name in the reference dataset containing cell type annotations.}
-
-\item{query_cell_type_col}{A character string specifying the column name in the query dataset containing cell type annotations.}
-
-\item{n_components}{An integer specifying the number of principal components to use. Default is 10.}
-
-\item{n_tree}{An integer specifying the number of trees for the isolation forest. Default is 500}
-
-\item{anomaly_treshold}{A numeric value specifying the threshold for identifying anomalies, Default is 0.5.}
-
-\item{...}{Additional arguments passed to the `isolation.forest` function.}
-}
-\value{
-A list containing the following components for each cell type and the combined data:
-\item{anomaly_scores}{Anomaly scores for each cell in the query data.}
-\item{anomaly}{Logical vector indicating whether each cell is classified as an anomaly.}
-\item{reference_mat_subset}{PCA projections of the reference data.}
-\item{query_mat_subset}{PCA projections of the query data (if provided).}
-\item{var_explained}{Proportion of variance explained by the retained principal components.}
-}
-\description{
-This function detects anomalies in single-cell data by projecting the data onto a PCA space and using an isolation forest 
-algorithm to identify anomalies.
-}
-\details{
-This function projects the query data onto the PCA space of the reference data. An isolation forest is then built on the 
-reference data to identify anomalies in the query data based on their PCA projections. If no query dataset is provided by the user,
-the anomaly scores are computed on the reference data itself. Anomaly scores for the data with all combined cell types are also
-provided as part of the output.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 2000)
-query_var <- getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Store PCA anomaly data and plots
-anomaly_output <- detectAnomaly(ref_data_subset, query_data_subset,
-                                ref_cell_type_col = "reclustered.broad", 
-                                query_cell_type_col = "labels",
-                                n_components = 10,
-                                n_tree = 500,
-                                anomaly_treshold = 0.5) 
-
-# Plot the output for a cell type
-plot(anomaly_output, cell_type = "CD8", pc_subset = c(1:5), data_type = "query")
-
-}
-\seealso{
-\code{\link{plot.detectAnomaly}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/figures/CMD-Scatter-Plot-1.png b/man/figures/CMD-Scatter-Plot-1.png
deleted file mode 100644
index b97d2c8..0000000
Binary files a/man/figures/CMD-Scatter-Plot-1.png and /dev/null differ
diff --git a/man/figures/Cell-Type-Correlation-Analysis-Visualization-1.png b/man/figures/Cell-Type-Correlation-Analysis-Visualization-1.png
deleted file mode 100644
index 794447d..0000000
Binary files a/man/figures/Cell-Type-Correlation-Analysis-Visualization-1.png and /dev/null differ
diff --git a/man/figures/Gene-Expression-Histogram-1.png b/man/figures/Gene-Expression-Histogram-1.png
deleted file mode 100644
index 5393ef3..0000000
Binary files a/man/figures/Gene-Expression-Histogram-1.png and /dev/null differ
diff --git a/man/figures/Gene-Expression-Scatter-1.png b/man/figures/Gene-Expression-Scatter-1.png
deleted file mode 100644
index 196a78a..0000000
Binary files a/man/figures/Gene-Expression-Scatter-1.png and /dev/null differ
diff --git a/man/figures/Mito-Genes-Vs-Annotation-1.png b/man/figures/Mito-Genes-Vs-Annotation-1.png
deleted file mode 100644
index f633ca9..0000000
Binary files a/man/figures/Mito-Genes-Vs-Annotation-1.png and /dev/null differ
diff --git a/man/figures/Pairwise-Distance-Analysis-Density-Visualization-1.png b/man/figures/Pairwise-Distance-Analysis-Density-Visualization-1.png
deleted file mode 100644
index 63af85f..0000000
Binary files a/man/figures/Pairwise-Distance-Analysis-Density-Visualization-1.png and /dev/null differ
diff --git a/man/figures/Pairwise-Distance-Correlation-Based-Density-Visualization-1.png b/man/figures/Pairwise-Distance-Correlation-Based-Density-Visualization-1.png
deleted file mode 100644
index a31491e..0000000
Binary files a/man/figures/Pairwise-Distance-Correlation-Based-Density-Visualization-1.png and /dev/null differ
diff --git a/man/figures/Pathway-Scores-on-Dimensional-Reduction-Scatter-1.png b/man/figures/Pathway-Scores-on-Dimensional-Reduction-Scatter-1.png
deleted file mode 100644
index d6cfb8c..0000000
Binary files a/man/figures/Pathway-Scores-on-Dimensional-Reduction-Scatter-1.png and /dev/null differ
diff --git a/man/figures/QC-Annotation-Scatter-AllCellTypes-1.png b/man/figures/QC-Annotation-Scatter-AllCellTypes-1.png
deleted file mode 100644
index 93e3650..0000000
Binary files a/man/figures/QC-Annotation-Scatter-AllCellTypes-1.png and /dev/null differ
diff --git a/man/figures/QC-Annotation-Scatter-Mito-1.png b/man/figures/QC-Annotation-Scatter-Mito-1.png
deleted file mode 100644
index dfba242..0000000
Binary files a/man/figures/QC-Annotation-Scatter-Mito-1.png and /dev/null differ
diff --git a/man/figures/Scatter-Plot-LibrarySize-Vs-Annotation-Scores-1.png b/man/figures/Scatter-Plot-LibrarySize-Vs-Annotation-Scores-1.png
deleted file mode 100644
index 0dc51b8..0000000
Binary files a/man/figures/Scatter-Plot-LibrarySize-Vs-Annotation-Scores-1.png and /dev/null differ
diff --git a/man/figures/Scatter-Plot-QC-Stats-Vs-Annotation-Scores-1.png b/man/figures/Scatter-Plot-QC-Stats-Vs-Annotation-Scores-1.png
deleted file mode 100644
index 1feb9ae..0000000
Binary files a/man/figures/Scatter-Plot-QC-Stats-Vs-Annotation-Scores-1.png and /dev/null differ
diff --git a/man/histQCvsAnnotation.Rd b/man/histQCvsAnnotation.Rd
deleted file mode 100644
index 3c3b2b8..0000000
--- a/man/histQCvsAnnotation.Rd
+++ /dev/null
@@ -1,92 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/histQCvsAnnotation.R
-\name{histQCvsAnnotation}
-\alias{histQCvsAnnotation}
-\title{Histograms: QC Stats and Annotation Scores Visualization}
-\usage{
-histQCvsAnnotation(
-  query_data,
-  qc_col = qc_col,
-  label_col,
-  score_col,
-  label = NULL
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-expression data and metadata.}
-
-\item{qc_col}{character. A column name in the \code{colData} of \code{query_data} that 
-contains the QC stats of interest.}
-
-\item{label_col}{character. The column name in the \code{colData} of \code{query_data} 
-that contains the cell type labels.}
-
-\item{score_col}{numeric. The column name in the \code{colData} of \code{query_data} that 
-contains the cell type scores.}
-
-\item{label}{character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")).
-Defaults to \code{NULL}, which will include all the cells.}
-}
-\value{
-A object containing two histograms displayed side by side. 
-The first histogram represents the distribution of QC stats, 
-and the second histogram represents the distribution of annotation scores.
-}
-\description{
-This function generates histograms for visualizing the distribution of quality control (QC) statistics and 
-annotation scores associated with cell types in single-cell genomic data.
-}
-\details{
-The particularly useful in the analysis of data from single-cell experiments, 
-where understanding the distribution of these metrics is crucial for quality assessment and 
-interpretation of cell type annotations.
-}
-\examples{
-\donttest{
-library(scater)
-library(scran)
-library(scRNAseq)
-library(SingleR)
-library(gridExtra)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log-transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR
-pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Assign labels to query data
-colData(query_data)$labels <- pred$labels
-
-# Get annotation scores
-scores <- apply(pred$scores, 1, max)
-
-# Assign scores to query data
-colData(query_data)$cell_scores <- scores
-
-# Generate histograms
-histQCvsAnnotation(query_data = query_data, 
-                  qc_col = "percent.mito", 
-                  label_col = "labels", 
-                  score_col = "cell_scores", 
-                  label = c("CD4", "CD8"))
-                  
-histQCvsAnnotation(query_data = query_data, 
-                   qc_col = "percent.mito", 
-                   label_col = "labels", 
-                   score_col = "cell_scores", 
-                   label = NULL)
-}
-
-}
diff --git a/man/nearestNeighborDiagnostics.Rd b/man/nearestNeighborDiagnostics.Rd
deleted file mode 100644
index b45211a..0000000
--- a/man/nearestNeighborDiagnostics.Rd
+++ /dev/null
@@ -1,106 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/nearestNeighborDiagnostics.R
-\name{nearestNeighborDiagnostics}
-\alias{nearestNeighborDiagnostics}
-\title{Calculate Nearest Neighbor Diagnostics for Cell Type Classification}
-\usage{
-nearestNeighborDiagnostics(
-  query_data,
-  reference_data,
-  n_neighbor = 15,
-  n_components = 10,
-  pc_subset = c(1:10),
-  query_cell_type_col,
-  ref_cell_type_col
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{n_neighbor}{An integer specifying the number of nearest neighbors to consider. Default is 15.}
-
-\item{n_components}{An integer specifying the number of principal components to use for dimensionality reduction. Default is 10.}
-
-\item{pc_subset}{A vector specifying the subset of principal components to use in the analysis. Default is c(1:10).}
-
-\item{query_cell_type_col}{A character string specifying the column name in the query dataset containing cell type annotations.}
-
-\item{ref_cell_type_col}{A character string specifying the column name in the reference dataset containing cell type annotations.}
-}
-\value{
-A list where each element corresponds to a cell type and contains two vectors:
-\item{prob_ref}{The probabilities of each query sample belonging to the reference dataset.}
-\item{prob_query}{The probabilities of each query sample belonging to the query dataset.}
-The list is assigned the class \code{"nearestNeighbotDiagnostics"}.
-}
-\description{
-This function computes the probabilities for each sample of belonging to either the reference or query dataset for 
-each cell type using nearest neighbor analysis.
-}
-\details{
-This function performs a nearest neighbor search to calculate the probability of each sample in the query dataset 
-belonging to the reference dataset for each cell type. It uses principal component analysis (PCA) to reduce the dimensionality 
-of the data before performing the nearest neighbor search. The function balances the sample sizes between the reference and query 
-datasets by data augmentation if necessary.
-}
-\examples{
-# Load necessary library
-library(scRNAseq)
-library(scuttle)
-library(scran)
-library(SingleR)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 500)
-query_var <- getTopHVGs(query_data, n = 500)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Project the query data onto PCA space of reference
-nn_output <- nearestNeighborDiagnostics(query_data_subset, ref_data_subset,
-                                        n_neighbor = 15, 
-                                        n_components = 10,
-                                        pc_subset = c(1:10),
-                                        query_cell_type_col = "labels", 
-                                        ref_cell_type_col = "reclustered.broad")
-
-# Plot output
-plot(nn_output, cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
-     prob_type = "query")
-
-
-}
-\seealso{
-\code{\link{plot.nearestNeighborDiagnostics}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/plot.calculateAveragePairwiseCorrelation.Rd b/man/plot.calculateAveragePairwiseCorrelation.Rd
deleted file mode 100644
index 6caf9af..0000000
--- a/man/plot.calculateAveragePairwiseCorrelation.Rd
+++ /dev/null
@@ -1,88 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plot.calculateAveragePairwiseCorrelation.R
-\name{plot.calculateAveragePairwiseCorrelation}
-\alias{plot.calculateAveragePairwiseCorrelation}
-\title{Plot the output of the calculateAveragePairwiseCorrelation function}
-\usage{
-\method{plot}{calculateAveragePairwiseCorrelation}(x, ...)
-}
-\arguments{
-\item{x}{Output matrix from calculateAveragePairwiseCorrelation function.}
-
-\item{...}{Additional arguments to be passed to the plotting function.}
-}
-\value{
-A ggplot2 object representing the heatmap plot.
-}
-\description{
-This function takes the output of the calculateAveragePairwiseCorrelation function,
-which should be a matrix of pairwise correlations, and plots it as a heatmap.
-}
-\details{
-This function converts the correlation matrix into a dataframe, creates a heatmap using ggplot2,
-and customizes the appearance of the heatmap with updated colors and improved aesthetics.
-}
-\examples{
-library(scater)
-library(scran)
-library(scRNAseq)
-library(SingleR)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Compute Pairwise Correlations
-# Note: The selection of highly variable genes and desired cell types may vary 
-# based on user preference. 
-# The cell type annotation method used in this example is SingleR. 
-# User can use any other method for cell type annotation and provide 
-# the corresponding labels in the metadata.
-
-# Selecting highly variable genes
-ref_var <- getTopHVGs(ref_data, n = 2000)
-query_var <- getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-
-# Select desired cell types
-selected_cell_types <- c("CD4", "CD8", "B_and_plasma")
-ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad \%in\% selected_cell_types]
-query_data_subset <- query_data[common_genes, query_data$reclustered.broad \%in\% selected_cell_types]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Compute pairwise correlations
-cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, 
-                                                      reference_data = ref_data_subset, 
-                                                      n_components = 10,
-                                                      query_cell_type_col = "labels", 
-                                                      ref_cell_type_col = "reclustered.broad", 
-                                                      cell_types = selected_cell_types, 
-                                                      correlation_method = "spearman")
-
-# Visualize the results
-plot(cor_matrix_avg)
-
-
-}
-\seealso{
-\code{\link{calculateAveragePairwiseCorrelation}}
-}
diff --git a/man/plot.calculateSampleDistances.Rd b/man/plot.calculateSampleDistances.Rd
deleted file mode 100644
index 52e082f..0000000
--- a/man/plot.calculateSampleDistances.Rd
+++ /dev/null
@@ -1,99 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plot.calculateSampleDistances.R
-\name{plot.calculateSampleDistances}
-\alias{plot.calculateSampleDistances}
-\title{Plot Distance Density Comparison for a Specific Cell Type and Selected Samples}
-\usage{
-\method{plot}{calculateSampleDistances}(x, ref_cell_type, sample_names, ...)
-}
-\arguments{
-\item{x}{A list containing the distance data computed by \code{calculateSampleDistances}.}
-
-\item{ref_cell_type}{A string specifying the reference cell type.}
-
-\item{sample_names}{A string specifying the query sample name for which to plot the distances.}
-
-\item{...}{Additional arguments passed to the plotting function.}
-}
-\value{
-A ggplot2 density plot comparing the reference distances and the distances from the specified sample to the reference samples.
-}
-\description{
-This function plots the density functions for the reference data and the distances from a specified query samples 
-to all reference samples within a specified cell type.
-}
-\details{
-The function first checks if the specified cell type and sample names are present in the \code{x}. If the 
-specified cell type or sample name is not found, an error is thrown. It then extracts the distances within the reference dataset 
-and the distances from the specified query sample to the reference samples. The function creates a density plot using \code{ggplot2} 
-to compare the distance distributions. The density plot will show two distributions: one for the pairwise distances within the 
-reference dataset and one for the distances from the specified query sample to each reference sample. These distributions are 
-plotted in different colors to visually assess how similar the query sample is to the reference samples of the specified cell type.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-query_var <- scran::getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Plot the PC data
-distance_data <- calculateSampleDistances(query_data_subset, ref_data_subset, 
-                                          n_components = 10, 
-                                          query_cell_type_col = "labels", 
-                                          ref_cell_type_col = "reclustered.broad",
-                                          pc_subset = c(1:10)) 
-
-# Identify outliers for CD4
-cd4_anomalies <- detectAnomaly(ref_data_subset, query_data_subset, 
-                               query_cell_type_col = "labels", 
-                               ref_cell_type_col = "reclustered.broad",
-                               n_components = 10,
-                               n_tree = 500,
-                               anomaly_treshold = 0.5)$CD4
-cd4_top5_anomalies <- names(sort(cd4_anomalies$query_anomaly_scores, decreasing = TRUE)[1:6])
-
-# Plot the densities of the distances
-plot(distance_data, ref_cell_type = "CD4", sample_names = cd4_top5_anomalies)
-plot(distance_data, ref_cell_type = "CD8", sample_names = cd4_top5_anomalies)
-
- 
-}
-\seealso{
-\code{\link{calculateSampleDistances}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/plot.calculateSampleSimilarityPCA.Rd b/man/plot.calculateSampleSimilarityPCA.Rd
deleted file mode 100644
index b025237..0000000
--- a/man/plot.calculateSampleSimilarityPCA.Rd
+++ /dev/null
@@ -1,90 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plot.calculateSampleSimilarityPCA.R
-\name{plot.calculateSampleSimilarityPCA}
-\alias{plot.calculateSampleSimilarityPCA}
-\title{Plot Cosine Similarities Between Samples and PCs}
-\usage{
-\method{plot}{calculateSampleSimilarityPCA}(x, pc_subset = c(1:5), ...)
-}
-\arguments{
-\item{x}{An object of class 'calculateSampleSimilarityPCA' containing a dataframe of cosine similarity values 
-between samples and PCs.}
-
-\item{pc_subset}{A numeric vector specifying the subset of principal components to include in the plot (default: c(1:5)).}
-
-\item{...}{Additional arguments passed to the plotting function.}
-}
-\value{
-A ggplot object representing the cosine similarity heatmap.
-}
-\description{
-This function creates a heatmap plot to visualize the cosine similarities between samples and principal components (PCs).
-}
-\details{
-This function reshapes the input data frame to create a long format suitable for plotting as a heatmap. It then
-creates a heatmap plot using ggplot2, where the x-axis represents the PCs, the y-axis represents the samples, and the
-color intensity represents the cosine similarity values.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-query_var <- scran::getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data (assumed to be prepared)
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Store PCA anomaly data and plots
-anomaly_output <- detectAnomaly(reference_data = ref_data_subset, 
-                                ref_cell_type_col = "reclustered.broad", 
-                                n_components = 10,
-                                n_tree = 500,
-                                anomaly_treshold = 0.5) 
-top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, 
-                             decreasing = TRUE)[1:6])
-
-# Compute cosine similarity between anomalies and top PCs
-cosine_similarities <- calculateSampleSimilarityPCA(ref_data_subset, samples = top6_anomalies, 
-                                                    pc_subset = c(1:10), n_top_vars = 50)
-cosine_similarities
-
-# Plot similarities
-plot(cosine_similarities, pc_subset = c(1:5))
-
-}
-\seealso{
-\code{\link{calculateSampleSimilarityPCA}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/plot.compareCCA.Rd b/man/plot.compareCCA.Rd
deleted file mode 100644
index 40f0522..0000000
--- a/man/plot.compareCCA.Rd
+++ /dev/null
@@ -1,87 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plot.compareCCA.R
-\name{plot.compareCCA}
-\alias{plot.compareCCA}
-\title{Plot Visualization of Output from compareCCA Function}
-\usage{
-\method{plot}{compareCCA}(x, ...)
-}
-\arguments{
-\item{x}{A list containing the output from the `compareCCA` function. 
-This list should include `cosine_similarity` and `correlations`.}
-
-\item{...}{Additional arguments passed to the plotting function.}
-}
-\value{
-A ggplot object representing the scatter plot of cosine similarities of CCA coefficients and correlations.
-}
-\description{
-This function generates a visualization of the output from the `compareCCA` function.
-The plot shows the cosine similarities of canonical correlation analysis (CCA) coefficients,
-with point sizes representing the correlations.
-}
-\details{
-The function converts the input list into a data frame suitable for plotting with `ggplot2`.
-Each point in the scatter plot represents the cosine similarity of CCA coefficients, with the size of the point
-indicating the correlation.
-}
-\examples{
-# Load necessary library
-library(scRNAseq)
-library(scuttle)
-library(scran)
-library(SingleR)
-library(ggplot2)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 500)
-query_var <- getTopHVGs(query_data, n = 500)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Subset reference and query data for a specific cell type
-ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-
-# Run PCA on the reference and query datasets
-ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-query_data_subset <- runPCA(query_data_subset, ncomponents = 50)
-
-# Compare CCA
-cca_comparison <- compareCCA(query_data_subset, ref_data_subset, 
-                             pc_subset = c(1:5))
-
-# Visualize output of CCA comparison
-plot(cca_comparison)
-
-
-}
-\seealso{
-\code{\link{compareCCA}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/plot.comparePCA.Rd b/man/plot.comparePCA.Rd
deleted file mode 100644
index 14662aa..0000000
--- a/man/plot.comparePCA.Rd
+++ /dev/null
@@ -1,90 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plot.comparePCA.R
-\name{plot.comparePCA}
-\alias{plot.comparePCA}
-\title{Plot Heatmap of Cosine Similarities Between Principal Components}
-\usage{
-\method{plot}{comparePCA}(x, ...)
-}
-\arguments{
-\item{x}{A numeric matrix output from the `comparePCA` function, representing 
-cosine similarities between query and reference principal components.}
-
-\item{...}{Additional arguments passed to the plotting function.}
-}
-\value{
-A ggplot object representing the heatmap of cosine similarities.
-}
-\description{
-This function generates a heatmap to visualize the cosine similarities between 
-principal components from the output of the `comparePCA` function.
-}
-\details{
-The function converts the input matrix into a long-format data frame 
-suitable for plotting with `ggplot2`. The rows in the heatmap are ordered in 
-reverse to match the conventional display format. The heatmap uses a blue-white-red 
-color gradient to represent cosine similarity values, where blue indicates negative 
-similarity, white indicates zero similarity, and red indicates positive similarity.
-}
-\examples{
-# Load necessary library
-library(scRNAseq)
-library(scuttle)
-library(scran)
-library(SingleR)
-library(ComplexHeatmap)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 500)
-query_var <- getTopHVGs(query_data, n = 500)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Subset reference and query data for a specific cell type
-ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-
-# Run PCA on the reference and query datasets
-ref_data_subset <- runPCA(ref_data_subset)
-query_data_subset <- runPCA(query_data_subset)
-
-# Call the PCA comparison function
-similarity_mat <- comparePCA(query_data_subset, ref_data_subset, 
-                             pc_subset = c(1:5), 
-                             metric = c("cosine", "correlation")[1], 
-                             correlation_method = c("spearman", "pearson")[1])
-
-# Create the heatmap
-plot(similarity_mat)
-
-
-}
-\seealso{
-\code{\link{comparePCA}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/plot.comparePCASubspace.Rd b/man/plot.comparePCASubspace.Rd
deleted file mode 100644
index 5889cfb..0000000
--- a/man/plot.comparePCASubspace.Rd
+++ /dev/null
@@ -1,87 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plot.comparePCASubspace.R
-\name{plot.comparePCASubspace}
-\alias{plot.comparePCASubspace}
-\title{Plot Visualization of Output from comparePCASubspace Function}
-\usage{
-\method{plot}{comparePCASubspace}(x, ...)
-}
-\arguments{
-\item{x}{A numeric matrix output from the `comparePCA` function, representing 
-cosine similarities between query and reference principal components.}
-
-\item{...}{Additional arguments passed to the plotting function.}
-}
-\value{
-A ggplot object representing the heatmap of cosine similarities.
-}
-\description{
-This function generates a visualization of the output from the `comparePCASubspace` function.
-The plot shows the cosine of principal angles between reference and query principal components,
-with point sizes representing the variance explained.
-}
-\details{
-The function converts the input list into a data frame suitable for plotting with `ggplot2`.
-Each point in the scatter plot represents the cosine of a principal angle, with the size of the point
-indicating the average variance explained by the corresponding principal components.
-}
-\examples{
-# Load necessary library
-library(scRNAseq)
-library(scuttle)
-library(scran)
-library(SingleR)
-library(ggplot2)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 500)
-query_var <- getTopHVGs(query_data, n = 500)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Subset reference and query data for a specific cell type
-ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD8")]
-query_data_subset <- query_data_subset[, which(colData(query_data_subset)$labels == "CD8")]
-
-# Run PCA on the reference and query datasets
-ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-query_data_subset <- runPCA(query_data_subset, ncomponents = 50)
-
-# Compare PCA subspaces
-subspace_comparison <- comparePCASubspace(query_data_subset, ref_data_subset, 
-                                          pc_subset = c(1:5))
-
-# Create a data frame for plotting
-plot(subspace_comparison)
-
-
-}
-\seealso{
-\code{\link{comparePCASubspace}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/plot.detectAnomaly.Rd b/man/plot.detectAnomaly.Rd
deleted file mode 100644
index a3ca284..0000000
--- a/man/plot.detectAnomaly.Rd
+++ /dev/null
@@ -1,99 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plot.detectAnomaly.R
-\name{plot.detectAnomaly}
-\alias{plot.detectAnomaly}
-\title{Create Faceted Scatter Plots for Specified PC Combinations From \code{detectAnomaly} Object}
-\usage{
-\method{plot}{detectAnomaly}(
-  x,
-  cell_type = NULL,
-  pc_subset = NULL,
-  data_type = c("query", "reference"),
-  ...
-)
-}
-\arguments{
-\item{x}{A list object containing the anomaly detection results from the \code{detectAnomaly} function. 
-Each element of the list should correspond to a cell type and contain \code{reference_mat_subset}, \code{query_mat_subset}, 
-\code{var_explained}, and \code{anomaly}.}
-
-\item{cell_type}{A character string specifying the cell type for which the plots should be generated. This should
-be a name present in \code{x}. If NULL, the "Combined" cell type will be plotted. Default is NULL.}
-
-\item{pc_subset}{A numeric vector specifying the indices of the PCs to be included in the plots. If NULL, all PCs
-in \code{reference_mat_subset} will be included.}
-
-\item{data_type}{A character string specifying whether to plot the "query" data or the "reference" data. Default is "query".}
-
-\item{...}{Additional arguments.}
-}
-\value{
-A ggplot2 object representing the PCA plots with anomalies highlighted.
-}
-\description{
-This function generates faceted scatter plots for specified principal component (PC) combinations
-within an anomaly detection object. It allows visualization of the relationship between specified
-PCs and highlights anomalies detected by the Isolation Forest algorithm.
-}
-\details{
-The function extracts the specified PCs from the given anomaly detection object and generates
-scatter plots for each pair of PCs. It uses \code{ggplot2} to create a faceted plot where each facet represents
-a pair of PCs. Anomalies are highlighted in red, while normal points are shown in black.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 2000)
-query_var <- getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
-
-# Store PCA anomaly data and plots
-anomaly_output <- detectAnomaly(ref_data_subset, query_data_subset, 
-                                ref_cell_type_col = "reclustered.broad", 
-                                query_cell_type_col = "labels",
-                                n_components = 10,
-                                n_tree = 500,
-                                anomaly_treshold = 0.5) 
-
-# Plot the output for a cell type
-plot(anomaly_output, cell_type = "CD8", pc_subset = c(1:5), data_type = "query")
-
-}
-\seealso{
-\code{\link{detectAnomaly}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/plot.nearestNeighborDiagnostics.Rd b/man/plot.nearestNeighborDiagnostics.Rd
deleted file mode 100644
index 2464969..0000000
--- a/man/plot.nearestNeighborDiagnostics.Rd
+++ /dev/null
@@ -1,87 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plot.nearestNeighborDiagnostics.R
-\name{plot.nearestNeighborDiagnostics}
-\alias{plot.nearestNeighborDiagnostics}
-\title{Plot Density of Probabilities for Cell Type Classification}
-\usage{
-\method{plot}{nearestNeighborDiagnostics}(x, cell_types = NULL, prob_type = c("query", "reference")[1], ...)
-}
-\arguments{
-\item{x}{An object of class \code{nearestNeighbotDiagnostics} containing the probabilities calculated by the \code{\link{nearestNeighborDiagnostics}} function.}
-
-\item{cell_types}{A character vector specifying the cell types to include in the plot. If NULL, all cell types in \code{x} will be plotted. Default is NULL.}
-
-\item{prob_type}{A character string specifying the type of probability to plot. Must be either "query" or "reference". Default is "query".}
-
-\item{...}{Additional arguments to be passed to \code{\link[ggplot2]{geom_density}}.}
-}
-\value{
-A ggplot2 density plot.
-}
-\description{
-This function generates a density plot showing the distribution of probabilities for each sample of belonging to 
-either the reference or query dataset for each cell type.
-}
-\details{
-This function creates a density plot to visualize the distribution of probabilities for each sample belonging to the 
-reference or query dataset for each cell type. It utilizes the ggplot2 package for plotting.
-}
-\examples{
-# Load necessary library
-library(scRNAseq)
-library(scuttle)
-library(scran)
-library(SingleR)
-library(scater)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- getTopHVGs(ref_data, n = 500)
-query_var <- getTopHVGs(query_data, n = 500)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Project the query data onto PCA space of reference
-nn_output <- nearestNeighborDiagnostics(query_data_subset, ref_data_subset,
-                                        n_neighbor = 15, 
-                                        n_components = 10,
-                                        pc_subset = c(1:10),
-                                        query_cell_type_col = "labels", 
-                                        ref_cell_type_col = "reclustered.broad")
-
-# Plot output
-plot(nn_output, cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
-     prob_type = "query")
-
-
-}
-\seealso{
-\code{\link{nearestNeighborDiagnostics}}
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/plotGeneExpressionDimred.Rd b/man/plotGeneExpressionDimred.Rd
deleted file mode 100644
index 2d9021e..0000000
--- a/man/plotGeneExpressionDimred.Rd
+++ /dev/null
@@ -1,52 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plotGeneExpressionDimred.R
-\name{plotGeneExpressionDimred}
-\alias{plotGeneExpressionDimred}
-\title{Visualize gene expression on a dimensional reduction plot}
-\usage{
-plotGeneExpressionDimred(se_object, method, n_components = c(1, 2), feature)
-}
-\arguments{
-\item{se_object}{An object of class "SingleCellExperiment" containing log-transformed expression matrix and other metadata.
-It can be either a reference or query dataset.}
-
-\item{method}{The reduction method to use for visualization. It should be one of the supported methods: "tSNE", "UMAP", or "PCA".}
-
-\item{n_components}{A numeric vector of length 2 indicating the first two dimensions to be used for plotting.}
-
-\item{feature}{A character string representing the name of the gene or feature to be visualized.}
-}
-\value{
-A ggplot object representing the dimensional reduction plot with gene expression.
-}
-\description{
-This function plots gene expression on a dimensional reduction plot using methods like t-SNE, UMAP, or PCA. Each single cell is color-coded based on the expression of a specific gene or feature.
-}
-\examples{
-library(scater)
-library(scran)
-library(scRNAseq)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log transform datasets
-query_data <- logNormCounts(query_data)
-
-# Run PCA
-query_data <- runPCA(query_data)
-
-# Plot gene expression on PCA plot
-plotGeneExpressionDimred(se_object = query_data, 
-                         method = "PCA", 
-                         n_components = c(1, 2), 
-                         feature = "VPREB3")
-
-
-}
diff --git a/man/plotGeneSetScores.Rd b/man/plotGeneSetScores.Rd
deleted file mode 100644
index dd18d86..0000000
--- a/man/plotGeneSetScores.Rd
+++ /dev/null
@@ -1,78 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plotGeneSetScores.R
-\name{plotGeneSetScores}
-\alias{plotGeneSetScores}
-\title{Visualization of gene sets or pathway scores on dimensional reduction plot}
-\usage{
-plotGeneSetScores(se_object, method, feature, pc_subset = c(1:5))
-}
-\arguments{
-\item{se_object}{An object of class "SingleCellExperiment" containing numeric expression matrix and other metadata.
-It can be either a reference or query dataset.}
-
-\item{method}{A character string indicating the method for visualization ("PCA", "TSNE", or "UMAP").}
-
-\item{feature}{A character string representing the name of the feature (score) in the colData(query_data) to plot.}
-
-\item{pc_subset}{An optional vector specifying the principal components (PCs) to include in the plot if method = "PCA". 
-Default is c(1:5).}
-}
-\value{
-A ggplot2 object representing the gene set scores plotted on the specified reduced dimensions.
-}
-\description{
-Plot gene sets or pathway scores on PCA, TSNE, or UMAP. Single cells are color-coded by scores of gene sets or pathways.
-}
-\details{
-This function plots gene set scores on reduced dimensions such as PCA, t-SNE, or UMAP. 
-It extracts the reduced dimensions from the provided SingleCellExperiment object.
-Gene set scores are visualized as a scatter plot with colors indicating the scores.
-For PCA, the function automatically includes the percentage of variance explained 
-in the plot's legend.
-}
-\examples{
-library(scater)
-library(scran)
-library(scRNAseq)
-library(AUCell)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-## log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Run PCA on the query data
-query_data <- runPCA(query_data)
-
-# Compute scores using AUCell
-expression_matrix <- assay(query_data, "logcounts")
-cells_rankings <- AUCell_buildRankings(expression_matrix, plotStats = FALSE)
-# Generate gene sets
-gene_set1 <- sample(rownames(expression_matrix), 10)
-gene_set2 <- sample(rownames(expression_matrix), 20)
-gene_sets <- list(geneSet1 = gene_set1, geneSet2 = gene_set2)
-
-# Calculate AUC scores for gene sets
-cells_AUC <- AUCell_calcAUC(gene_sets, cells_rankings)
-
-# Assign scores to colData (users should ensure that the scores are present in the colData)
-colData(query_data)$geneSetScores <- assay(cells_AUC)["geneSet1", ] 
-
-# Plot gene set scores on PCA
-plotGeneSetScores(se_object = query_data, 
-                  method = "PCA", 
-                  feature = "geneSetScores",
-                  pc_subset = c(1:5))
-
-# Note: Users can provide their own gene set scores in the colData of the 'se_object' object, 
-# using any method of their choice.
-
-}
diff --git a/man/plotMarkerExpression.Rd b/man/plotMarkerExpression.Rd
deleted file mode 100644
index 780c894..0000000
--- a/man/plotMarkerExpression.Rd
+++ /dev/null
@@ -1,79 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plotMarkerExpression.R
-\name{plotMarkerExpression}
-\alias{plotMarkerExpression}
-\title{Plot gene expression distribution from overall and cell type-specific perspective}
-\usage{
-plotMarkerExpression(
-  reference_data,
-  query_data,
-  ref_cell_type_col,
-  query_cell_type_col,
-  gene_name,
-  label
-)
-}
-\arguments{
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} that identifies the cell types.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} that identifies the cell types.}
-
-\item{gene_name}{character. A string representing the gene name for which the distribution is to be visualized.}
-
-\item{label}{character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")).}
-}
-\value{
-A gtable object containing two arranged density plots as grobs. 
-        The first plot shows the overall gene expression distribution, 
-        and the second plot displays the cell type-specific expression 
-        distribution.
-}
-\description{
-This function generates density plots to visualize the distribution of gene expression values 
-for a specific gene across the overall dataset and within a specified cell type.
-}
-\details{
-This function generates density plots to compare the distribution of a specific marker 
-gene between reference and query datasets. The aim is to inspect the alignment of gene expression 
-levels as a surrogate for dataset similarity. Similar distributions suggest a good alignment, 
-while differences may indicate discrepancies or incompatibilities between the datasets.
-}
-\examples{
-library(scater)
-library(scran)
-library(scRNAseq)
-library(SingleR)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Get cell type scores using SingleR or any other method
-pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- pred$labels
-
-# Note: Users can use SingleR or any other method to obtain the cell type annotations.
-plotMarkerExpression(reference_data = ref_data, 
-                     query_data = query_data, 
-                     ref_cell_type_col = "reclustered.broad", 
-                     query_cell_type_col = "labels", 
-                     gene_name = "VPREB3", 
-                     label = "B_and_plasma")
-
-
-}
diff --git a/man/plotQCvsAnnotation.Rd b/man/plotQCvsAnnotation.Rd
deleted file mode 100644
index ac2b417..0000000
--- a/man/plotQCvsAnnotation.Rd
+++ /dev/null
@@ -1,88 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plotQCvsAnnotation.R
-\name{plotQCvsAnnotation}
-\alias{plotQCvsAnnotation}
-\title{Scatter plot: QC stats vs Cell Type Annotation Scores}
-\usage{
-plotQCvsAnnotation(query_data, qc_col, label_col, score_col, label = NULL)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-expression data and metadata.}
-
-\item{qc_col}{character. A column name in the \code{colData} of \code{query_data} that 
-contains the QC stats of interest.}
-
-\item{label_col}{character. The column name in the \code{colData} of \code{query_data} 
-that contains the cell type labels.}
-
-\item{score_col}{character. The column name in the \code{colData} of \code{query_data} that 
-contains the cell type annotation scores.}
-
-\item{label}{character. A vector of cell type labels to plot (e.g., c("T-cell", "B-cell")).
-Defaults to \code{NULL}, which will include all the cells.}
-}
-\value{
-A ggplot object displaying a scatter plot of QC stats vs annotation scores, 
-        where each point represents a cell, color-coded by its cell type.
-}
-\description{
-Creates a scatter plot to visualize the relationship between QC stats (e.g., library size) 
-and cell type annotation scores for one or more cell types.
-}
-\details{
-This function generates a scatter plot to explore the relationship between various quality 
-control (QC) statistics, such as library size and mitochondrial percentage, and cell type 
-annotation scores. By examining these relationships, users can assess whether specific QC 
-metrics, systematically influence the confidence in cell type annotations, 
-which is essential for ensuring reliable cell type annotation.
-}
-\examples{
-\donttest{
-library(celldex)
-library(scater)
-library(scran)
-library(scRNAseq)
-library(SingleR)
-
-# load reference dataset
-ref_data <- fetchReference("hpca", "2024-02-26")
-
-# Load query dataset (Bunis haematopoietic stem and progenitor cell data) from 
-# Bunis DG et al. (2021). Single-Cell Mapping of Progressive Fetal-to-Adult 
-# Transition in Human Naive T Cells Cell Rep. 34(1): 108573
-query_data <- BunisHSPCData()
-rownames(query_data) <- rowData(query_data)$Symbol
-
-# Add QC metrics to query data
-query_data <- addPerCellQCMetrics(query_data)
-
-# Log transform query dataset
-query_data <- logNormCounts(query_data)
-
-# Run SingleR to predict cell types
-
-pred <- SingleR(query_data, ref_data, labels = ref_data$label.main)
-
-# Assign predicted labels to query data
-colData(query_data)$pred.labels <- pred$labels
-
-# Get annotation scores
-scores <- apply(pred$scores, 1, max)
-
-# Assign scores to query data
-colData(query_data)$cell_scores <- scores
-
-# Create a scatter plot between library size and annotation scores
-
-p1 <- plotQCvsAnnotation(
-      query_data = query_data,
-      qc_col = "total",
-      label_col = "pred.labels",
-      score_col = "cell_scores",
-      label = NULL)
-p1 + xlab("Library Size")
-}
-
-                   
-}
diff --git a/man/projectPCA.Rd b/man/projectPCA.Rd
deleted file mode 100644
index f03b9a8..0000000
--- a/man/projectPCA.Rd
+++ /dev/null
@@ -1,125 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/projectPCA.R
-\name{projectPCA}
-\alias{projectPCA}
-\title{Project Query Data Onto PCA Space of Reference Data}
-\usage{
-projectPCA(
-  query_data,
-  reference_data,
-  n_components = 10,
-  query_cell_type_col = NULL,
-  ref_cell_type_col = NULL,
-  return_value = c("data.frame", "list")[1]
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10. 
-Must be less than or equal to the number of components available in the reference PCA.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} 
-that identifies the cell types.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} 
-that identifies the cell types.}
-
-\item{return_value}{A character string specifying the format of the returned data. Can be \code{data.frame} (combined reference 
-and query projections) or \code{list} (separate lists for reference and query projections) (default = \code{data.frame}).}
-}
-\value{
-A \code{data.frame} containing the projected data in rows (reference and query data combined) or a \code{list} containing 
-separate matrices for reference and query projections, depending on the \code{return_value} argument.
-}
-\description{
-This function projects a query singleCellExperiment object onto the PCA space of a reference 
-singleCellExperiment object. The PCA analysis on the reference data is assumed to be pre-computed and stored within the object.
-}
-\details{
-This function assumes that the "PCA" element exists within the \code{reducedDims} of the reference data 
-(obtained using \code{reducedDim(reference_data)}) and that the genes used for PCA are present in both the reference and query data. 
-It performs centering and scaling of the query data based on the reference data before projection.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-library(RColorBrewer)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-query_var <- scran::getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data (assumed to be prepared)
-ref_data_subset <- runPCA(ref_data_subset)
-
-# Project the query data onto PCA space of reference
-pca_output <- projectPCA(query_data_subset, ref_data_subset,
-                         n_components = 10,
-                         query_cell_type_col = "labels",
-                         ref_cell_type_col = "reclustered.broad",
-                         return_value = c("data.frame", "list")[1])
-
-# Compute t-SNE and UMAP using first 10 PCs
-tsne_data <- data.frame(calculateTSNE(t(pca_output[, paste0("PC", 1:10)])))
-umap_data <- data.frame(calculateUMAP(t(pca_output[, paste0("PC", 1:10)])))
-
-# Combine the cell type labels from both datasets
-tsne_data$Type <- paste(pca_output$dataset, pca_output$cell_type)
-
-# Define the cell types and legend order
-legend_order <- c("Query CD8",
-                  "Reference CD8",
-                  "Query CD4",
-                  "Reference CD4",
-                  "Query B_and_plasma",
-                  "Reference B_and_plasma")
-
-# Define the colors for cell types
-color_palette <- brewer.pal(length(legend_order), "Paired")
-color_mapping <- setNames(color_palette, legend_order)
-cell_type_colors <- color_mapping[legend_order]
-
-# Visualize t-SNE output
-tsne_plot <- ggplot(tsne_data[tsne_data$Type \%in\% legend_order,],
-                    aes(x = TSNE1, y = TSNE2, color = factor(Type, levels = legend_order))) +
-    geom_point(alpha = 0.5, size = 1) +
-    scale_color_manual(values = cell_type_colors) +
-    theme_bw() +
-    guides(color = guide_legend(title = "Cell Types"))
-
-
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/man/regressPC.Rd b/man/regressPC.Rd
deleted file mode 100644
index 8afbbbd..0000000
--- a/man/regressPC.Rd
+++ /dev/null
@@ -1,121 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/regressPC.R
-\name{regressPC}
-\alias{regressPC}
-\alias{plotPCRegression}
-\title{Principal component regression}
-\usage{
-regressPC(sce, dep.vars = NULL, indep.var)
-
-plotPCRegression(sce, regressPC_res, dep.vars = NULL, indep.var, max_pc = 20)
-}
-\arguments{
-\item{sce}{An object of class \code{\linkS4class{SingleCellExperiment}}
-containing the data for regression analysis.}
-
-\item{dep.vars}{character. Dependent variable(s). Determines which principal
-component(s) (e.g., "PC1", "PC2", etc.) are used as explanatory variables.
-Principal components are expected to be stored in a PC matrix named
-\code{"PCA"} in the \code{reducedDims} of \code{sce}. Defaults to
-\code{NULL} which will then regress on each principal component present in
-the PC matrix.}
-
-\item{indep.var}{character. Independent variable. A column name in the
-\code{colData} of \code{sce} specifying the response variable.}
-
-\item{regressPC_res}{a result from \code{\link{regressPC}}}
-
-\item{max_pc}{The maximum number of PCs to show on the plot. Set to 0 to show
-all.}
-}
-\value{
-A \code{list} containing \itemize{ \item summaries of the linear
-  regression models for each specified principal component, \item the
-  corresponding R-squared (R2) values, \item the variance contributions for
-  each principal component, and \item the total variance explained.}
-}
-\description{
-This function performs linear regression of a covariate of interest onto one
-or more principal components, based on the data in a SingleCellExperiment
-object.
-}
-\details{
-Principal component regression, derived from PCA, can be used to
-  quantify the variance explained by a covariate interest. Applications for
-  single-cell analysis include quantification of batch removal, assessing
-  clustering homogeneity, and evaluation of alignment of query and reference
-  datasets in cell type annotation settings.  Briefly, the R^2 is calculated
-  from a linear regression of the covariate B of interest onto each principal
-  component. The variance contribution of the covariate effect per principal
-  component is then calculated as the product of the variance explained by
-  the ith principal component (PC) and the corresponding R2(PCi|B). The sum
-  across all variance contributions by the covariate effects in all principal
-  components gives the total variance explained by the covariate as follows:
-
-  Var(C|B) = sum_{i=1}^G Var(C|PC_i) * R^2 (PC_i | B)
-
-  where, Var(C|PCi) is the variance of the data matrix C explained by the ith
-  principal component. See references.
-
-  If the input is large (>3e4 cells) and the independent variable is
-  categorical with >10 categories, this function will use a stripped down
-  linear model function that is faster but doesn't return all the same
-  components. Namely, the \code{regression.summaries} component of the result
-  will contain only the R^2 values, nothing else.
-}
-\examples{
-library(scater)
-library(scran)
-library(scRNAseq)
-library(SingleR)
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(sce),
-    size = floor(0.7 * ncol(sce)),
-    replace = FALSE
-)
-ref <- sce[, indices]
-query <- sce[, -indices]
-
-# log transform datasets
-ref <- logNormCounts(ref)
-query <- logNormCounts(query)
-
-# Run PCA
-query <- runPCA(query)
-
-# Get cell type scores using SingleR
-# Note: replace when using cell type annotation scores from other methods
-scores <- SingleR(query, ref, labels = ref$reclustered.broad)
-
-# Add labels to query object
-query$labels <- scores$labels
-
-# Specify the dependent variables (principal components) and
-# independent variable (e.g., "labels")
-dep.vars <- paste0("PC", 1:3)
-indep.var <- "labels"
-
-# Perform linear regression on multiple principal components
-res <- regressPC(
-    sce = query,
-    dep.vars = dep.vars,
-    indep.var = indep.var
-)
-
-# Obtain linear regression summaries and R-squared values
-res$regression.summaries
-res$rsquared
-
-
-plotPCRegression(query, res, dep.vars, indep.var)
-
-}
-\references{
-Luecken et al. Benchmarking atlas-level data integration in
-  single-cell genomics. Nature Methods, 19:41-50, 2022.
-}
diff --git a/man/visualizeCellTypeMDS.Rd b/man/visualizeCellTypeMDS.Rd
deleted file mode 100644
index 84dad90..0000000
--- a/man/visualizeCellTypeMDS.Rd
+++ /dev/null
@@ -1,85 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/visualizeCellTypeMDS.R
-\name{visualizeCellTypeMDS}
-\alias{visualizeCellTypeMDS}
-\title{Visualizing Reference and Query Cell Types using MDS}
-\usage{
-visualizeCellTypeMDS(
-  query_data,
-  reference_data,
-  cell_types = NULL,
-  query_cell_type_col,
-  ref_cell_type_col
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} containing the single-cell 
-expression data and metadata.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing the single-cell 
-expression data and metadata.}
-
-\item{cell_types}{A character vector specifying the cell types to include in the plot. If NULL, all cell types are included.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} 
-that identifies the cell types.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} 
-that identifies the cell types.}
-}
-\value{
-A ggplot object representing the MDS scatter plot with cell type coloring.
-}
-\description{
-This function facilitates the assessment of similarity between reference and query datasets 
-through Multidimensional Scaling (MDS) scatter plots. It allows the visualization of cell types, 
-color-coded with user-defined custom colors, based on a dissimilarity matrix computed from a 
-user-selected gene set.
-}
-\details{
-To evaluate dataset similarity, the function selects specific subsets of cells from 
-both reference and query datasets. It then calculates Spearman correlations between gene expression profiles, 
-deriving a dissimilarity matrix. This matrix undergoes Classical Multidimensional Scaling (MDS) for 
-visualization, presenting cell types in a scatter plot, distinguished by colors defined by the user.
-}
-\examples{
-library(scater)
-library(scran)
-library(scRNAseq)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-query_var <- scran::getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Generate the MDS scatter plot with cell type coloring
-plot <- visualizeCellTypeMDS(query_data = query_data_subset, 
-                             reference_data = ref_data_subset, 
-                             query_cell_type_col = "labels", 
-                             ref_cell_type_col = "reclustered.broad")
-print(plot)
-
-}
diff --git a/man/visualizeCellTypePCA.Rd b/man/visualizeCellTypePCA.Rd
deleted file mode 100644
index 3d62fb0..0000000
--- a/man/visualizeCellTypePCA.Rd
+++ /dev/null
@@ -1,97 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/visualizeCellTypePCA.R
-\name{visualizeCellTypePCA}
-\alias{visualizeCellTypePCA}
-\title{Visualize Principal Components for Different Cell Types}
-\usage{
-visualizeCellTypePCA(
-  query_data,
-  reference_data,
-  n_components = 10,
-  cell_types = NULL,
-  query_cell_type_col,
-  ref_cell_type_col,
-  pc_subset = c(1:5)
-)
-}
-\arguments{
-\item{query_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the query cells.}
-
-\item{reference_data}{A \code{\linkS4class{SingleCellExperiment}} object containing numeric expression matrix for the reference cells.}
-
-\item{n_components}{An integer specifying the number of principal components to use for projection. Defaults to 10. 
-Must be less than or equal to the number of components available in the reference PCA.}
-
-\item{cell_types}{A character vector specifying the cell types to include in the plot. If NULL, all cell types are included.}
-
-\item{query_cell_type_col}{character. The column name in the \code{colData} of \code{query_data} 
-that identifies the cell types.}
-
-\item{ref_cell_type_col}{character. The column name in the \code{colData} of \code{reference_data} 
-that identifies the cell types.}
-
-\item{pc_subset}{A numeric vector specifying which principal components to include in the plot. Default is PC1 to PC5.}
-}
-\value{
-A ggplot object representing the boxplots of specified principal components for the given cell types and datasets.
-}
-\description{
-This function plots the principal components for different cell types in the query and reference datasets.
-}
-\details{
-This function projects the query dataset onto the principal component space of the reference dataset and then visualizes the 
-specified principal components for the specified cell types.
-It uses the `projectPCA` function to perform the projection and `ggplot2` to create the plots.
-}
-\examples{
-# Load required libraries
-library(scRNAseq)
-library(scuttle)
-library(SingleR)
-library(scran)
-library(scater)
-
-# Load data (replace with your data loading)
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# log transform datasets
-ref_data <- scuttle::logNormCounts(ref_data)
-query_data <- scuttle::logNormCounts(query_data)
-
-# Get cell type scores using SingleR (or any other cell type annotation method)
-scores <- SingleR::SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-
-# Add labels to query object
-colData(query_data)$labels <- scores$labels
-
-# Selecting highly variable genes (can be customized by the user)
-ref_var <- scran::getTopHVGs(ref_data, n = 2000)
-query_var <- scran::getTopHVGs(query_data, n = 2000)
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-ref_data_subset <- ref_data[common_genes, ]
-query_data_subset <- query_data[common_genes, ]
-
-# Run PCA on the reference data (assumed to be prepared)
-ref_data_subset <- runPCA(ref_data_subset)
-
-pc_plot <- visualizeCellTypePCA(query_data_subset, ref_data_subset,
-                                n_components = 10,
-                                cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
-                                query_cell_type_col = "labels", 
-                                ref_cell_type_col = "reclustered.broad", 
-                                pc_subset = c(1:5))
-pc_plot
-
-
-}
-\author{
-Anthony Christidis, \email{anthony-alexander_christidis@hms.harvard.edu}
-}
diff --git a/pkgdown/extra.css b/pkgdown/extra.css
deleted file mode 100644
index 19caa82..0000000
--- a/pkgdown/extra.css
+++ /dev/null
@@ -1,99 +0,0 @@
-/*
-Developed and maintained by Kevin Rue-Albrecht (@kevinrue)
-Borrowed by Leo from https://github.com/iSEE/iSEEhub/blob/main/pkgdown/extra.css
-See https://github.com/lcolladotor/biocthis/issues/34 for more details.
-*/
-
-/*
-#0092ac   blue
-#00758a   darker blue (active menu)
-#c4d931   green (on blue)
-#87b13f   green (on white)
-*/
-
-.headroom {
-  background-color: #0092ac;
-}
-
-.navbar-default .navbar-link {
-    color: #ffffff;
-}
-
-.navbar-default .navbar-link:hover {
-    color: #c4d931;
-}
-
-.navbar-default .navbar-nav>.active>a,
-.navbar-default .navbar-nav>.active>a:hover,
-.navbar-default .navbar-nav>.active>a:focus {
-    color: #c4d931;
-    background-color: #00758a;
-}
-
-.navbar-default .navbar-nav>.open>a,
-.navbar-default .navbar-nav>.open>a:hover,
-.navbar-default .navbar-nav>.open>a:focus {
-    color: #c4d931;
-    background-color: #00758a;
-}
-
-.dropdown-menu>.active>a,
-.dropdown-menu>.active>a:hover,
-.dropdown-menu>.active>a:focus {
-    color: #c4d931;
-    background-color: #00758a;
-}
-
-.navbar-default .navbar-nav>li>a:hover,
-.navbar-default .navbar-nav>li>a:focus {
-    color: #c4d931;
-}
-
-.dropdown-menu>li>a:hover {
-    color: #87b13f;
-    background-color: #ffffff;
-}
-
-.navbar-default .navbar-nav>li>a {
-    color: #ffffff;
-}
-
-h1 {
-  color: #87b13f;
-}
-
-h2 {
-  color: #1a81c2;
-}
-
-h3 {
-  color: #1a81c2;
-  font-weight: bold;
-}
-
-.btn-copy-ex {
-  color: #ffffff;
-  background-color: #0092ac;
-  border-color: #0092ac;
-}
-
-.btn-copy-ex:hover {
-  color: #ffffff;
-  background-color: #00758a;
-  border-color: #00758a;
-}
-
-.btn-copy-ex:active:focus {
-  color: #c4d931;
-  background-color: #00758a;
-  border-color: #0092ac;
-}
-
-p>.fa,
-p>.fas {
-  color: #0092ac;
-}
-
-img {
-  width: auto;
-}
diff --git a/scDiagnostics.Rproj b/scDiagnostics.Rproj
deleted file mode 100644
index a4dce49..0000000
--- a/scDiagnostics.Rproj
+++ /dev/null
@@ -1,17 +0,0 @@
-Version: 1.0
-
-RestoreWorkspace: Default
-SaveWorkspace: Default
-AlwaysSaveHistory: Default
-
-EnableCodeIndexing: Yes
-UseSpacesForTab: Yes
-NumSpacesForTab: 4
-Encoding: UTF-8
-
-RnwWeave: Sweave
-LaTeX: pdfLaTeX
-
-BuildType: Package
-PackageUseDevtools: Yes
-PackageInstallArgs: --no-multiarch --with-keep.source
diff --git a/tests/testthat.R b/tests/testthat.R
deleted file mode 100644
index 952fd33..0000000
--- a/tests/testthat.R
+++ /dev/null
@@ -1,12 +0,0 @@
-# This file is part of the standard setup for testthat.
-# It is recommended that you do not modify it.
-#
-# Where should you do additional test configuration?
-# Learn more about the roles of various files in:
-# * https://r-pkgs.org/testing-design.html#sec-tests-files-overview
-# * https://testthat.r-lib.org/articles/special-files.html
-
-library(testthat)
-library(scDiagnostics)
-
-test_check("scDiagnostics")
diff --git a/tests/testthat/test-calculateCategorizationEntropy.R b/tests/testthat/test-calculateCategorizationEntropy.R
deleted file mode 100644
index 8849056..0000000
--- a/tests/testthat/test-calculateCategorizationEntropy.R
+++ /dev/null
@@ -1,3 +0,0 @@
-test_that("multiplication works", {
-  expect_equal(2 * 2, 4)
-})
diff --git a/vignettes/scDiagnostics.Rmd b/vignettes/scDiagnostics.Rmd
deleted file mode 100644
index e428f47..0000000
--- a/vignettes/scDiagnostics.Rmd
+++ /dev/null
@@ -1,760 +0,0 @@
----
-title: "scDiagnostics: diagnostic functions to assess the quality of cell type annotations in single-cell RNA-seq data"
-author:
-    - name: Anthony Christidis
-      affiliation: Center for Computational Biomedicine, Harvard Medical School
-      email: anthony-alexander_christidis@hms.harvard.edu
-    - name: Andrew Ghazi
-      affiliation: Center for Computational Biomedicine, Harvard Medical School
-    - name: Smriti Chawla
-      affiliation: Center for Computational Biomedicine, Harvard Medical School
-    - name: Nitesh Turaga
-      affiliation: Center for Computational Biomedicine, Harvard Medical School
-    - name: Ludwig Geistlinger
-      affiliation: Center for Computational Biomedicine, Harvard Medical School
-    - name: Robert Gentleman
-      affiliation: Center for Computational Biomedicine, Harvard Medical School
-package: scDiagnostics
-output: 
-  BiocStyle::html_document:
-    toc: true
-    toc_float: true
-vignette: >
-  %\VignetteIndexEntry{scDiagnostics}
-  %\VignetteEncoding{UTF-8}
-  %\VignetteEngine{knitr::rmarkdown}
-editor_options: 
-  markdown: 
-    wrap: 72
----
-
-```{r setup, include = FALSE}
-knitr::opts_chunk$set(
-  collapse = TRUE,
-  comment = "#>"
-)
-```
-
-# Purpose
-
-Annotation transfer from a reference dataset for the cell type
-annotation of a new query single-cell RNA-sequencing (scRNA-seq)
-experiment is an integral component of the typical analysis workflow.
-The approach provides a fast, automated, and reproducible alternative to
-the manual annotation of cell clusters based on marker gene expression.
-However, dataset imbalance and undiagnosed incompatibilities between
-query and reference dataset can lead to erroneous annotation and distort
-downstream applications.
-
-The `scDiagnostics` package provides functionality for the systematic
-evaluation of cell type assignments in scRNA-seq data. `scDiagnostics`
-offers a suite of diagnostic functions to assess whether both (query and
-reference) datasets are aligned, ensuring that annotations can be
-transferred reliably. `scDiagnostics` also provides functionality to
-assess annotation ambiguity, cluster heterogeneity, and marker gene
-alignment. The implemented functionality helps researchers to determine
-how accurately cells from a new scRNA-seq experiment can be assigned to
-known cell types.
-
-# Installation
-
-To install the development version of the package from Github, use the
-following command:
-
-```{r dev_version_install, eval = FALSE}
-BiocManager::install("ccb-hms/scDiagnostics")
-```
-
-NOTE: you will need the
-[remotes](https://cran.r-project.org/web/packages/remotes/index.html)
-package to install from GitHub.
-
-To build the package vignettes upon installation use:
-
-```{r build_vignettes, eval=FALSE}
-BiocManager::install("ccb-hms/scDiagnostics",
-                     build_vignettes = TRUE,
-                     dependencies = TRUE)
-```
-
-# Usage
-
-To explore the capabilities of the scDiagnostics package, you can load
-your own data or utilize publicly available datasets obtained from the
-scRNAseq R package. In this guide, we will demonstrate how to use
-scDiagnostics with such datasets, which serve as valuable resources for
-exploring the package and assessing the appropriateness of cell type
-assignments.
-
-```{r libraries, message = FALSE}
-library(scDiagnostics)
-library(celldex)
-library(corrplot)
-library(scater)
-library(scran)
-library(scRNAseq)
-library(AUCell)
-library(RColorBrewer)
-library(SingleR)
-library(ComplexHeatmap)
-```
-
-## Scatter Plot: QC stats vs. Annotation Scores
-
-Here, we will consider the Human Primary Cell Atlas (Mabbott et al.
-2013) as a reference dataset and our query dataset consists of
-Haematopoietic stem and progenitor cells from (Bunis DG et al. 2021).
-
-In scRNA-seq studies, assessing the quality of cells is important for
-accurate downstream analyses. At the same time, assigning accurate cell
-type labels based on gene expression profiles is an integral aspect of
-scRNA-seq data interpretation. Generally, these two are performed
-independently of each other. The rationale behind this function is to
-inspect whether certain QC (Quality Control) criteria impact the
-confidence level of cell type annotations.
-
-For instance, it is reasonable to hypothesize that higher library sizes
-could contribute to increased annotation confidence due to enhanced
-statistical power for identifying cell type-specific gene expression
-patterns, as evident in the scatter plot below.
-
-```{r Scatter-Plot-LibrarySize-Vs-Annotation-Scores, message=FALSE, warning=FALSE, eval=FALSE}
-
-# load reference dataset
-ref_data <- celldex::fetchReference("hpca", "2024-02-26")
-
-# Load query dataset (Bunis haematopoietic stem and progenitor cell
-# data) from Bunis DG et al. (2021). Single-Cell Mapping of
-# Progressive Fetal-to-Adult Transition in Human Naive T Cells Cell
-# Rep. 34(1): 108573
-
-query_data <- BunisHSPCData()
-rownames(query_data) <- rowData(query_data)$Symbol
-
-# Add QC metrics to query data
-query_data <- addPerCellQCMetrics(query_data)
-
-# Log transform query dataset
-query_data <- logNormCounts(query_data)
-
-# Run SingleR to predict cell types
-pred <- SingleR(query_data, ref_data, labels = ref_data$label.main)
-
-# Assign predicted labels to query data
-colData(query_data)$pred.labels <- pred$labels
-
-# Get annotation scores
-scores <- apply(pred$scores, 1, max)
-
-# Assign scores to query data
-colData(query_data)$cell_scores <- scores
-
-# Create a scatter plot between library size and annotation scores
-p1 <- plotQCvsAnnotation(
-    query_data = query_data,
-    qc_col = "total",
-    label_col = "pred.labels",
-    score_col = "cell_scores",
-    label = NULL
-)
-p1 + xlab("Library Size")
-```
-
-However, certain QC metrics, such as the proportion of mitochondrial
-genes, may require careful consideration as they can sometimes be
-associated with cellular states or functions rather than noise. The
-interpretation of mitochondrial content should be context-specific and
-informed by biological knowledge.
-
-In next analysis, we investigated the relationship between mitochondrial
-percentage and cell type annotation scores using liver tissue data from
-He S et al. 2020. Notably, we observed high annotation scores for
-macrophages and monocytes. These findings align with the established
-biological characteristic of high mitochondrial activity in macrophages
-and monocytes, adding biological context to our results.
-
-```{r QC-Annotation-Scatter-Mito, warning=FALSE, message=FALSE, eval=FALSE}
-# load query dataset
-query_data <- HeOrganAtlasData(
-    tissue = c("Liver"),
-    ensembl = FALSE,
-    location = TRUE
-)
-
-# Add QC metrics to query data
-
-mito_genes <- rownames(query_data)[grep("^MT-", rownames(query_data))]
-query_data <- unfiltered <- addPerCellQC(query_data,subsets = list(mt = mito_genes))
-qc <- quickPerCellQC(colData(query_data), sub.fields = "subsets_mt_percent")
-query_data <- query_data[,!qc$discard]
-
-# Log transform query dataset
-query_data <- logNormCounts(query_data)
-
-# Run SingleR to predict cell types
-pred <- SingleR(query_data, ref_data, labels = ref_data$label.main)
-
-# Assign predicted labels to query data
-colData(query_data)$pred.labels <- pred$labels
-
-# Get annotation scores
-scores <- apply(pred$scores, 1, max)
-
-# Assign scores to query data
-colData(query_data)$cell_scores <- scores
-
-# Create a new column for the labels so it is easy to distinguish
-#  between Macrophoges, Monocytes and other cells
-query_data$label_category <-
-    ifelse(query_data$pred.labels %in% c("Macrophage", "Monocyte"),
-           query_data$pred.labels,
-           "Other cells")
-
-
-# Define custom colors for cell type labels
-cols <- c("Other cells" = "grey", "Macrophage" = "green", "Monocyte" = "red")
-
-# Generate scatter plot for all cell types
-p1 <- plotQCvsAnnotation(
-    query_data = query_data,
-    qc_col = "subsets_mt_percent",
-    label_col = "label_category",
-    score_col = "cell_scores",
-    label = NULL) + 
-    scale_color_manual(values = cols) +
-    xlab("subsets_mt_percent")
-p1
-```
-
-## Examining Distribution of QC stats and Annotation Scores
-
-In addition to the scatter plot, we can gain further insights into the
-gene expression profiles by visualizing the distribution of user defined
-QC stats and annotation scores for all the cell types or specific cell
-types. This allows us to examine the variation and patterns in
-expression levels and scores across cells assigned to the cell type of
-interest.
-
-To accomplish this, we create two separate histograms. The first
-histogram displays the distribution of the annotation scores.
-
-The second histogram visualizes the distribution of QC stats. This
-provides insights into the overall gene expression levels for the
-specific cell type. Here in this particular example we are investigating
-percentage of mitochondrial genes.
-
-By examining the histograms, we can observe the range, shape, and
-potential outliers in the distribution of both annotation scores and QC
-stats. This allows us to assess the appropriateness of the cell type
-assignments and identify any potential discrepancies or patterns in the
-gene expression profiles for the specific cell type.
-
-```{r Mito-Genes-Vs-Annotation, warning=FALSE, message=FALSE, eval=FALSE}
-# Generate histogram
-histQCvsAnnotation(query_data = query_data, qc_col = "subsets_mt_percent", 
-                   label_col = "pred.labels", 
-                   score_col = "cell_scores", 
-                   label = NULL)
-```
-
-The right-skewed distribution for mitochondrial percentages and a
-left-skewed distribution for annotation scores in above histograms
-suggest that most cells have lower mitochondrial contamination and
-higher confidence in their assigned cell types.
-
-## Exploring Gene Expression Distribution
-
-This function helps user to explore the distribution of gene expression
-values for a specific gene of interest across all the cells in both
-reference and query datasets and within specific cell types. This helps
-to evaluate whether the distributions are similar or aligned between the
-datasets. Discrepancies in distribution patterns may indicate potential
-incompatibilities or differences between the datasets.
-
-The function also allows users to narrow down their analysis to specific
-cell types of interest. This enables investigation of whether alignment
-between the query and reference datasets is consistent not only at a
-global level but also within specific cell types.
-
-```{r Gene-Expression-Histogram, warning=FALSE, message=FALSE}
-
-# Load data
-sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
-
-# Divide the data into reference and query datasets
-set.seed(100)
-indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
-ref_data <- sce[, indices]
-query_data <- sce[, -indices]
-
-# Log-transform datasets
-ref_data <- logNormCounts(ref_data)
-query_data <- logNormCounts(query_data)
-
-# Run PCA
-ref_data <- runPCA(ref_data)
-query_data <- runPCA(query_data)
-
-# Get cell type scores using SingleR
-pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
-pred <- as.data.frame(pred)
-   
-# Assign labels to query data
-colData(query_data)$labels <- pred$labels
-   
-# Generate density plots
-plotMarkerExpression(reference_data = ref_data, 
-                     query_data = query_data, 
-                     ref_cell_type_col = "reclustered.broad", 
-                     query_cell_type_col = "labels", 
-                     gene_name = "MS4A1", 
-                     label = "B_and_plasma")
-```
-
-In the provided example, we examined the distribution of expression
-values for the gene MS4A1, a marker for naive B cells, in both the query
-and reference datasets. Additionally, we also looked at the distribution
-of MS4A1 expression in the B_and_plasma cell type. We observed
-overlapping distributions in both cases, suggesting alignment between
-the reference and query datasets.
-
-## Evaluating Alignment Between Reference and Query Datasets in Terms of Highly Variable Genes
-
-We are assessing the similarity or alignment between two datasets, the
-reference dataset, and the query dataset, in terms of highly variable
-genes (HVGs). We calculate the overlap coefficient between the sets of
-highly variable genes in the reference and query datasets. The overlap
-coefficient quantifies the degree of overlap or similarity between these
-two sets of genes. A value closer to 1 indicates a higher degree of
-overlap, while a value closer to 0 suggests less overlap. The computed
-overlap coefficient is printed, providing a numerical measure of how
-well the highly variable genes in the reference and query datasets
-align. In this case, the overlap coefficient is 0.62, indicating a
-moderate level of overlap.
-
-```{r HVG overlap, warning=FALSE, message=FALSE}
-
-# Selecting highly variable genes
-ref_var <- getTopHVGs(ref_data, n=2000)
-query_var <- getTopHVGs(query_data, n=2000)
-
-# Compute the overlap coefficient
-overlap_coefficient <- calculateHVGOverlap(reference_genes = ref_var, 
-                                           query_genes = query_var)
-print(overlap_coefficient)
-```
-
-In the provided example, we examined the distribution of expression
-values for the gene MS4A1, a marker for naive B cells, in both the query
-and reference datasets. Additionally, we also looked at the distribution
-of MS4A1 expression in the B_and_plasma cell type. We observed
-overlapping distributions in both cases, suggesting alignment between
-the reference and query datasets.
-
-## Visualize Gene Expression on Dimensional Reduction Plot
-
-To gain insights into the gene expression patterns and their
-representation in a dimensional reduction space, we can utilize the
-plotGeneExpressionDimred function. This function allows us to plot the
-gene expression values of a specific gene on a dimensional reduction
-plot generated using methods like t-SNE, UMAP, or PCA. Each single cell
-is color-coded based on its expression level of the gene of interest.
-
-In the provided example, we are visualizing the gene expression values
-of the gene "VPREB3" on a PCA plot. The PCA plot represents the cells in
-a lower-dimensional space, where the x-axis corresponds to the first
-principal component (Dimension 1) and the y-axis corresponds to the
-second principal component (Dimension 2). Each cell is represented as a
-point on the plot, and its color reflects the expression level of the
-gene "VPREB3," ranging from low (lighter color) to high (darker color).
-
-```{r Gene-Expression-Scatter, warning=FALSE, message=FALSE}
-# Generate dimension reduction plot color code by gene expression
-plotGeneExpressionDimred(se_object = query_data, 
-                         method = "PCA", 
-                         n_components = c(1, 2), 
-                         feature = "VPREB3")
-```
-
-The dimensional reduction plot allows us to observe how the gene
-expression of VPREB3 is distributed across the cells and whether any
-clusters or patterns emerge in the data.
-
-## Visualize Gene Sets or Pathway Scores on Dimensional Reduction Plot
-
-In addition to examining individual gene expression patterns, it is
-often useful to assess the collective activity of gene sets or pathways
-within single cells. This can provide insights into the functional
-states or biological processes associated with specific cell types or
-conditions. To facilitate this analysis, the scDiagnostics package
-includes a function called plotGeneSetScores that enables the
-visualization of gene set or pathway scores on a dimensional reduction
-plot.
-
-The plotGeneSetScores function allows you to plot gene set or pathway
-scores on a dimensional reduction plot generated using methods such as
-PCA, t-SNE, or UMAP. Each single cell is color-coded based on its scores
-for specific gene sets or pathways. This visualization helps identify
-the heterogeneity and patterns of gene set or pathway activity within
-the dataset, potentially revealing subpopulations with distinct
-functional characteristics.
-
-```{r Pathway-Scores-on-Dimensional-Reduction-Scatter, warning=FALSE, message=FALSE}
-
-# Compute scores using AUCell
-expression_matrix <- assay(query_data, "logcounts")
-cells_rankings <- AUCell_buildRankings(expression_matrix, plotStats = FALSE)
-
-# Generate gene sets
-gene_set1 <- sample(rownames(expression_matrix), 10)
-gene_set2 <- sample(rownames(expression_matrix), 20)
-
-gene_sets <- list(geneSet1 = gene_set1,
-                  geneSet2 = gene_set2)
-
-# Calculate AUC scores for gene sets
-cells_AUC <- AUCell_calcAUC(gene_sets, cells_rankings)
-
-# Assign scores to colData
-colData(query_data)$geneSetScores <- assay(cells_AUC)["geneSet1", ]
-
-# Plot gene set scores on PCA
-plotGeneSetScores(se_object = query_data, 
-                  method = "PCA", 
-                  feature = "geneSetScores",
-                  pc_subset = c(1:5))
-```
-
-In the provided example, we demonstrate the usage of the
-plotGeneSetScores function using the AUCell package to compute gene set
-or pathway scores. Custom gene sets are generated for demonstration
-purposes, but users can provide their own gene set scores using any
-method of their choice. It is important to ensure that the scores are
-assigned to the colData of the reference or query object and specify the
-correct feature name for visualization.
-
-By visualizing gene set or pathway scores on a dimensional reduction
-plot, you can gain a comprehensive understanding of the functional
-landscape within your single-cell gene expression dataset and explore
-the relationships between gene set activities and cellular phenotypes.
-
-## Visualizing Reference and Query Cell Types using Multidimensional Scaling (MDS)
-
-This function performs Multidimensional Scaling (MDS) analysis on the
-query and reference datasets to examine their similarity. The
-dissimilarity matrix is calculated based on the correlation between the
-datasets, representing the distances between cells in terms of gene
-expression patterns. MDS is then applied to derive low-dimensional
-coordinates for each cell. Subsequently, a scatter plot is generated,
-where each data point represents a cell, and cell types are color-coded
-using custom colors provided by the user. This visualization enables the
-comparison of cell type distributions between the query and reference
-datasets in a reduced-dimensional space.
-
-The rationale behind this function is to visually assess the alignment
-and relationships between cell types in the query and reference
-datasets.
-
-
-```{r CMD-Scatter-Plot, warning=FALSE, message=FALSE}
-
-# Intersect the gene symbols to obtain common genes
-common_genes <- intersect(ref_var, query_var)
-
-# Select desired cell types
-selected_cell_types <- c("CD4", "CD8", "B_and_plasma")
-ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad %in% selected_cell_types]
-query_data_subset <- query_data[common_genes, query_data$labels %in% selected_cell_types]
-
-# Extract cell types for visualization
-ref_labels <- ref_data_subset$reclustered.broad
-query_labels <- query_data_subset$labels
-
-# Generate the MDS scatter plot with cell type coloring
-visualizeCellTypeMDS(query_data = query_data_subset, 
-                     reference_data = ref_data_subset, 
-                     query_cell_type_col = "labels",
-                     ref_cell_type_col = "reclustered.broad")
-```
-
-Upon examining the MDS scatter plot, we observe that the CD4 and CD8
-cell types overlap to some extent.By observing the proximity or overlap
-of different cell types, one can gain insights into their potential
-relationships or shared characteristics.
-
-The selection of custom genes and desired cell types depends on the
-user's research interests and goals. It allows for flexibility in
-focusing on specific genes and examining particular cell types of
-interest in the visualization.
-
-## Cell Type-specific Pairwise Correlation Analysis and Visualization
-
-This analysis aims to explore the correlation patterns between different
-cell types in a single-cell gene expression dataset. The goal is to
-compare the gene expression profiles of cells from a reference dataset
-and a query dataset to understand the relationships and similarities
-between various cell types.
-
-To perform the analysis, we start by computing the pairwise correlations
-between the query and reference cells for selected cell types ("CD4",
-"CD8", "B_and_plasma"). The Spearman correlation method is used, user
-can also use Pearsons correlation coeefficient.
-
-This will return average correlation matrix which can be visulaized by
-user's method of choice. Here, the results are visualized as a
-correlation plot using the corrplot package.
-
-```{r Cell-Type-Correlation-Analysis-Visualization, warning=FALSE, message=FALSE}
-selected_cell_types <- c("CD4", "CD8", "B_and_plasma")
-ref_data_subset <- runPCA(ref_data_subset)
-cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset, 
-                                                      reference_data = ref_data_subset, 
-                                                      n_components = 5,
-                                                      query_cell_type_col = "labels", 
-                                                      ref_cell_type_col = "reclustered.broad", 
-                                                      cell_types = selected_cell_types, 
-                                                      correlation_method = "spearman")
-
-# Visualize the output
-plot(cor_matrix_avg)
-```
-
-In this case, users have the flexibility to extract the gene expression
-profiles of specific cell types from the reference and query datasets
-and provide these profiles as input to the function. Additionally, they
-can select their own set of genes that they consider relevant for
-computing the pairwise correlations. For demonstartion we have used
-common highly variable genes from reference and query dataset.
-
-By providing their own gene expression profiles and choosing specific
-genes, users can focus the analysis on the cell types and genes of
-interest to their research question.
-
-## Pairwise Distance Analysis and Density Visualization
-
-This function serves to conduct a analysis of pairwise distances or
-correlations between cells of specific cell types within a single-cell
-gene expression dataset. By calculating these distances or correlations,
-users can gain insights into the relationships and differences in gene
-expression profiles between different cell types. The function
-facilitates this analysis by generating density plots, allowing users to
-visualize the distribution of distances or correlations for various
-pairwise comparisons.
-
-The analysis offers the flexibility to select a particular cell type for
-examination, and users can choose between different distance metrics,
-such as "euclidean" or "manhattan," to calculate pairwise distances.
-
-To illustrate, the function is applied to the cell type CD8 using the
-euclidean distance metric in the example below.
-
-```{r Pairwise-Distance-Analysis-Density-Visualization, fig.width=8, message=FALSE, warning=FALSE}
-calculatePairwiseDistancesAndPlotDensity(query_data = query_data_subset, 
-                                         reference_data = ref_data_subset, 
-                                         n_components = 10,
-                                         query_cell_type_col = "labels", 
-                                         ref_cell_type_col = "reclustered.broad", 
-                                         cell_type_query = "CD8", 
-                                         cell_type_reference = "CD8", 
-                                         distance_metric = "euclidean")
-```
-
-Alternatively, users can opt for the "correlation" distance metric,
-which measures the similarity in gene expression profiles between cells.
-
-To illustrate, the function is applied to the cell type CD8 using the
-correlation distance metric in the example below. By selecting either
-the "pearson" or "spearman" correlation method, users can emphasize
-either linear or rank-based associations, respectively.
-
-```{r Pairwise-Distance-Correlation-Based-Density-Visualization, warning=FALSE, message=FALSE, fig.width=8}
-calculatePairwiseDistancesAndPlotDensity(query_data = query_data_subset, 
-                                         reference_data = ref_data_subset, 
-                                         n_components = 10,
-                                         query_cell_type_col = "labels", 
-                                         ref_cell_type_col = "reclustered.broad", 
-                                         cell_type_query = "CD8", 
-                                         cell_type_reference = "CD8", 
-                                         distance_metric = "correlation",
-                                         correlation_method = "spearman")
-```
-
-By utilizing this function, users can explore the pairwise distances
-between query and reference cells of a specific cell type and gain
-insights into the distribution of distances through density plots. This
-analysis aids in understanding the similarities and differences in gene
-expression profiles for the selected cell type within the query and
-reference datasets.
-
-
-
-
-## PC regression analysis
-
-Performing PC regression analysis on a SingleCellExperiment object
-enables users to examine the relationship between a principal component
-(PC) from the dimension reduction slot and an independent variable of
-interest. By specifying the desired dependent variable as one of the
-principal components (e.g., "PC1", "PC2", etc.) and providing the
-corresponding independent variable from the colData slot (e.g.
-"cell_type"), users can explore the associations between linear
-structure in the single-cell gene expression dataset (reference and
-query) and an independent variable of interest (e.g. cell type or
-batch).
-
-The function prints two diagnostic plots by default:
-
--   a plot of the two PCs with the highest R^2^ with the specified
-    independent variable
--   a dot plot showing the R^2^ of each consecutive PC \~ indep.var
-    regression
-    -   Generally you should expect this plot to die off to near 0
-        before \~PC10
-    -   Interpretation example: If the R^2^ values are high (\>=50%)
-        anywhere in PCs 1-5 and your independent variable is "batch",
-        you have batch effects!
-
-```{r Regression, warning=FALSE, message=FALSE}
-
-# Specify the dependent variables (principal components) and
-#  independent variable (e.g., "labels")
-dep.vars <- paste0("PC", 1:12)
-indep.var <- "labels"
-
-# Perform linear regression on multiple principal components
-result <- regressPC(sce = query_data,
-                    dep.vars = dep.vars, 
-                    indep.var = indep.var)
-
-# Print the summaries of the linear regression models and R-squared
-#  values
-
-# Summaries of the linear regression models
-result$regression.summaries[[1]]
-
-# R-squared values
-result$rsquared
-
-# Variance contributions for each principal component
-result$var.contributions
-
-# Total variance explained
-result$total.variance.explained
-```
-
-This analysis helps uncover whether there is a systematic variation in
-PC values across different cell types. In the example above, we can see
-that the four cell types are spread out across both PC1 and PC2. Digging
-into the genes with high loadins on these PCs can help explain the
-biological or technical factors driving cellular heterogeneity. It can
-help identify PC dimensions that capture variation specific to certain
-cell types or distinguish different cellular states.
-
-Let's look at the genes driving PC1 by ordering the rotation matrix by
-the absolute gene loadings for PC1:
-
-```{r}
-pc_df <-  attr(reducedDims(query_data)$PCA, "rotation")[,1:5] |> 
-  as.data.frame()
-
-pc_df[order(abs(pc_df$PC1)),] |> 
-  tail()
-```
-
-PC1 is mostly driven by NKG7 - Natural Killer Cell Granule Protein 7.
-This gene is important in CD8+ T cells, so that makes sense that it's
-distinguishing the cell types shown.
-
-> Exercise: What genes are driving PC2? Do they make sense?
-
-```{r echo = FALSE, eval = FALSE}
-pc_df[order(abs(pc_df$PC2)),] |> 
-  tail()
-
-# It's IL32 mostly.
-```
-
-> Exercise: Try to use the command below to examine the spike on PC5.
-> What's going on there?
-
-`plotPCA(query_data, ncomponents = c(1,5), color_by = "labels")`
-
-```{r eval=FALSE, echo=FALSE}
-plotPCA(query_data, ncomponents = c(1,5), color_by = "labels")
-# The myeloid cells are shifted off from the other types.
-
-pc_df[order(abs(pc_df$PC5)),] |> 
-  tail()
-# It's mostly driven by low GNLY expression in the myeloid cells.
-```
-
-
-## Annotation entropy
-
-In order to assess the confidence of cell type predictions, we can use
-the function `calculateCategorizationEntropy()`. This function
-calculates the information entropy of assignment probabilities across a
-set of cell types for each cell. If a set of class probabilities are
-confident, the entropies will be low.
-
-This can be used to compare two sets of cell type assignments (e.g. from
-different type assignment methods) to compare their relative confidence.
-**Please note that this has nothing to do with their accuracy!**
-Computational methods can sometimes be confidently incorrect.
-
-The cell type probabilities should be passed as a matrix with cell types
-as rows and cells as columns. If the columns of the matrix are not valid
-probability distributions (i.e. don't sum to 1 as in the below example),
-the function will perform a column-wise softmax to convert them to a
-probability scale. This may or may not work well depending on the
-distribution of the inputs, so if at all possible try to pass
-probabilities instead of arbitrary scores.
-
-In this example, we create 500 random cells with random normal cell type
-"scores" across 4 cell types. For demonstration we make the score of the
-first class much higher in the first 250 cells. After the softmax, this
-will equate to a very high probability of cell type 1. The remaining 250
-will have assignments that are roughly even across the four cell types
-(i.e. high entropy).
-
-```{r}
-X <- rnorm(500 * 4) |> matrix(nrow = 4)
-X[1, 1:250] <- X[1, 1:250] + 5 
-
-entropy_scores <- calculateCategorizationEntropy(X)
-```
-
-From the plot we can see that half of the cells (the first half we
-shifted to class 1) have low entropy, and half have high entropy.
-
-# Conclusion
-
-In this analysis, we have demonstrated the capabilities of the
-scDiagnostics package for assessing the appropriateness of cell
-assignments in single-cell gene expression profiles. By utilizing
-various diagnostic functions and visualization techniques, we have
-explored different aspects of the data, including total UMI counts,
-annotation scores, gene expression distributions, dimensional reduction
-plots, gene set scores, pairwise correlations, pairwise distances, and
-linear regression analysis.
-
-Through the scatter plots, histograms, and dimensional reduction plots,
-we were able to gain insights into the relationships between gene
-expression patterns, cell types, and the distribution of cells in a
-reduced-dimensional space. The examination of gene expression
-distributions, gene sets, and pathways allowed us to explore the
-functional landscape and identify subpopulations with distinct
-characteristics within the dataset. Additionally, the pairwise
-correlation and distance analyses provided a deeper understanding of the
-similarities and differences between cell types, highlighting potential
-relationships and patterns.
-
-------------------------------------------------------------------------
-
-## R.session Info
-
-```{r SessionInfo, echo=FALSE, message=FALSE, warning=FALSE, comment=NA}
-options(width = 80) #reset to 'default' width
-
-sessionInfo() #record the R and package versions used
-```