Add support to provide expression data frame to PANTHER enrichment

- Refined based on PANTHER's documentation, if a gene vector is supplied, the analysis is named over-representation analysis; if data frame with expression is supplied, the nalysis is named statistical enrichment.
moosa-r · Dec 28, 2024 · 5be8e93 · 5be8e93
1 parent a0866ad
commit 5be8e93
Show file tree

Hide file tree

Showing 2 changed files with 236 additions and 73 deletions.
diff --git a/R/panther.R b/R/panther.R
@@ -81,45 +81,90 @@ rba_panther_mapping <- function(genes,
   return(final_output)
 }
 
-#' PANTHER Over-Representation Enrichment Analysis
-#'
-#' Using this function you can use PANTHER services to perform
-#'   over-representation enrichment analysis. This statistical test will
-#'   compare your input genes to a set of defined gene lists to determine
-#'   if they are over/under-represented.
-#'
-#' @param genes Character vector of genes identifiers with maximum length of
-#'   10000. Can be any of: Ensemble gene ID, Ensemble protein ID, Ensemble
-#'   transcript ID, Entrez gene ID, gene symbol, NCBI GI, HGNC ID,
+#' PANTHER Over-Representation or Enrichment Analysis
+#'
+#' Use PANTHER services to perform over-representation enrichment analysis.
+#'   You can either provide a character vector of gene IDs for
+#'   over-representation analysis, or a data frame of gene IDs and expression
+#'   analysis.\cr Please refer to the details section for more information on
+#'   the statistical analysis.
+#'
+#' \strong{Over-representation Test}: It assesses whether specific gene sets are
+#'   represented in your input gene list differently from what is expected by
+#'   chance. It uses Fisher's exact test or Binomial test to calculate p-values.
+#'   Fisher's exact test determines the probability of observing the gene
+#'   counts in a category based on a hypergeometric distribution; the binomial
+#'   test compares the observed proportion of genes in a category to the
+#'   expected proportion based on the reference list. A significant p-value
+#'   indicates over-representation or under-representation of a gene set.
+#'
+#' \strong{Statistical Enrichment Test}: The statistical enrichment test uses the
+#'   Mann-Whitney U (Wilcoxon Rank-Sum) test to assess if the expression values
+#'   associated with genes in a specific category differ significantly from the
+#'   overall distribution in the input list. This non-parametric test first
+#'   ranks the numerical values and computes whether the expression values
+#'   were randomly drawn from the overall distribution of values. A small
+#'   p-value indicates that the numerical values for the genes in the category
+#'   are significantly different from the background distribution, thus
+#'   non-random patterns.
+#'
+#' Please note that starting from rbioapi version 0.8.2, you can supply a
+#'   gene expression data frame to perform statistical enrichment analysis.
+#'   In earlier versions, only a character vector of gene IDs was possible,
+#'   thus only over-representation analysis.
+#'
+#' @param genes Either a character vector or a data frame. Depending on
+#'   this parameter, the analysis type is determined.
+#'   \describe{
+#'   \item{Character vector:}{If a character vector is supplied,
+#'   over-representation analysis will be performed using either Fisher's exact
+#'   test (default), or binomial.}
+#'   \item{Data frame:}{If a data.frame is supplied, statistical enrichment
+#'   test is performed using Mann-Whitney U (Wilcoxon Rank-Sum) test. The
+#'   data frame should have two columns: the first column is a character vector
+#'   with gene identifiers and the second column is a numerical vector with
+#'   expression values.}
+#'   }
+#'   In both cases, maximum of 10000 genes can be supplied.
+#'   The gene identifiers can be any of: Ensemble gene ID, Ensembl protein ID,
+#'   Ensembl transcript ID, Entrez gene ID, gene symbol, NCBI GI, HGNC ID,
 #'   International protein index ID, NCBI UniGene ID, UniProt accession
-#'   and/or UniProt ID.
+#'   or UniProt ID.
 #' @param organism (numeric) NCBI taxon ID. run \code{\link{rba_panther_info}}
 #'   with argument 'what = "organisms"' to get a list of PANTHER's
 #'   supported organisms.
 #' @param annot_dataset A PANTHER dataset ID to test your input against it.
-#'   run \code{\link{rba_panther_info}}with argument 'what = "datasets"' to
+#'   run \code{\link{rba_panther_info}} with argument 'what = "datasets"' to
 #'   get a list of PANTHER's supported datasets. Note that you should enter
 #'   the "id" of the dataset, not its label (e.g. entering "biological_process"
 #'   is incorrect, you should rather enter "GO:0008150").
-#' @param test_type statistical test type to calculate the p values. either
-#'   "FISHER" (default) or "BINOMIAL".
+#' @param test_type statistical test type to calculate the p values.
+#'   \itemize{
+#'   \item If performing over-representation analysis (i.e. `genes` parameter
+#'   is a character vector), valid values are "FISHER" (default) or "BINOMIAL".
+#'   \item If performing statistical enrichment analysis (i.e. `genes`
+#'   parameter is a data.frame), the only valid value is "Mann-Whitney"
+#'   }
 #' @param correction p value correction method. either "FDR" (default),
 #'   "BONFERRONI" or "NONE".
 #' @param cutoff (Numeric) (Optional) a threshold to filter the results.
 #'   if correction is "FDR", the threshold will be applied to fdr column's
 #'   values; if otherwise, the threshold will be applied to p value column.
-#' @param ref_genes (Optional) A set of genes that will be used as the test's
-#'   background (reference/universe) gene set. If no value supplied, all of
-#'   the genes in specified organism will be used. maximum length and supported
-#'   IDs are the same as 'genes' argument.
-#' @param ref_organism (Optional) if 'ref_genes' is used, you can specify
-#'   the organisms which correspond to your supplied IDs in 'ref_genes'
-#'   argument. see 'organism' argument for supported values.
+#' @param ref_genes (Optional, only valid if genes is a character vector)
+#'   A character vector of genes that will be used as the test's background
+#'   (reference/universe) gene set. If no value supplied, all of the genes in
+#'   specified organism will be used. The maximum length and supported IDs are
+#'   the same as 'genes' argument.
+#' @param ref_organism (Optional, only valid if genes is a character vector)
+#'   if 'ref_genes' is used, you can specify the organisms which correspond to
+#'   your supplied IDs in 'ref_genes' argument. see 'organism' argument for
+#'   supported values.
 #' @param ... rbioapi option(s). See \code{\link{rba_options}}'s
 #'   arguments manual for more information on available options.
 #'
 #' @section Corresponding API Resources:
 #'  "POST https://www.pantherdb.org/services/oai/pantherdb/enrich/overrep"
+#'  \cr "POST https://www.pantherdb.org/services/oai/pantherdb/enrich/statenrich"
 #'
 #' @return A list with the parameters and results. If the analysis was
 #'   successful, the results data frame are returned in the "results" element
@@ -142,18 +187,34 @@ rba_panther_mapping <- function(genes,
 #'
 #' @examples
 #' \donttest{
-#' rba_panther_enrich(genes = c("p53", "BRCA1", "cdk2", "Q99835", "CDC42",
-#'         "CDK1", "KIF23", "PLK1", "RAC2", "RACGAP1"),
-#'     organism = 9606, annot_dataset = "GO:0008150",
-#'     cutoff = 0.01)
+#' rba_panther_enrich(
+#'   genes = c("p53", "BRCA1", "cdk2", "Q99835", "CDC42",
+#'     "CDK1", "KIF23", "PLK1", "RAC2", "RACGAP1"),
+#'   organism = 9606, annot_dataset = "GO:0008150",
+#'   cutoff = 0.01
+#'   )
+#' }
+#'
+#' \donttest{
+#' expression_df <- data.frame(
+#'   genes = c("p53", "BRCA1", "cdk2", "Q99835", "CDC42",
+#'     "CDK1", "KIF23", "PLK1", "RAC2", "RACGAP1"),
+#'   expr = runif(10, 0, 100)
+#'   )
+#'
+#' rba_panther_enrich(
+#'   genes = expression_df,
+#'   organism = 9606,
+#'   annot_dataset = "GO:0008150"
+#'   )
 #' }
 #'
 #' @family "PANTHER"
 #' @export
 rba_panther_enrich <- function(genes,
                                organism,
                                annot_dataset,
-                               test_type = "FISHER",
+                               test_type = NULL,
                                correction = "FDR",
                                cutoff = NULL,
                                ref_genes = NULL,
@@ -164,8 +225,7 @@ rba_panther_enrich <- function(genes,
   ## Check User-input Arguments
   .rba_args(cons = list(list(arg = "genes",
                              class = c("character",
-                                       "numeric"),
-                             max_len = 100000),
+                                       "data.frame")),
                         list(arg = "organism",
                              class = "numeric",
                              len = 1),
@@ -174,7 +234,7 @@ rba_panther_enrich <- function(genes,
                              len = 1),
                         list(arg = "test_type",
                              class = "character",
-                             val = c("FISHER", "BINOMIAL"),
+                             val = c("FISHER", "BINOMIAL", "Mann-Whitney"),
                              len = 1),
                         list(arg = "correction",
                              class = "character",
@@ -191,25 +251,66 @@ rba_panther_enrich <- function(genes,
                         list(arg = "ref_organism",
                              class = "numeric",
                              len = 1)),
-            cond = list(list(quote(!is.null(ref_organism) && is.null(ref_genes)),
-                             "'ref_organism' was ignored because no 'ref_genes' was supplied.")),
-            cond_warning = TRUE)
-  .msg("Performing over-representation enrichment analysis of %s input genes of organism %s against %s datasets.",
-       length(genes), organism, annot_dataset)
+            cond = list(list(quote(xor(is.null(ref_organism), is.null(ref_genes))),
+                             "'ref_organism' and 'ref_genes' should be supplied togeather."),
+                        list(quote(is.data.frame(genes) && (ncol(genes) != 2 || !inherits(genes[[1]], "character") || !inherits(genes[[2]], "numeric"))),
+                             "If the `genes` parameter is a data frame, statistical enrichment analysis will be performed.\nThe gene parameter should be a data frame with 2 columns, where the first column contains the genes identifiers and the second column contains numerical expression values."),
+                        list(quote(is.data.frame(genes) && !is.null(test_type) && test_type != "Mann-Whitney"),
+                             "If the `genes` parameter is a data frame, statistical enrichment analysis will be performed.\nThus, the only valid parameter for `test_type` is 'Mann-Whitney'."),
+                        list(quote(is.character(genes) && !is.null(test_type) && test_type == "Mann-Whitney"),
+                             "If the `genes` parameter is a character vector, over-representation analysis will be performed.\nThus, the valid parameters for `test_type` are either 'FISHER' or 'BINOMIAL'."),
+                        list(quote(is.data.frame(genes) && any(!is.null(ref_genes), !is.null(ref_organism))),
+                             "If the `genes` parameter is a data frame, statistical enrichment analysis will be performed.\nProviding Reference gene list (`ref_genes` and `ref_organism`) is not possible in this mode.")
+            )
+  )
 
-  ## Build POST API Request's body
-  call_body <- .rba_query(init = list(geneInputList =  paste(genes,
-                                                             collapse =  ","),
-                                      organism = organism,
-                                      annotDataSet = annot_dataset,
-                                      enrichmentTestType = test_type,
-                                      correction = correction),
-                          list("refInputList",
-                               !all(is.null(ref_genes)),
-                               paste(ref_genes, collapse =  ",")),
-                          list("refOrganism",
-                               !is.null(ref_organism),
-                               ref_organism))
+  if (is.character(genes)) {
+    if (is.null(test_type)) { test_type = "FISHER" }
+    # Over-representation analysis
+    .msg("Performing PANTHER over-representation analysis (%s test) on %s genes from `organism %s` against `%s` datasets.",
+         switch(test_type, "FISHER" = "Fisher's exact", "BINOMIAL" = "Binomial"),
+         length(genes), organism, annot_dataset)
+
+    path_input <- "enrich/overrep"
+    encode_input <- "form"
+
+    ## Build POST API Request's body
+    call_body <- .rba_query(init = list(geneInputList =  paste(genes,
+                                                               collapse =  ","),
+                                        organism = organism,
+                                        annotDataSet = annot_dataset,
+                                        enrichmentTestType = test_type,
+                                        correction = correction),
+                            list("refInputList",
+                                 !all(is.null(ref_genes)),
+                                 paste(ref_genes, collapse =  ",")),
+                            list("refOrganism",
+                                 !is.null(ref_organism),
+                                 ref_organism))
+
+  } else {
+    # Enrichment analysis
+    .msg("Performing PANTHER statistical enrichment analysis (Mann-Whitney U Test) on %s genes and expression values from `organism %s` against `%s` datasets.",
+         nrow(genes), organism, annot_dataset)
+    path_input <- "enrich/statenrich"
+    encode_input <- "multipart"
+
+    ## Build POST API Request's body
+    temp_file <- tempfile(pattern = "rba_", fileext = ".txt")
+
+    utils::write.table(x = genes,
+                       file = temp_file,
+                       sep = "\t",
+                       quote = FALSE,
+                       row.names = FALSE,
+                       col.names = FALSE)
+
+    call_body <- list(organism = organism,
+                      annotDataSet = annot_dataset,
+                      correction = correction,
+                      geneExp = httr::upload_file(temp_file))
+
+  }
 
   ## Build Function-Specific Call
   parser_input <- list("json->list_simp",
@@ -233,8 +334,8 @@ rba_panther_enrich <- function(genes,
   input_call <- .rba_httr(httr = "post",
                           url = .rba_stg("panther", "url"),
                           path = paste0(.rba_stg("panther", "pth"),
-                                        "enrich/overrep"),
-                          encode = "form",
+                                        path_input),
+                          encode = encode_input,
                           body = call_body,
                           accept = "application/json",
                           parser = parser_input,