Skip to content

Commit

Permalink
Add support to provide expression data frame to PANTHER enrichment
Browse files Browse the repository at this point in the history
- Refined based on PANTHER's documentation, if a gene vector is supplied, the analysis is named over-representation analysis; if data frame with expression is supplied, the nalysis is named statistical enrichment.
  • Loading branch information
moosa-r committed Dec 28, 2024
1 parent a0866ad commit 5be8e93
Show file tree
Hide file tree
Showing 2 changed files with 236 additions and 73 deletions.
199 changes: 150 additions & 49 deletions R/panther.R
Original file line number Diff line number Diff line change
Expand Up @@ -81,45 +81,90 @@ rba_panther_mapping <- function(genes,
return(final_output)
}

#' PANTHER Over-Representation Enrichment Analysis
#'
#' Using this function you can use PANTHER services to perform
#' over-representation enrichment analysis. This statistical test will
#' compare your input genes to a set of defined gene lists to determine
#' if they are over/under-represented.
#'
#' @param genes Character vector of genes identifiers with maximum length of
#' 10000. Can be any of: Ensemble gene ID, Ensemble protein ID, Ensemble
#' transcript ID, Entrez gene ID, gene symbol, NCBI GI, HGNC ID,
#' PANTHER Over-Representation or Enrichment Analysis
#'
#' Use PANTHER services to perform over-representation enrichment analysis.
#' You can either provide a character vector of gene IDs for
#' over-representation analysis, or a data frame of gene IDs and expression
#' analysis.\cr Please refer to the details section for more information on
#' the statistical analysis.
#'
#' \strong{Over-representation Test}: It assesses whether specific gene sets are
#' represented in your input gene list differently from what is expected by
#' chance. It uses Fisher's exact test or Binomial test to calculate p-values.
#' Fisher's exact test determines the probability of observing the gene
#' counts in a category based on a hypergeometric distribution; the binomial
#' test compares the observed proportion of genes in a category to the
#' expected proportion based on the reference list. A significant p-value
#' indicates over-representation or under-representation of a gene set.
#'
#' \strong{Statistical Enrichment Test}: The statistical enrichment test uses the
#' Mann-Whitney U (Wilcoxon Rank-Sum) test to assess if the expression values
#' associated with genes in a specific category differ significantly from the
#' overall distribution in the input list. This non-parametric test first
#' ranks the numerical values and computes whether the expression values
#' were randomly drawn from the overall distribution of values. A small
#' p-value indicates that the numerical values for the genes in the category
#' are significantly different from the background distribution, thus
#' non-random patterns.
#'
#' Please note that starting from rbioapi version 0.8.2, you can supply a
#' gene expression data frame to perform statistical enrichment analysis.
#' In earlier versions, only a character vector of gene IDs was possible,
#' thus only over-representation analysis.
#'
#' @param genes Either a character vector or a data frame. Depending on
#' this parameter, the analysis type is determined.
#' \describe{
#' \item{Character vector:}{If a character vector is supplied,
#' over-representation analysis will be performed using either Fisher's exact
#' test (default), or binomial.}
#' \item{Data frame:}{If a data.frame is supplied, statistical enrichment
#' test is performed using Mann-Whitney U (Wilcoxon Rank-Sum) test. The
#' data frame should have two columns: the first column is a character vector
#' with gene identifiers and the second column is a numerical vector with
#' expression values.}
#' }
#' In both cases, maximum of 10000 genes can be supplied.
#' The gene identifiers can be any of: Ensemble gene ID, Ensembl protein ID,
#' Ensembl transcript ID, Entrez gene ID, gene symbol, NCBI GI, HGNC ID,
#' International protein index ID, NCBI UniGene ID, UniProt accession
#' and/or UniProt ID.
#' or UniProt ID.
#' @param organism (numeric) NCBI taxon ID. run \code{\link{rba_panther_info}}
#' with argument 'what = "organisms"' to get a list of PANTHER's
#' supported organisms.
#' @param annot_dataset A PANTHER dataset ID to test your input against it.
#' run \code{\link{rba_panther_info}}with argument 'what = "datasets"' to
#' run \code{\link{rba_panther_info}} with argument 'what = "datasets"' to
#' get a list of PANTHER's supported datasets. Note that you should enter
#' the "id" of the dataset, not its label (e.g. entering "biological_process"
#' is incorrect, you should rather enter "GO:0008150").
#' @param test_type statistical test type to calculate the p values. either
#' "FISHER" (default) or "BINOMIAL".
#' @param test_type statistical test type to calculate the p values.
#' \itemize{
#' \item If performing over-representation analysis (i.e. `genes` parameter
#' is a character vector), valid values are "FISHER" (default) or "BINOMIAL".
#' \item If performing statistical enrichment analysis (i.e. `genes`
#' parameter is a data.frame), the only valid value is "Mann-Whitney"
#' }
#' @param correction p value correction method. either "FDR" (default),
#' "BONFERRONI" or "NONE".
#' @param cutoff (Numeric) (Optional) a threshold to filter the results.
#' if correction is "FDR", the threshold will be applied to fdr column's
#' values; if otherwise, the threshold will be applied to p value column.
#' @param ref_genes (Optional) A set of genes that will be used as the test's
#' background (reference/universe) gene set. If no value supplied, all of
#' the genes in specified organism will be used. maximum length and supported
#' IDs are the same as 'genes' argument.
#' @param ref_organism (Optional) if 'ref_genes' is used, you can specify
#' the organisms which correspond to your supplied IDs in 'ref_genes'
#' argument. see 'organism' argument for supported values.
#' @param ref_genes (Optional, only valid if genes is a character vector)
#' A character vector of genes that will be used as the test's background
#' (reference/universe) gene set. If no value supplied, all of the genes in
#' specified organism will be used. The maximum length and supported IDs are
#' the same as 'genes' argument.
#' @param ref_organism (Optional, only valid if genes is a character vector)
#' if 'ref_genes' is used, you can specify the organisms which correspond to
#' your supplied IDs in 'ref_genes' argument. see 'organism' argument for
#' supported values.
#' @param ... rbioapi option(s). See \code{\link{rba_options}}'s
#' arguments manual for more information on available options.
#'
#' @section Corresponding API Resources:
#' "POST https://www.pantherdb.org/services/oai/pantherdb/enrich/overrep"
#' \cr "POST https://www.pantherdb.org/services/oai/pantherdb/enrich/statenrich"
#'
#' @return A list with the parameters and results. If the analysis was
#' successful, the results data frame are returned in the "results" element
Expand All @@ -142,18 +187,34 @@ rba_panther_mapping <- function(genes,
#'
#' @examples
#' \donttest{
#' rba_panther_enrich(genes = c("p53", "BRCA1", "cdk2", "Q99835", "CDC42",
#' "CDK1", "KIF23", "PLK1", "RAC2", "RACGAP1"),
#' organism = 9606, annot_dataset = "GO:0008150",
#' cutoff = 0.01)
#' rba_panther_enrich(
#' genes = c("p53", "BRCA1", "cdk2", "Q99835", "CDC42",
#' "CDK1", "KIF23", "PLK1", "RAC2", "RACGAP1"),
#' organism = 9606, annot_dataset = "GO:0008150",
#' cutoff = 0.01
#' )
#' }
#'
#' \donttest{
#' expression_df <- data.frame(
#' genes = c("p53", "BRCA1", "cdk2", "Q99835", "CDC42",
#' "CDK1", "KIF23", "PLK1", "RAC2", "RACGAP1"),
#' expr = runif(10, 0, 100)
#' )
#'
#' rba_panther_enrich(
#' genes = expression_df,
#' organism = 9606,
#' annot_dataset = "GO:0008150"
#' )
#' }
#'
#' @family "PANTHER"
#' @export
rba_panther_enrich <- function(genes,
organism,
annot_dataset,
test_type = "FISHER",
test_type = NULL,
correction = "FDR",
cutoff = NULL,
ref_genes = NULL,
Expand All @@ -164,8 +225,7 @@ rba_panther_enrich <- function(genes,
## Check User-input Arguments
.rba_args(cons = list(list(arg = "genes",
class = c("character",
"numeric"),
max_len = 100000),
"data.frame")),
list(arg = "organism",
class = "numeric",
len = 1),
Expand All @@ -174,7 +234,7 @@ rba_panther_enrich <- function(genes,
len = 1),
list(arg = "test_type",
class = "character",
val = c("FISHER", "BINOMIAL"),
val = c("FISHER", "BINOMIAL", "Mann-Whitney"),
len = 1),
list(arg = "correction",
class = "character",
Expand All @@ -191,25 +251,66 @@ rba_panther_enrich <- function(genes,
list(arg = "ref_organism",
class = "numeric",
len = 1)),
cond = list(list(quote(!is.null(ref_organism) && is.null(ref_genes)),
"'ref_organism' was ignored because no 'ref_genes' was supplied.")),
cond_warning = TRUE)
.msg("Performing over-representation enrichment analysis of %s input genes of organism %s against %s datasets.",
length(genes), organism, annot_dataset)
cond = list(list(quote(xor(is.null(ref_organism), is.null(ref_genes))),
"'ref_organism' and 'ref_genes' should be supplied togeather."),
list(quote(is.data.frame(genes) && (ncol(genes) != 2 || !inherits(genes[[1]], "character") || !inherits(genes[[2]], "numeric"))),
"If the `genes` parameter is a data frame, statistical enrichment analysis will be performed.\nThe gene parameter should be a data frame with 2 columns, where the first column contains the genes identifiers and the second column contains numerical expression values."),
list(quote(is.data.frame(genes) && !is.null(test_type) && test_type != "Mann-Whitney"),
"If the `genes` parameter is a data frame, statistical enrichment analysis will be performed.\nThus, the only valid parameter for `test_type` is 'Mann-Whitney'."),
list(quote(is.character(genes) && !is.null(test_type) && test_type == "Mann-Whitney"),
"If the `genes` parameter is a character vector, over-representation analysis will be performed.\nThus, the valid parameters for `test_type` are either 'FISHER' or 'BINOMIAL'."),
list(quote(is.data.frame(genes) && any(!is.null(ref_genes), !is.null(ref_organism))),
"If the `genes` parameter is a data frame, statistical enrichment analysis will be performed.\nProviding Reference gene list (`ref_genes` and `ref_organism`) is not possible in this mode.")
)
)

## Build POST API Request's body
call_body <- .rba_query(init = list(geneInputList = paste(genes,
collapse = ","),
organism = organism,
annotDataSet = annot_dataset,
enrichmentTestType = test_type,
correction = correction),
list("refInputList",
!all(is.null(ref_genes)),
paste(ref_genes, collapse = ",")),
list("refOrganism",
!is.null(ref_organism),
ref_organism))
if (is.character(genes)) {
if (is.null(test_type)) { test_type = "FISHER" }
# Over-representation analysis
.msg("Performing PANTHER over-representation analysis (%s test) on %s genes from `organism %s` against `%s` datasets.",
switch(test_type, "FISHER" = "Fisher's exact", "BINOMIAL" = "Binomial"),
length(genes), organism, annot_dataset)

path_input <- "enrich/overrep"
encode_input <- "form"

## Build POST API Request's body
call_body <- .rba_query(init = list(geneInputList = paste(genes,
collapse = ","),
organism = organism,
annotDataSet = annot_dataset,
enrichmentTestType = test_type,
correction = correction),
list("refInputList",
!all(is.null(ref_genes)),
paste(ref_genes, collapse = ",")),
list("refOrganism",
!is.null(ref_organism),
ref_organism))

} else {
# Enrichment analysis
.msg("Performing PANTHER statistical enrichment analysis (Mann-Whitney U Test) on %s genes and expression values from `organism %s` against `%s` datasets.",
nrow(genes), organism, annot_dataset)
path_input <- "enrich/statenrich"
encode_input <- "multipart"

## Build POST API Request's body
temp_file <- tempfile(pattern = "rba_", fileext = ".txt")

utils::write.table(x = genes,
file = temp_file,
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = FALSE)

call_body <- list(organism = organism,
annotDataSet = annot_dataset,
correction = correction,
geneExp = httr::upload_file(temp_file))

}

## Build Function-Specific Call
parser_input <- list("json->list_simp",
Expand All @@ -233,8 +334,8 @@ rba_panther_enrich <- function(genes,
input_call <- .rba_httr(httr = "post",
url = .rba_stg("panther", "url"),
path = paste0(.rba_stg("panther", "pth"),
"enrich/overrep"),
encode = "form",
path_input),
encode = encode_input,
body = call_body,
accept = "application/json",
parser = parser_input,
Expand Down
Loading

0 comments on commit 5be8e93

Please sign in to comment.