PANTHER Over-Representation or Enrichment Analysis

Use PANTHER services to perform over-representation enrichment analysis. You can either provide a character vector of gene IDs for over-representation analysis, or a data frame of gene IDs and expression analysis.
Please refer to the details section for more information on the statistical analysis.

Usage

rba_panther_enrich(
  genes,
  organism,
  annot_dataset,
  test_type = NULL,
  correction = "FDR",
  cutoff = NULL,
  ref_genes = NULL,
  ref_organism = NULL,
  ...
)

Arguments

genes

Either a character vector or a data frame. Depending on this parameter, the analysis type is determined.

Character vector:: If a character vector is supplied, over-representation analysis will be performed using either Fisher's exact test (default), or binomial.
Data frame:: If a data.frame is supplied, statistical enrichment test is performed using Mann-Whitney U (Wilcoxon Rank-Sum) test. The data frame should have two columns: the first column is a character vector with gene identifiers and the second column is a numerical vector with expression values.

In both cases, maximum of 10000 genes can be supplied. The gene identifiers can be any of: Ensemble gene ID, Ensembl protein ID, Ensembl transcript ID, Entrez gene ID, gene symbol, NCBI GI, HGNC ID, International protein index ID, NCBI UniGene ID, UniProt accession or UniProt ID.

organism

(numeric) NCBI taxon ID. run rba_panther_info with argument 'what = "organisms"' to get a list of PANTHER's supported organisms.

annot_dataset

A PANTHER dataset ID to test your input against it. run rba_panther_info with argument 'what = "datasets"' to get a list of PANTHER's supported datasets. Note that you should enter the "id" of the dataset, not its label (e.g. entering "biological_process" is incorrect, you should rather enter "GO:0008150").

test_type

statistical test type to calculate the p values.

If performing over-representation analysis (i.e. `genes` parameter is a character vector), valid values are "FISHER" (default) or "BINOMIAL".
If performing statistical enrichment analysis (i.e. `genes` parameter is a data.frame), the only valid value is "Mann-Whitney"

correction

p value correction method. either "FDR" (default), "BONFERRONI" or "NONE".

cutoff

(Numeric) (Optional) a threshold to filter the results. if correction is "FDR", the threshold will be applied to fdr column's values; if otherwise, the threshold will be applied to p value column.

ref_genes

(Optional, only valid if genes is a character vector) A character vector of genes that will be used as the test's background (reference/universe) gene set. If no value supplied, all of the genes in specified organism will be used. The maximum length and supported IDs are the same as 'genes' argument.

ref_organism

(Optional, only valid if genes is a character vector) if 'ref_genes' is used, you can specify the organisms which correspond to your supplied IDs in 'ref_genes' argument. see 'organism' argument for supported values.

...

rbioapi option(s). See rba_options's arguments manual for more information on available options.

Value

A list with the parameters and results. If the analysis was successful, the results data frame are returned in the "results" element within the list. Otherwise, an error message will be returned under the "search$error" element in the returned list.

Details

Over-representation Test: It assesses whether specific gene sets are represented in your input gene list differently from what is expected by chance. It uses Fisher's exact test or Binomial test to calculate p-values. Fisher's exact test determines the probability of observing the gene counts in a category based on a hypergeometric distribution; the binomial test compares the observed proportion of genes in a category to the expected proportion based on the reference list. A significant p-value indicates over-representation or under-representation of a gene set.

Statistical Enrichment Test: The statistical enrichment test uses the Mann-Whitney U (Wilcoxon Rank-Sum) test to assess if the expression values associated with genes in a specific category differ significantly from the overall distribution in the input list. This non-parametric test first ranks the numerical values and computes whether the expression values were randomly drawn from the overall distribution of values. A small p-value indicates that the numerical values for the genes in the category are significantly different from the background distribution, thus non-random patterns.

Please note that starting from rbioapi version 0.8.2, you can supply a gene expression data frame to perform statistical enrichment analysis. In earlier versions, only a character vector of gene IDs was possible, thus only over-representation analysis.

Corresponding API Resources

"POST https://www.pantherdb.org/services/oai/pantherdb/enrich/overrep"
"POST https://www.pantherdb.org/services/oai/pantherdb/enrich/statenrich"

References

Huaiyu Mi, Dustin Ebert, Anushya Muruganujan, Caitlin Mills, Laurent-Philippe Albou, Tremayne Mushayamaha, Paul D Thomas, PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API, Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D394–D403, https://doi.org/10.1093/nar/gkaa1106
PANTHER Services Details
Citations note on PANTHER website

Examples

# \donttest{
rba_panther_enrich(
  genes = c("p53", "BRCA1", "cdk2", "Q99835", "CDC42",
    "CDK1", "KIF23", "PLK1", "RAC2", "RACGAP1"),
  organism = 9606, annot_dataset = "GO:0008150",
  cutoff = 0.01
  )
# }

# \donttest{
expression_df <- data.frame(
  genes = c("p53", "BRCA1", "cdk2", "Q99835", "CDC42",
    "CDK1", "KIF23", "PLK1", "RAC2", "RACGAP1"),
  expr = runif(10, 0, 100)
  )

rba_panther_enrich(
  genes = expression_df,
  organism = 9606,
  annot_dataset = "GO:0008150"
  )
# }