--- title: "ADAM: Activity and Diversity Analysis Module" author: "André L. Molan, Giordano B. S. Seco, Agnes A. S. Takeda, Jose L. Rybarczyk-Filho" date: "`r doc_date()`" package: "`r pkg_ver('ADAM')`" bibliography: bibliography.bib fig_caption: yes output: BiocStyle::html_document: css: custom.css vignette: > %\VignetteIndexEntry{"Using ADAM"} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Overview *ADAM* is a GSEA R package created to group a set of genes from comparative samples (control *versus* experiment) according to their respective functions (Gene Ontology and KEGG pathways as default) and show their significance by calculating p-values referring to gene diversity and activity (@Castro2009). Each group of genes is called GFAG (Group of Functionally Associated Genes). The package has support for many species in regards to the genes and their respective functions. In the package's analysis, all genes present in the expression data are grouped by their respective functions according to the domains described by AnalysisDomain argument. The relationship between genes and functions are made based on the species annotation package. If there is no annotation package, a three column file (gene, function and function description) must be provided. For each GFAG, gene diversity and activity in each sample are calculated. As the package always compare two samples (control *versus* experiment), relative gene diversity and activity for each GFAG are calculated. Using bootstrap method, for each GFAG, according to relative gene diversity and activity, two p-values are calculated. The p-values are then corrected, by using the correction method defined by *PCorrectionMethod* argument, generating a q-value (@molan2018). The significative GFAGs will be those whose q-value stay under the cutoff set by *PCorrection* argument. Optionally, it's possible to run Wilcoxon ranck sum test and/or Fisher's exact test (@fontoura2016). These tests also provide a corrected p-value, and siginificative groups can be seen through them. # GFAGAnalysis *GFAGAnalysis* function provides a complete analysis, using all available arguments. As an example, lets consider a gene expression set of *Aedes aegypti*: ```{r eval=TRUE, out.height.px=6, out.width.px=6} suppressMessages(library(ADAM)) ``` ```{r eval=TRUE, out.height.px=6, out.width.px=6} data("ExpressionAedes") head(ExpressionAedes) ``` The first column refers to the gene names, while the others are the expression obtained by a specific experiment (in this case, RNA-seq). ADAM always need two samples (control *versus* experiment). This way, we must select two sample columns from the expression data for each comparison: ```{r eval=TRUE, out.height.px=6, out.width.px=6} Comparison <- c("control1,experiment1","control2,experiment2") ``` Each GFAG has a number of genes associated to it. This way, the analysis can consider all GFAGs or just those with a certain number of genes: ```{r eval=TRUE, out.height.px=6, out.width.px=6} Minimum <- 3 Maximum <- 20 ``` The p-values for each GFAG is calculated through the bootstrap method, which demands a seed for generating random numbers and a number of bootstraps steps (the number of bootstraps should be a value that ensures the p-value precision): ```{r eval=TRUE, out.height.px=6, out.width.px=6} SeedBootstrap <- 1049 StepsBootstrap <- 1000 ``` The p-values will be corrected by a specific method with a certain cutoff value: ```{r eval=TRUE, out.height.px=6, out.width.px=6} CutoffValue <- 0.05 MethodCorrection <- "fdr" ``` In order to group the genes according to their biological functions, it's necessary an annotation package or a file relating genes and functions. In this case, *Aedes aegypti* doesn't have an annotation package. This way, we build our own file: ```{r eval=TRUE, out.height.px=6, out.width.px=6} data("KeggPathwaysAedes") head(KeggPathwaysAedes) ``` It's necessary to inform which function domain and gene nomenclature are being used. As *Aedes agypti* doesn't have an annotation package, the domain will be "own" and the nomenclature "gene": ```{r eval=TRUE, out.height.px=6, out.width.px=6} Domain <- "own" Nomenclature <- "geneStableID" ``` Wilcoxon rank sum test and Fisher's exact test will be run: ```{r eval=TRUE, out.height.px=6, out.width.px=6} Wilcoxon <- TRUE Fisher <- TRUE ``` As all arguments were defined, then we can run GFAGAnalysis function: ```{r eval=TRUE, out.height.px=6, out.width.px=6} ResultAnalysis <- suppressMessages(GFAGAnalysis(ComparisonID = Comparison, ExpressionData = ExpressionAedes, MinGene = Minimum, MaxGene = Maximum, SeedNumber = SeedBootstrap, BootstrapNumber = StepsBootstrap, PCorrection = CutoffValue, DBSpecies = KeggPathwaysAedes, PCorrectionMethod = MethodCorrection, WilcoxonTest = Wilcoxon, FisherTest = Fisher, AnalysisDomain = Domain, GeneIdentifier = Nomenclature)) ``` In the example above, we used the function *supressMessages* just to stop showing messages during the *GFAGAnalysis* function execution. The output of *GFAGAnalysis* will be a *list* with two elements. The first corresponds to a *data frame* showing genes and their respective functions: ```{r eval=TRUE, out.height.px=6, out.width.px=6} head(ResultAnalysis[[1]]) ``` The second element of the output list result corresponds to data frames according to the argument ComparisonID: ```{r eval=TRUE, out.height.px=6, out.width.px=6} DT::datatable(as.data.frame(ResultAnalysis[[2]][1]), width = 800, options = list(scrollX = TRUE)) DT::datatable(as.data.frame(ResultAnalysis[[2]][2]), width = 800, options = list(scrollX = TRUE)) ``` The data frames corresponding to the second element of the list have the following columns: *
**ID** - A code identifying the GFAG (GO term, KEGG pathway or one according to users annotations).
*
**Description** - Description of the GFAG.
*
**Raw_Number_Genes** - Total number of genes related to the GFAG.
*
**Sample_Number_Genes** - Number of genes, present in the sample, related to the GFAG.
*
**H_** - Two columns. GFAG gene diversity of each sample (control *versus* experiment).
*
**N_** - Two columns. GFAG gene activity of each sample (control *versus* experiment).
*
**h** - Relative gene diversity.
*
**n** - Relative gene activity.
*
**pValue_h** - GFAG p-value related to gene diversity.
*
**pValue_n** - GFAG p-value related to gene activity.
*
**qValue_h** - GFAG corrected p-value related to gene diversity.
*
**qValue_n** - GFAG corrected p-value related to gene activity.
*
**Significance_h** - GFAG significance related to gene diversity. "significative" means the q-value is lower than the cutoff set by PCorrection argument, while "not-significative" means the opposite.
*
**Significance_n** - GFAG significance related to gene activity. "significative" means the q-value is lower than the cutoff set by PCorrection argument, while "not-significative" means the opposite.
*
**Wilcox_pvalue** - GFAG p-value generated by Wilcoxon rank test.
*
**Wilcox_qvalue** - Wilcoxon GFAG corrected p-value.
*
**Wilcox_significance** - GFAG significance related Wilcoxon test. "significative" means the q-value is lower than the cutoff set by PCorrection argument, while "not-significative" means the opposite.
*
**Fisher_pvalue** - GFAG p-value generated by Fisher's exact test.
*
**Fisher_qvalue** - Fisher GFAG corrected p-value.
*
**Fisher_significance** - GFAG significance related to Fisher's exact test. "significative" means the q-value is lower than the cutoff set by PCorrection argument, while "not-significative" means the opposite.
# ADAnalysis *ADAnalysis* function provides a partial analysis, where is calculated just gene diversity and activity of each GFAG with no signicance by bootstrap, Wilcoxon or Fisher. As an example, lets consider the same gene expression set of *Aedes aegypti* previously used in *GFAGAnalysis* funcion example: ```{r eval=TRUE, out.height.px=6, out.width.px=6} suppressMessages(library(ADAM)) data("ExpressionAedes") data("KeggPathwaysAedes") ``` As ADAM always need two samples (control *versus* experiment), let's select two sample columns from the expression data and define minimum and maximum number of genes per GFAG: ```{r eval=TRUE, out.height.px=6, out.width.px=6} Comparison <- c("control1,experiment1") Minimum <- 3 Maximum <- 100 ``` *Aedes aegypti* doesn't have an annotation package. This way, we build our own file: ```{r eval=TRUE, out.height.px=6, out.width.px=6} SpeciesID <- "KeggPathwaysAedes" ``` It's necessary to inform which function domain and gene nomenclature are being used. *Aedes agypti* doesn't have an annotation package. So the domain will be "own" and the nomenclature "geneStableID": ```{r eval=TRUE, out.height.px=6, out.width.px=6} Domain <- "own" Nomenclature <- "geneStableID" ``` As all arguments were defined, then we can run ADAnalysis function: ```{r eval=TRUE, out.height.px=6, out.width.px=6} ResultAnalysis <- suppressMessages(ADAnalysis(ComparisonID = Comparison, ExpressionData = ExpressionAedes, MinGene = Minimum, MaxGene = Maximum, DBSpecies = KeggPathwaysAedes, AnalysisDomain = Domain, GeneIdentifier = Nomenclature)) ``` In the example above, we used the function *supressMessages* just to stop showing messages during the *ADAnalysis* function execution. The output of *ADAnalysis* will be a *list* with two elements. The first corresponds to a *data frame* showing genes and their respective functions: ```{r eval=TRUE, out.height.px=6, out.width.px=6} head(ResultAnalysis[[1]]) ``` The second element of the output list result corresponds to data frames according to the argument ComparisonID: ```{r eval=TRUE, out.height.px=6, out.width.px=6} DT::datatable(as.data.frame(ResultAnalysis[[2]][1]), width = 800, options = list(scrollX = TRUE)) ``` The data frames corresponding to the second element of the list have the following columns: *
**ID** - A code identifying the GFAG (GO term, KEGG pathway or one according to users annotations).
*
**Description** - Description of the GFAG.
*
**Raw_Number_Genes** - Total number of genes related to the GFAG.
*
**Sample_Number_Genes** - Number of genes, present in the sample, related to the GFAG.
*
**H_** - Two columns. GFAG gene diversity of each sample (control *versus* experiment).
*
**N_** - Two columns. GFAG gene activity of each sample (control *versus* experiment).
*
**h** - Relative gene diversity.
*
**n** - Relative gene activity.
# Session information ```{r, eval=TRUE, out.height.px=6, out.width.px=6, label='Session information', , echo=FALSE} sessionInfo() ``` # References