--- title: "mdgsa Library" author: "[David Montaner](http://www.dmontaner.com)" date: "(2014-11-24)" output: BiocStyle::pdf_document: toc: yes fig_width: 4 fig_height: 4.5 BiocStyle::html_document: toc: yes fig_width: 5 fig_height: 5 vignette: > %\VignetteIndexEntry{mdgsa_vignette} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r, include = FALSE} library (knitr) ## this is needed when R CMD build opts_chunk$set (message = FALSE) ``` [montaner2010]:http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0010348 "mdgsa Paper" [paperALL]: http://bloodjournal.hematologylibrary.org/cgi/pmidlookup?view=long&pmid=14684422 "Reference of the ALL dataset" [bioconductor]:http://bioconductor.org/ "Bioconductor home page" [all]:http://bioconductor.org/packages/release/data/experiment/html/ALL.html "ALL Bioconductor data library" [limma]:http://bioconductor.org/packages/release/bioc/html/limma.html "Bioconductor limma package" [hgu95av2.db]:http://bioconductor.org/packages/release/data/annotation/html/hgu95av2.db.html "Bioconductor annotation package" [kegg]:http://www.genome.jp/kegg/ "KEGG web site" [go]:http://geneontology.org/ "Gene Ontology Consortium" Introduction ================================================================================ The `mdgsa` library implements the _gene set analysis_ methodology developed in [Montaner and Dopazo (2010)][montaner2010]. It presents a flexible framework for analyzing the enrichment of _gene sets_ along a given _ranking_ of genes. The novelty is that, not just one _ranking index_ but two, may be analyzed and jointly explored in a __multidimensional gene set analysis__. As classical _GSEA_, our approach allows for the functional profiling of isolated genomic characteristics; differential gene expression, copy number analyses, or variant to disease associations may be interpreted in terms of _gene sets_ using the `mdgsa` package. But more interestingly, our multivariate approach may be used to find out _gene set_ enrichments due to the combined effect of two of such genomic dimensions. We could for instance detect _gene sets_ affected by the interaction of gene expression changes and copy number alterations. Citation ---------------------------------------- Further description of the `mdgsa` methods may be found at: > _Multidimensional gene set analysis of genomic data._ > David Montaner and Joaquin Dopazo. > [PLoS One. 2010 Apr 27;5(4):e10348. doi: 10.1371/journal.pone.0010348.][montaner2010] Functional Profiling of Gene Expression Data ================================================================================ In this tutorial we use the data in the [Acute Lymphocytic Leukemia expression dataset][all] package of [Bioconductor]. First we will use the [limma] library to compute a differential gene expression analysis. Then we will use the functions `uvGsa` and `mdGsa` in the `mdgsa` package to perform uni-dimensional and bi-dimensional gene set analyses respectively. The functional interpretation will be done in terms of [KEGG] Pathways. The annotation will be taken from the [hgu95av2.db] library in [Bioconductor]. First we load the data and describe the design matrix of the experiment ```{r} library (ALL) data (ALL) des.mat <- model.matrix (~ 0 + mol.biol, data = ALL) colnames (des.mat) <- c("ALL", "BCR", "E2A", "NEG", "NUP", "p15") head (des.mat) ``` Then we can use `limma` to carry out some gene expression comparisons. We can for instance compare _ALL_ samples to _NEG_ control samples or explore gene differential expression between _BCR_ and _NEG_. ```{r} library (limma) cont.mat <- makeContrasts (ALL-NEG, BCR-NEG, levels = des.mat) cont.mat fit <- lmFit (ALL, design = des.mat) fit <- contrasts.fit (fit, cont.mat) fit <- eBayes (fit) ``` From this analysis we get _test statistics_ and _p-values_ for each of the two contrasts ```{r} fit$t[1:3,] fit$p.value[1:3,] ``` These gene level information may be now interpreted in terms of _gene sets_. For this example we will carry out a _gene set analysis_ using the functional blocks described in [KEGG], but any other functional data base such as the [Gene Ontology][go] or even a customized one may be analyzed using `mdgsa`. We can get the [KEGG] annotation from the [hgu95av2.db] library as follows. ```{r} library (hgu95av2.db) anmat <- toTable (hgu95av2PATH) anmat[1:3,] ``` Univariate Gene Set Analysis -------------------------------------------------------------------------------- We can now carry out the functional interpretation of the contrast "BCR - NEG" for instance. The data needed for the gene set analysis are the p-values and test statistics returned by `limma` at gene level. ```{r} fit$t[1:3, "BCR - NEG"] fit$p.value[1:3, "BCR - NEG"] ``` We load the `mdgsa` library. ```{r} library (mdgsa) ``` The first step in the procedure is to combine p-values and test statistics in a single ranking value. ```{r} rindex <- pval2index (pval = fit$p.value[,"BCR - NEG"], sign = fit$t[,"BCR - NEG"]) rindex <- indexTransform (rindex) rindex[1:3] ``` This _ranking index_ keeps the sign of the test statistic; that is, the information of whether the gene is over or underexpressed. ```{r} plot (fit$t[,"BCR - NEG"], rindex) ``` but the evidence of the differential expression is taken directly from the p-value ```{r} plot (fit$p.value[,"BCR - NEG"], rindex) ``` Next we will need to format the annotation. The function `annotMat2list` in the `mdgsa` library converts the annotation matrix into a list. ```{r} anmat[1:3,] annot <- annotMat2list (anmat) length (annot) ``` Each element of the list contains the gene names of a gene set. ```{r} lapply (annot[1:3], head, n= 3) ``` It is also important to make sure that the _gene universe_ described by the _ranking index_ and the _annotation_ are concordant. The function `annotFilter` encompasses the annotation list to the names in the ranking index; it is also used to exclude functional blocks too small to be considered gene sets, or too big to be specific of any biological process of interest. ```{r} annot <- annotFilter (annot, rindex) ``` Now everything is ready to carry out the _univariate_ gene set analysis. ```{r uvGsa} res.uv <- uvGsa (rindex, annot) ``` The output of the `uvGsa` function is a data frame which rows correspond to functional blocks analyzed. ```{r} res.uv[1:3,] ``` The `uvGsa` function fits a _logistic regression model_ relating, the probability of genes belonging to the gene set, with the value of the ranking statistic. Significant and __positive__ _log odds ratio_ (`lor`) indicate that the gene set is enriched in those genes with __high values__ of the ranking statistic. In our example this means that genes up regulated in _BCR_ compared to _NEG_ are more likely to belong to the functional block. We could also say that the block of genes shows a significant degree of over expression _BCR_ compared to _NEG_. On the other hand, when a gene set has a __negative__ _log odds ratio_ we can say that the genes in the set are more likely to be associated to __low values__, negative in our case, of the ranking statistics. In our case this means that the gene set is down regulated in _BCR_ compared to _NEG_. The function `uvPat` helps you classifying the analyzed gene sets ```{r} res.uv[,"pat"] <- uvPat (res.uv, cutoff = 0.05) table (res.uv[,"pat"]) ``` Positive values (1) correspond to significant and positive _log odds ratios_. Negative values (-1) correspond to significant and negative _log odds ratios_. Zeros correspond to non enriched blocks. As in this example we are analyzing [KEGG] pathways we can use the function `getKEGGnames` to find out the "name" of the pathways^[Similar function `getGOnames` is available to be used with [Gene Ontology][go] terms.]. ```{r, results = "hide"} res.uv[,"KEGG"] <- getKEGGnames (res.uv) ``` Finally, the function `uvSignif` may help us displaying just the enriched blocks. ```{r} res <- uvSignif (res.uv) res[,c("pat", "KEGG")] ``` Multivariate Gene Set Analysis -------------------------------------------------------------------------------- But with the `mdgsa` library we can analyze not just one but two ranking statistics at a time. In our example we were interested not in just one differential expression contrast but in two: ALL vs. NEG and BCR vs. NEG. We used `limma` to fit this two contrasts (see previous sections) and computed gene statistics and p-values for for each of them. ```{r} fit$t[1:3,] fit$p.value[1:3,] ``` We can combine this two matrices in a single one containing a _ranking statistic_ for each contrast, just as we did in the _univariate_ example. ```{r} rindex <- pval2index (pval = fit$p.value, sign = fit$t) rindex <- indexTransform (rindex) rindex[1:3,] ``` Now we can explore the bi-dimensional distribution of this ranking indexes ```{r} plot (rindex, pch = ".") ``` and search for gene sets enrichment patterns in both _dimensions_ The same annotation list we __filtered__ in the previous section may be used in this second example. Thus everything is ready to carry out our _multidimensional_ analysis. ```{r mdGsa} res.md <- mdGsa (rindex, annot) ``` As in the _univariate_ analysis, the output of the `mdGsa` function is a data frame with a row per analyzed gene set. ```{r} res.md[1:3,] ``` The column of the row contain _log odds ratios_ and p-values for each of the analyzed _dimensions_ and also for their _interaction_ effect. The function `mdPat` helps clarifying the bi-dimensional pattern of enrichment. ```{r} res.md[,"pat"] <- mdPat (res.md) table (res.md[,"pat"]) ``` And as before we can incorporate the KEGG names to our results. ```{r, results = "hide"} res.md[,"KEGG"] <- getKEGGnames (res.md) res.md[1:3,] ``` Thus we could for instance explore the KEGG classified as having a with a __q3f__ pattern. The __q3f__ classification means that the genes of this gene set are located in the __third quadrant__ of the _bivariate ranking index_ representation. The `plotMdGsa` function help us understanding such pattern. ```{r} Q3 <- rownames (res.md)[res.md$pat == "q3f"] Q3 plotMdGsa (rindex, block = annot[[Q3]], main = res.md[Q3, "KEGG"]) ``` Red dots in the figure represent the genes within the gene set. They show, in both dimensions of our ranking, values significantly more negatives that the remaining genes in the study. This same pattern may be appreciated in the ellipses drawn in the figure. The blue one represents a _confidence region_ for all the genes in the study. The red one shows the same _confidence region_ but just for those genes within the gene set. We can see how the distribution of the genes in KEGG _`r Q3`_ is displaced towards the third quadrant of the plot. This indicates us that "`r res.md[Q3, "KEGG"]`" is a _pathway_ jointly down regulated in both _ALL_ and _BCR_ when compared to the controls in the _NEG_ group. Similarly we can explore a _bimodal_ (__b13__) KEGG. This pattern classification indicates us that the functional block has two __sub-modules__ of genes with opposite patterns of expression. One subset of the KEGG is up-regulated in both conditions while the other subset is down-regulated also in both dimensions analyzed. It may be worth pointing here that, this _bimodal_ pattern will be missed by standard _univariate_ gene set methods, and that it may be just detected in the _multidimensional_ analysis. ```{r} BI <- rownames (res.md)[res.md$pat == "b13"] plotMdGsa (rindex, block = annot[[BI]], main = res.md[BI, "KEGG"]) ``` As third example we could display some KEGG enriched in one dimension but not in the other. The __yh__ pattern indicates gene set over-representation just in the _Y_ axis but not in the horizontal axis. In our case, KEGG pathways with at __yh__ pattern are those over expressed in _BRC_ compared to _NEG_ but not enriched in the comparison between _ALL_ and _NEG_. Similarly an __yl__ pattern indicates a down-regulation of the block in the _BRC vs. NEG_ comparison but not in the _ALL vs. NEG_ one. The plot below displays one of such __yl__ classified KEGGs. ```{r} rownames (res.md)[res.md$pat == "yl"] YL <- "00280" plotMdGsa (rindex, block = annot[[YL]], main = res.md[YL, "KEGG"]) ``` All possible multidimensional enrichment __patterns__ are listed in the [Appendix](#appendix1). Appendix ================================================================================ 1.- Multidimensional Functional Classification {#appendix1} -------------------------------------------------------------------------------- All possible functional block classifications in the bi-dimensional gene set analysis are: - __q1i__: block displaced toward quadrant __1__ (0 < X & 0 < Y) with interaction. - __q2i__: block displaced toward quadrant __2__ (0 > X & 0 < Y) with interaction. - __q3i__: block displaced toward quadrant __3__ (0 > X & 0 > Y) with interaction. - __q4i__: block displaced toward quadrant __4__ (0 < X & 0 > Y) with interaction. - __q1f__: block displaced toward quadrant __1__, no interaction. - __q2f__: block displaced toward quadrant __2__, no interaction. - __q3f__: block displaced toward quadrant __3__, no interaction. - __q4f__: block displaced toward quadrant __4__, no interaction. - __xh__: block shifted to __positive X__ values. - __xl__: block shifted to __negative X__ values. - __yh__: block shifted to __positive Y__ values. - __yl__: block shifted to __negative Y__ values. - __b13__: bimodal block. Half of the genes displaced towards quadrant __1__ and the other half towards quadrant __3__. - __b24__: bimodal block. Half of the genes displaced towards quadrant __2__ and the other half towards quadrant __4__. - __NS__: __non significant__ block. A detailed description of each of the patterns can be found in [Montaner and Dopazo (2010)][montaner2010]. The function `mdPat` in the `mdgsa` package is devised to help the user classifying bi-dimensional GSA results in such patterns. Session Info ================================================================================ ```{r} sessionInfo() ```