---
title: "mdgsa Library"
author: "[David Montaner](http://www.dmontaner.com)"
date: "(2014-11-24)"
output:
  BiocStyle::pdf_document:
    toc: yes
    fig_width: 4
    fig_height: 4.5
  BiocStyle::html_document:
    toc: yes
    fig_width: 5
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{mdgsa_vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---


<!-- OPTIONS ------------------------------------------------------------------>

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r, include = FALSE}
library (knitr) ## this is needed when R CMD build
opts_chunk$set (message = FALSE)
```

<!-- LINKS -------------------------------------------------------------------->

[montaner2010]:http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0010348 "mdgsa Paper"

[paperALL]: http://bloodjournal.hematologylibrary.org/cgi/pmidlookup?view=long&pmid=14684422 "Reference of the ALL dataset"

[bioconductor]:http://bioconductor.org/ "Bioconductor home page"
[all]:http://bioconductor.org/packages/release/data/experiment/html/ALL.html "ALL Bioconductor data library"
[limma]:http://bioconductor.org/packages/release/bioc/html/limma.html "Bioconductor limma package"
[hgu95av2.db]:http://bioconductor.org/packages/release/data/annotation/html/hgu95av2.db.html "Bioconductor annotation package"

[kegg]:http://www.genome.jp/kegg/ "KEGG web site"
[go]:http://geneontology.org/ "Gene Ontology Consortium"


Introduction
================================================================================

The `mdgsa` library implements the _gene set analysis_ methodology developed in
[Montaner and Dopazo (2010)][montaner2010].
It presents a flexible framework for analyzing the enrichment of _gene sets_
along a given _ranking_ of genes.
The novelty is that,
not just one _ranking index_ but two,
may be analyzed and jointly explored 
in a __multidimensional gene set analysis__.

As classical _GSEA_,
our approach allows for the functional profiling of
isolated genomic characteristics; 
differential gene expression,
copy number analyses,
or variant to disease associations
may be interpreted in terms of _gene sets_ using the `mdgsa` package.
But more interestingly,
our multivariate approach may be used to find out _gene set_ enrichments
due to the combined effect of two of such genomic dimensions.
We could for instance detect _gene sets_
affected by the interaction of
gene expression changes and copy number alterations. 


Citation
----------------------------------------

Further description of the `mdgsa` methods may be found at:

> _Multidimensional gene set analysis of genomic data._  
> David Montaner and Joaquin Dopazo.  
> [PLoS One. 2010 Apr 27;5(4):e10348. doi: 10.1371/journal.pone.0010348.][montaner2010]


Functional Profiling of Gene Expression Data
================================================================================

In this tutorial we use the data in the 
[Acute Lymphocytic Leukemia expression dataset][all] package of [Bioconductor]. 
First we will use the [limma] library to compute a differential gene expression
analysis.
Then we will use the functions `uvGsa` and `mdGsa` in the `mdgsa` package
to perform uni-dimensional and bi-dimensional gene set analyses respectively.
The functional interpretation will be done in terms of [KEGG] Pathways.
The annotation will be taken from the [hgu95av2.db] library in [Bioconductor].


First we load the data and describe the design matrix of the experiment

```{r}
library (ALL)
data (ALL)

des.mat <- model.matrix (~ 0 + mol.biol, data = ALL)
colnames (des.mat) <- c("ALL", "BCR", "E2A", "NEG", "NUP", "p15")
head (des.mat)
```

Then we can use `limma` to carry out
some gene expression comparisons.
We can for instance compare 
_ALL_ samples to _NEG_ control samples or 
explore gene differential expression between 
_BCR_ and _NEG_. 


```{r}
library (limma)
cont.mat <- makeContrasts (ALL-NEG, BCR-NEG, levels = des.mat)
cont.mat

fit <- lmFit (ALL, design = des.mat)
fit <- contrasts.fit (fit, cont.mat)
fit <- eBayes (fit)
```

From this analysis we get _test statistics_ and _p-values_ for each of the
two contrasts

```{r}
fit$t[1:3,]
fit$p.value[1:3,]
```

These gene level information may be now interpreted in terms of _gene sets_.
For this example we will carry out a _gene set analysis_ using the
functional blocks described in [KEGG],
but any other functional data base such as the [Gene Ontology][go] or
even a customized one may be analyzed using `mdgsa`.
We can get the [KEGG] annotation from the [hgu95av2.db] library as follows.

```{r}
library (hgu95av2.db)
anmat <- toTable (hgu95av2PATH)
anmat[1:3,]
```


Univariate Gene Set Analysis
--------------------------------------------------------------------------------

We can now carry out the functional interpretation
of the contrast "BCR - NEG" for instance.

The data needed for the gene set analysis
are the p-values and test statistics returned by `limma` at gene level.

```{r}
fit$t[1:3, "BCR - NEG"]
fit$p.value[1:3, "BCR - NEG"]
```

We load the `mdgsa` library.

```{r}
library (mdgsa)
```

The first step in the procedure is to combine p-values and test statistics
in a single ranking value. 

```{r}
rindex <- pval2index (pval = fit$p.value[,"BCR - NEG"], sign = fit$t[,"BCR - NEG"])
rindex <- indexTransform (rindex)
rindex[1:3]
```

This _ranking index_ keeps the sign of the test statistic;
that is, the information of whether the gene is over or underexpressed.

```{r}
plot (fit$t[,"BCR - NEG"], rindex)
```

but the evidence of the differential expression is taken directly
from the p-value

```{r}
plot (fit$p.value[,"BCR - NEG"], rindex)
```

Next we will need to format the annotation.
The function `annotMat2list` in the `mdgsa` library
converts the annotation matrix into a list.

```{r}
anmat[1:3,]
annot <- annotMat2list (anmat)
length (annot)
```

Each element of the list contains the gene names of a gene set.

```{r}
lapply (annot[1:3], head, n= 3)
```

It is also important to make sure that the
_gene universe_ described by the _ranking index_
and the _annotation_ are concordant.
The function `annotFilter`
encompasses the annotation list to the names in the ranking index;
it is also used to exclude functional blocks too small to be considered
gene sets,
or too big to be specific of any biological process of interest.

```{r}
annot <- annotFilter (annot, rindex)
```

Now everything is ready to carry out the _univariate_ gene set analysis.

```{r uvGsa}
res.uv <- uvGsa (rindex, annot)
```

The output of the `uvGsa` function is a data frame
which rows correspond to functional blocks analyzed.


```{r}
res.uv[1:3,]
```

The `uvGsa` function fits a _logistic regression model_
relating, the probability of genes belonging to the gene set,
with the value of the ranking statistic.

Significant and __positive__ _log odds ratio_ (`lor`) indicate that
the gene set is enriched in those genes with __high values__ of the
ranking statistic.
In our example this means that genes up regulated in _BCR_ compared to
_NEG_ are more likely to belong to the functional block.
We could also say that the block of genes shows a significant degree
of over expression _BCR_ compared to _NEG_.

On the other hand, when a gene set has a __negative__  _log odds ratio_
we can say that the genes in the set are more likely to be associated to
__low values__,
negative in our case, of the ranking statistics.
In our case this means that the gene set is down regulated in _BCR_
compared to _NEG_.

The function `uvPat` helps you classifying the analyzed gene sets

```{r}
res.uv[,"pat"] <- uvPat (res.uv, cutoff = 0.05)
table (res.uv[,"pat"])
```

Positive values (1)  correspond to significant and positive _log odds ratios_.
Negative values (-1) correspond to significant and negative _log odds ratios_.
Zeros correspond to non enriched blocks.

As in this example we are analyzing [KEGG] pathways
we can use the function `getKEGGnames`
to find out the "name" of the
pathways^[Similar function `getGOnames` is available to be used with [Gene Ontology][go] terms.].

```{r, results = "hide"}
res.uv[,"KEGG"] <- getKEGGnames (res.uv)
```

Finally, the function `uvSignif` may help us displaying
just the enriched blocks.

```{r}
res <- uvSignif (res.uv)
res[,c("pat", "KEGG")]
```


Multivariate Gene Set Analysis
--------------------------------------------------------------------------------

But with the `mdgsa` library we can analyze not just one
but two ranking statistics at a time. 

In our example we were interested
not in just one differential expression contrast but in two: 
ALL vs. NEG and BCR vs. NEG.
We used `limma` to fit this two contrasts (see previous sections)
and computed gene statistics and p-values for for each of them.

```{r}
fit$t[1:3,]
fit$p.value[1:3,]
```

We can combine this two matrices in a single one
containing a _ranking statistic_ for each contrast,
just as we did in the _univariate_ example.

```{r}
rindex <- pval2index (pval = fit$p.value, sign = fit$t)
rindex <- indexTransform (rindex)
rindex[1:3,]
```

Now we can explore the bi-dimensional distribution of this ranking indexes

```{r}
plot (rindex, pch = ".")
```

and search for gene sets enrichment patterns in both _dimensions_

The same annotation list we __filtered__ in the previous section may be used in this second example.
Thus everything is ready to carry out our _multidimensional_ analysis.

```{r mdGsa}
res.md <- mdGsa (rindex, annot)
```

As in the _univariate_ analysis,
the output of the `mdGsa` function is a data frame with a row per analyzed gene set. 

```{r}
res.md[1:3,]
```

The column of the row contain _log odds ratios_ and p-values for each of the analyzed _dimensions_ and also for their _interaction_ effect.
The function `mdPat` helps clarifying the bi-dimensional pattern of enrichment. 

```{r}
res.md[,"pat"] <- mdPat (res.md)
table (res.md[,"pat"])
```

And as before we can incorporate the KEGG names to our results.

```{r, results = "hide"}
res.md[,"KEGG"] <- getKEGGnames (res.md)
res.md[1:3,]
```

Thus we could for instance explore the KEGG classified as having a with a
__q3f__ pattern.
The __q3f__ classification means that the genes of this gene set
are located in the
__third quadrant__ of the _bivariate ranking index_ representation.

The `plotMdGsa` function help us understanding such pattern.

```{r}
Q3 <- rownames (res.md)[res.md$pat == "q3f"]
Q3
plotMdGsa (rindex, block = annot[[Q3]], main = res.md[Q3, "KEGG"])
```

Red dots in the figure represent the genes within the gene set.
They show, in both dimensions of our ranking, 
values significantly more negatives that the remaining genes in the study.
This same pattern may be appreciated in the ellipses drawn in the figure.
The blue one represents a _confidence region_ for all the genes in the study.
The red one shows the same _confidence region_ but just for those genes within
the gene set.
We can see how the distribution of the genes in KEGG _`r Q3`_ is displaced
towards the third quadrant of the plot.
This indicates us that "`r res.md[Q3, "KEGG"]`" is a _pathway_
jointly down regulated in both
_ALL_ and _BCR_ when compared to the controls in the _NEG_ group.


Similarly we can explore a _bimodal_ (__b13__) KEGG.
This pattern classification indicates us that
the functional block has two __sub-modules__ of genes
with opposite patterns of expression. 
One subset of the KEGG is up-regulated in both conditions while
the other subset is down-regulated also in both dimensions analyzed.

It may be worth pointing here that,
this _bimodal_ pattern will be missed by standard _univariate_ gene set methods,
and that it may be just detected in the _multidimensional_ analysis. 

```{r}
BI <- rownames (res.md)[res.md$pat == "b13"]
plotMdGsa (rindex, block = annot[[BI]], main = res.md[BI, "KEGG"])
```


As third example we could display some KEGG enriched in one dimension
but not in the other.
The __yh__ pattern indicates gene set over-representation
just in the _Y_ axis but not in the horizontal axis.
In our case, KEGG pathways with at __yh__ pattern are those
over expressed in _BRC_ compared to _NEG_ but not
enriched in the comparison between _ALL_ and _NEG_.

Similarly an __yl__ pattern indicates a down-regulation of the block 
in the _BRC vs. NEG_ comparison but not in the _ALL vs. NEG_ one.

The plot below displays one of such __yl__ classified KEGGs.


```{r}
rownames (res.md)[res.md$pat == "yl"]
YL <- "00280"
plotMdGsa (rindex, block = annot[[YL]], main = res.md[YL, "KEGG"])
```


All possible multidimensional enrichment __patterns__ are listed
in the [Appendix](#appendix1).


<!-- SOME MORE EXAMPLES --------------------------------------------------------

```{r}
Q2 <- rownames (res.md)[res.md$pat == "q2f"]
Q2
plotMdGsa (rindex, block = annot[[Q2]], main = res.md[Q2, "KEGG"])
```

```{r}
rownames (res.md)[res.md$pat == "yh"]
YH <- "05332"
plotMdGsa (rindex, block = annot[[YH]], main = res.md[YH, "KEGG"])
```
-->


<!--

Gene Set analysis of miRNA expression Data
================================================================================

To Do

-->


Appendix
================================================================================

1.- Multidimensional Functional Classification {#appendix1}
--------------------------------------------------------------------------------

All possible functional block classifications in the bi-dimensional gene set analysis are:

- __q1i__: block displaced toward quadrant __1__ (0 < X & 0 < Y) with interaction.
- __q2i__: block displaced toward quadrant __2__ (0 > X & 0 < Y) with interaction.
- __q3i__: block displaced toward quadrant __3__ (0 > X & 0 > Y) with interaction.
- __q4i__: block displaced toward quadrant __4__ (0 < X & 0 > Y) with interaction.

- __q1f__: block displaced toward quadrant __1__, no interaction. 
- __q2f__: block displaced toward quadrant __2__, no interaction. 
- __q3f__: block displaced toward quadrant __3__, no interaction. 
- __q4f__: block displaced toward quadrant __4__, no interaction. 

- __xh__: block shifted to __positive X__ values.
- __xl__: block shifted to __negative X__ values. 
- __yh__: block shifted to __positive Y__ values.
- __yl__: block shifted to __negative Y__ values.

- __b13__: bimodal block. Half of the genes displaced towards quadrant __1__ and the other half towards quadrant __3__.
- __b24__: bimodal block. Half of the genes displaced towards quadrant __2__ and the other half towards quadrant __4__.

- __NS__: __non significant__ block.

A detailed  description of each of the patterns can be found in
[Montaner and Dopazo (2010)][montaner2010].

The function `mdPat` in the `mdgsa` package is devised to help the user
classifying bi-dimensional GSA results in such patterns.


Session Info
================================================================================

```{r}
sessionInfo()
```

<!-- 
```{r, echo = FALSE, results = "asis"}
toLatex (sessionInfo())
```
-->