%\VignetteIndexEntry{EnrichmentBrowser Manual}
%\VignetteDepends{EnrichmentBrowser}
%\VignettePackage{EnrichmentBrowser}
%\VignetteEngine{utils::Sweave}

\documentclass[a4paper]{article}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\usepackage{graphicx, subfig}
\usepackage[utf8]{inputenc} 

\title{EnrichmentBrowser: Seamless navigation through combined \\
        results of set-based and network-based enrichment analysis}
\author{Ludwig Geistlinger} 

\begin{document}

\maketitle
\tableofcontents

\section{Introduction}
The \Rpackage{EnrichmentBrowser} package implements essential functionality for
the enrichment analysis of gene expression data.
The analysis combines the advantages of set-based and network-based enrichment 
analysis in order to derive high-confidence gene sets and biological pathways 
that are differentially regulated in the expression data under investigation.
Besides, the package facilitates the visualization and exploration of such sets 
and pathways.\\  
To demonstrate the functionality of the package, we consider microarray 
expression data of patients suffering from acute lymphoblastic leukemia 
\cite{Chiaretti04}. A frequent chromosomal defect found among these patients 
is a translocation, in which parts of chromosome 9 and 22 swap places. 
This results in the oncogenic fusion gene BCR/ABL created by positioning the 
ABL1 gene on chromosome 9 to a part of the BCR gene on chromosome 22.\\
We load the \Biocpkg{ALL} dataset
<<>>=
library(ALL)
data(ALL)
@
and select B-cell ALL patients with and without the BCR/ABL fusion as it has 
been described previously \cite{Scholtens05}.
<<>>=
ind.bs <- grep("^B", ALL$BT)
ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG"))
sset <- intersect(ind.bs, ind.mut)
eset <- ALL[, sset]
@
Typically, the expression data is not already available as an 
\Rclass{ExpressionSet} in \R{} but rather has to be read in from file. 
This can be done using the function \Rfunction{read.eset}, which reads 
the expression data (exprs) along with the phenotype data (pdat) and feature 
data (fdat) into an \Rclass{ExpressionSet}. 
<<>>=
library(EnrichmentBrowser)
data.dir <- system.file("extdata", package="EnrichmentBrowser")
exprs.file <- file.path(data.dir, "ALL_exprs.tab")
pdat.file <- file.path(data.dir, "ALL_pData.tab")
fdat.file <- file.path(data.dir, "ALL_fData.tab")
eset2 <- read.eset(exprs.file, pdat.file, fdat.file)
@
%Thus, we write the restricted ALL data in a simple format that requires minimal 
%annotation to file. In accordance to the structure of the \Rclass{ExpressionSet}
%class the input should contain
%\begin{enumerate}
%\item{The expression matrix (features as rows, samples as columns)
%<<>>=
%out.dir <- system.file("extdata", package="EnrichmentBrowser")
%exprs.file <- file.path(out.dir, "ALL_exprs.tab")
%write.table(exprs(b.eset), file=exprs.file, 
%    sep="\t", col.names=FALSE, row.names=FALSE, quote=FALSE)
%@
%}nnotationDbi
%\item{The phenotype data (group assignment for the samples)
%<<>>=
%pdat.file <- file.path(out.dir, "ALL_pData.tab")
%grp <- ifelse(b.eset$mol.biol == "BCR/ABL", 1, 0)
%pdat <- cbind(sampleNames(b.eset), grp)
%write.table(pdat, file=pdat.file, sep="\t",
%        col.names=FALSE, row.names=FALSE, quote=FALSE)
%@
%}
%\item{The feature data (maps features to genes)
%<<>>=
%library(hgu95av2.db)
%xx <- as.list(hgu95av2ENTREZID)
%entrez.ids <- xx[featureNames(ALL)]
%fdat <- cbind(featureNames(ALL), unlist(entrez.ids))
%fdat.file <- file.path(out.dir, "ALL_fData.tab")
%write.table(fdat, file=fdat.file, sep="\t",
%        col.names=FALSE, row.names=FALSE, quote=FALSE)
%@
%}
%\end{enumerate}  
%
We can now access the expression values, which are intensity measurements 
on a log-scale for 12,625 probes (rows) across 79 patients (columns).

<<>>=
exprs(eset)[1:4,1:4]
dim(exprs(eset))
@

The phenotype data should contain for each patient a binary group assignment
indicating here whether the BCR-ABL gene fusion is present (1) or not (0).
<<>>=
grp <- ifelse(eset$mol.biol == "BCR/ABL", "1", "0")
pData(eset)$GROUP <- grp
table(pData(eset)$GROUP)
@

The \Rfunction{de.ana} function carries out a differential 
expression analysis between the two groups using the function 
\Rfunction{get.fold.change.and.t.test} from the \Biocpkg{simpleaffy} package.\\
Resulting fold changes and $t$-test derived $p$-values for each probe are 
appended to the \Robject{fData} slot.

<<>>=
eset <- de.ana(eset)
head(fData(eset), n=4)
@
 
Raw $p$-values (\Robject{RAW.PVAL}) are already corrected for multiple testing 
(\Robject{ADJ.PVAL}) using the method from Benjamini and Hochberg implemented 
in the function \Rfunction{p.adjust} from the \CRANpkg{stats} package.\\
To get a first overview, we inspect the $p$-value distribution and the volcano 
plot (fold change against $p$-value).
 
\begin{center}
<<fig=TRUE, width=12, height=6>>=
par(mfrow=c(1,2))
pdistr(eset)
volcano(eset)
@
\end{center}

The expression change of highest statistical significance is observed for the 
probe \verb=1636_g_at=. 
<<>>=
fData(eset)[ which.min(fData(eset)$ADJ.PVAL), ]
@
This turns out to be ABL1 oncogene itself 
(\href{http://www.genome.jp/dbget-bin/www_bget?hsa:25}{hsa:25@KEGG}).
As we often have more than one probe per gene, we compute gene expression values
as the average of the corresponding probe values.
<<>>=
gene.eset <- probe.2.gene.eset(eset)
head(fData(gene.eset))
@
(Note, that the mapping from probe to gene is done automatically as long as 
as you have the corresponding annotation package, here the 
\Biocpkg{hgu95av2.db} package, installed. Otherwise, the mapping can be 
defined in the \Rfunction{fData} slot.)
<<>>=
head(fData(eset2))
@  
Now, we subject the ALL gene expression data to the enrichment analysis. 

\newpage
\section{Set-based enrichment analysis}
In the following, we introduce how the \Rpackage{EnrichmentBrowser} package 
can be used to perform state-of-the-art enrichment analysis of gene sets. 
We consider the ALL gene expression set as it has been processed in the previous
 section. We are now interested whether there are not only single genes that are
 differentially expressed, but also sets of genes known to work together, 
e.g.~as defined by their membership in KEGG pathways.\\
Hence, we use the function \Rfunction{get.kegg.genesets}, which is based on 
functionality from the \Biocpkg{KEGGREST} package, to download all human KEGG 
pathways as gene sets.

<<>>=
# hsa.gs <- get.kegg.genesets("hsa")
gmt.file <- file.path(data.dir, "hsa_kegg_gs.gmt")
hsa.gs <- parse.genesets.from.GMT(gmt.file)
length(hsa.gs)
hsa.gs[1:2]
@

Currently, the following set-based enrichment analysis methods are supported
<<>>=
sbea.methods()
@

\begin{itemize}
\item{ORA: Overrepresentation Analysis (simple and frequently used test based 
        on the hypergeometric distribution \cite{Goeman07} for a critical 
        review)}
\item{SAFE: Significance Analysis of Function and Expression 
        (generalization of ORA, includes other test statistics, e.g.~Wilcoxon's
        rank sum, and allows to estimate the significance of gene sets by 
        sample permutation; implemented in the \Biocpkg{safe} package)} 
\item{GSEA: Gene Set Enrichment Analysis (frequently used and widely accepted, 
        uses a Kolmogorov–Smirnov statistic to test whether the ranks of the 
        $p$-values of genes in a gene set resemble a uniform distribution 
        \cite{Subramanian05})}
\item{SAMGS: Significance Analysis of Microarrays on Gene Sets 
        (extending the SAM method for single genes to gene set analysis 
        \cite{Dinu07})}
\end{itemize}

For demonstration we perform here a basic ORA choosing a significance level 
$\alpha$ of 0.05.
<<>>=
sbea.res <- sbea(method="ora", eset=gene.eset, gs=hsa.gs, perm=0, alpha=0.05)
gs.ranking(sbea.res)
@

The result of every enrichment analysis is a ranking of gene sets by the 
corresponding $p$-value. The \Rfunction{gs.ranking} function displays only those
gene sets satisfying the chosen significance level $\alpha$.\\
While such a ranked list is the standard output of existing enrichment tools, 
the functionality of the \Rpackage{EnrichmentBrowser} package allows visualization 
and interactive exploration of resulting gene sets far beyond that point.
Using the \Rfunction{ea.browse} function creates a HTML summary from which each
gene set can be inspected in more detail. The various options are described in
Figure \ref{fig:oraRes}. 

<<eval=FALSE>>=
ea.browse(sbea.res)
@

\begin{figure}
\centering
\includegraphics[width=13.5cm, height=9cm]{ora_ebrowse.png}
\caption{ORA result view. For each significant gene set in the ranking, the user
    can select to view (1) a basic report, that lists all genes of a set along
    with fold change and $t$-test derived p-value, (2) overview plots, such as
    heatmap, $p$-value distribution, and volcano plot, (3) the pathway in KEGG 
    with differentially expressed genes highlighted in red.}\label{fig:oraRes}
\end{figure}

The goal of the \Rpackage{EnrichmentBrowser} package is to provide the most
frequently used enrichment methods. However, it is also possible to exploit
its visualization capabilities while using one's own set-based enrichment method.
This requires to implement a function that takes the characteristical arguments
\Robject{eset} (expression data), \Robject{gs} (gene sets), \Robject{alpha}
(significance level), and \Robject{perm} (number of permutations). 
In addition, it must return a numeric vector \Robject{ps} storing the resulting
$p$-value for each gene set in \Robject{gs}. The $p$-value vector must be also 
named accordingly (i.e.~\Rcode{names(ps) == names(gs)}).\\
Let us consider the following dummy enrichment method, which randomly renders 
five gene sets significant and all others insignificant.

<<>>=
dummy.sbea <- function(eset, gs, alpha, perm)
{
        sig.ps <- sample(seq(0,0.05, length=1000),5)
        insig.ps <- sample(seq(0.1,1, length=1000), length(gs)-5)
        ps <- sample(c(sig.ps, insig.ps), length(gs))
        names(ps) <- names(gs)
        return(ps)
}
@

We can plug this method into \Rfunction{sbea} as before.

<<>>=
sbea.res2 <- sbea(method="dummy.sbea", eset=gene.eset, gs=hsa.gs)
gs.ranking(sbea.res2)
@ 

\newpage
\section{Network-based enrichment analysis}
Having found sets of genes that are differentially regulated in the ALL data, 
we are now interested whether these findings can be supported by known 
regulatory interactions. 
For example, we want to know whether transcription factors and their target 
genes are expressed in accordance to the connecting regulations. 
Such information is usually given in a gene regulatory network derived from 
specific experiments, e.g.~using the \Biocpkg{GeneNetworkBuilder}, or compiled 
from the literature (\cite{Geistlinger13} for an example). 
There are well-studied processes and organisms for which comprehensive and 
well-annotated regulatory networks are available, e.g.~the \verb=RegulonDB= for
\emph{E.~coli} and \verb=Yeastract= for \emph{S.~cerevisiae}. 
However, in many cases such a network is missing. 
A first simple workaround is to compile a network from regulations in the KEGG
database.\\
We can download all KEGG pathways of a specified organism via the 
\Rfunction{download.kegg.pathways} function that exploits functionality
from the \Biocpkg{KEGGREST} package.

<<eval=FALSE>>=
pwys <- download.kegg.pathways("hsa")
@

In this case, we have already downloaded all human KEGG pathways.
We parse them making use of the \Biocpkg{KEGGgraph} package and compile the 
resulting gene regulatory network.   
<<>>=
pwys <- file.path(data.dir, "hsa_kegg_pwys.zip")
hsa.grn <- compile.grn.from.kegg(pwys)
head(hsa.grn)
@

Now we are able to perform enrichment analysis based on the compiled network.
Currently the following network-based enrichment analysis methods are supported
<<>>=
nbea.methods()
@
 
\begin{itemize}
\item{GGEA: Gene Graph Enrichment Analysis 
    (evaluates consistency of known regulatory interactions with the observed
    expression data \cite{Geistlinger11})}
\item{NEA: Network Enrichment Analysis 
    (implemented in the \Biocpkg{neaGUI} package)}
\item{SPIA: Signaling Pathway Impact Analysis 
    (implemented in the \Biocpkg{SPIA} package)}
\end{itemize}

For demonstration we perform here GGEA using the gene regulatory network 
compiled above
(Note: to produce meaningful $p$-values of suitable granularity, 
1000 permutations is the suggested default. Here, we perform only 100 
permutations for demonstration purpose).

<<>>=
nbea.res <- nbea(method="ggea", eset=gene.eset, gs=hsa.gs, grn=hsa.grn, perm=100)
gs.ranking(nbea.res)
@

The resulting ranking lists for each statistically significant gene set the 
number of relations (NR.RELS) of the given gene regulatory network that involve
a gene set member, the sum of consistencies over all relations (RAW.SCORE), the 
score normalized by induced network size (NORM.SCORE = RAW.SCORE / NR.RELS), 
and the statistical significance of each gene set based on a permutation 
approach.\\
A GGEA graph for a gene set of interest displays the consistency of each 
interaction in the network that involves a gene set member. Nodes (genes) are 
colored according to expression (up-/down-regulated) and edges (interactions) 
are colored according to consistency, i.e.~how well the interaction type 
(activation/inhibition) is reflected in the correlation of the observed 
expression of both interaction partners.

\begin{center}
<<fig=TRUE, width=12, height=6>>=
par(mfrow=c(1,2))
ggea.graph(
    gs=hsa.gs[["hsa05217_Basal_cell_carcinoma"]], 
    grn=hsa.grn, eset=gene.eset)
ggea.graph.legend()
@
\end{center}

As described in the previous section it is also possible to plug in one's own
network-based enrichment method.

\newpage
\section{Combining results}
Different enrichment analysis methods usually result in different gene set 
rankings for the same dataset.
To compare results and detect gene sets that are supported by different methods,
the \Rpackage{EnrichmentBrowser} package allows to combine results from the 
different set-based and network-based enrichment analysis methods.
The combination of results yields a new ranking of the gene sets under 
investigation either by the average rank across methods or a combined $p$-value
using Fisher's method or Stouffer's method \cite{Kim13}.\\
We consider the ORA result and the GGEA result from the previous sections and 
use the function \Rfunction{comb.ea.results}.

<<>>=
res.list <- list(sbea.res, nbea.res)
comb.res <- comb.ea.results(res.list, pcomb.meth="fisher")
@
 
The combined result can be detailedly inspected as before and interactively 
ranked as depicted in Figure \ref{fig:combRes}. 

<<eval=FALSE>>=
ea.browse(comb.res, graph.view=hsa.grn)
@

\begin{figure}[!h]
\centering
\includegraphics[width=17cm, height=11.3cm]{comb_ebrowse.png}
\caption{Combined result view. By clicking on one of the blue columns 
    (ORA.RANK, ..., COMB.PVAL) the result can be interactively ranked according
    to the selected criterion.}\label{fig:combRes}
\end{figure}

\newpage
\section{Putting it all together}
There are cases where it is necessary to perform some steps of the demonstrated
enrichment analysis pipeline individually.
However, often it is more convenient to run the complete standardized pipeline.
This can be done using the all-in-one wrapper function \Rfunction{ebrowser}.
Thus, in order to produce the result page displayed in Figure \ref{fig:combRes}
from scratch, without going through the individual steps listed above, the 
following call would do the job. 

<<eval=FALSE>>=
ebrowser(   meth=c("ora", "ggea"), 
        exprs=exprs.file, pdat=pdat.file, fdat=fdat.file, 
        gs=hsa.gs, grn=hsa.grn, comb=TRUE)
@

%\bibliographystyle{unsrturl}
\bibliography{refs}
\end{document}