\documentclass[a4paper]{article}
\usepackage[utf8]{inputenc}

<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

%\VignetteIndexEntry{UsersGuide}
%\VignetteKeyword{GOexpress}

\bioctitle[GOexpress: Visualise gene expression data using gene ontology
annotations]{GOexpress: identify and visualise robust gene ontology
signatures through supervised classification of gene expression data}

\author{Kévin Rue-Albrecht, Paul A. McGettigan, Belinda Hernández,\\
David A. Magee, Nicolas C. Nalpas, Andrew C. Parnell,\\
Stephen V. Gordon, and David E. MacHugh}


\begin{document}
\SweaveOpts{concordance=TRUE}
\setkeys{Gin}{width=1\textwidth}

<<echo=FALSE>>=
options(useFancyQuotes="UTF-8")
@

\maketitle

\tableofcontents

\section[Introduction]{Introduction}

\subsection[The origin and purpose of \Biocpkg{GOexpress}]{The origin and
purpose of \Biocpkg{GOexpress}}

The idea leading to the \Biocpkg{GOexpress} \R{} package emerged from a set of
plotting functions I regularly copy-pasted across various complex
multifactorial transcriptomics studies from both microarray and RNA-seq
platforms. Those functions were repeatedly used to visualise the expression
profile of genes across groups of samples, to annotate technical gene
identifiers from both microarray and RNA-seq platforms (\emph{i.e.},
probesets,
Ensembl gene identifiers) with their associated gene name, and to evaluate the
classification of samples based on genes participating in a common cellular
function or location (i.e. gene ontology). While developing the
\Biocpkg{GOexpress} package and discussing its features with colleagues and
potential users, a few more features were added, to enhance and complement the
initial functions, leading to the present version of the package.

Complex multifactorial experiments have become the norm in many research
fields, thanks to the decrease in cost of high-throughput transcriptomics
platforms and the barcoding/multiplexing of samples on the RNA-seq platform.
While much effort has been (correctly!) spent on the development of adequate
statistical frameworks for the processing of raw expression data, much of the
genewise exploration and visualisation is left to the end-user. However,
data summarisation
and visualisation can be a daunting task in multifactorial experiments, or
require large amounts of copy-pasting to investigate the expression profile of
a handful or genes and cellular pathways.

Developed and tested on multiple RNA-seq and microarray datasets,
\Biocpkg{GOexpress} offers
an extendable set of data-driven plotting functions readily applicable to the
output of widely used analytic packages estimating (differential) gene
expression. Once the initial analysis and filtering of \Biocpkg{GOexpress}
results is complete --- literaly two command lines ---, each gene and gene
ontology is accessible by a single line of code to produce high-quality
graphics and summary tables. In short, \Biocpkg{GOexpress} is
a software package
developed based on real experimental datasets to ease the visualisation and
interpretation of multifactorial transcriptomics data by bioinformaticians and
biologists, while striving to keep it a simple, fast, and intuitive toolkit.

Notably, the use of the \Biocpkg{biomaRt} package enables
\Biocpkg{GOexpress}
to support and annotate gene expression identifiers from any species and any
microarray platform present in the Ensembl BioMart server
(\url{http://www.ensembl.org/biomart/martview}), while custom
annotations may also be provided for the analysis of species or platforms
not supported yet, the classification of non-transcriptomics datasets
(\emph{e.g.},
proteomics), or the comparison of panels of biomarkers independent from
gene ontology annotations.


\subsection[Purpose of this document]{Purpose of this document}

This User's Guide was intended as a helpful description of the main features
implemented in the \Biocpkg{GOexpress} package, as well as a tutorial taking
the user through a typical analysis pipeline that \Biocpkg{GOexpress} was
designed for. While an example usage will be provided for each function of the
package, the many arguments of each function cannot realistically be
demonstrated in this Guide, and we kindly ask users to also read the
individual help pages accompanying the corresponding package functions for
further details.


\section[Before you start]{Before you start}

\subsection[Installation]{Installation}

Installing \Biocpkg{GOexpress} should be as easy as running the two lines
below:

<<eval=FALSE>>=
source("http://bioconductor.org/biocLite.R")
biocLite("GOexpress")
@

Installation issues should be reported to the \Bioconductor{} mailing list.

\subsection[Getting help]{Getting help}

The \Biocpkg{GOexpress} package is still relatively young and
may require some fine-tuning or bug fixes. Please contact the
maintainer with a copy of the error message and the command run.

<<echo=TRUE>>=
maintainer("GOexpress")
@

Despite our efforts to repeatedly test the package on
in-house datasets of both microarray and RNA-seq platform, and of human and
bovine origin, many of the model species and gene expression platforms have
not been tested yet. We welcome feedback!

Interesting suggestions for additional package functions, or improvement of
existing ones are most welcome and may be implemented when time allows.
Alternatively, we also encourage users to fork the GitHub repository of the
project, develop and test their own feature(s), and finally generate a pull
request to integrate it to the original repository
(\url{https://github.com/kevinrue/GOexpress}).

As for all \Bioconductor{} packages, the \Bioconductor{} support site is the
best place to seek advice with a large and active community of \Bioconductor{}
users.
More detailed information is available at:

\url{http://www.bioconductor.org/help/support}.


\subsection[Citing \Biocpkg{GOexpress}]{Citing \Biocpkg{GOexpress}}

The work underlying \Biocpkg{GOexpress} has not been formally published yet. A
manuscript has been submitted for peer-review. In the meantime, users of the
\Biocpkg{GOexpress} package are encouraged to cite it using the output of the
\Rfunction{citation} function in the \Rpackage{utils} package,
as follows:

<<echo=FALSE>>=
citation(package="GOexpress")
@


\section[Quick start]{Quick start}

\subsection[Input data]{Input data}

Despite their different underlying technologies, microarray and RNA-seq
analytic pipelines typically yield a matrix measuring the expression level of
many gene features in each sample. Commonly, this expression matrix is
filtered to
retain only genes qualified as ``informative'' (e.g. > 1 cpm in at least N
replicates; N being the number of replicates for a given set of experimental
conditions); and genes lowly expressed are removed to limit the False
Discovery
Rate (FDR) of differentially expressed genes induced by the larger variability
of expression at the lower end of the dynamic range.

\Biocpkg{GOexpress} requires this filtered normalised expression matrix to
be accompanied by an \Rclass{AnnotatedDataFrame} object of the
\Biocpkg{Biobase} package providing phenotypic information for each of those
samples (e.g. unique identifier, treatment, time-point). \Biocpkg{GOexpress}
expects those two variables in an \Rclass{ExpressionSet} container of the
\Biocpkg{Biobase} package, both simplifying the manipulation of the data and,
most importantly, ensuring interoperatibility with other packages that handle
\Bioconductor{} \Rclass{ExpressionSet} objects. The other fields of the
\Rclass{ExpressionSet} container may be left empty as \Biocpkg{GOexpress}
does not currently access them. Instructions to create
\Rclass{AnnotatedDataFrame} and \Rclass{ExpressionSet}
objects are detailed in the vignettes of the \Biocpkg{Biobase} package.

To use the analytical part of the \Biocpkg{GOexpress} package, the phenotypic
data-frame --- \Robject{phenodata} slot of the \Rclass{ExpressionSet} ---
must
contain at least one column containing an experimental factor made of two or
more levels in the strict meaning of ``factor'' and ``levels'' in the \R{}
programming language. The above \Rclass{ExpressionSet} and the name of the
column containing such a factor are the minimal two input variables required
for the \Rfunction{GO\_analyse} function to work. Additional arguments may be
required, in particular for microarray datasets, but those are discussed
in section \ref{sec:MicroarrayArguments}.

In the examples below, we will use the toy dataset \Robject{AlvMac}
provided
with the package and made of a subset of 100 bovine Ensembl gene identifiers
(rows) measured in 117 samples (columns), extracted from a larger RNA-seq
experiment (see help page for the \Robject{AlvMac} object).
This toy \Rclass{ExpressionSet} also includes an \Rclass{AnnotationDataFrame}
detailing a number of phenotypic information fields describing each
sample.

Let us load the \Biocpkg{GOexpress} package and import the toy dataset in the
workspace:

<<>>=
library(GOexpress) # load the GOexpress package
data(AlvMac) # import the training dataset
@

Now, the expression matrix and phenotypic data of the \Rclass{ExpressionSet}
container can be accessed using dedicated functions from the \Biocpkg{Biobase}
package:

<<>>=
exprs(AlvMac)[1:5,1:5] # Subset of the expression data
head(pData(AlvMac)) # Subset of the phenotypic information
@

An advantage of the \Rclass{ExpressionSet} container is that it takes care of
the compatibility between the expression matrix and the phenotypic information
data-frame. For instance, it will check that samples names do not differ
between expression matrix and phenotypic information.

Users can visually inspect that adequate row names are used in the
expression matrix:

<<>>=
head(rownames(exprs(AlvMac))) # Subset of gene identifiers
@

\bioccomment{In the training dataset, the \Robject{Time} column of
\Rcode{pData(targets)} is an \R{} factor while the \Robject{Timepoint} column
is
a numeric vector. The fomer is useful for grouping the samples for the
analysis, while the latter is better suited to plot gene expression profiles
respecting the relative distance between the time-points. See
section \ref{sec:expressionPlots} for examples using of a numeric value or a
factor as the variable of the X-axis.}

\subsection[Main analysis]{Main analysis}

\subsubsection[Preparing the grouping factor to analyse]{Preparing the
grouping factor to analyse}

In this example, we search for GO terms containing genes that best classify
samples according their \Robject{Treatment} level. In other words, after
estimating the capacity of each gene to classify the different experimental
groups, the algorithm will use this gene ranking to rank GO terms based on
the average rank (alternatively, score) of their annotated genes.
But first, let us make
sure that the \Robject{Treatment} column of \Rcode{pData(targets)} is indeed
an \R{} factor:

<<>>=
is.factor(AlvMac$Treatment) # assertion test
AlvMac$Treatment # visual inspection
@

In this case, it is already a properly formatted factor. If that was not the
case, the following line of code would convert the column to an \R{} factor
and allow to continue the analysis (note that in some cases, it may be
preferrable
to order the different levels of a factor, for an example see factor
\Robject{Time}):

<<eval=FALSE>>=
AlvMac$Treatment <- factor(AlvMac$Treatment)
@

\subsubsection[Running the random forest algorithm using local annotations]{
Running the random forest algorithm using local annotations}

Now, we use the random forest statistical framework to score each gene
feature on its
ability to classify samples from different treatments separately,
before summarising this information at the ontology level. The
ensuing analysis therefore considers the \Robject{Treatment} factor,
irrespective of the
\Robject{Time} and \Robject{Animal} factors (we search for time- and
animal-independent discriminants of infection). Alternatively, a subset of
samples from the input \Rclass{ExpressionSet} may be specified to address
more specific hypotheses. In this
example, we use locally saved copies of gene and gene ontology annotations
previously downloaded from Ensembl annotation release 75 using the
\Biocpkg{biomaRt} package (See section \ref{sec:customAnnotations}
to generate suitable local annotations, and the benefits of using them).

\label{sec:GOanalyse}
<<>>=
set.seed(4543) # set random seed for reproducibility
AlvMac_results <- GO_analyse(
    eSet = AlvMac, f = "Treatment",
    GO_genes=AlvMac_GOgenes, all_GO=AlvMac_allGO, all_genes=AlvMac_allgenes)
@

At this stage, it is a good idea to save the result variable into an \R{}
data-file using the \Rfunction{save} function. Mostly because the stochastic
aspect of the sampling approach implemented by the \CRANpkg{randomForest}
package may return slightly different scores in each run (as opposed to the
use of ANOVA F-score).

The output variable of the analysis summarises the parameters of the analysis
and can easily be browsed with standard \R{} functions:

<<>>=
names(AlvMac_results) # Data slot names
head(AlvMac_results$GO[, c(1:5, 7)], n=5) # Ranked table of GO terms (subset)
head(AlvMac_results$genes[, c(1:3)], n=5) # Ranked table of genes (subset)
head(AlvMac_results$mapping) # Gene to gene ontology mapping table (subset)
@
\bioccomment{In the tables of GO terms and genes above, the column containing
the name of the GO terms and the column containing the description of the
genes are hidden, as their content is very long in some cases, affecting
the readability of this document.}


\subsubsection[Important notes in the absence of local annotations]{
Important notes in the absence of local annotations}

If no annotations mapping gene features identifiers to gene ontology
identifiers are provided, \Rfunction{GO\_analyse} will
connect to the Ensembl server to fetch appropriate annotations in a
semi-automated procedure
(See arguments \Robject{dataset} and \Robject{microarray} of the
\Rfunction{GO\_analyse} function).
Typically, the first feature identifier in the
\Robject{ExpressionSet} is used to determine the corresponding species and
type of
data. This is a fairly straightforward process for Ensembl gene identifiers
(e.g. in the prefix `ENSBTAG', `BT' indicates \textit{Bos taurus}). Hence,
the simplest use of the \Rfunction{GO\_analyse} is:

<<eval=FALSE>>=
AlvMac_results <- GO_analyse(eSet = AlvMac, f = "Treatment")
@

\warning{
Without local annotations, connecting to the Ensembl server and
downloading the annotations significantly impacts the runtime of the function.
}

\label{sec:MicroarrayArguments}
However, it can be more difficult to identify the microarray used to obtain a
certain dataset, as many different Affymetrix chips contain probesets named
with the pattern ``AFFX.*'' for instance.
In cases where the microarray platform cannot be detected
automatically, we recommend using the \Robject{microarray} argument of
the \Rfunction{GO\_analyse} function. The list of valid values for the
microarray
argument is available in the \Robject{microarray2dataset} data frame which can
be loaded in the workspace using:

<<eval=FALSE>>=
data(microarray2dataset)
@

\subsection[Permutation-based P-value for ontologies]{Permutation-based
P-value for ontologies}

To assess the significance of GO term ranking --- or scoring ---, we
implemented a permutation-based function randomising the gene feature
ranking, and counting how many times each GO term is ranked (scored) equal
or higher than the real rank (score).

Critically, the function should be applied directly to the output of the
\Rfunction{GO\_analyse} function prior to filtering, in order to use the full
list of gene features as a background for permutation:

<<eval=FALSE>>=
AlvMac_results.pVal = pValue_GO(result=AlvMac_results, N=100)
@

<<echo=FALSE>>=
data(AlvMac_results.pVal)
@

\warning{
The \Rfunction{pValue\_GO} function is relatively lengthy. However, it is
suggested to calculate P-values on the basis of at least 1,000 permutation
(approximately 50 min on a standard Ubuntu server)
to obtain reach minimal non-zero P-values as low as 0.001.}

Given that many genes are associated to various gene ontologies due to the
hierarchical relationships in the three Direct Acyclic Graphs (DAGs), a
permutation-based approach appears the best suited strategy to assess the
significance of ontology-related genes to display a specific average rank (or
score). As further discussed in the next section, this approach also
addresses the issue
of the considerable range of gene counts associated with each known gene
ontology.

\subsection[Filtering of results]{Filtering of results}
\label{sec:Filtering}

\subsubsection[Filtering of the result object]{Filtering of the result object}

The \Rfunction{subset\_scores} function allows users to filter for GO terms
passing certain criteria (e.g. maximal P-value, minimal gene counts,
type of ontology). A \Robject{filters.GO} slot will be created in the
filtered result object, stating the filters and cutoff values applied.
A filtered object may be further filtered, and the \Robject{filters.GO} slot
will be updated accordingly.

Importantly, an early-identified bias of the scoring function is that GO terms
associated with fewer genes are favored at the top of the ranking table. This
is due to the fact that it is much easier for a group of 5 genes (e.g.
``\texttt{B cell apoptotic process}'') to have an high average rank
--- and average score --- than it is
for a group of ~6,000 genes (e.g. ``\texttt{protein binding}''). Indeed, the
highest possible average rank of 5 genes is 3 while it is 3,000 for a group of
6,000 genes. The calculation of P-values using the \Rfunction{pValue\_GO}
partially controls that bias, as smaller groups of ontology-related genes may
appear by chance at a higher average rank than observed using the
ranking based on actual gene expression.

Furthermore, in our experience, this bias presents some benefits. First, it
implicitely favors specific and well-defined GO terms (e.g. ``\texttt{negative
regulation of T cell apoptotic process}'') as opposed to vague and
uninformative GO terms (e.g. ``\texttt{cytoplasm}''). Secondly, we observed
many top-ranking GO terms associated with a single gene. Those GO terms are
consequently susceptible to single-gene events and artefacts in the expression
data, as opposed to GO terms with a reasonable number of associated genes.
Using the above filtering function, it is straightforward to filter out those
GO terms with only a handful of associated genes, in combination with a
standard P-value filter:

<<>>=
BP.5 <- subset_scores(
    result = AlvMac_results.pVal,
    namespace = "biological_process",
    total = 5, # requires 5 or more associated genes
    p.val=0.05)
MF.10 <- subset_scores(
    result = AlvMac_results.pVal,
    namespace = "molecular_function",
    total = 10,
    p.val=0.05)
CC.15 <- subset_scores(
    result = AlvMac_results.pVal,
    namespace = "cellular_component",
    total = 15,
    p.val=0.05)
@

Finally, the inherent hierarchical structure and ``granularity'' of gene
ontology terms can be browsed conveniently by using increasingly large values
of the
\Robject{total} filter. Note that this filter retains only GO terms associated
with a minimal given count of genes in the gene-GO mapping table. It is also
possible to use the \Robject{data} argument to filter for GO terms associated
with a certain count of genes in the given expression dataset, although this
approach is obviously more data-dependent and less robust.

\warning{
To optimise the use of memory space, after removing all ontologies not passing
the criteria, the function will also discard from the filtered result object
all gene features not associated with the remaining gene ontologies, and the
mapping information related to gene ontologies absent from the filtered data.
However, those data will still be left in the original object containing
unfiltered raw results.}


\subsubsection[Quick filtering of the ontology scoring table]{Quick filtering
of the ontology scoring table}

As an alternative to the above cascade-filtering of gene ontologies and gene
features result tables, users can extract and filter information from either
the
gene or the gene ontology scoring tables using the \Rfunction{subset}
function.

An example of filtering the gene ontology result table for the top
ontologies of the `Biological Process' type, associated with at least 5
genes, and a P-value lower than 0.05:

<<eval=FALSE>>=
subset(
    AlvMac_results.pVal$GO,
    total_count >= 5 & p.val<0.05 & namespace_1003=='biological_process'
    )
@

\bioccomment{
Note that the above command will return a data-frame containing only the
filtered
GO terms, as opposed to the full result object returned by the
\Rfunction{subset\_scores} function.
}

\subsection[Details of the top-ranking GO terms]{Details of the top-ranking GO
terms}

Once the GO terms are ranked (and filtered), the top-ranking GO
terms in the filtered object are those containing the largest proportion of
top-ranking genes with expression
levels that best classify the predefined groups of samples, based on the levels
of the factor considered (\Rcode{raw\_results\$factor}).

In this example, we list the top filtered ``Biological Process'' GO terms
extracted above and their statistics (currently ranked by increasing
average rank of their associated genes):

<<>>=
head(BP.5$GO)
@

\subsection[Hierarchical clustering of samples based on gene expression
associated with a GO term]{Hierarchical clustering of samples based on gene
expression associated with a GO term}

In the previous section, we identified the GO terms containing the largest
proportion of top-ranking genes that best classify
samples according to their treatment. We will now generate for the
top-ranked GO term (``\texttt{toll-like receptor 4 signaling pathway}'') a
heatmap to visualise simulatenously the clustering of samples and the
expression level of each gene in each sample:

\begin{center}
<<fig=TRUE>>=
heatmap_GO(
    go_id = "GO:0034142", result = BP.5, eSet=AlvMac, cexRow=0.4,
    cexCol=1, cex.main=1, main.Lsplit=30)
@
\end{center}

In this example, we can observe a group of ``Control'' (\emph{i.e.},
untreated; green
color) samples clustering together at the bottom of the heatmap.

Re-labelling of samples by \Robject{Group} (i.e. combination of treatment
and time-point) reveals that those samples are mainly 24 and 48 hours
post-infection control samples:

\begin{center}
<<fig=FALSE>>=
heatmap_GO(
    go_id = "GO:0034142", result = BP.5, eSet=AlvMac, cexRow=0.4,
    cexCol=1, cex.main=1, main.Lsplit=30,
    labRow=AlvMac$Group)
@
\end{center}

Subsequently encouraging the generation of a heatmap restricted to samples
from those time-points (i.e. \Robject{24H} and \Robject{48H}):

\begin{center}
<<fig=TRUE>>=
heatmap_GO(
    go_id = "GO:0034142", result = BP.5, eSet=AlvMac, cexRow=0.6,
    cexCol=1, cex.main=1, main.Lsplit=30,
    labRow='Group', subset=list(Time=c('24H','48H')))
@
\end{center}

Alternatively, it is possible to focus only on the hierarchical clustering of
samples. The following code will build a dendrogram clustering samples using
the expression data of the subset of genes associated with the
``\texttt{toll-like receptor 4 signaling pathway}'' gene ontology, considering
only samples obtained 24 and 48 hours post-infection, and labelling samples by
\Robject{Group} rather than simply \Robject{Treatment}:

\begin{center}
<<fig=TRUE>>=
cluster_GO(
    go_id = "GO:0034142", result = BP.5, eSet=AlvMac,
    cex.main=1, cex=0.6, main.Lsplit=30,
    subset=list(Time=c("24H", "48H")), f="Group")
@
\end{center}

Note that labelling samples by another factor does not affect the clustering
process itself, as the underlying expression data has not changed.


\subsection[Details of genes associated with a GO term]{Details of genes
associated with a GO term}

Following the identification of relevant GO terms in the above sections, users
may want to have a closer look at the individual genes associated with a given
GO term. The default behaviour of the function is to order the gene features
by increasing rank (equivalent to decreasing score):

<<>>=
table_genes(go_id = "GO:0034142", result = BP.5)[,c(1:3)]
@
\bioccomment{In the table above, the column containing the description of the
genes is hidden, as its content was very long in some cases, affecting
the readability of this document.}

Note that the default behaviour of the above function is to return a table of
all the genes associated with the GO term based on the annotations collected.
For obvious reasons, genes present in the annotations
but absent from the expression dataset will be absent from the score table
and consequently lack data in this result table.
It is possible to restrict the above table
to only genes present in the expression dataset using the \Robject{data.only}
argument set to \Robject{TRUE}.

If only the feature identifiers associated with a given GO identifier are
needed, users may use the function below:

<<>>=
list_genes(go_id = "GO:0034142", result = BP.5)
@

\subsection[Expression profile of a gene by sample group]{Expression profile
of a gene by sample group}

\subsubsection[Using the unique feature identifier]{Using the unique feature
identifier}
\label{sec:expressionPlots}

In the above section, we listed the genes associated with a particular gene
ontology. In our example, the respective score and rank of each gene estimates
the capacity of the gene to classify the samples according to the treatment
factor. The genes that most improve the classification of samples will have
the highest scores
and the lowest ranks. Those genes will likely produce the expression profiles
with the most consistent differential expression between the treatment groups
over time. Here is one example:

\setkeys{Gin}{width=0.75\textwidth}

\begin{center}
<<fig=TRUE>>=
expression_plot(
    gene_id = "ENSBTAG00000047107", result = BP.5, eSet=AlvMac,
    x_var = "Timepoint", title.size=1.5,
    legend.title.size=10, legend.text.size=10, legend.key.size=15)
@
\end{center}

Note that \Robject{Timepoint} is another column of \Rcode{pData(targets)}.
that encodes a numeric vector, as opposed to the column named \Robject{Time},
encoding a factor. This difference
enables the plotting function to respect the relative distance between the
time-points for an output more representative of the actual time-scale.

To investigate the impact of other factors on the expression level of the
same gene, users are encouraged to use the \Robject{f} and \Robject{x\_var}
arguments to specify alternate grouping factor and X variable, respectively.
Note that the \Rfunction{geom\_smooth} of the \Biocpkg{ggplot2} package may
fail if a minimal number of replicates is not available to calculate proper
confidence intervals. In such cases, it is recommended to use the function
\Rfunction{expression\_profiles} described in section
\ref{sec:expressionProfiles}.

Here is another valid example separating samples by the factor
\Robject{Animal} on the X axis and summarising all time-points in a
confidence 95\% confidence interval on the Y-axis:

\begin{center}
<<fig=TRUE>>=
expression_plot(
    gene_id = "ENSBTAG00000047107", result = BP.5, eSet=AlvMac,
    x_var = "Animal", title.size=1.5, axis.text.angle=90,
    legend.title.size=10, legend.text.size=10, legend.key.size=15)
@
\end{center}

\subsubsection[Using the associated gene name]{Using the associated gene name}

It is also possible to visualise the expression profile of genes from their
associated gene name if any. This is a more human-friendly version of the
function presented in the previous subsection:

\begin{center}
<<eval=FALSE>>=
expression_plot_symbol(
    gene_symbol = "BIKBA", result = BP.5, eSet=AlvMac,
    x_var = "Timepoint", title.size=1.5,
    legend.title.size=10, legend.text.size=10, legend.key.size=15)
@
\end{center}

However, the benefits of this feature are balanced by the fact that genes
lacking an associated gene name cannot be visualised in this manner, and that
some gene symbols are associated with multiple Ensembl gene identifiers and
probesets (e.g. `RPL36A'). In the latter case, we turned the ambiguity into an
additional useful feature: a lattice is created, and each of the multiple
features associated with the given gene symbol are plotted simultaneously in
the lattice. Subsequently, each of the sub-figures plotted may be re-plotted
by itself using the \Robject{index} argument as indicated in the accompanying
message printed in the \R{} console:

\begin{center}
<<fig=FALSE>>=
expression_plot_symbol(
    gene_symbol = "RPL36A", result = AlvMac_results, eSet=AlvMac,
    x_var = "Timepoint", title.size=1.5,
    legend.title.size=10, legend.text.size=10, legend.key.size=15)
@
\end{center}

\subsection[Expression profile of a gene by individual sample series]{
Expression profile of a gene by individual sample series}

\label{sec:expressionProfiles}

\subsubsection[Using the unique feature identifier]{Using the unique feature
identifier}

It may be useful to track and visualise the expression profile of genes in
each individual sample series, rather than their average. This could help
identify outliers within sample groups, or visually compare paired samples,
for instance.

In the \Robject{AlvMac} dataset, samples from each of the animals were
subjected to all three treatments in parallel (i.e. paired samples). In the
figure below, a sample series is defined by a given \Robject{Animal} and a
given \Robject{Treatment}. Each sample series is then tracked over time, and
coloured according to the \Robject{Treatment} factor (default, factor stored
in \Rcode{raw\_results\$factor}):

\begin{center}
<<fig=TRUE>>=
AlvMac$Animal.Treatment <- paste(AlvMac$Animal, AlvMac$Treatment, sep="_")
expression_profiles(
    gene_id = "ENSBTAG00000047107", result = AlvMac_results,
    eSet=AlvMac, x_var = "Timepoint", line.size=1,
    seriesF="Animal.Treatment", linetypeF="Animal",
    legend.title.size=10, legend.text.size=10,
    legend.key.size=15)
@
\end{center}

In the figure above, the \Robject{linetypeF} helps to highlight samples from
an animal which start at unusually high expression values, while those samples
progressively return to expression values similar to other samples in their
respective treatment groups.

If omitted, the \Robject{linetypeF} argument will mirror the
\Robject{colourF},
which can be useful for colour-blind people. Alternatively, a single line-type
can be applied to all groups using the \Robject{lty} argument as follows:

\begin{center}
<<eval=FALSE>>=
expression_profiles(
    gene_id = "ENSBTAG00000047107", result = AlvMac_results,
    eSet=AlvMac, x_var = "Timepoint",
    lty=rep(1,10), # use line-type 1 for all 10 groups
    seriesF="Animal.Treatment", linetypeF="Animal",
    legend.title.size=10, legend.text.size=10,
    legend.key.size=15, line.size=1)
@
\end{center}

\subsubsection[Using the associated gene name]{Using the associated gene name}

Similarly to the \Rfunction{expression\_plot} function, an alternative was
implemented to use gene names instead of feature identifiers. An example:

\begin{center}
<<eval=FALSE>>=
expression_profiles_symbol(
    gene_symbol="TNIP3", result = AlvMac_results,
    x_var = "Timepoint", linetypeF="Animal", line.size=1,
    eSet=AlvMac, lty=rep(1,10), seriesF="Animal.Treatment",
    title.size=1.5, legend.title.size=10, legend.text.size=10,
    legend.key.size=15)
@
\end{center}

\subsection[Comparison of univariate effects on gene expression]{Comparison of
univariate effects on gene expression}

While the analysis is restricted to the evaluation of a single factor, it can
be helpful to compare the relative impact of all known factors present in the
the accompanying \Robject{phenoData} on the gene expression in the different
groups of samples.

In other words, given a GO term identifier this feature will generate a plot
for each associated gene, where the mean (default; can be changed) expression
level will be computed for each level of each factor and compared to one
another:

\begin{center}
<<fig=TRUE>>=
plot_design(
    go_id = "GO:0034134", result = BP.5, eSet=AlvMac,
    ask = FALSE, factors = c("Animal", "Treatment", "Time", "Group"),
    main.Lsplit=30)
@
\end{center}


\section[Additional controls and advanced functions]{Additional controls and
advanced functions}

\subsection[Custom annotations]{Custom annotations}

\label{sec:customAnnotations}

To enable all downstream filtering and visualisation features, the
\Rfunction{GO\_analyse} function uses three types of annotations:

\begin{itemize}
\item \Robject{GO\_genes}: Mapping of gene feature identifiers to gene
ontology identifiers.
\item \Robject{all\_genes}: Annotations of gene feature identifiers.
\item \Robject{all\_GO}: Annotations of gene ontology identifiers.
\end{itemize}

The use of custom annotations has several advantages:

\begin{itemize}
\item \textbf{Traceability and reproducibility:} the Ensembl annotations are
updated on a regular basis. A local copy of annotations allows use of archived
annotation releases. In the absence of local annotations,
\Rfunction{GO\_analyse()} systematically connects to the latest (i.e. current)
Ensembl release.
\item \textbf{Speed:} Providing custom annotations will skip calls to the
Ensembl server, significantly reducing the runtime of the
\Rfunction{GO\_analyse} function.
\item \textbf{Autonomy from web-services:} occasionally the Ensembl server may
be unavailable (e.g. maintenance, internet connection). A local copy of
annotations allows to work independently from any web-service.
\item \textbf{Alternative annotations:} Users working with gene feature
identifiers not supported in the Ensembl annotations (e.g. species, or
platforms), or comparing panels of biomarkers instead of GO terms for
instance, may provide their own custom annotations to
enable the ranking and visualisation of their data using \Biocpkg{GOexpress}.
\end{itemize}

\subsubsection[Generating custom annotations]{Generating custom annotations}
\label{sec:generateCustomAnnotations}

Using the Ensembl release 75, we show below the code
used to retrieve annotations for \textit{Bos taurus} Ensembl gene identifiers:

<<eval=FALSE>>=
# Load the interface to BioMart databases
library(biomaRt)
# See available resources in Ensembl release 75
listMarts(host='feb2014.archive.ensembl.org')
# Connect to the Ensembl Genes annotation release 75 for Bos taurus
ensembl75 = useMart(
    host='feb2014.archive.ensembl.org',
    biomart='ENSEMBL_MART_ENSEMBL', dataset='btaurus_gene_ensembl')
## Download all the Ensembl gene annotations (no filtering)
allgenes.Ensembl = getBM(
    attributes=c('ensembl_gene_id', 'external_gene_id', 'description'),
    mart=ensembl75)
# Rename the gene identifier column to 'gene_id'
# This allows GOexpress to treat microarray and RNA-seq data identically
colnames(allgenes.Ensembl)[1] = 'gene_id'
## Download all the gene ontology annotations (no filtering)
allGO.Ensembl = getBM(
    attributes=c('go_id', 'name_1006', 'namespace_1003'),
    mart=ensembl75)
## Download all the mapping between gene and gene ontology identifiers
GOgenes.Ensembl = getBM(
    attributes=c('ensembl_gene_id', 'go_id'),
    mart=ensembl75)
# Rename the gene identifier column to 'gene_id'
colnames(GOgenes.Ensembl)[1] = 'gene_id'
# Cleanup: remove some blank fields often found in both columns
GOgenes.Ensembl = GOgenes.Ensembl[GOgenes.Ensembl$go_id != '',]
GOgenes.Ensembl = GOgenes.Ensembl[GOgenes.Ensembl$gene_id != '',]
@

\bioccomment{
The automated retrieval procedure retrieves all gene ontology annotations
from the Ensembl server, inclusive of annotations not Inferred from Experiment
(EXP). Users may consider filtering local annotations for desired GO Evidence
code(s).
}

\subsubsection[Using custom annotations]{Using custom annotations}
\label{sec:useCustomAnnotations}

The annotations download above can then be saved in local R data files, and
subsequently used to run entirely offline analyses of \Rclass{ExpressionSet}
objects with corresponding gene feature identifiers:

<<eval=FALSE>>=
# save each custom annotation to a R data file
save(GOgenes.Ensembl, file='GOgenes.Ensembl75.rda')
save(allGO.Ensembl, file='allGO.Ensembl75.rda')
save(allgenes.Ensembl, file='allgenes.Ensembl75.rda')
# Run an analysis using those local annotations
GO_analyse(
    eSet=AlvMac, f='Treatment',
    GO_genes=GOgenes.Ensembl,
    all_GO=allGO.Ensembl,
    all_genes=allgenes.Ensembl)
@


Ideally, all three annotation objects should be provided, to enable all
downstream features. A toy example for each type of custom annotations,
ready for analysis, is provided with the package:

<<eval=FALSE>>=
data(AlvMac_GOgenes)
data(AlvMac_allGO)
data(AlvMac_allgenes)
@

\warning{
Critically, it is highly recommended to provide full genome annotations for
the
species of interest, including annotations for feature identifiers that are
absent from the given \Rclass{ExpressionSet}. As described in section
\ref{sec:StatsOverview}, all genes present in the annotations will affect the
scoring of gene ontologies, even if they are absent from the
\Rclass{ExpressionSet}.}


\subsection[Using subsets of samples]{Using subsets of samples}

\label{sec:subsetArgument}

It may be desirable to rank genes and gene ontologies according to their
capacity to classify only specific subsets of samples, while visualising the
expression data of all samples. Instead of creating two separate
\Rclass{ExpressionSet} objects (one containing all the samples to visualise,
another one containing only the samples to analyse), a \Robject{subset}
argument was added to most functions in the \Biocpkg{GOexpress} package,
allowing the use of a single \Rclass{ExpressionSet} object from which
the desired subset of samples is extracted at run time.

This argument takes a named list where names must be column names existing in
\Rcode{colnames(pData(eSet))}, and values must be vectors of values existing
in the corresponding column of \Rcode{pData(eSet)}. The original
\Rclass{ExpressionSet} will be left unchanged. An example:

\begin{center}
<<eval=FALSE>>=
AlvMac_results <- GO_analyse(
    eSet = AlvMac, f = "Treatment",
    subset=list(
        Time=c("6H","24H", "48H"),
        Treatment=c("CN","MB"))
    )

expression_plot(
    gene_id = "ENSBTAG00000047107", result = BP.5, eSet=AlvMac,
    x_var = "Timepoint", title.size=1.5,
    legend.title.size=10, legend.text.size=10, legend.key.size=15,
    subset=list(Treatment=c("TB","MB"))
    )
@
\end{center}

\subsection[Distribution of scores]{Distribution of scores}

Users might be interested in the general distribution of score and rank
statistics produced by \Biocpkg{GOexpress}. The distribution of scores may be
represented as a histogram:

\begin{center}
<<fig=TRUE>>=
hist_scores(result = BP.5, labels = TRUE)
@
\end{center}

Alternatively, quantile values can be returned for default or customised
percentiles:

<<>>=
quantiles_scores(result = BP.5)
@

\subsection[Reordering scoring tables]{Reordering scoring tables}

\subsubsection[Reordering by score]{Reordering by score}

While scores are more prone to extreme outlier values and may slightly
fluctuate between multiple runs of the random forest algorithm; ranks of genes
and subsequently average ranks of GO terms tend to be more stable
and may be more reliable estimators of the importance of genes and cellular
functions. Therefore, the default behaviour of \Biocpkg{GOexpress} is to use
the rank and average rank metrics to order genes and GO terms, respectively,
in the returned score tables.

It is however possible to re-order the tables in the output variable according
to the score metric (or revert back to the original one) as in the example
below:

<<eval=FALSE>>=
BP.5.byScore <- rerank(result = BP.5, rank.by = "score")
@

\subsubsection[Reordering by P-value]{Reordering by P-value}

Additionally, it is possible to reorder GO terms by increasing P-value,
provided those values were were computed using the \Rfunction{pValue\_GO}
function.

<<eval=FALSE>>=
BP.5.byPval <- rerank(result = BP.5, rank.by = "p.val")
@

\subsubsection[Reordering and breaking ties]{Reordering and breaking ties}

For instance, to rank GO terms by P-value, while breaking ties on their
\Robject{ave\_rank} value, one needs to first rank the object by
\Robject{rank}, and rank the resulting object by \Robject{p.val}:

<<eval=FALSE>>=
BP.5.pVal_rank <- rerank(result = BP.5, rank.by = "rank")
BP.5.pVal_rank <- rerank(result = BP.5.pVal_rank, rank.by = "p.val")
@

\subsection[Subsetting an ExpressionSet to specific sample groups]{Subsetting
an ExpressionSet to specific sample groups}

While this feature exists, users may want to consider the newer section
\ref{sec:subsetArgument} describing the definition of a subset of
samples from the given \Rclass{ExpressionSet} on-the-fly
without the need to create a new object
containing the subsetted \Rclass{ExpressionSet}.

It is straightforward to subset an \Rclass{ExpressionSet} by extracting
given columns (i.e. samples) and rows (i.e. gene features). Nevertheless,
the \CRANpkg{randomForest} package is quite sensitive to the definition of
\R{} factors; for instance, the \Rfunction{randomForest} function will crash
if a factor is declared to have 3 levels (e.g. "A", "B", and "C"), while the
\Rclass{ExpressionSet} only contains samples for two of them (e.g. "A" and
"B"). A simple fix is to update the known levels of the factor after having
subsetted the \Rclass{ExpressionSet}:

<<>>=
pData(AlvMac) <- droplevels(pData(AlvMac))
@

\bioccomment{
Note that this operation preserves the order of ordered factors.
}

Typically, subsetting an \Rclass{ExpressionSet} by rows and columns does not
automatically update the known levels of each factor to the remaining levels.
The function \Rfunction{subEset} performs this additional task. The function
takes a named list, where names must be column names from the
\Robject{phenoData} slots and values
must be present in the corresponding columns, and returns a subset of the
original \Rclass{ExpressionSet} including only the samples which match those
values:

<<eval=FALSE>>=
subEset(
    eSet=AlvMac, subset=list(
    Time=c("2H","6H","24H"),
    Treatment=c("CN","MB")))
@

\section[Statistics]{Statistics}

\subsection[Overview]{Overview}

\label{sec:StatsOverview}

\Biocpkg{GOexpress} was initially created from a set of gene-based, and later
ontology-based, visualisation functions. Following the integration of those
various plotting functions at the core of the \Biocpkg{GOexpress} package, the
need for a ranking of genes and GO terms soon became apparent in order to
rapidly identify those with gene expression data best classify samples
according to the experimental factor studied.
A two-fold procedure was implemented:
\begin{enumerate}
    \item Using the available expression data, each gene present in the
    dataset is scored, evaluating its ability to classify the predefined
    groups of samples. Genes are ranked according to their respective score;
    ties are resolved by assigning the lowest rank R to all G genes, giving
    the rank G+n to the next gene(s). The genewise scoring functions
    implemented are described in the following subsections, and all were
    validated on in-house datasets cross-checked with comparable
    analytic pipelines (e.g. Ingenuity\textregistered Pathway Analysis,
    SIGORA)
    \item Using the above ranks and scores, each GO term in the
    gene ontology annotations
    is assigned the mean score and the mean rank of all the genes
    associated with it in the gene-GO annotations. Genes present in the
    annotations but
    absent from the
    \Rclass{ExpressioSet} are assigned a score of 0 (minimum valid
    score; indicates no power to discriminates the predefined groups of
    sample) and a rank equal to the number of genes present in the
    \Rclass{ExpressioSet} plus one (worst rank, while preserving discrete
    continuity of
    the ranking).
\end{enumerate}

\subsection[Data-driven visualisation functions]{Data-driven visualisation
functions}

Importantly, the statistics performed to rank GO terms and genes do \emph{not}
influence the behaviour of the subsequent plotting functions; heatmaps,
dendrograms and gene expression profiles are purely driven by the
expression data, sample phenotype annotations provided and GO terms
annotations, without any transformation or normalisation applied
to the data. Therefore, users are encouraged to use and suggest alternative
relevant scoring and ranking strategies, which could prioritise GO terms and
genes in different ways. A current acknowledged bias is the higher scoring of
GO terms associated with fewer genes, which is discussed
in section \ref{sec:Filtering}.

\subsection[Random Forest]{Random Forest}

We implemented the Random Forest framework to answer the question: ``How well
does each gene in the dataset classify predefined groups of samples?''. The
random forest consists of multiple decision trees. Each tree is built based on
a bootstrap sample (sample with replacement) of observations and a random
sample of variables. The \CRANpkg{randomForest} package first calculates the
Gini index (Breiman et al, 1984) for each node in each tree. The Gini index
is a measure of homogeneity from 0 (homogeneous) to 1 (heterogeneous). The
decrease in the Gini index resulting from a split on a variable is then
calculated for each node and averaged for each variable over all the trees in
the model. The variable with the biggest mean decrease in the Gini index is
then considered the most important. Technically, \Biocpkg{GOexpress} extracts
the \Robject{MeanDecreaseGini} value from the \Robject{importance} slot of the
\Rfunction{randomForest} output and uses this value as the score for each
gene.

A key feature of the Random Forest framework is the implicit handling of
interactions between genes, given a sufficient number of trees generated and
genes sampled. Indeed, at each node in the decision tree, genes are sampled
from the expression data and tested for their individual capacity to improve
the partitioning reached in the previous node. The larger the number of trees
and genes sampled, the more complete the coverage of interactions will be.

\subsection[One-way Analysis of Variance (ANOVA)]{One-way Analysis of Variance
(ANOVA)}

We implemented the ANOVA statistical framework to answer the question: "How
different is the expression level of each gene in the dataset in the different
groups of samples?". Given the expression level of a gene in all the samples,
the one-way ANOVA determines the ratio of the variance between the groups
compared to the variance within the groups, summarised as an F statistic
that \Biocpkg{GOexpress} uses as a score for each gene. Simply put, if samples
from the same groups show gene expression values similar to each other while
samples from different groups show different levels of expression, those genes
will produce a higher score. This score cannot be less than 0 (variance
between groups insignificant to variance within groups), while very large
ratios can easily be reached for genes markedly different between groups.

Contrary to the random forest framework, the Analysis of Variance makes
important assumptions on the data: namely, the independence of observations,
the normality of residuals, and the equality of variances in all groups. While
the former two are the responsibility of the user to verify, the latter is
taken care of by \Biocpkg{GOexpress}. Indeed, the \Rfunction{oneway.test}
function of the package \Rpackage{stats} is used with parameter
\Robject{var.equal} set
to to \Rcode{FALSE}. While this reduces the sensitivity of the test, all
genes are affected by this correction based on the relative amount of
variance in the different predefined groups of samples. Finally, it is once
again important to note that the one-way ANOVA only evaluates univariate
changes,
while the random forest framework implicitely allows for interactions between
genes.

\section[Integration with other packages]{Integration with other packages}

\subsection[shiny]{shiny}

Shiny is an \R{} package that makes it easy to build interactive web
applications (apps) straight from \R{}
\incfig{images/shiny_screenshot.png}{\textwidth}{Shiny app.}
{This simple application allows visualisation of genes using the
\Biocpkg{GOexpress} \Rfunction{expression\_profiles\_symbol} function.
An online interactive version was made available at:
\url{https://kevinrue.shinyapps.io/alvmac/}}
as shown in Figure~\ref{images/shiny_screenshot.png}. More information is
available at \url{http://shiny.rstudio.com/}.

\section[Notes]{Notes}

\subsection[Authors' contributions]{Authors' contributions}

Conception and development of the \Biocpkg{GOexpress} package was carried out
by KR-A with contributions by PAM, under the supervision of SVG and DEM.
Experimental data used for testing was generated and analysed with the help of
DAM and NC.
Integration of the random forest statistical frameworks was advised by BH and
ACP. This User's Guide was prepared by KR-A, and edited by PAM, BH, DAM, NCN,
ACP, SVG, and DEM.

\subsection[Acknowledgement]{Acknowledgments}

Since the early beginning, \Biocpkg{GOexpress} has grown from constructive
feedback, and I would like to thank a number of colleagues and scientists from
all backgrounds who contributed each in their own way to the present version
and features of the package.
Special thanks to Dr. Paul Cormican, Simone Coughlan, Dr. Karsten Hokamp,
and an unknown reviewer for feedback leading to some features.
Sincere thanks to Dr. Kate Killick for testing on human data and feedback.
My thanks to the \Bioconductor{} staff and in particular to Hervé Pagès for
the helpful feedback which improved the standards of the code and
documentation.
Last but not least, thanks to the University College Dublin "OpenSequencing"
group,
the "Virtual Institute of Bioinformatics and Evolution" (VIBE), and the
"UCD PhD Symposium in Computational Biology \& Innovation" where
I first presented raw versions of \Biocpkg{GOexpress} and received valuable
feedback and advice in return, underlying a significant number of updated and
new features.

\subsection[Session information]{Session information}

<<>>=
sessionInfo()
@

\end{document}