%\textwidth=6.2in \textheight=8.5in
%\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in
%\headheight=-.3in
%\VignetteIndexEntry{FELLA}
%\VignettePackage{BiocStyle}
%\VignetteEngine{utils::Sweave}

\documentclass[12pt]{article}

<<style-Sweave, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\usepackage{dsfont}
\usepackage{booktabs}
\usepackage{fixltx2e} %subscripts
\usepackage{multirow}
\usepackage{adjustbox}
%\usepackage{hyperref}


\newcommand{\textq}[1]{``#1''}
% \newcommand{\Rcode}[1]{{\texttt{#1}}}
% \newcommand{\Robject}[1]{{\texttt{#1}}}
% \newcommand{\Rfunction}[1]{{\texttt{#1}}}
% \newcommand{\Rpackage}[1]{{\textit{#1}}}
% \newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

\newcommand{\fella}{{\Rpackage{FELLA}}}
\newcommand{\metaboanalyst}{{\href{www.metaboanalyst.ca}{MetaboAnalyst}}}
\newcommand{\impala}{{\href{http://impala.molgen.mpg.de}{IMPaLA}}}
\newcommand{\warnkegg}{{\footnote{
This analysis is subject to KEGG release 83.0, from August 17th, 2017. 
Posterior KEGG releases might alter the reported sub-network
}}}

\begin{document}

\SweaveOpts{concordance=TRUE, include=TRUE, echo=TRUE, results=hide}
%\SweaveOpts{fig=FALSE, results=hide, prefix.string=fella_latexfiles/}
%\setkeys{Gin}{width=1\textwidth}
\bioctitle[The FELLA R package]{An R package to enrich 
metabolomics data: FELLA}
\author[1,2]{Sergio Picart-Armada\thanks{\email{sergi.picart@upc.edu}}}
\author[1,2]{Francesc Fern\'andez-Albert}
\author[3,4,5]{Maria Vinaixa}
\author[3,4]{Oscar Yanes}
\author[1,2]{Alexandre Perera-Lluna}
\affil[1]{B2SLab, Departament d'Enginyeria de Sistemes,
Autom\`atica i Inform\`atica Industrial,
Universitat Polit\`ecnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain}
\affil[2]{Institut de Recerca Pedi\`atrica
Hospital Sant Joan de D\'eu, Esplugues de Llobregat, Barcelona, 08950, Spain}
\affil[3]{Centre for Omic Sciences, Department of Electronic Engineering, 
Rovira i Virgili University, Reus, 43204, Spain}
\affil[4]{Metabolomics Platform, Spanish Biomedical Research Centre in 
Diabetes and Associated Metabolic Disorders, Madrid, 28029, Spain}
\affil[5]{Manchester Synthetic Biology Research Centre for Fine and 
Speciality Chemicals, University of Manchester, 
Manchester, M1 7DN, United Kingdom}

\maketitle

\section{Abstract}

Pathway enrichment techniques are useful for giving 
context to experimental metabolomics data.
The primary analysis of the raw metabolomics data leads to 
annotated metabolites with abundance measures. 
These metabolites are compared between experimental conditions, 
in order to find discriminative molecular signatures. 
The secondary analysis of the dataset aims at giving context 
to the affected metabolites in terms of the 
prior biological knowledge gathered in metabolic pathways. 
Several statistical approaches are available to derive 
a list of prioritised metabolic pathways that relate to 
the underlying changes in metabolite abundances.
However, the interpretation of a prioritised pathway list 
remains challenging, as pathways are not disjoint and 
show overlap and cross talk effects. 
Furthermore, it is not straightforward to automatically propose 
novel enzymatic targets given a pathway enrichment. 

We introduce \fella, an R package to perform a 
network-based enrichment of a list of affected metabolites. 
\fella~builds a hierarchical network representation of 
the organism of choice using the Kyoto Encyclopedia of Genes and Genomes, 
which contains pathways, modules, enzymes, reactions and metabolites. 
The enrichment is accomplished by applying 
diffusion algorithms in the knowledge network. 
Flow is introduced in the metabolites from the input list 
and propagates to the rest of nodes, 
resulting in diffusion scores for all the nodes in the network. 
The top scoring nodes contain not only relevant pathways, 
but also the intermediate entities that build a plausible explanation 
on how the input metabolites translate into reported pathways. 
The highlighted sub-network can shed light on 
pathway cross talk under the experimental condition 
and potential enzymatic targets for further study.

The implementation and the programmatic use of \fella~is hereby described, 
along with a graphical user interface that wraps the package functionality. 
The algorithmic part in \fella~was previously validated on 
the study of an uncharacterised mitochondrial protein. 
The functionality of \fella~has been demonstrated on 
three public human metabolomics studies, respectively on 
(a) ovarian cancer cells, (b) dry eye and (c) malaria 
and other febrile illnesses. 
\fella~has been able to reproduce findings 
from the original publications and to report sub-network representations 
that can be manually handled. 

\section{Introduction}

Metabolomics is the science that studies the chemical reactions 
in living organisms by quantifying their lightweight molecules, 
called metabolites.
The utilities of metabolomics range from disease diagnosis 
through biomarkers and personalised medicine 
to the generation of biological knowledge \cite{metaboreview}. 

Metabolomics data is mainly acquired through technologies such as, 
but not limited to, Nuclear Magnetic Resonance (NMR) 
and Mass Spectrometry (MS). 
MS is usually preceded by Liquid Chromatography (LC) or 
Gas Chromatography (GC) \cite{metabolomicstechnologiescomplete}. 
The primary analysis of the raw metabolomics data 
can be achieved through publicly available tools: 
the R packages \Biocpkg{xmcs} \cite{xcms} for peak identification 
and \Biocpkg{CAMERA} \cite{camera} for peak annotation.
There are pipelines that cover the whole process, 
for example the online tool 
\href{https://meltdb.cebitec.uni-bielefeld.de}{MeltDB} \cite{meltdb} 
or the R package \Biocpkg{MAIT} \cite{mait}. 
Metabolites found in samples are mapped to specral databases 
such as the Human Metabolome Database \cite{hmdb}. 

The secondary analysis, or data interpretation, 
starts when the metabolites are mapped to a database and 
their abundances are available \cite{reviewsecondary}. 
The existence of experimental conditions enables a 
statistical differential analysis that yields a set of 
metabolites that exhibit changes in the intervention. 
It is, however, increasingly important to understand the 
underlying biological perturbation by giving context to the 
affected metabolites rather than focusing on the ability to 
classify samples through them \cite{metaboreview}. 
Pathway analysis is a fundamental methodology for data interpretation 
\cite{review10years} that enriches the affected metabolites with 
current knowledge on biology, available in pathway databases 
including the Kyoto Encyclopedia of Genes and Genomes 
or KEGG \cite{kegg}, Reactome \cite{reactome} 
and WikiPathways \cite{wikipathways}. 
Enrichment techniques will be discussed in three categories 
or generations, according to the classification 
proposed in the review \cite{review10years}. 
Commercial pathway analysis products such as \software{IPA} 
(QIAGEN Inc., 
\url{https://www.qiagenbioinformatics.com/products/ingenuitypathway-analysis}) 
are out of the scope of this work.

The first generation of methods, 
named over representation analysis (ORA), 
are based on testing if the proportion of 
affected metabolites within a pathway is statistically meaningful. 
ORA is based in statistical tests on probability distribution 
like the hypergeometric, binomial or chi-squared \cite{review10years}. 
ORA is available in tools like the web servers 
\metaboanalyst~\cite{metaboanalyst} and \impala~\cite{impala} 
and the R package \Biocpkg{clusterProfiler} \cite{clusterprofiler}. 
The online resource 
\href{http://www.bio-bigdata.com/SubpathwayMiner/}{SubPathwayMiner} 
identifies sub-pathways from KEGG pathways 
by mining $k$-cliques in each metabolic pathway prior to ORA. 
With this strategy, significant sub-regions can be spotted 
even if the whole pathway is not significant \cite{subpathwayminer}. 

The second generation of methods, functional class scoring (FCS), 
uses quantitative data instead and seeks 
subtle but coordinated changes in the metabolites belonging to a pathway.
MSEA \cite{msea} in \metaboanalyst~\cite{metaboanalyst} 
and \impala~\cite{impala} contain implementations of FCS for metabolomics. 
The R package \Biocpkg{PAPi} calculates pathways activity scores 
per sample, based on the number of metabolites identified 
from each pathway and their relative abundances.
Significantly affected pathways are found by applying an ANOVA 
or a t-test on those scores \cite{papi}. 
On the other hand, there is an ensemble approach relying on 
several pathway-based statistical tests \cite{egsea} 
and is available in the R pacakge \Biocpkg{EGSEA}. 

The third generation, known as pathway topology-based (PT) methods, 
further includes topological measures of the metabolites 
in the statistic, accounting for their inequivalence 
in the metabolic network. 
PT analyses can be performed using \metaboanalyst~\cite{metaboanalyst}, 
where metabolites are weighted by their centrality within the pathway. 
The R package \CRANpkg{MPINet} builds a pathway-level statistic 
that accounts for metabolite inequivalence in the 
global metabolic network and for bias in technical equipment \cite{mpinet}. 

Another perspective for understanding metabolomics data is 
through the construction and inquiry of metabolic networks. 
The \software{MetScape} plugin \cite{metscape} within the 
\software{Cytoscape} environment \cite{cytoscape} is useful for 
representing metabolite-reaction-enzyme-gene networks. 
\Biocpkg{KEGGGraph} is an R package for constructing 
metabolic networks from the KEGG pathways \cite{kegggraph}.
\Biocpkg{MetaboSignal} is an R package for 
building and examining the topology of 
gene-metabolite networks \cite{metabosignal}. 
The R package \Githubpkg{dgrapov/MetaMapR} helps reduce sparsity 
in metabolic networks by integrating biochemical transformations, 
structural similarity, mass spectral similarity and 
empirical correlation information \cite{metamapr}.

Here, we introduce the R package \fella~for 
metabolomics data interpretation that combines concepts from 
pathway enrichment and network analysis. 
The main objective of \fella~is providing the user with a 
biological explanation involving biological pathways. 
\fella~starts from a single, comprehensive network 
consisting of metabolites, reactions, 
enzymes, modules and pathways as nodes.
The list of affected metabolites and the pathways 
highlighted by \fella~are connected through intermediate entities 
-reactions, enzymes and KEGG modules- and returned as a sub-network.
The intermediate entities suggest how the perturbation spreads 
from metabolites to pathways and how pathways cross talk. 
The provided enzymes are candidates for further examination, 
whereas new metabolites might be reported as well. 
\fella~is publicly available in 
\url{https://github.com/b2slab/FELLA} under the GPL-3 license. 

\section{Methodology}

\subsection{Implementation details}

\fella~is written entirely in R \cite{r} and relies on 
the \Biocpkg{KEGGREST} R package \cite{keggrest} for retreiving KEGG, 
the \CRANpkg{igraph} R package \cite{igraph} for network analysis and 
the \CRANpkg{shiny} R package \cite{shiny} for providing 
a graphical user interface.

\fella~defines two S4 classes for handling its main purposes: 
a \Rclass{FELLA.DATA} object that encompasses the knowledge model 
from KEGG and a \Rclass{FELLA.USER} object that contains 
the current analysis by the user.
Table \ref{tab:classes} contains further details about 
the slots and sub-slots in each one of these classes, 
whereas figure \ref{fig:workflow} depicts the package 
workflow and main functions. 

\begin{table}[h]
%\centering
%\begin{adjustbox}{width=\linewidth,center}
% \begin{adjustbox}{center}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lllll}
\hline
Custom class & Slot & Sub-slot & Class & Description \\
\hline
\multirow{12}{*}{\texttt{FELLA.DATA}} & \multirow{5}{*}{\texttt{@keggdata}} & 
\texttt{@graph} & igraph & Knowledge graph object \\
& & \texttt{@id2name} & list & Dictionary from KEGG ID to common name \\
& & \texttt{@pvalues.size} & matrix & Matrix with largest 
CC size probabilities \\
& & \texttt{@id} & list & Correspondence between IDs and category \\
& & \texttt{@status} & character & Status indicator 
of the object \\ \cline{2-5}
& \multirow{1}{*}{\texttt{@hypergeom}} & \texttt{@matrix} & 
Matrix & Metabolite-pathway binary relationship \\ \cline{2-5}
& \multirow{3}{*}{\texttt{@diffusion}} & \texttt{@matrix} & 
matrix & Matrix to compute diffusion as a matrix-vector product \\
& & \texttt{@rowSums} & vector & Internal data to compute the z-scores \\
& & \texttt{@squaredRowSums} & vector & Internal data to compute 
the z-scores \\ \cline{2-5}
& \multirow{3}{*}{\texttt{@pagerank}} & \texttt{@matrix} & 
matrix & Matrix to compute PageRank as a matrix-vector product \\
& & \texttt{@rowSums} & vector & Internal data to compute the z-scores \\
& & \texttt{@squaredRowSums} & vector & Internal data 
to compute the z-scores \\ \hline
\multirow{17}{*}{\texttt{FELLA.USER}} & \multirow{3}{*}{\texttt{@userinput}} & 
\texttt{@metabolites} & vector & KEGG IDs that map to the knowledge graph \\
& & \texttt{@metabolitesbackground} & vector & Background KEGG IDs \\
& & \texttt{@excluded} & vector & Input IDs not mapping to the 
knowledge graph \\ \cline{2-5}
& \multirow{6}{*}{\texttt{@hypergeom}} & \texttt{@valid} & logical & 
Indicator of analysis validity \\
& & \texttt{@pvalues} & vector & Pathway p-values \\
& & \texttt{@pathhits} & vector & Number of hits in each pathway \\
& & \texttt{@pathbackground} & vector & Number of metabolites 
in each pathway \\
& & \texttt{@nbackground} & numeric & Number of compounds 
in the background \\
& & \texttt{@ninput} & numeric & Number of compounds 
in the input \\ \cline{2-5}
& \multirow{4}{*}{\texttt{@diffusion}} & \texttt{@valid} & 
logical & Indicator of analysis validity \\
& & \texttt{@pscores} & vector & P-scores for each node in the network \\
& & \texttt{@approx} & character & Chosen approximation \\
& & \texttt{@niter} & numeric & Chosen iterations \\ \cline{2-5}
& \multirow{4}{*}{\texttt{@parerank}} & \texttt{@valid} & logical & 
Indicator of analysis validity \\
& & \texttt{@pscores} & vector & P-scores for each node in the network \\
& & \texttt{@approx} & character & Chosen approximation \\
& & \texttt{@niter} & numeric & Chosen iterations \\ \hline
\end{tabular}
}\caption{Summary of the S4 classes defined in \fella.}
\label{tab:classes}
\end{table}

\begin{figure}[!tpb]%figure1
\centerline{\includegraphics[width=\linewidth]{workflow3.eps}}
\caption{Design of the R package \fella.
Block \textbf{I} covers the creation of a graph object from an 
organism code and its database, which can be loaded 
into a \Rclass{FELLA.DATA} object.
This object is needed in all the following blocks.
Block \textbf{II} requires block \textbf{I} and shows 
how to map the KEGG identifiers to the database 
in a \Rclass{FELLA.USER} object and run the propagation algorithms 
(diffusion, PageRank) to score all the entities in the graph.
Block \textbf{III} requires blocks \textbf{I} and \textbf{II} and 
exports the results as a sub-network or as a table.}
\label{fig:workflow}
\end{figure}


\fella~contains two vignettes that illustrate its capabilities: 
(1) a quick-start example with the main functions applied to 
a toy dataset, and (2) this document, an in-depth 
demonstration on three real studies. 
This vignette requires an internet connection 
and can take up some time and memory to build, 
as it builds the internal KEGG representation 
for Homo sapiens on the fly.

\subsection{Database and knowledge model}

A distinctive feature of \fella~is its unique knowledge model. 
Instead of using individual pathway representations, 
either as a list of metabolites (ORA) or as a metabolic network (TP), 
\fella~builds a unique network that encompasses 
all the pathways at once: the KEGG graph.
Figure \ref{fig:kegggraph} shows the hierarchical representation 
of the KEGG database, ranging from the small, specific 
molecular level (metabolite) to the large, complex unit (pathway). 
Intermediate levels contain, from bottom to top: 
reactions relating the metabolites, enzymes catalising the 
reactions and KEGG modules containing the enzymes.
More details on the construction and curation of this structure, 
resemblant to the one used by MetScape \cite{metscape}, 
can be found in \cite{metabopicart}.
The enrichment is therefore achieved by finding a sub-network 
from the whole KEGG graph that is statistically 
relevant for a list of input metabolites.

\begin{figure}[!tpb]%figure2
\centerline{\includegraphics[width=\linewidth]{kegggraph.eps}}
\caption{Internal knowledge representation from KEGG. 
The scheme outlines the KEGG graph, a heterogeneous network 
whose nodes belong to a category in KEGG: 
compound, reaction, enzyme, module or pathway. 
Lower levels are expected to be more specific entities, 
while top levels are broader concepts. 
The enrichment procedure starts from input metabolites and 
extracts a relevant sub-network from the KEGG graph. 
Figure extractred from \cite{metabopicart}}
\label{fig:kegggraph}
\end{figure}

As shown in the block (I) of figure \ref{fig:workflow}, 
the first step is to build a KEGG graph from an organism 
in KEGG -Homo sapiens by default- using the 
\Rfunction{buildGraphFromKEGGREST} command. 
Afterwards, a local database can be built from 
the KEGG graph through the \Rfunction{buildDataFromGraph} command. 
The main purposes of \Rfunction{buildDataFromGraph} are to 
save (1) the matrices that allow computing diffusion and 
PageRank as a matrix-vector product, and (2) the null distribution 
of the largest connected component of a $k$-th order subgraph, 
with uniformly chosen nodes. 
Point (1) is required to compute the diffusion scores, 
whereas (2) is useful for filtering small connected components 
in the reported subgraphs.

The user should be aware that KEGG is frequently updated and 
therefore the derived KEGG graph can change between KEGG releases.
The metadata from the KEGG version used to build a 
\Rclass{FELLA.DATA} object can be retrieved through \Rfunction{getInfo}.

\subsection{Enrichment analysis}

Once the database is ready as a \Rclass{FELLA.DATA} object 
and the input is formatted as a list of KEGG compounds, 
the enrichment can be performed. 
The results of the enrichment are stored in a 
\Rclass{FELLA.USER} object, possibly using three 
methodologies described below. 

\subsubsection{Hypergeometric test}

For completeness purposes, the hypergeometric test is 
included in \fella~in the function \Rfunction{runHypergeom}. 
As in several ORA implementations, the hypergeometric distribution 
is used to assess whether a biological pathway contains 
more hits within the input list than expected from chance given its size. 
Pathways are ranked according to their p-value 
after multiple testing correction. 

Note that the results from this test will differ from 
a hypergeometric test using the original KEGG pathways, 
because metabolite-pathway connections are inferred from the KEGG graph. 
A metabolite is included in a pathway if the pathway 
can be reached from the metabolite in the upwards-directed 
KEGG graph, depicted in figure \ref{fig:pr}. 
In consequence, metabolites related to the enzymes 
within a pathway will belong to the pathway, 
even if they were not in the original definition of the KEGG pathway. 

\subsubsection{Diffusion}\label{sec:hd}

Diffusion algorithms have been extensively used in computational biology. 
For instance, HotNet is an algorithm for finding 
sub-networks with a large amount of mutated genes \cite{hotnet}, 
whereas TieDIE attemps to link a source set and 
a target set of molecular entities through 
two diffusion processes \cite{tiedie}.
Other applications include the prioritisation of 
disease genes \cite{gwadata} and the prediction of 
gene function \cite{genemania}.

In \fella, diffusion is a natural way to score all the nodes 
in the KEGG graph given an input list of metabolites, 
available using \Rcode{method = "diffusion"} 
in the function \Rfunction{runDiffusion}.
The input metabolites introduce unitary flow in the network.
Flow can only leave the network through pathway nodes, 
forcing it to propagate through the intermediate entities 
as well (reactions, enzymes and modules), see figure \ref{fig:hd}. 
Further details can found in \cite{metabopicart}. 

\begin{figure}[!tpb]
\centerline{\includegraphics[width=.8\linewidth]{HD_contour.eps}}
\caption{Network setup for the diffusion process. 
nput metabolites (in black rings) introduce a unitary flow 
in the network and only the pathway nodes (blue rings) 
can leak the flow. The final score of the nodes reflects 
the \textq{temperature} of a stationary state. 
Figure extractred from \cite{metabopicart}.}
\label{fig:hd}
\end{figure}

However, the diffusion scores are biased due to 
the network topology \cite{metabopicart} and therefore a 
normalisation step is required. 
\fella~offers a normalisation through a z-score 
(\Rcode{approx = "normality"}) or through an empirical p-value 
(\Rcode{approx = "simulation"}), both assessing whether 
the diffusion score of a node is likely to be reached 
in a permutation analysis, i.e. if the input is random. 

The normalisation through the z-scores leads to p-scores, defined as:

$$ ps_i = 1 - \Phi(z_i) $$

Where $ps_i$ is the p-score of node $i$, $z_i$ is 
its z-score \cite{metabopicart} and $\Phi$ is the 
cumulative distribution function of the standard gaussian distribution. 
Under this definition, nodes are ranked using increasing p-scores. 

For completeness, two alternative parametric scores have been added. 
The heavier-tailed \textbf{t}-distribution can be used 
instead of the gaussian by choosing \Rcode{approx = "t"} 
and supplying the desired degrees of freedom $\nu$. 

Similarly, the \textbf{gamma} distribution 
can be used through \Rcode{approx = "gamma"}. 
The p-score is obtained with

$$ ps_i = 1 - F_i(T_i) $$

Being $T_i$ the raw temperature of node $i$ and $F_i$ 
the cumulative distribution function of a gamma distribution, 
adjusted by its shape ($\frac{\mu_i^2}{\sigma_i^2}$) and scale 
($\frac{\sigma_i^2}{\mu_i}$) parameters. 
The quantities $\mu_i$ and $\sigma_i^2$ are the mean 
and variance of the null temperatures and are analytically known 
from the null model formulation \cite{metabopicart}. 

\subsubsection{PageRank}

PageRank \cite{pagerank} offers a scoring method for 
the nodes in the KEGG graph, based on a random walks approach.
The random walks start at the input metabolites and 
are forced to explore their reachable nodes, see figure \ref{fig:pr}. 
As random walks take into account the direction of the edges, 
PageRank is applied to the upwards-directed KEGG graph 
(figure \ref{fig:kegggraph}) in order to force 
the walks to reach pathway nodes. 
Nodes that are frequently visited by the random walks 
earn a higher PageRank, analogously to the diffusion scores. 
More details about this particular formulation, 
implemented in \Rfunction{runPagerank}, 
can be found in \cite{metabopicart}.

\begin{figure}[!tpb]
\centerline{\includegraphics[width=.8\linewidth]{PR_contour.eps}}
\caption{Network setup for PageRank. 
Input metabolites (in black rings) are the source of random walks 
that must climb through the graph levels, up to the pathway nodes. 
Figure extractred from \cite{metabopicart}.}
\label{fig:pr}
\end{figure}

The PageRank scores are statistically normalised, 
providing the same options as in the diffusion scores in section \ref{sec:hd}. 
Therefore, the argument \Rfunarg{approx} can be set 
to \Rcode{"simulation"} for the permutation analysis, 
or to \Rcode{"normality"}, \Rcode{"t"} or \Rcode{"gamma"} 
for the parametric alternatives.

\subsection{Enrichment wrapper}

\fella~contains the wrapper \Rfunction{enrich} 
that maps the KEGG ids and runs the desired enrichment 
procedure with a single call. 
This can be convenient for producing compact scripts 
and running quick analyses.

\subsection{Limitations}

\fella~currently starts the statistical analysis 
from a list of affected metabolites. 
Therefore, it inherits a limitation from ORA methods: 
the need of choosing a cutoff to derive the list of 
affected metabolites, assuming that the metabolites 
stem from a differential abundance analysis. 

Another limitation, shared among network-based models, 
is the incomplete biological knowledge from which the network is built. 
The knowledge model in \fella~might also constraint 
the complexity of the mechanisms that can be found through it.
Processes such as genetic and epigenetic events, 
or the type and directionality of regulatory events, 
are not considered at the moment. 

The user should be aware that \fella~neither builds 
a dynamic model of the biochemical reactions in the metabolism, 
nor relies on flux balance analysis.  
Conversely, \fella~is built on a knowledge representation 
from the biology in KEGG that focuses on offering 
interpretability to the final user.

\section{Case studies}

The functionalities of \fella~are demonstrated by (1) 
building a Homo sapiens database and (2) 
enriching summary metabolomics data from three public datasets. 

\subsection{Building the database}\label{sec:builddb}

\fella~requires a database built from KEGG to perform any data enrichment. 
\fella~contains a small example database as a 
\Rclass{FELLA.DATA} object, accessible via 
\Rcode{data("FELLA.sample")}, but this is a toy example 
for demonstration purposes, not suited for regular analyses. 

Therefore, the database for the corresponding organism 
has to be built before any analysis is run. 
The first step is to build the KEGG graph from the current 
KEGG release with the function \Rfunction{buildGraphFromKEGGREST}. 
Note that the user can force specific KEGG pathways to be 
excluded from the graph - the following code removes 
\textq{overview} metabolic pathways based on 
\href{http://www.genome.jp/kegg/brite.html}{KEGG brite}. 


<<01_graph, eval=TRUE>>=
library(FELLA)
set.seed(1)
# Filter overview pathways
graph <- buildGraphFromKEGGREST(
    organism = "hsa", 
    filter.path = c("01100", "01200", "01210", "01212", "01230"))
@

Once the KEGG graph is ready, the database will be saved 
locally using \Rfunction{buildDataFromGraph}. 
The user can choose which matrices shall be stored 
using the \Rfunarg{matrices} argument - saving both 
\Rcode{"diffusion"} and \Rcode{"pagerank"} 
might take up to 1GB of disk space. 

If the user plans on using the z-score approximation, 
it is advisable to set the \Rfunarg{normality} 
argument to \Rcode{c("diffusion", "pagerank")} 
in order to speed up future computations. 
Using the z-scores with a custom metabolite background 
will require the matrices to be saved as well. 

Finally, the argument \Rfunarg{niter} controls how many 
random trials are performed in the estimation of the 
null distribution of the largest connected component 
of a $k$-th order random subgraph. 
As this is a property of the KEGG graph, 
it is performed once and reused in each analysis. 
This finds application when filtering 
small connected components from the reported sub-network, 
see section \ref{sec:export}.

<<01_database, eval=TRUE>>=
tmpdir <- paste0(tempdir(), "/my_database")
# Mke sure the database does not exist from a former vignette build
# Otherwise the vignette will rise an error 
# because FELLA will not overwrite an existing database
unlink(tmpdir, recursive = TRUE)  
buildDataFromGraph(
    keggdata.graph = graph, 
    databaseDir = tmpdir, 
    internalDir = FALSE, 
    matrices = "diffusion", 
    normality = "diffusion", 
    niter = 50)
@

When the database is available in local, it can be loaded 
in an \R{} session and assigned to a \Rclass{FELLA.DATA} 
object using the function \Rfunction{loadKEGGdata}. 
This should be the only procedure for creating any \Rclass{FELLA.DATA} object.
The user is given the choice of loading the diffusion 
and pagerank matrices to ease memory saving. 

<<01_loadkegg, eval=TRUE>>=
fella.data <- loadKEGGdata(
    databaseDir = tmpdir, 
    internalDir = FALSE, 
    loadMatrix = "diffusion"
)
@

The contents of the \Rclass{FELLA.DATA} object can be summarised as well:

<<01_summary, eval=TRUE, results=verbatim>>=
fella.data
@

The function \Rfunction{getInfo} provides the KEGG 
release and organism that generated a \Rclass{FELLA.DATA} object:

<<01_info, eval=TRUE, results=verbatim>>=
cat(getInfo(fella.data))
@

Please note that the database built for this vignette is 
stored in a temporary folder and will not be persistent.  
The user should build his or her own database and save it 
in a persistent location, either in the package 
installation directory (\Rcode{internalDir = TRUE}) 
or in a custom folder (\Rcode{internalDir = FALSE}). 
Internal databases can be listed using \Rfunction{listInternalDatabases}.

A cautionary note if the user is relying on the 
internal directory: reinstalling \fella~will wipe 
existent databases because its internal directory is overwritten.
Also, if the database name already exists when saving 
a new database, the existing database will be renamed by 
appending \Rcode{\_old} in order to avoid overwriting.

\subsection{Epithelial cells dataset}

This example data is extracted from the epithelial 
cancer cells dataset \cite{epithelial}, an in vitro model 
of dry eye in which the human epithelial cells IOBA-NHC 
are put under hyperosmotic stress. 
The original study files are deposited in the Metabolights repository 
\cite{metabolights} under the identifier MTBLS214: 
\url{https://www.ebi.ac.uk/metabolights/MTBLS214}.
The list of metabolites hereby used reflects 
metabolic changes in \textq{Treatment 1} (24 hours in 
serum-free media at 380 mOsm) against control (24 hours at 280 mOsm). 
The metabolites have been extracted 
from \textq{Table 1} in the original manuscript and mapped to KEGG ids. 


\subsubsection{Mapping the input metabolites}

The input metabolites should be provided as 
\href{http://www.genome.jp/kegg/compound/}{KEGG compound} identifiers. 
If the user starts from another source (common names, 
\href{http://www.hmdb.ca/}{HMDB} identifiers), tools like the 
\textq{compound ID conversor} from 
\metaboanalyst~can be useful for the ID conversion. 

<<02_compounds, eval=TRUE>>=
compounds.epithelial <- c(
    "C02862", "C00487", "C00025", "C00064", 
    "C00670", "C00073", "C00588", "C00082", "C00043")
@

The first step is to map the input metabolites 
to the KEGG graph with \Rfunction{defineCompounds}. 
This step requires the \Rclass{FELLA.DATA} object, 
loaded in section \ref{sec:builddb}. 
The user can impose a custom metabolite background 
with the \Rfunarg{compoundsBackground} argument. 
By default, all the KEGG compounds in the graph are used. 

<<02_mapcompounds, eval=TRUE>>=
analysis.epithelial <- defineCompounds(
    compounds = compounds.epithelial, 
    data = fella.data)
@

Notice that \Rfunction{defineCompounds} throws a 
warning if any of the input metabolites does not map to the graph. 
The user can retrieve the mapped and unmapped identifiers 
through \Rfunction{getInput} and \Rfunction{getExcluded}, respectively.

<<02_mapped, results=verbatim, eval=TRUE>>=
getInput(analysis.epithelial)
getExcluded(analysis.epithelial)
@

The status of a \Rclass{FELLA.USER} object 
can be checked by printing the object.

<<02_print, results=verbatim, eval=TRUE>>=
analysis.epithelial
@

\subsubsection{Enriching using diffusion}

Having mapped the compounds, the enrichment can be performed. 
In this vignette, only the diffusion method 
in \Rfunction{runDiffusion} will be applied, 
although PageRank has an almost identical usage in \Rfunction{runPagerank}.

If the user prefers an explicit permutation analysis, 
the option \Rcode{approx = "simulation"} performs 
the amount of iterations specified in the \Rfunarg{niter} argument.

Conversely, if the desired approximation is 
the z-score (\Rcode{approx = "normality"}), 
the process does not require permutations. 
The z-scores are converted to \Rcode{p.scores} using 
the \Rcode{pnorm} routine. 
Likewise, \Rcode{approx = "t"} and \Rcode{approx = "gamma"} 
respectively rely on \Rcode{pt} and \Rcode{pgamma}. 
Section \ref{sec:hd} contains further details on the scores.

This example applies \Rcode{approx = "normality"}, a fast option. 
For a comparison between prioritisations using 
Monte Carlo trials or the parametric z-score, 
the user can is referred to \cite{metabopicart}.

<<03_enrich, eval=TRUE>>=
analysis.epithelial <- runDiffusion(
    object = analysis.epithelial, 
    data = fella.data, 
    approx = "normality")
@

The \Rclass{FELLA.USER} object has been updated 
with the \Rcode{p.scores} from the diffusion results:

<<03_summary, results=verbatim, eval=TRUE>>=
analysis.epithelial
@

At this point, the subgraph consisting of top scoring nodes 
can be plotted in a heterogeneous network layout. 
In the presence of signal, this subgraph will exhibit 
large connected components and contain nodes from 
all the levels in the KEGG graph. 
It is also expected that the algorithm gives a high priority 
to the metabolites specified in the input, 
although not all of them must necessarily be top ranked. 

Therefore, the user should expect to find 
the presence of intermediate entities 
(reactions, enzymes and modules) that connect 
the input to relevant KEGG pathways. 
Note that \fella~can also pinpoint new KEGG 
compounds as potentially relevant.

In this example, the plot is limited to $150$ nodes 
using the \Rfunarg{nlimit} argument from \Rfunction{plot}. 

<<03_plot, fig=TRUE, fig.wide=TRUE, fig.asp=1, eval=TRUE>>=
nlimit <- 150
vertex.label.cex <- .5
plot(
    analysis.epithelial, 
    method = "diffusion", 
    data = fella.data, 
    nlimit = nlimit, 
    vertex.label.cex = vertex.label.cex)
@

In the original work \cite{epithelial}, the activation of 
the \textbf{glycerophosphocholine synthesis} rather than 
the \textbf{carnitine} response is a main result. 
\fella~highlights\warnkegg~the related pathway 
\textit{choline metabolism in cancer} and the 
\textit{choline} metabolite as well. 
Another key process is the \textbf{O-linked glycosilation}, 
which is close to the KEGG module 
\textit{O-glycan biosynthesis, mucin type core} and to 
the KEGG pathway \textit{Mucin type O-glycan biosynthesis}.
Finally, \fella~reproduces the finding of \textbf{UAP1} 
by reporting the enzyme \textit{2.7.7.23}, named 
\textit{UDP-N-acetylglucosamine diphosphorylase}. 
\textbf{UAP1} is a key protein in the study, 
pinpointed by iTRAQ and validated via western blot. 


\subsubsection{Exporting the results}\label{sec:export}

After an initial exploration of the results, 
these can be exported using three functions that 
lead to network and tabular formats. 

The top scoring nodes can be exported as a network in 
\CRANpkg{igraph} with the function \Rfunction{generateResultsGraph}. 
The number $k$ of nodes in the subgraph is controlled 
by the most stringent filter between \Rfunarg{nlimit} 
(limit on the number of nodes) and \Rfunarg{threshold} 
(limit on the \Rcode{p.score}). 

Once $k$ is determined, the argument \Rfunarg{thresholdConnectedComponent} 
further filters small connected components from the subgraph, 
implying that the resulting subgraph can have less than $k$ nodes. 
A connected component of order $r$ will be kept only if 
the probability that a random subgraph of order $k$ contains 
a connected component of order at least $r$ is smaller than 
the specified threshold.
In other words, small connected components can arise 
from random sampling of the subgraph, whereas larger 
connected components are highly unlikely under a uniform sampling. 
The user can filter connected components that are 
too small to be meaningful in that sense. 

Lastly, the argument \Rfunarg{LabelLengthAtPlot} allows to 
truncate the KEGG names at the given number of 
characters for visualisation purposes.

<<04_graph, results=verbatim, eval=TRUE>>=
g <- generateResultsGraph(
    object = analysis.epithelial, 
    method = "diffusion", 
    nlimit = nlimit, 
    data = fella.data)
g
@

The exported (sub)graph can be further complemented with 
data from GO, the \href{www.geneontology.org}{Gene Ontology} 
\cite{geneontology}. 
Specifically, the enzymes can be equipped with annotations 
from their underlying genes in any ontology from GO. 
Note that this requires additional packages: \Biocpkg{biomaRt} 
and \Biocpkg{org.Hs.eg.db}.
The latter should be changed in case the analysis and 
the database are not from Homo sapiens.

The function \Rfunction{addGOToGraph} achieves this by 
accepting a query GO term and computing the semantic similarity of 
all the genes within each enzyme to the query GO term. 
The semantic similarity is detailed and implemented in the 
package \Biocpkg{GOSemSim} \cite{gosemsim}.

In the current example, enzymes are going to be compared to 
the GO cellular component term 
\href{http://amigo.geneontology.org/amigo/term/GO:0005739}{mitochondrion}.
Enzymes that contain genes whose cellular component is 
closer or coincident with the mitochondrion will be highlighted.

<<04_go, results=verbatim, eval=TRUE>>=
# GO:0005739 is the term for mitochondrion
g.go <- addGOToGraph(
    graph = g, 
    GOterm = "GO:0005739", 
    godata.options = list(
        OrgDb = "org.Hs.eg.db", ont = "CC"), 
    mart.options = list(
        biomart = "ensembl", dataset = "hsapiens_gene_ensembl"))
g.go
@

Plotting the graph with the function \Rfunction{plotGraph} 
reveals the addition of the GO term due to 
a slight change in the plotting legend. 
Enzyme nodes have a different shape and their colour scale 
reflects their degree of similarity to the queried GO term. 

<<04_plot_go, fig=TRUE, eval=TRUE>>=
plotGraph(
    g.go, 
    vertex.label.cex = vertex.label.cex)
@

The second way to export the enrichment results is to 
write the data from the KEGG entries in the top $k$ 
p.scores using \Rfunction{generateResultsTable}. 
This function accepts arguments similar to those 
in \Rfunction{generateResultsTable}. 

<<04_table, results=tex, eval=TRUE>>=
tab.all <- generateResultsTable(
    method = "diffusion", 
    nlimit = 100, 
    object = analysis.epithelial, 
    data = fella.data)
# Show head of the table
knitr::kable(head(tab.all), format = "latex")
@

The last exporting option, \Rfunction{generateEnzymesTable}, 
is to a tabular format with details from the enzymes 
reported among the top $k$ KEGG entries. 
In particular, the table contains the genes that belong to 
each enzyme family, separated by semicolons.

<<04_enzyme, results=tex, eval=TRUE>>=
tab.enzyme <- generateEnzymesTable(
    method = "diffusion", 
    nlimit = 100, 
    object = analysis.epithelial, 
    data = fella.data)
# Show head of the table
knitr::kable(head(tab.enzyme, 10), format = "latex")
@

The three exporting options shown above are included in 
the wrapper function \Rfunction{exportResults}, 
using \Rcode{format = "csv"} for the general tabular data, 
\Rcode{format = "enzyme"} for the enzyme tabular data and 
\Rcode{format = "igraph"} for saving an \Rcode{.RData} 
object with the igraph sub-network object. 

For instance, the general tabular data:

<<04_file, eval=TRUE>>=
tmpfile <- tempfile()
exportResults(
    format = "csv", 
    file = tmpfile, 
    method = "diffusion", 
    object = analysis.epithelial, 
    data = fella.data)
@

If the argument \Rfunarg{format} is none of the former, 
\fella~saves the sub-network using \Rfunction{write.graph} 
from the \CRANpkg{igraph} package with the desired format.

<<04_pajek, eval=TRUE>>=
tmpfile <- tempfile()
exportResults(
    format = "pajek", 
    file = tmpfile, 
    method = "diffusion", 
    object = analysis.epithelial, 
    data = fella.data)
@

\subsubsection{Deploying the graphical user interface}

\fella~is equipped with a graphical user interface that 
eases data analysis without learning the package syntax. 
The app is divided in the following tabs:

\begin{itemize}

\item Compounds upload (figure \ref{fig:shiny01}): 
contains a general description of 
the tabs and a handle to submit the input metabolite list 
as a text file. Examples are provided as well. 
The right panel shows the mapped and the mismatching 
compounds with regard to the default database.

\begin{figure}[!h]
\centerline{\includegraphics[width=11cm]{shiny1.png}}
\caption{Graphical interface: compounds upload}
\label{fig:shiny01}
\end{figure}

\item Advanced options (figure \ref{fig:shiny02}): 
widgets that contain the main 
function arguments for customising the enrichment procedure. 
Allows database choice from the internal package directory, 
method and approximation choice and parameter tweaking. 
It also allows defining a GO label for the semantic similarity 
analysis on the reported enzymes.

\begin{figure}[!h]
\centerline{\includegraphics[width=11cm]{shiny2.png}}
\caption{Graphical interface: advanced options}
\label{fig:shiny02}
\end{figure}

\item Results (figure \ref{fig:shiny03}): 
interactive plot with the sub-graph with the 
top $k$ KEGG entries. Nodes can be selected, 
queried and link to the KEGG entries when hovered. 
Below the network lies an interactive table with the graph nodes, 
allowing the user to look into particular entries.

\begin{figure}[!h]
\centerline{\includegraphics[width=12cm]{shiny3.png}}
\caption{Graphical interface: results}
\label{fig:shiny03}
\end{figure}

\item Export (figure \ref{fig:shiny04}): 
several tabular and network exporting options.

\begin{figure}[!h]
\centerline{\includegraphics[width=12cm]{shiny4.png}}
\caption{Graphical interface: export}
\label{fig:shiny04}
\end{figure}

\end{itemize}

The app is based on \CRANpkg{shiny} \cite{shiny} and can 
be launched through \Rfunction{launchApp}.

\subsubsection{Helper functions}

\fella~is equipped with helper functions that ease the 
user experience and avoid direct manipulation of the S4 classes. 
Some of them have been already introduced - a complete 
enumeration of the exported functions is hereby provided. 

Functions of the type \Rcode{get-} ease object and slot 
retrieval, with the following possibilities: 
\Rfunction{getBackground}, \Rfunction{getExcluded}, 
\Rfunction{getInfo}, \Rfunction{getInput}, 
\Rfunction{getName}, \Rfunction{getPscores}. 

On the other hand, functions starting by \Rcode{list-} 
provide general purpose data about the package 
(\Rfunction{listMethods}, \Rfunction{listApprox}, 
\Rfunction{listCategories}) and a listing of the available 
internal databases (\Rfunction{listInternalDatabases}).

Finally, functions starting by \Rcode{is-} check if 
an object belongs to a certain class: \Rfunction{is.FELLA.DATA} 
and \Rfunction{is.FELLA.USER}.


\subsection{Ovarian cancer cells dataset}

The next example has been extracted from the study on 
metabolic responses of ovarian cancer cells \cite{ovarian}. 
The original files can be found in the MTBLS150 study 
in the Metabolights respository: 
\url{https://www.ebi.ac.uk/metabolights/MTBLS150}.
OCSCs are isogenic ovarian cancer stem cells derived from 
the OVCAR-3 ovarian cancer cells.
The abundances of six metabolites are affected by the 
exposure to several environmental conditions: 
glucose deprivation, hypoxia and ischemia 
(column \textq{All} in \textq{Figure 3} from their main manuscript).

The common names have been converted to KEGG ids prior to applying \fella. 
The analysis is performed using the wrapper \Rfunction{enrich} 
that maps the compounds to the internal representation 
and runs the desired methods.

<<05, fig=TRUE, eval=TRUE>>=
compounds.ovarian <- c(
    "C00275", "C00158", "C00042", 
    "C00346", "C00122", "C06468")
analysis.ovarian <- enrich(
    compounds = compounds.ovarian, 
    data = fella.data, 
    methods = "diffusion")

plot(
    analysis.ovarian, 
    method = "diffusion", 
    data = fella.data, 
    nlimit = 150, 
    vertex.label.cex = vertex.label.cex, 
    plotLegend = FALSE)
@

The resulting subnetwork\warnkegg~reports several 
\textbf{TCA cycle}-related entities, also reported by 
the authors and by previous work \cite{ovarian_tca_fumarate_succinate}. 
It also mentions \textit{sphingosine degradation}, 
closely related to the reported \textbf{sphingosine metabolism} 
in the original work. 
Enzymes that have been formerly related to 
cancer are suggested within the TCA cycle, 
like \textit{fumarate hydratase} 
\cite{ovarian_fumarate_pithukpakorn,ovarian_4_2_1_2_lehtonen, 
ovarian_tca_fumarate_succinate} 
\textit{succinate dehydrogenase} 
\cite{ovarian_succinate_dehydrogenase_ni,ovarian_tca_fumarate_succinate} 
and \textit{aconitase} \cite{ovarian_4_2_1_3_singh}.
Another suggestion is \textit{lysosome} - 
lysosomes suffer changes in cancer cells and directly affect 
apoptosis \cite{ovarian_lysosome_kirkegaard}. 
Finally, the graph contains several \textit{hexokinases}, 
potential targets to disrupt glycolysis, a fundamental need 
in cancer cells \cite{ovarian_hexokinase_kaelin}.

\subsection{Malaria dataset}

The metabolites in the last example are related to 
the distinction between malaria and other febrile ilnesses in \cite{malaria}. 
The study files can be found under the MTBLS315 identifier in Metabolights: 
\url{https://www.ebi.ac.uk/metabolights/MTBLS315}.
Specifically, the list of KEGG identifiers has been 
extracted from the supplementary data spreadsheet, 
using all the possible KEGG matches for the 
\textq{non malaria} patient group. 

<<06, fig=TRUE, eval=TRUE>>=
compounds.malaria <- c(
    "C05471", "C14831", "C02686", "C06462", "C00735", "C14833", 
    "C18175", "C00550", "C01124", "C05474", "C05469")

analysis.malaria <- enrich(
    compounds = compounds.malaria, 
    data = fella.data, 
    methods = "diffusion")

plot(
    analysis.malaria, 
    method = "diffusion", 
    data = fella.data, 
    nlimit = 50, 
    vertex.label.cex = vertex.label.cex, 
    plotLegend = FALSE)
@

In this case, the depicted subnetwork\warnkegg~contains 
the modules \textit{C21-Steroid hormone biosynthesis, 
progesterone => corticosterone/aldosterone} and 
\textit{C21-Steroid hormone biosynthesis, progesterone => cortisol/cortisone}, 
related to the \textbf{corticosteroids} as a main 
pathway reported in the original text.  
This is part of the also reported 
\textit{Aldosterone synthesis and secretion}; 
aldosterone is known to show changes related to 
fever as a metabolic response to infection \cite{mala_nonmala_aldosterone}.
Another plausible hit in the sub-network is 
\textit{linoleic acid metabolism}, as erythrocytes infected 
by various malaria parasytes can be enriched in 
linoleic acid \cite{mala_nonmala_linoleic}. 
In addition, the pathway \textit{sphingolipid metabolism} 
can play a role in the immune response 
\cite{mala_nonmala_sphingo, mala_nonmala_sphingo2}. 
As for the enzymes, \textit{3alpha-hydroxysteroid 3-dehydrogenase 
(Si-specific)} and \textit{Delta4-3-oxosteroid 5beta-reductase} 
are related to three input metabolites each 
and might be candidates for further examination. 

\section{Conclusions}

The \fella~R package provides a simple, programmatic 
and intuitive enrichment tool for metabolomics summary data.
Starting from a list of metabolites, \fella~not only 
pinpoints relevant pathways but also intermediate reactions, 
enzymes and modules that links the input metabolites to the pathways.
The reported entries have a network structure focused on 
interpretability and new hypotheses generation, 
giving a richer perspective than classical pathway enrichment tools.
This comprehensive layout can also suggest potential 
enzymes and new metabolites for further study.
Finally, \fella~comes equipped with a graphical user interface 
that promotes its usage to a wider audience and offers 
interactive sub-network examination.

\section{Funding}
This work was supported by the Spanish
Ministry of Economy and Competitiveness (MINECO)
[BFU2014-57466-P to O.Y., TEC2014-60337-R and DPI2017-89827-R to A.P.].
O.Y., A.P. and S.P. thank for funding the
Spanish Biomedical Research Centre in Diabetes
and Associated Metabolic Disorders (CIBERDEM)
and the Networking Biomedical Research Centre
in the subject area of Bioengineering,
Biomaterials and Nanomedicine (CIBER-BBN),
both initiatives of Instituto de Investigaci\'on Carlos III (ISCIII).
SP. thanks the AGAUR FI-scholarship programme.

\newpage

\bibliography{bibliography}

\newpage

\appendix

\section{Session info}

Here is the output of \Rfunction{sessionInfo()} on the system that
compiled this vignette:

<<sessioninfo, results=tex, echo=FALSE>>=
toLatex(sessionInfo())
@

\end{document}