%\VignetteIndexEntry{The oposSOM users guide}
\documentclass{article}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{graphicx}

\begin{document}
\SweaveOpts{concordance=TRUE}

\title{The oposSOM Package}
\author{Henry L\"offler-Wirth, Martin Kalcher}
\maketitle


High-throughput technologies such as whole genome transcriptional profiling
revolutionized molecular biology and provide an incredible amount of data. On
the other hand, these techniques pose elementary methodological challenges
simply by the huge and ever increasing amount of data produced: researchers
need adequate tools to extract the information content of the data in an
effective and intelligent way. This includes algorithmic tasks such as data
compression and filtering, feature selection, linkage with the functional
context, and proper visualization.
Especially, the latter task is very important because an intuitive
visualization of massive data clearly promotes quality control, the discovery
of their intrinsic structure, functional data mining and finally the generation
of hypotheses.
We aim at adapting a holistic view on the gene activation patterns as seen by
expression studies rather than to consider single genes or single pathways.
This view requires methods which support an integrative and reductionist
approach to disentangle the complex gene-phenotype interactions related to
cancer genesis and progression. With this motivation we implemented an analysis
pipeline based on data processing by a Self-Organizing Map (SOM)
\citep{Wirth2011}\citep{Wirth2012}\citep{Loffler-Wirth2015}. This approach simultaneously searches for features which are differentially expressed and correlated in their profiles in the set of samples studied. We include functional information about such co-expressed genes to extract distinct functional modules inherent in the data and attribute them to particular types of cellular and biological processes such as inflammation, cell division, etc. This modular view facilitates the understanding of the gene expression patterns characterizing different cancer subtypes on the molecular level. Importantly, SOMs preserve the information richness of the original data allowing the detailed study of the samples after SOM clustering. A central role in our analysis is played by the so-called expression portraits which serve as intuitive and easy-to-interpret fingerprints of the transcriptional activity of the samples. Their analysis provides a holistic view on the expression patterns activated in a particular sample. Importantly, they also allow identification and interpretation of outlier samples and, thus, improve data quality \citep{Hopp2013a}\citep{Hopp2013}.


\section{Example data: transctiptome of healthy human tissue samples}

The data was downloaded from Gene Expression Omnibus repository
(\href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7307}%
{GEO accession no. GSE7307}).
About 20,000 genes in more than 650 tissue samples were measured using the
Affymetrix HGU133-Plus2 microarray. A subset of 12 selected tissues from
different categories is used here as example data set for the oposSOM-package.

\section{Setting up the environment}

In order to set the analysis parameters and to create the enclosing environment
it is obligatory to use \textbf{opossom.new}. If any parameter is not
explicitly defined, default values will be used (see also Parameters section):

<<>>=
library(oposSOM)
env <- opossom.new(list(dataset.name="Tissues",
                        dim.1stLvlSom=20))
@

\ \\
The oposSOM package requires input of the expression data, for example preprocessed RNA microarray or sequencing data. It is recommended to transform data into logarithmic scale prior to utilizing them in the pipeline.\\
The workflow accepts two formats: Firstly a simple two-dimensional numerical matrix, where the columns and rows represent the samples and genes, respectively:

<<>>=
data(opossom.tissues)
str(opossom.tissues, vec.len=3)

env$indata <- opossom.tissues
@

\pagebreak
Secondly the input data can also be given as \textit{Biobase::ExpressionSet} object:

<<>>=
data(opossom.tissues)

library(Biobase)
opossom.tissues.eset = ExpressionSet(assayData=opossom.tissues)
opossom.tissues.eset

env$indata <- opossom.tissues.eset
@


\ \\
Each sample may be assigned to a distinct group and a corresponding color to
improve data visualization and result presentations. \textit{group.labels} can also be set to \textit{"auto"} to apply unsupervised grouping of samples according to their expression module activation patterns.
Otherwise, samples will be collected within one group and colored using a
standard scheme.

<<>>=
env$group.labels <- c(rep("Homeostasis", 2),
                          "Endocrine",
                          "Digestion",
                          "Exocrine",
                          "Epithelium",
                          "Reproduction",
                          "Muscle",
                      rep("Immune System", 2),
                      rep("Nervous System", 2) )
@
<<>>=
env$group.colors <- c(rep("gold", 2),
                          "red2",
                          "brown",
                          "purple",
                          "cyan",
                          "pink",
                          "green2",
                      rep("blue2", 2),
                      rep("gray", 2) )
@

\pagebreak
Alternatively, the \textit{group.labels} and \textit{group.colors} can also be defined within the phenotype information of the ExpressionSet:

<<>>=
group.info <- data.frame( 
                  group.labels = c(rep("Homeostasis", 2),
                                       "Endocrine",
                                       "Digestion",
                                       "Exocrine",
                                       "Epithelium",
                                       "Reproduction",
                                       "Muscle",
                                   rep("Immune System", 2),
                                   rep("Nervous System", 2) ),

                  group.colors = c(rep("gold", 2),
                                       "red2",
                                       "brown",
                                       "purple",
                                       "cyan",
                                       "pink",
                                       "green2",
                                   rep("blue2", 2),
                                   rep("gray", 2) ),
													
                  row.names=colnames(opossom.tissues))
@
<<>>=
opossom.tissues.eset = ExpressionSet(assayData=opossom.tissues,
                                     phenoData=AnnotatedDataFrame(group.info) )
opossom.tissues.eset

env$indata <- opossom.tissues.eset
@


\pagebreak
Finally the pipeline will run through all analysis modules without further
input. Periodical status messages are given to inform about running and
accomplished tasks. Please note that the tissue sample will take approx. 30min
to finish, depending on the users' hardware:

<<eval=FALSE>>=
opossom.run(env)
@

\begin{figure}[h!]
	\begin{center}
	\includegraphics[width=0.9\textwidth]{Summary.pdf}
	\end{center}
	\caption{Few selected results provided by the oposSOM package: (a) Expression
landscape portraits represent fingerprints of transcriptional activity.
The \textit{group.labels} and \textit{group.colors} parameters are used to
arrange and represent the samples throughout all analyses. (b) Functional
expression modules are identified in the expression landscapes and described
using appropriate summary portraits (left part), and expression profiles,
enrichment analyses and differential gene lists (right part). (c) Sample
similarity structure is analysed using different algorithms and distance
metrics. Here a clustered pairwise correlation matrix is shown.}
	\label{fig:Results summary}
\end{figure}

\pagebreak
\section{Browsing the results}

The pipeline will store the results in a defined folder structure. These
results comprise a variety of PDF documents with plots and images addressing
the input data, supplementary descriptions of the SOM generated, the metadata
obtained by the SOM algorithm, the sample similarity structures and also
functional annotations. The PDF reports are accompanied by detailed CSV
spreadsheets which render the complete information richness accessible.\\
Figure ~\ref{fig:Results summary} shows few selected outputs generated by the
pipeline. The expression landscape portraits
(Figure ~\ref{fig:Results summary}a) represent fingerprints of transcriptional
activity. They are used to identify functional expression modules, which are
further visualized and evaluated (Figure ~\ref{fig:Results summary}b). Sample
similarity structure is analysed using different algorithms and distance
metrics, for example by clustering the pairwise sample correlation matrix
(Figure ~\ref{fig:Results summary}c).\\
Detailed description of the respective algorithms and visualizations would
exceed the scope of this outline. We therefore refer to our publications aiming
at methodical issues and application of the pipeline
\citep{Wirth2011}\citep{Wirth2012m}\citep{Wirth2012}\citep{Wirth2012a}\citep{Steiner2012}\citep{Binder2012}\citep{Hopp2013a}\citep{Hopp2013}.\\
HTML files are generated to provide straightforward access to this great
amount of analysis results (see Figure ~\ref{fig:Results HTML}). They guide the
user in terms of giving the most prominent links at a glance and leading from
one analsis module to another. The \textbf{Summary.html} is the starting point
of this browsing and can be found in the results folder created by the oposSOM
pipeline.

\pagebreak

\begin{figure}[h!]
	\begin{center}
	\includegraphics[width=0.9\textwidth]{HTML.pdf}
	\end{center}
	\caption{HTML files allow browsing all results provided by the oposSOM
package: (a) The central \textit{Summary.html} serves as starting point and
contains general information and results, as well as links to other HTML files
such as (b) the sample summary page, (c) the spot module summary page and (d)
the functional analyses page.}
	\label{fig:Results HTML}
\end{figure}

\pagebreak

\section{Parameter settings}

All parameters are optional and will be set to default values if missing.
However we recommend to adapt the following parameters according to the
respective analysis:

\begin{itemize}
	\item \textit{dataset.name} (character): name of the dataset. Used to name
    results folder and environment image (default: "Unnamed").
	\item \textit{dim.1stLvlSom} (integer): dimension of primary SOM
    (default: "auto"). Given as a single value defining the size of the square
    SOM grid. Use "auto" to set SOM size to recommendation (see below).
	\item \textit{feature.centralization} (boolean): enables or disables
    centralization of the features (default: TRUE).
	\item \textit{sample.quantile.normalization} (boolean): enables quantile
    normalization of the samples (default: TRUE).
\end{itemize}

\ \\
Database parameters are required to enable gene annotations and functional analyses (details are given below):

\begin{itemize}
  \item \textit{database.dataset} (character): type of ensemble dataset
    queried using biomaRt interface (default: "auto"). Use "auto" to detect
    database parameters automatically.
	\item \textit{database.id.type} (character): type of rowname identifier in
    biomaRt database (default: ""). Obsolete if
    \textit{database.dataset="auto"}. 
    
\end{itemize}

\ \\
The parameters below are secondary and may be left unattended by the user:

\begin{itemize}
  \item \textit{activated.modules} (list): activates/deactivates pipeline functionalities:
    \begin{itemize} 
      \item \textit{reporting} (boolean): enables or disables output of pdf and csv results and html summaries (default: TRUE). When deactivated, only R workspace will be stored after analysis.
      \item \textit{primary.analysis} (boolean): enables or disables data preprocessing and SOM training (default: TRUE). When deactivated, prior SOM training results are required to be contained in the workspace environment.
      \item \textit{sample.similarity.analysis} (boolean): enables or disables diversity analyses such as clustering heatmaps, correlation networks and ICA (default: TRUE).
    	\item \textit{geneset.analysis} (boolean): enables or disables geneset analysis (default: TRUE).
    	\item \textit{geneset.analysis.exact} (boolean): enables or disables p-value and fdr calculation in geneset analysis (default: FALSE). Obsolete if \textit{geneset.analysis=F}.
    	\item \textit{group.analysis} (boolean): enables or disables group centered analyses such as group portraits and functional mining (default: TRUE).
    	\item \textit{difference.analysis} (boolean): enables or disables pairwise comparisons of the grous and of pairs provided by \textit{pairwise.comparison.list} as described below (default: TRUE).
    \end{itemize}
	\item \textit{dim.2ndLvlSom} (integer): dimension of the second level SOM
    (default: 20). Given as a single value defining the size of the square SOM
    grid.
	\item \textit{training.extension} (numerical, >0): factor extending the
    number of iterations in SOM training (default: 1).
	\item \textit{rotate.SOM.portraits} (integer \{0,1,2,3\}): number of roations
    of the primary SOM in counter-clockwise fashion (default: 0). This solely
    influences the orientation of the portraits.
	\item \textit{flip.SOM.portraits} (boolean): mirroring the primary SOM along
    the bottom-left to top-right diagonal (default: FALSE). This solely
    influences the orientation of the portraits.\\
	\item \textit{standard.spot.modules} (character, one of \{"overexpression", "underexpression", "kmeans", "correlation", "group.overexpression", "dmap"\}): spot modules utilized in diverse downstream analyses and visualizations, e.g. PAT detection and module correlation map (default: "dmap").
	\item \textit{spot.threshold.modules} (numerical, between 0 and 1): spot
    detection in summary maps, expression threshold (default: 0.95).
	\item \textit{spot.coresize.modules} (integer, >0): spot detection in summary
    maps, minimum spot size (default: 3).
	\item \textit{spot.threshold.groupmap} (numerical, between 0 and 1): spot
    detection in group-specific summary maps, expression threshold
    (default: 0.75).
	\item \textit{spot.coresize.groupmap} (integer, >0): spot detection in
    group-specific summary maps, minimum spot size (default: 5).\\

	\item \textit{pairwise.comparison.list} (list of group lists): group list
    for pairwise analyses (default: NULL). Each element is a list of
    two character vectors containing the sample names to be analysed in
    pairwise comparison. The sample names must be contained in the column names
    of the input data matrix. For example, the following setting will compare
    the homeostasis (liver, kidney) to the nervous system samples
    (accumbens, cortex), and also tongue and intestine to the nervous system:

<<>>=
env$preferences$pairwise.comparison.list <-
    list(list(c("liver","kidney cortex"),
              c("accumbens","cerebral cortex")),
         list(c("tongue","small intestine"),
              c("accumbens","cerebral cortex")))
@

\end{itemize}

\pagebreak
\section{Recommended SOM size and runtime estimation}

The size of the SOM required to resolve main expression modules depends on both the number of features (e.g. genes measured) and the number of samples. Here we give a recommendation based on previous analyses of a multitude of different data sets (see Figure ~\ref{fig:Size recommendation}). Addionally, we give an estimation for runtime of the SOM training algorithm (upper limits on an Intel Core i7 system with 16GB RAM).

\begin{figure}[h!]
	\begin{center}
	\includegraphics[width=1.0\textwidth]{SizeRecommendation.pdf}
	\end{center}
	\caption{Recommended size of the SOM and estimated runtime of the SOM training on an Intel Core i7 system (16GB RAM).}
	\label{fig:Size recommendation}
\end{figure}



\pagebreak
\section{Biomart database settings}

Two parameters are required to access gene annotations and functional information via biomaRt interface:\\
\\
\textbf{\textit{database.dataset}} defines the Ensembl data set to be queried, e.g.\\ "hsapiens\_gene\_ensembl", "mmusculus\_gene\_ensembl" or "rnorvegicus\_gene\_ensembl". A complete list of possible entries can be obtained by
<<eval=FALSE>>=
library(biomaRt)
mart<-useMart("ensembl")
listDatasets(mart)
@
The default setting "auto" will cause oposSOM to test frequently used settings of \textit{database.dataset} and \textit{database.id.type}. If this automatic download of annotation data fails, a warning will be given and manual definition of the parameters will be necessary to enable functional analyses.\\
\\
\textbf{\textit{database.id.type}} provides information about the identifier type constituted by the rownames of the expression matrix, e.g. "ensembl\_gene\_id", "refseq\_mrna" or "affy\_hg\_u133\_plus\_2". A complete list of possible entries can be obtained by
<<eval=FALSE>>=
library(biomaRt)
mart<-useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
listFilters(mart)
@



\pagebreak
\section{Citing oposSOM}

Please cite \citep{Loffler-Wirth2015} when using the package.


\section{Details}

This document was written using:
<<>>=
sessionInfo()
@

\pagebreak
\bibliographystyle{plainnat}
\bibliography{opossom}
\end{document}