% \VignetteIndexEntry{RTools4TB}
% \VignetteKeywords{Expression Analysis}
% \VignettePackage{RTools4TB}

\documentclass[10pt]{article}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[numbers]{natbib}
\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in
\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}


\begin{document}
\title{\bf The RTools4TB package: data mining of public microarray data through connections to the TranscriptomeBrowser database.}
\author{A. Bergon, F. Lopez, J. Textoris, S. Granjeaud and D. Puthier}
\maketitle


\begin{center}
TAGC/Inserm U928. Parc Scientifique de Luminy case 928. 163, avenue de Luminy. 13288 MARSEILLE cedex 09. FRANCE

   {\tt bergon@tagc.univ-mrs.fr}\\
\end{center}

\tableofcontents


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Overview}
TranscriptomeBrowser (TBrowser, \url{http://tagc.univ-mrs.fr/tbrowser}) hosts a large collection of transcriptional signatures (TS)  automatically extracted from the Gene Expression Omnibus (GEO) database. Each GEO experiment (GSE) was processed so that a subset of the original expression matrix containing the most relevant/informative genes was kept and organized into a set of homogeneous signatures \cite{lopez}. Each signature was tested for functional  enrichment using annotations terms obtained from numerous ontologies or curated databases (Gene Ontology, KEGG, BioCarta, Swiss-Prot, BBID, SMART, NIH Genetic Association DB, COG/KOG...) using the DAVID knowledgebase \cite{sher}.\\
\Rpackage{RTools4TB} is a library for data mining of public microarray data.  \Rpackage{RTools4TB} can be helpful (i) to define the biological contexts (i.e, experiments) in which a set of genes are co-expressed and (ii) to define their most frequent neighbors \cite{lopez}. The RTools4TB package also implements the DBF-MCL algorithm ("Density Based Filtering And Markov Clustering") that can be used for fast and automated partitioning of microarray data. DBF-MCL is a tree-step adaptative algorithm that (i) find elements located in dense areas (ie. clusters) (ii) uses selected items to construct a graph and (iii) performs graph partitioning using MCL \cite{vd}. Note that a UNIX-like systems is required to use DBF-MCL.


\section{Fetching transcriptional signatures from TBrowserDB}
\subsection{The getSignatures function}

Connection to the TranscriptomeBrowser database (TBrowserDB) relies on the \Rfunction{getSignatures}, \Rfunction{getExpressionMatrix} and \Rfunction{getTBInfo} functions.

Basically, the  \Rfunction{getSignatures} function can be used to retrieve transcriptional signature IDs using gene symbol(s), probe ID(s), experiment ID, microarray platform  ID or annotation term(s) as input. This is controled by the "field" argument.

<<eval=TRUE,echo=TRUE>>=
library(RTools4TB)
args(getSignatures)
@

Once the field argument is set, one need to provide a value as input. For instance the following query use gene name as input with value "PCNA".

<<eval=FALSE, echo=TRUE>>=
res <- getSignatures(field="gene", value="PCNA")
head(res)
@


Transcriptional signature IDs can also be obtained by selecting the relevant experiment IDs, platform IDs and probe IDs. To get all transcriptional signature IDs associated with GSE2004 experiment, one should use the following syntax:

<<eval=TRUE,echo=TRUE>>=
res <- getSignatures(field="experiment", value="GSE2004")
@

To get all signatures obtained on GPL96 platform, use the following syntax:

<<eval=TRUE,echo=TRUE>>=
res <- getSignatures(field="platform", value="GPL96")
@

Moreover, as all signatures were tested for functional enrichment using keywords from the DAVID knowledgebase these terms can be used to query the database (DAVID collects a wide range of annotation from several databases including: GO, BIOCARTA, KEGG, PANTHER, BBID,...) . The \Robject{annotationList} dataset contains the annotations terms.

<<eval=TRUE,echo=TRUE>>=
data(annotationList)
names(annotationList)
attach(annotationList)
annotationList[1:4,]
table(TableName)
@

The selected terms can be used to select TS IDs. In this case, user should define a q-value. For instance one can select TS enriched in genes related to the "HSA04110:CELL CYCLE" KEGG pathway with q-value below $10e^{-20}$.

<<eval=TRUE,echo=TRUE>>=
#Signatures containing genes related to Cell-Cycle
cc <- getSignatures(field="annotation", value="HSA04110:CELL CYCLE", qValue=20)
@

Of note, one can also search for TS IDs containing genes located in the same chromosomal region. For instance one can select TS IDs enriched in genes located in the "8q" region which is frequently amplified or deleted in tumors. This will point out the biological contexts in which sets of genes located in the 8q region share the same expression profile, suggesting amplifications or deletions in some biological samples.

<<eval=TRUE,echo=TRUE>>=
query <- paste(grep("^8q", Keyword,val=T),collapse="|")
query
cc <- getSignatures(field="annotation", value=query, qValue=10)
@


Next sections will introduce more complexe queries using sets of genes with or without Boolean operators.

\subsubsection{Request without logical operators (gene list)}
When field is set to  "gene" or "probe", user can perform a request using a list of item separated by blanks. These blanks are interpreted as the OR logical operators. In this case, all signatures containing at least one gene of the list will be returned. To select more informative signatures we suggest to use the nbMin argument that will select signatures containing at least nbMin genes out of the list.

The following examples search for signatures containing at least 2 genes of the input list (CD3D, CD3E and CD4).
<<eval=TRUE,echo=TRUE>>=
gl <- getSignatures(field="gene", value="CD3D CD3E CD4", nbMin=2)
head(gl)
@


\subsubsection{Request using logical operators}

The "value" argument of \Rfunction{getSignatures} may contain the following Boolean operators (see help section on TranscriptomeBrowser web site for more informations, \url{http://tagc.univ-mrs.fr/tbrowser})

\begin{itemize}
 \item \& : AND
 \item | : OR
 \item ! : NOT , (used in conjonction with \&)
\end{itemize}

This is a convenient way to create relevant queries. Suppose your field of interest is related to T-cell activation. You could be interested in retrieving all TS IDs that contain the CD4 gene as they likely contain additional T cell markers. Comparing these TS IDs should help you to define frequent CD4 neighbors (very likely related to TCR signaling cascade). Thereby, your request should be:

<<eval=TRUE,echo=TRUE>>=
res <- getSignatures(field="gene", value="CD4")
@


This gene is found in 371 TS (with the current database release). Obtaining associated gene lists would be time consuming and would not be as specific as expected. Indeed, the CD4 marker is also expressed by macrophages. Another solution would be to search for TS containing two T-cell markers (CD4 and CD3E for instance) and to exclude (using the NOT operator) those containing the CD14 marker (a macrophages marker). The syntax should be the following:

<<eval=TRUE,echo=TRUE>>=
res <- getSignatures(field="gene", value="CD4 & CD3E & !CD14")
@


In the same way, one can try to exclude TS containing B-cells markers by discarding those containing the CD19 or IGHM markers. The resulting query would be the following:


<<eval=TRUE,echo=TRUE>>=
res <- getSignatures(field="gene", value="CD4 & CD3E & !(CD19 | IGHM)")
@


\subsection{Finding the biological contexts in which sets of genes are co-expressed}

As mentioned by Lacroix \textit{et al.}, ESR1, GATA3,  XBP1 are co-expressed in breast cancer tumors (see \cite{lac}). This assumption can be easily verified using \Rpackage{RTools4TB}. For instance, in the following examples, we fetch transcriptional signature IDs that contain "XBP1 \& ESR1 \& GATA3". Next the \Rfunction{getTBInfo} function is used to retrieve the experiment description from which they are derived (here only for TS ID "3DE64836D").


<<eval=TRUE, echo=TRUE>>=
TS <- getSignatures(field="gene", value="XBP1 & ESR1 & GATA3")
head(TS)
a <- getTBInfo(field="signature", value="3DE64836D", verbose=FALSE)
exp <- a["Experiment",1]
info <- getTBInfo(field="experiment", value=exp, verbose=TRUE)
@

As expected the transcriptional signature "3DE64836D" correspond to a breast cancer tumor analysis. This is also true for the other TS (not shown).


\subsection{Finding transcriptional neighbors}

One interesting feature of \Rpackage{RTools4TB} its ability to find genes frequently co-expressed with the input list. Indeed, results from a request to TBrowserDB can be displayed as a graph using the \Rfunction{createGraph4BioC} function. This function retrieves the list of TS that verify the constrain (here "XBP1 \& ESR1 \& GATA3"). A list of gene falling in at least one of the TS is next computed. A gene-gene matrix $M$ is created that will record for each pair of gene the number of time they were observed in the same signature. In the following example, only a subset of this adjacency matrix (containing  genes falling in a significant proportion of signatures, \Rfunction{prop=80\%}) is used to create a graph.

<<eval=TRUE,echo=TRUE>>=
library(biocGraph)
@

<<eval=FALSE,echo=TRUE>>=
adjMat <- createGraph4BioC(request="XBP1 & ESR1 & GATA3", prop=80)
@
<<eval=TRUE,echo=FALSE>>=
adjMat <- as.matrix(read.table("ftp://tagc.univ-mrs.fr/public/TranscriptomeBrowser/RTools4TB/adjMat.txt", sep="\t", quote="", head=TRUE, row=1))
@
<<graph,fig=TRUE,prefix=FALSE,echo=TRUE,include=FALSE>>=
g1 <- new("graphAM", adjMat=adjMat)
nodes(g1)
nAt <- makeNodeAttrs(g1)
nAt$fillcolor[match(rownames(as.matrix(nAt$fillcolor)), c("GATA3", "XBP1", "ESR1"), nomatch=F)!=0] <- "green"
nAt$fillcolor[match(rownames(as.matrix(nAt$fillcolor)), c("TBC1D9", "FOXA1"), nomatch=F)!=0] <- "yellow"
plot(g1,"fdp",nodeAttrs=nAt)
@

As expected the list of gene contains "XBP1 \& ESR1 \& GATA3" but also FOXA1/HNF3A that was reported to be co expressed with ESR1 in several experiments (see \cite{lac}). Other genes are also particularly relevant such as TBC1D9/MDR1 (Multidrug Resistance 1) (figure \ref{fig:3}).

\begin{figure}
\begin{center}
\includegraphics[width=5in,height=5in,angle=0]{graph}
\end{center}
\caption{A graph containing "XBP1 \& ESR1 \& GATA3" together with their most frequent transcriptional neighbors.}
\protect\label{fig:3}
\end{figure}


\subsection{Vizualising expression matrix.}

The TS "3DE64836D" is related to experiment "GSE7904". In this experiments, the authors were interested in analysing several classes of breast cancer tumors especially "Sporadic basal-like cancers".

<<eval=TRUE,echo=TRUE>>=
a <- getTBInfo(field="signature", value="3DE64836D", verbose=FALSE)
exp <- a["Experiment",1]
info <- getTBInfo(field="experiment", value=exp, verbose=TRUE)
@

The samples that were used are the following.
<<eval=TRUE,echo=TRUE>>=
sampleInfo <- getTBInfo(field="samples", value = "3DE64836D")
head(sampleInfo[,1:2])
@


 Using the \Rfunction{getExpressionMatrix} function, the expression matrix for signature "3DE64836D" can be fetched in order to visualize the expression profile of ESR1, GATA3 and XBP1 tumors compare to normal breast tissues.

<<eval=TRUE,echo=TRUE>>=
em <- getExpressionMatrix(signatureID="3DE64836D")
class(em)
@

The \Rfunction{getExpressionMatrix} function returns a \Rfunction{data.frame}. The first two columns store probe IDs and gene symbols. Additional columns contain corresponding expression values (figure \ref{fig:1}).

<<echo=FALSE,eval=TRUE>>=
pdf(file="heat.pdf")
@

<<echo=TRUE,eval=TRUE>>=
library(RColorBrewer)
col <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
geneNames <- paste(em[,1],em[,2],sep="||")
em <- as.matrix(em[,-c(1,2)])
ind <- match(colnames(em), sampleInfo[,1])
colnames(em) <- sampleInfo[ind,2]
row <- rep(1,nrow(em))
ind <- grep("(XBP1)|(ESR1)|(GATA3)", geneNames,perl=TRUE)
row[ind] <- 2
rc <- rainbow(2, start=0, end=.3)
rc <- rc[row]
col <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
split <- strsplit(colnames(em)," (", fixed=TRUE)
pheno <- unlist(lapply(split,"[",1))
pheno <- as.factor(pheno)
levels(pheno) <- 1:5
cc <- rainbow(5, start=0, end=.3)
cc <- cc[pheno]
heatmap(em, col=col, RowSideColors=rc, ColSideColors=cc, labRow=geneNames, cexRow=0.3)
@

<<echo=FALSE,eval=TRUE, results=hide>>=
dev.off()
@

\begin{figure}[!h]
\begin{center}
\includegraphics[width=4in,height=4in,angle=0]{heat}
\end{center}
\caption{The expression matrix corresponding to signature "3DE64836D". }
\protect\label{fig:1}
\end{figure}

Of note, the \Rfunction{plotGeneExpProfiles} is a high level function to visualize gene expression levels in a signature (figure \ref{fig:2}).

<<profil,fig=TRUE,prefix=FALSE,echo=TRUE,include=FALSE>>=
  plotGeneExpProfiles(data=em, X11=FALSE)
@


\begin{figure}[!h]
\begin{center}
\includegraphics[width=3in,height=3in,angle=0]{profil}
\end{center}
\caption{Gene expression profiles of signatures containing XBP1, ESR1 and GATA3: the centroid is highlighted in green.}
\protect\label{fig:2}
\end{figure}

\section{Creating transcriptional signatures from a user defined data set using DBF-MCL algorithm.}

When analyzing a noisy dataset, one is interested in isolating dense regions as they are populated with genes/elements that display weak distances to their nearest neighbors (i.e. strong profile similarities). To isolate these regions DBF-MCL computes, for each gene/element, the distance with its kth nearest neighbor (DKNN). In order to define a critical DKNN value that will depend on the dataset and below which a gene/element will be considered as falling in a dense area, DBF-MCL computes simulated DKNN values by using an empirical randomization procedure. Given a dataset containing n genes and p samples, a simulated DKNN value is obtained by sampling n distance values from the gene-gene distance matrix D and by extracting the kth-smallest value. This procedure is repeated n times to obtain a set of simulated DKNN values S. Computed distributions of simulated DKNN are used to compute a FDR value for each observed DKNN value. The critical value of DKNN is the one for which a user-defined FDR value (typically 10\%) is observed. Genes with DKNN value below this threshold are selected and used to construct a graph. In this graph, edges are constructed between two genes (nodes) if one of them belongs to the k-nearest neighbors of the other. Edges are weighted based on the respective coefficient of correlation (\emph{i.e.}, similarity) and the graph obtained is partitioned using the Markov CLustering algorithm (MCL).

\subsection{Installation}
With the current implementation DBFMCL function works only on UNIX-like plateforms. MCL is required and can be installed  using your package manager or using the following command lines pasted in a terminal:

\begin{verbatim}
# Download the latest version of mcl
# (the library has been tested successfully with the 06-058 version)
wget http://micans.org/mcl/src/mcl-latest.tar.gz
# Uncompress and install mcl
tar xvfz mcl-latest.tar.gz
cd mcl-xx-xxx
./configure
make
sudo make install
# You should get mcl in your path
mcl -h

\end{verbatim}

\subsection{Examples}
We will search for transcriptional signatures in a subset of the ALL dataset.
<<eval=FALSE, echo=TRUE>>=
library(ALL)
data(ALL)
sub <- exprs(ALL)[1:3000,]
@

First, we will normalize the data set using the \Rfunction{doNormalScore} function. This function performs normal score transformation of a matrix. The \Rfunction{doNormalScore} transforms each sample to follow a normal distribution (with \Rfunction{mean = 0} and \Rfunction{sd = 1}). Alternatively, users may also use other normalization routines such as : \Rfunction{doRankTransformation} or \Rfunction{limma::normalizeQuantiles}.

<<eval=FALSE, echo=TRUE>>=
subNorm <- doNormalScore(sub)
@


The \Rfunction{DBFMCL} function allows one to extract TS from a data set. Its behaviour is controlled by several arguments.

<<eval=TRUE, echo=TRUE>>=
args(DBFMCL)
@
The \Rfunction{DBFMCL} function accepts a tab-delimited file (argument \Robject{filename}), an expressionSet, a data.frame or a matrix (argument \Robject{data}) as input. The input data must contain an expression matrix with gene as rows and samples as columns. Note that space characters inside gene names are not allowed (as they are not supported by the mcl command-line program). \\

The two main parameters of DBF-MCL are \Robject{k} that controls the size of the neighborhood and the \Robject{inflation} (range 1.1 to 5) which controls the way the underlying graph is partitioned. In the following example, the neighborhood size (\Robject{k})  is set to 150 and the MCL inflation parameters is set to 2.0 (default MCL setting). Most generally these default parameters give very good results on microarray datasets. For a detailed discussion about these parameters please read the section "Performances of DBF-MCL on Complex9RN200 dataset" in the article describing TranscriptomeBrowser stategy \cite{lopez}. Morevover, in our example, the distance method is set to "pearson" although the "spearman" (that is the default method for computing TS in the TranscriptomeBrowser projet) also give very relevant results. Note that additional distance, including "euclidean" and two mixtures of pearson" and "spearman" ("spm" and "spgm") are also available.


<<eval=FALSE, echo=TRUE>>=
res <- DBFMCL(subNorm, distance.method="pearson", memory=512)
@

The results are stored in an instance of class \Robject{DBFMCLresult}.

<<eval=FALSE, echo=TRUE>>=
class(res)
@
\begin{verbatim}
[1] "DBFMCLresult"
attr(,"package")
[1] "RTools4TB"
\end{verbatim}

<<eval=FALSE, echo=TRUE>>=
res
@

\begin{verbatim}
An object of class DBFMCLresult
		Name: exprs
		Memory used:  1140412
		Number of samples:  128
		Number of informative genes:  1053
		Number of clusters:  3
		This object contains the following informations:
			 -  name
			 -  data
			 -  cluster
			 -  size
			 -  center
			 -  parameters
				 *  distanceMethod  =  pearson
				 *  k  =  150
				 *  random  =  3
				 *  fdr  =  10
				 *  set.seed  =  123
				 *  inflation  =  2
\end{verbatim}

The expression matrix is stored in the \Robject{data} slot. This matrix contains only genes detected as informative (that is falling into a cluster).

<<eval=FALSE, echo=TRUE>>=
 head(res@data[,1:2])
@

\begin{verbatim}
  01005            01010
1153_f_at   1.14383543   1.14401685
1025_g_at -0.35119623 -0.39702813
31604_at    0.21453883  0.35885572
1455_f_at  -0.06960184  0.01066486
1908_at      0.31048075  0.42706913
1492_f_at   0.27323570  0.44352271
\end{verbatim}

The partitioning results are stored in the \Robject{cluster} slot.
<<eval=FALSE, echo=TRUE>>=
slotNames(res)
@

\begin{verbatim}
[1] "name"       "data"       "cluster"    "size"       "center"
[6] "parameters"
\end{verbatim}

Here, 3 TS were found.
<<eval=FALSE, echo=TRUE>>=
res@size
@
\begin{verbatim}
[1] 498 461  94
\end{verbatim}

The following instruction can be used to get the expression matrix corresponding to the first TS.

<<eval=FALSE, echo=TRUE>>=
res@data[res@cluster ==1,]
@


The high level function \Rfunction{plotGeneExpProfiles} can be used to visualize, for instance, gene expression profiles corresponding to the first signature.
<<eval=FALSE, echo=TRUE>>=
plotGeneExpProfiles(res, sign=1)
@

To stored the partitioning results onto disk (as a tab-delimited file), use the \Rfunction{writeDBFMCLresult} function as show below.
<<eval=FALSE, echo=TRUE>>=
writeDBFMCLresult(res, filename.out="ALL.sign.txt")
@

\begin{thebibliography}{2}

\bibitem[1]{lopez} Lopez F.,Textoris J., Bergon A., Didier G., Remy E., Granjeaud S., Imbert J. , Nguyen C. and Puthier D. TranscriptomeBrowser: a powerful and
flexible toolbox to explore productively the transcriptional landscape of the Gene Expression Omnibus database. PLoSONE, 2008;3(12):e4001.
\bibitem[2]{sher} Sherman BT, Huang DW, Tan Q,  Guo Y, Bour S, Liu D, Stephens R, Baseler MW, Lane HC, Lempicki RA. DAVID Knowledgebase: A Gene-centered Database Integrating Heterogeneous Gene Annotation Resources to Facilitate High-throughput Gene Functional Analysis. BMC Bioinformatics. 2007 Nov 2;8(1):426.
\bibitem[3]{vd} Van Dongen S. (2000) A cluster algorithm for graphs. National Research Institute for Mathematics and Computer Science in the 1386-3681.
\bibitem[4]{lac} Lacroix M, Leclercq G. About GATA3, HNF3A, and XBP1, three genes co-expressed with the oestrogen receptor-alpha gene (ESR1) in breast cancer.  Mol Cell Endocrinol. 2004 Apr 30;219(1-2):1-7.


\end{thebibliography}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}