%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Introduction to RGSEA}
%\VignettePackage{RGSEA}
\documentclass{article}
<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@
\title{Random Gene Set Enrichment Analysis}
\author{Chengcheng Ma}
\begin{document}
\maketitle
\section{Introduction}

Random Gene Set Enrichment Analysis (RGSEA) is an algorithm for measuring 
the similarities between samples and classifies the samples based on 
transcriptome data. The algorithm combines bootstrap aggregating[1] 
and gene set enrichment analysis (GSEA)[2], which is similar to random forests[3] 
and random generalized linear model[4]. This algorithm is non-parametric 
and does not need to fit parameter, so the robustness of this method 
is high and there is no overfitting problem. Using this algorithm, researchers 
can compare the data from different studies or classify the samples 
with the data from other studies.
There are three functions in this package: RGSEAfix, RGSEAsd and RGSEApredict.

RGSEAfix and RGSEAsd are the functions implementing RGSEA algorithm, RGSEApredict
 is for classification based on the results generated from RGSEA fix and RGSEAsd.
Seven parameters are needed to be input for RGSEAfix.
\subsection{Inputs}
\begin{itemize}
     \item query:a matrix, the query data
     \item reference: matrix, the reference data
     \item queryclasses: character vector, the classes of the query data
     \item refclasses: character vector, the classes of reference data
     \item random: numeric variable, the number of randomly sampled features
           to form the subset features.
     \item featurenum: numeric variable, the number of features selected from
           both top and bottom of the subset features
     \item iteration: numeric variable, the times of random sampling
     \item RGSEAsd includes all the seven parameters but
     \item instead a parameter named as sd, which indicates features with sd
           deviations from the mean value of the subset features be selected
           for the calculation.
\end{itemize}
RGSEAsd includes all the seven parameters but 5), instead a parameter named as sd,
 which indicates features with sd deviations from the mean value of the subset 
 features be selected for the calculation.
RGSEAprecdict is the function calculating the relative probability of the query 
data to be from a class of the refclasses based on the result generated by 
RGSEAfix or RGSEAsd. 


\section{Example Datset Description}
For RGSEA, we need two datasets - The reference data is defined as the data,
whose classes we already know.  
whereas query data is defined as the data, which we want to know the classes.
In our given example, four samples from GDS4100, stored in (e2) is the 
reference data and two samples from GDS4102,stored in (e1) is the query data.
The data were downloaded by command getGEO in GEOquery and transformed to expression
datasets with GDS2eSet. The four samples from GDS4100 are GSM356796, GSM356797(tumor),
GSM356828 and GSM356829(normal). The two samples from GDS4102 are GSM414924(tumor) 
and GSM414975(normal).


The cmap data is the part of the dataset of conncectivity map build 01 downloaded 
form 

http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/geo/query/acc.cgi?acc=GSE5258.
An instance means the transcriptome of a cell line perturbed by a chemical and
its corresponding negative control. We normalized the whole data, subtract the 
controls from the corresponding perturbations. The query ddata is 
5202764005791175120104.C08, treated with thioridazine. The reference data are 
5202764005789148112904.G05,	5202764005789148112904.F03,	

5202764005789148112904.F05, 5202764005789148112904.E02,
5202764005789148112904.E04. 
They were treated with tretinoin,	prochlorperazine,	chlorpromazine,	vorinostat,	
sirolimus respectively. All the data were generated by MCF7 cell line.


\subsection {Data download}

We downloaded the file using :
<<RGSEA-DOWNLOAD, message=FALSE, warning= FALSE>>=
library(GEOquery)
g4100 <- GDS2eSet(getGEO("GDS4100"))
g4102 <- GDS2eSet(getGEO("GDS4102"))
@

\subsection{Data transformations}

The data was then transformed in the following way
<<RGSEA-transform>>=
e4102<-exprs(g4102)
e4100<-exprs(g4100)
@

\subsection{Final Exmple Data}
<<RSGEA-data>>=
e1<-e4102[,c(1,51)]
e2<-e4100[,c(1,2,23,24)]
colnames(e1)<-c("tumor", "normal")
colnames(e2)<-c("tumor","tumor","normal","normal")
@ 
This data was stored in variables e1 and e2 and is now availiable for you. 

\section{Running RGSEA}
Here are two examples for how to use RGSEA
Suppose we want to classify two samples from GDS4102 based on the data from GDS4100. 
<<>>=
library(RGSEA)
data(e1)
data(e2)
RGSEAfix(e1,e2, queryclasses=colnames(e1), refclasses=colnames(e2), random=20000, 
featurenum=1000, iteration=100)->test
@
We can see the result from "test".
<<>>=
test[[1]]
@
The column names are the class of reference data. The row names are the class 
of the query data. We can also predict the relative probability of the query data.
<<>>=
RGSEApredict(test[[1]], colnames(e2))
@

	Here, we can see that the relative probability of the probability to be each of the class.

Another example is measuring the similarities of the data from Connectivity map build 01. 

The query data is treatment of MCF7 cell line by thioridazine. The reference data 
is treatment of MCF7 cell line by tretinoin,	prochlorperazine,	chlorpromazine,	
vorinostat and sirolimus respectivelly.

<<>>=
data(cmap)
test2<-RGSEAsd(cmap[,1],cmap[,2:6], queryclasses=colnames(cmap)[1], 

refclasses=colnames(cmap)[2:6], random=5000, sd=2, iteration=100)
test2[[1]]
@
As we can see from the result, the value of chlorpromazine is the largest, which 
means among the five chemicals the function of chlorpromazine is most 
similar to thioridazine.

\section{References}
\begin{itemize}
\item  Breiman L (1996) Bagging predictors. Machine learning 24: 123-140.
\item Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, et al. (2005) 
Gene setenrichment analysis: a knowledge-based approach for interpreting genome-wide 
expression profiles. Proceedings of the National Academy of Sciences of 
the United States of America102: 15545-15550.
\item Breiman L (2001) Random forests. Machine learning 45: 5-32.
\item Song L, Langfelder P, Horvath S (2013) Random generalized linear model: 
a highly accurate and interpretable ensemble predictor. BMC bioinformatics 14: 5.
\item Pei H, Li L, Fridley BL, Jenkins GD et al. FKBP51 affects cancer cell response to 
chemotherapy by negatively regulating Akt. Cancer Cell 2009 Sep 8;16(3):259-66. 
PMID: 19732725
\item Zhang L, Farrell JJ, Zhou H, Elashoff D et al. Salivary transcriptomic 
biomarkers for detection of resectable pancreatic cancer. Gastroenterology 2010 Mar;
138(3):949-57.e1-7. PMID: 19931263
\end{itemize}


\end{document}