%\VignetteIndexEntry{SimulatorZ}
%\VignettePackage{SimulatorZ}

\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@ 

\title{\Rpackage{SimulatorZ} package vignette}
\author{Yuqing Zhang\footnote{\email{zhangyuqing.pkusms@gmail.com}}, Christoph Bernau\footnote{\email{Christoph.Bernau@lrz.de}},Levi Waldron\footnote{\email{levi.waldron@hunter.cuny.edu}}}


\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle
\tableofcontents

\section{Introduction}
\Rpackage{simulatorZ} is a package for generating realistic
simulations of a range of genomic data types, based on real
experimental data contained in \Rclass{ExpressionSet} or
\Rclass{RangedSummarizedExperiment} data containers.  It can generate
simulations of a single dataset based on a reference experimental
dataset, or of a collection of independent datasets based on a
reference collection.  Details of the simulation methodology are
described by Bernau et al. \cite{Bernau2014} and in section \ref{sec:approach}.
Briefly, standard non-parametric bootstrap is used to sample (with
replacement) individuals from experiments and to sample experiments
from a collection.  Optionally, a form of parametric bootstrap is used
to simulate censored time-to-event outcomes based on the baseline
hazard, follow-up time, and association between outcome and genomic
features found in the experimental datasets.

Development of this package was motivated by the need to simulate
collections of independent genomic data sets, and to perform training
and validation with prediction algorithms across independent
datasets. It derived from studies concerning the performance of risk
prediction algorithms across independent transcriptome studies of
cancer patients.

Cross study validation (CSV) is an alternative to traditional
cross-validation (CV) in research domains where predictive modeling is
applied to high-dimensional data and where multiple comparable
datasets are publicly available. Recent studies have both shown the
potential power and cast questions upon this method. For example,
Bernau et al. noticed a dramatic drop of CSV performance compared to
that of CV.  \Rpackage{simulatorZ} enables statistical investigation
of potential causes for this discrepancy through realistic simulation
studies.

In this vignette, we give a short tour of the package and will show how to use it for interesting tasks.

\section{Data Set Simulation}

<<echo=FALSE,results=hide>>=
library(simulatorZ)
@

\subsection{Approach in the simulation}
\label{sec:approach}
The \Rpackage{simulatorZ} package implements a resampling scheme
illustrated in Bernau et al. \cite{Bernau2014} to simulate a
collection of independent genomic sets with gene expression profiles
and time to event outcomes. The resampling procedure involves both
parametric and non-parametric bootstrap. The whole process can be
divided into three steps:
\begin{itemize}
  \item non-parametric bootstrap at set level. \\ In this step, we sample the set labels with replacement, from a list of sets we use as original ones. This will capture the differences between studies, and the variability of whether the studies are accessible.
  \item non-parametric bootstrap at patient level.\\ We resample observations with replacement from each set corresponding to the randomly drawn labels obtained from step 1.
  \item parametric bootstrap to simulate time to event. \\ The parametric step involves a generative model that combines the truncated inversion method in Bender et al. (2005) \cite{Bender2005}, the Nelson-Aalen estimator for cumulative hazard functions, and  CoxBoost method (Binder and Schumacher, 2008)\cite{Binder2008} to simulate survival times.
\end{itemize}

\Rpackage{simulatorZ} makes use of this approach and improves it by
applying a stratified non-parametric bootstrap at the patient
level (step 2), to balance the prevalence of covariates in the
simulated sets, thus allowing for more diverse purpose of use in
similar researches.

The package provides separate functions for both parametric and non-parametric step in the simulation(\Rfunction{simData, simTime}), as well as one driver function to complete the whole process(\Rfunction{simBootstrap}). It supports both \Rclass{ExpressionSets} and \Rclass{RangedSummarizedExperiment} objects and the examples below will demonstrate how to use the functions for both these types respectively. 

Examples in this vignette uses \Rclass{ExpressionSets} from \Biocexptpkg{curatedOvarianData} \cite{Ganzfried2013} and \Rclass{RangedSummarizedExperiment} from \Biocexptpkg{parathyroidSE} in \Bioconductor{}. Users who are not familiar with this two types of object could refer to the corresponding manuals for more information.

\subsection{Simulation on \Rclass{ExpressionSets}}
\subsubsection{non-parametric bootstrap}
First of all, we use the \Biocexptpkg{curatedOvarianData} package
\cite{Ganzfried2013} to obtain a list of \Rclass{ExpressionSets}.  A
collection of 14 datasets with overall survival information could be
obtained as follows, as demonstrated in the
\Biocexptpkg{curatedOvarianData} package:

<<echo=TRUE,eval=FALSE, results=hide>>=
library(curatedOvarianData)
source(system.file("extdata",
                   "patientselection.config",package="curatedOvarianData"))
source(system.file("extdata", "createEsetList.R", package="curatedOvarianData"))
@ 

However this takes a while, so to get a smaller dataset quickly we
simply select three datasets and construct them as a list of called
\Robject{esets} including 3 \Rclass{ExpressionSets}.

<<echo=TRUE,results=hide>>=
library(curatedOvarianData)
data(GSE17260_eset)
data(E.MTAB.386_eset)
data(GSE14764_eset)
esets.orig <- list(GSE17260=GSE17260_eset, E.MTAB.386=E.MTAB.386_eset, GSE14764=GSE14764_eset)
@ 

This function help clean the list of ExpressionSets, providing options
to select only features (rows) common to all datasets
(keep.common.only=TRUE), and select only samples having certain
annotations available (if meta.required is a vector of columns found
in phenoData(eset):

<<cleanEsets>>=
cleanEsets <- function(obj, keep.common.only=TRUE, meta.required=NULL){
    if(keep.common.only){
        intersect.features <- Reduce(intersect, lapply(obj, featureNames))
        for (i in seq_along(obj))
            obj[[i]] <- obj[[i]][intersect.features, ]
    }
    if(!is.null(meta.required)){
        obj <- lapply(obj, function(obj1){
            for (x in meta.required)
                obj1 <- obj1[, !is.na(obj1[[x]])]
            if(ncol(obj1) > 0){
                return(obj1)
            }else{
                return(NULL)
            }
        })
        return(obj[!sapply(obj, is.null)])
    }
}
@ 

Now create a list of ExpressionSets that contains only samples where
overall survival is known, and with only the intersection of features
available in all datasets.  To speed later simulations, we keep only 1000 features.
<<>>=
esets <- cleanEsets(esets.orig, meta.required=c("days_to_death", "vital_status"))
esets <- lapply(esets, function(eset) eset[1:1000, ])
sapply(esets, dim)
@


This will be our list of original sets for further simulation. Noted
that here we only take a subset for the convenience. It is not
obligatory for the sets to have the same number of
observations. However, if performing the parametric bootstrap with
multiple datasets it is important that they should have the SAME
features.

\Rfunction{simData} is the function to perform non-parametric bootstrap resampling at both set and patient level. So if we want to simulate a collection of sets each containing 500 observations with the same method mentioned above, we can use \Rfunction{simData} like this:
<<echo=TRUE>>=
set.seed(8)
sim.esets <- simData(esets, n.samples=500)
names(sim.esets)
@

The output includes another list of the simulated data sets.  Elements of this list are:

\begin{itemize}
\item \Robject{obj}: the simulated data, as a list of
  \Rclass{ExpressionSet} or \Rclass{RangedSummarizedExperiment} objects.
\item \Robject{setsID} and \Robject{indices}: these indicate original
  set and patient labels selected in the simulation process, making 
  it easy to keep track of the data. 
\item \Robject{prob.desired} and \Robject{prob.real} will be discussed in
  detail later.
\end{itemize}

By default, \Rfunction{simData} performs a two-step bootstrap: the
first step samples complete datasets, and the second step samples
individuals from each dataset.  If we prefer instead to sample
individuals from each original dataset, but keep all datasets without
resampling at the dataset, set the \Robject{type} argument to
``one-step'':

<<echo=TRUE>>=
sim.esets <- simData(esets, 500, type="one-step")
@

\subsubsection{Balance the Prevalence of Covariates}
As we mentioned before, \Rpackage{simulatorZ} supports balancing the distribution of covariates which is also done with \Rfunction{simData}. The parameter \Robject{balance.variables} is a string or a vector of strings that gives the names of covariates to be balanced. This makes the distribution of the covariate(or joint distribution of all those covariates) resemble that in all the original sets combined, by re-weighting the probability of sampling. For instance, we will balance \Rcode{"tumorstage"} like this:

First, we eliminate observations missing the value in "tumorstage" (if
"tumorstage" is not available in any set, it is ruled out from the
original set).

<<echo=TRUE>>=
cleaned.esets <- cleanEsets(esets.orig, meta.required="tumorstage")
@

Then we can balance the distribution of covariates like:
<<echo=TRUE>>=
sim.sets <- simData(cleaned.esets, 500, 
                    balance.variables="tumorstage")
sim.sets$prob.desired
sim.sets$prob.real[[1]]
@
Now we will explain the output \Robject{prob.desired} and \Robject{prob.real}. These two output are only useful when we consider the distribution of covariates. \Robject{prob.desired} is the overall distribution calculated by combining all observations together. \Robject{prob.real} is a list of probablities. Each elements in the list corresponds to the covariate value of patients in certain original set. For example, in the codes above, \Rcode{prob.real[[1]]} prints out the tumorstage value, and the probability of being sampled in the first original data set.

We can also consider the joint distribution of two covariates at the same time. Like if we want to re-weight the resampling probability considering both tumorstage and grade, first we need to do similar cleaning:

<<echo=TRUE>>=
cleaned.esets <- cleanEsets(esets.orig, meta.required=c("tumorstage", "grade"))
@

Then we set \Robject{balance.variables} as a character vector:
<<echo=TRUE>>=
sim.sets <- simData(cleaned.esets, 500, 
                    balance.variables=c("tumorstage", "grade"))
sim.sets$prob.desired
sim.sets$prob.real[[1]]
@

Note that this function is only suitable for discrete covariates. We need to discretize the continuous variables first if we wish to consider them.

Before we move to the next part, there is one more comment on this function. \Rfunction{simData} does not ask for a specific response variable, even though it has a parameter for it. This is useful when we use \Rfunction{simBootstrap} directly. If a y variable is specified, \Rfunction{simData} will pass it to the output without dealing with it. 

\subsubsection{Parametric Boostrap}
To simulated corresponding time to event, first we need to obtain the true model from original data sets. \Rfunction{getTrueModel} function first fit CoxBoost to the original sets for coefficients, then calculates the base cumulative survival and censoring hazard. These will all be used by \Rfunction{simTime}, which takes the output of \Rfunction{getTrueModel} to simulate the time and censoring status in simulated data sets.
These will require a response variable, which we can extract from the data sets in this example. 
<<echo=TRUE>>=
y.list <- lapply(esets, function(eset){
    return( Surv(eset$days_to_death, eset$vital_status=="deceased") )
  })
@
<<echo=TRUE,results=hide>>=
true.mod <- getTrueModel(esets, y.list, 100)
@
<<echo=TRUE>>=
names(true.mod)
@
\Robject{beta} represents coefficients obtained from fitting CoxBoost to the original data sets. \Robject{lp} is linear predictor calculated by multiplying coefficients with the gene expression matrix. \Robject{survH} and \Robject{censH} are cumulative survival and censoring hazard while \Robject{grid} is the corresponding time span. 
<<echo=TRUE,results=hide>>=
simmodel <- simData(esets, 500, y.list)
new.esets <- simTime(simmodel, true.mod)
@
\Rfunction{simTime} output the same form of list as input, only different in that the y.vars is updated.
Both parameters of \Rfunction{simTime} can be constructed from scratch. Please refer to the examples sim the help manuel.

\subsubsection{The driver function}
In order to achieve the three-step simulation, we can complete it step-by-step in the following way:
<<echo=TRUE,results=hide>>=
true.mod <- getTrueModel(esets, y.list, 100)
sim.sets <- simData(esets, 500, y.list)
sim.sets <- simTime(sim.sets, true.mod)
@

Or, we can use the driver function \Rfunction{simBootstrap} directly:
<<echo=TRUE,results=hide>>=
sim.sets <- simBootstrap(esets, y.list, 500, 100)
@ 

This driver function is capable of skipping resampling set labels and
balancing covariates. We can use its \Robject{type} and
\Robject{balance.variables} the same way as we use them in
\Rfunction{simData}. However, separating the functions allows users to
use different true models. For example, we can combine all the data
sets into one big set and use it to simulate one true model, then use
it for every simulated independent set.
<<echo=TRUE>>=
y.vars=y.list
balance.variables=c("tumorstage", "grade")
X.list <- lapply(esets, function(eset){
  newx <- t(exprs(eset))
  return(newx)
}) 
all.X <- do.call(rbind, X.list)
cov.list <- lapply(esets, function(eset){
  return(pData(eset)[, balance.variables])
})  
all.cov <- as(data.frame(do.call(rbind, cov.list)), "AnnotatedDataFrame")
rownames(all.cov) <- colnames(t(all.X))
all.yvars <- do.call(rbind, y.vars)
whole_eset <- ExpressionSet(t(all.X), all.cov)
@ 

Then we use this big merged ExpressionSet to fit a parametric model of
survival that will be used as the ``true'' model:
<<echo=TRUE,results=hide>>=
lp.index <- c()
for(i in seq_along(esets)){
  lp.index <- c(lp.index, rep(i, length(y.vars[[i]][, 1])))
}
  
truemod <- getTrueModel(list(whole_eset), list(all.yvars), parstep=100)

simmodels <- simData(esets, 150, y.vars)
beta <- grid <- survH <- censH <- lp <- list()
for(listid in seq_along(esets)){
  beta[[listid]] <- truemod$beta[[1]]
  grid[[listid]] <- truemod$grid[[1]]
  survH[[listid]] <- truemod$survH[[1]]
  censH[[listid]] <- truemod$censH[[1]]
  lp[[listid]] <- truemod$lp[[1]][which(lp.index==listid)]
}
res <- list(beta=beta, grid=grid, survH=survH, censH=censH, lp=lp)
simmodels <- simTime(simmodels, res)
@

Users can also build up their own true model based on the example in
\Rfunction{simTime}, which ensures flexibility of this
package. 

The \Robject{y.vars} parameters also support \Rclass{Surv, matrix}
and \Rclass{data.frame} objects for additional flexibility.


\subsection{Simulation on \Rclass{SummarizedExpreriment}}

Simulation over \Rclass{RangedSummarizedExperiment} is similar to that over
\Rclass{ExpressionSets}. So we will illustrate that the
\Rpackage{simulatorZ} is suitable for simulating from one orginal set
as well. First, we use \Biocexptpkg{parathysoidSE} package to obtain a
\Rclass{SummarizedExpreriment} object.

<<echo=TRUE,results=hide>>=
library(parathyroidSE)
data("parathyroidGenesSE")
@

Enclose \Robject{parathyroidGenesSE} in a list for compatibility with
\Robject{simData}.  

<<echo=TRUE>>=
sim.sets <- simData(list(parathyroidGenesSE), 100)
names(sim.sets)
sim.sets$obj[[1]]   #The simulated RangedSummarizedExperiment
@

In this simple example the simulated object is simply a bootstrap sample (with replacement) of the original object.  In the original object, ``run'' is unique:

<<>>=
table(colData(parathyroidGenesSE)$run)
@ 

But ``run'' is not unique in the simulated object due to the bootstrap sampling:

<<>>=
table(colData(sim.sets$obj[[1]])$run)
@ 

\section{Training and Validation on genomic data sets}
Another important feature of \Rpackage{simulatorZ} is to perform predictive training and validation over data sets. 

\subsection{Independent within study validation (Superpc)} 
The following example shows how to use SuperPC (Blair and Tibshirani, 2004) algorithm to train and validate on one ExpressionSet
\subsubsection{create training set and large validation set}
We use the first element in \Robject{esets} to do the within study validation. 
First we generate a training set with 450 observations:
<<echo=TRUE,results=hide>>=
tr.size <- 450
simmodel.tr <- simBootstrap(esets[1], y.list[1], tr.size, 100)
tr.set <- simmodel.tr$obj.list[[1]]
X.tr <- t(exprs(tr.set))
y.tr <- simmodel.tr$y.vars.list[[1]]
@
Then we simulate the validation set with the same original set, so that the training and validation set are independent, yet from the same study.
<<echo=TRUE,results=hide>>=
val.size <- 450
simmodel.val <- simBootstrap(esets[1], y.list[1], val.size, 100)
val.set <- simmodel.val$obj.list[[1]]
X.val <- t(exprs(val.set))
y.val <- simmodel.val$y.vars.list[[1]]
@
Here we use C-Index(Gnen and Heller,2005) as a validation measure. First we can use the true linear predictor to calculated the C-Index. No algorithms is supposed to generate a C-Index larger than this one.
<<echo=TRUE,results=hide>>=
#check C-Index for true lp
val.par <- getTrueModel(esets, y.list, 100)
lpboot <- val.par$lp[[1]][simmodel.val$indices[[1]]]
library(Hmisc)
c.ind <- rcorr.cens(-lpboot, y.val)[1]
@
<<echo=TRUE>>=
print(c.ind)
@

\subsubsection{Fit Superpc}
Now we fit Superpc on training set with parameter tuning step to get the optimal parameters which works best for the validation.
<<echo=TRUE,results=hide>>=
library(superpc)
tr.data<- data<-list(x=t(X.tr),y=y.tr[,1], censoring.status=y.tr[,2], 
                     featurenames=colnames(X.tr))
fit.tr<-superpc.train(data=tr.data,type='survival')
#tuning
cv.tr<-superpc.cv(fit.tr,data=tr.data)
n.comp<-which.max(apply(cv.tr$scor, 1, max, na.rm = TRUE))
thresh<-cv.tr$thresholds[which.max(cv.tr$scor[n.comp, ])]
@

Then we can compute linear predictors using the optimal parameters:
<<superpcbox,echo=TRUE,fig=TRUE,include=FALSE,results=hide>>=
lp.tr<- superpc.predict(fit.tr, tr.data, tr.data, threshold=thresh, 
                        n.components=n.comp)$v.pred.1df
boxplot(lp.tr)
@ 
\incfig{simulatorZ-vignette-superpcbox}{0.8\textwidth}{Distribution of Linear Predictors Calculated by Training}

\subsubsection{validation on large validation set}
<<echo=T>>=
data.val<- data<-list(x=t(X.val),y=y.val[,1], censoring.status=y.val[,2], 
                      featurenames=colnames(X.tr))
lp.val<-superpc.predict(fit.tr, tr.data, data.val, threshold=thresh, 
                        n.components=n.comp)$v.pred.1df
@
Now we get can compute the C-Index for the validation data: 
<<echo=TRUE>>=
print('C-Index')
(c.ind<-rcorr.cens(-lp.val,y.val)[1])
@
Also, the correlation of this linear predictor to the true one is another important measure for the quality of validation.
<<echo=TRUE>>=
print('correlation to true lp')
(corlps<-cor(lp.val,lpboot,method='pearson'))
@ 
\Rpackage{simulatorZ} contains a function for the Mas-o-menus algorithm(Zhao et al., 2013) to support simple training and validation task. Please see the help document for \Rfunction{plusMinus} function.


\subsection{Cross Study Validation}
With enough data sets, we can perform training and validation on one set as illustrated in the last section, or do cross study validation with multiple sets. If we hope to perform cross study validation between each pair of data sets, the \Rpackage{simulatorZ} uses \Rfunction{zmatrix} to generate a matrix of C-Index. For the diagnostic elements, it performs cross validation. An example is presented as following:
<<echo=T,results=hide>>=
z <- zmatrix(obj=esets, y.vars=y.list,
             fold=3,trainingFun=plusMinus)
@
<<echo=TRUE>>=
print(z)
@
Combining the two main features of \Rpackage{simulatorZ}, we can first simulate a collection of independent genomic data sets, then perform the training and prediction with these data sets. The last example in this vignette will show how to complete this whole process.
<<echo=TRUE,results=hide>>=
Z.list <- list()
CV <- CSV <- c()
for(b in 1:20){
  print(paste("iteration: ", b, sep=""))
  sim2.esets <- simBootstrap(obj=esets, n.samples=150, y.vars=y.list,
                             parstep=100, type="two-steps")
  Z.list[[b]] <- zmatrix(obj=sim2.esets$obj.list, 
                         y.vars=sim2.esets$y.vars.list, fold=4,
                         trainingFun=plusMinus)
  sum.cv <- 0
  for(i in seq_along(esets)){
    sum.cv <- sum.cv + Z.list[[b]][i, i]
  }
  CV[b] <- sum.cv / length(esets)
  CSV[b] <- (sum(Z.list[[b]]) - sum.cv) / (length(esets)*(length(esets)-1))
}
@
So far we have done 20 simulations, during each of them we get a matrix of C-Index, the average of the CV and CSV validation statistics. This provides an insight of the difference between cross-study validation and the traditional cross validation.
<<box,echo=TRUE,fig=TRUE,include=FALSE>>=
average.Z <- Z.list[[1]]
for(i in 2:length(Z.list)){
  average.Z <- average.Z + Z.list[[i]]
}
average.Z <- average.Z / 20
print(average.Z)

resultlist <- list(CSV=CSV, CV=CV)
boxplot(resultlist, col=c("white", "grey"), ylab="C-Index", 
        boxwex = 0.25, xlim=c(0.5, 2.5))
@
\incfig{simulatorZ-vignette-box}{0.8\textwidth}{Boxplots of C-Index performance with cross validation and cross study validation.}{There is a dramatic drop of the CSV statistics compared to CV.}

\section{Session Info}
<<echo=FALSE>>=
sessionInfo()
@ 

\clearpage
\bibliography{simulatorZ-vignette}

\end{document}