%\VignetteIndexEntry{R packages: CellScore}
%\VignetteDepends{Biobase, CellScore, hgu133plus2CellScore}
%\VignetteEngine{knitr::knitr}
%\VignetteKeywords{CellScore}
%\VignettePackage{CellScore}

\documentclass[a4paper, 10pt]{article}
\usepackage[toc,page]{appendix}
\usepackage{hyperref} % \href
\usepackage{framed} % framed element/text
\usepackage{rotating} % sidewaysfigure
\usepackage{pdflscape} % \landscape
\usepackage[labelfont=bf]{caption} % Figure label in bold
\usepackage{etoolbox} % \ifboolexpr

%% new commands for string formating, for consistent and
%% systematic change across the entire text
%% \pkg{} - for formating R package names
\newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}}
%% \prog{} - for formating programming language names like R
\let\prog=\texttt
%% \prog{} - for formating names of R data objects in the text
\newcommand{\code}[1]{\texttt{\detokenize{#1}}}
%% \vars{} - for new concepts, names of classes ...
\newcommand{\vars}[1]{\textit{\detokenize{#1}}}

<<label='version', include=FALSE, cache=FALSE>>=
#suppressPackageStartupMessages(library(Biobase))
library(Biobase)
pkg.ver <- package.version("CellScore")
@

\title{\pkg{CellScore} \Sexpr{pkg.ver}: Evaluation of Cell Identity}
\author{Nancy Mah, Katerina Ta\v{s}kova\\
\texttt{nancy.l.mah@googlemail.com, katerina@tashkova.org}}
\date{\today}

\begin{document}
\maketitle

%% for the problem of too long words (usualy variable names)
%% getting over the text margin
\emergencystretch=5em

<<label='Setup', include=FALSE, cache=FALSE>>=

## Save the current working directory
dir.main <- getwd()
## Set the name of the directory in which figures will be saved (if any)
dir.figures <- 'figures'

## global chunk options
library(knitr)
opts_chunk$set(
    concordance=FALSE,
    cashe=2,
    ## cache is only valid with a specific version of R and session info
    ## cache will be kept for at most a month (re-compute the next month)
    cache.extra=list(R.version,
                     sessionInfo(),
                     format(Sys.Date(), '%Y-%m')
                     ),
    autodep=TRUE,
    fig.path=paste0(dir.figures,"/"),
    tidy=FALSE,
    size="small",
    message=FALSE,
    warning=FALSE
)
options(width=70, promp="R> ", continue="+  ", useFancyQuotes=FALSE)
@

\tableofcontents
\newpage

\section{Introduction}
The \pkg{CellScore} package contains functions to evaluate the cell identity of
a test sample undergoing a cell transition, given a starting (donor) cell type
and a desired target cell type. The evaluation is based upon a scoring system,
which uses a set of standard samples
of known cell types as the reference set. It combines the benefits of two
metrics, cosine similarity of expression profiles and fractions of expressed
cell type specific genes, into a single score for the cell identity,
called CellScore.
The CellScore evaluation has been carried out on a large set of
microarray data from one platform (Affymetrix Human Genome U133
Plus 2.0). In principle, the method could be applied to any expression dataset,
provided that there are a sufficient number of standard samples and that the
data are properly normalized (to account for study-specific and
platform-specific effects).

\section{Installation}
This vignette assumes that you have already installed \prog{R} ($\geq$
\Sexpr{paste(R.version$major, R.version$minor, sep=".")})
and that you have basic working knowledge of \prog{R}. You will
additionally need to install the core Bioconductor packages if these have not
already been installed:

<<eval=FALSE>>=
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install()
@

To complete this tutorial, you will to install the \pkg{hgu133plus2CellScore}
and \pkg{CellScore} packages from the Bioconductor repository:

<<eval=FALSE>>=
BiocManager::install(c("hgu133plus2CellScore", "CellScore"))
@


\section{Data preparation for analysis with \pkg{CellScore}}
In order to use the package functionality, you will need normalized expression
data from both the reference samples (from standard cell types) and from the 
test samples (from engineered cell types). A
pre-selected and formatted dataset of reference samples from the
Affymetrix Human Genome U133 Plus 2.0 platform is provided as a separate data
package (\pkg{hgu133plus2CellScore}). The reference dataset (stored as
 \code{eset.std})
includes 837 samples covering over 100 tissues or distinct cell types. To
illustrate the usage of the package, a test dataset is also distributed with
the package.

For an explicit description of the format of the data and
how to generate your own reference data, please see the
Appendix \ref{appendix:inputdata}.

\section{Example analysis with \pkg{CellScore}}
In this section, we include a full analysis on the provided test dataset to
demonstrate the functionality of \pkg{CellScore}. The typical workflow includes
the following steps:
\begin{enumerate}
    \item {\it Prepare the data}: A formatted expression dataset from 
    reference cell types profiled with a specific 
    Affymetrix platform is provided as a data 
    package. You will also need to prepare the expression profiles of your test
    samples in the same way as the reference data, and provide an additional
    table with the information about the cell transitions that should be
    evaluated (described in Appendix \ref{appendix:inputdata}).
    \item {\it Calculate the on/off score}: The first metric of CellScore
    is based on present/absent probe calls between donor and target cell types.
    \item {\it Calculate the cosine similarity}: The second metric of 
    CellScore is the cosine similarity between the standard and the test cell
    types, based on the normalized expression data.
    \item {\it Generate the CellScore}: This calculates the overall cell
    identity score of a given test sample and a given cell transition.
    \item {\it Generate the CellScore report}: The PDF report contains plots and
    heatmaps of the CellScore metrics to help you visualize the
    progress/direction of the evaluated cell transitions.
\end{enumerate}

\subsection{Load the data}

<<eval=TRUE, echo=FALSE, cache=FALSE>>=
options(BIOCINSTALLER_ONLINE_DCF=FALSE)
pc_info <- ifelse(grepl(getwd(),"kate"),
  "Intel Core i7-7500U CPU @ 2.70GHz",
  "Intel Core i7-5600U CPU @ 2.60GHz")
@

Load the packages and standard dataset:
<<eval=TRUE, echo=1:3, cache=FALSE, include=TRUE>>=
library(Biobase)
library(CellScore)
library(hgu133plus2CellScore) # loads eset.std
@

For this vignette, you will need the data from two external files distributed
with the \pkg{CellScore} package source:
\begin{itemize}
    \item {\code{eset48.RData:}} An \prog{R} data file that stores \code{eset48},
    an \vars{ExpressionSet} object with the normalized gene expression values 
    of 48 samples from engineered (derived) cells representing several cell
    transitions.
    \item {\code{cell_change_test.tsv:}} A tab-delimited text file with the
    information on the cell transitions that should be evaluated; each row
    depicts a single cell transition.
\end{itemize}
These data can be loaded with the code below, which also combines the gene
expression values of the test samples (\code{eset48}) with the ones of the standard
reference samples (\code{eset.std}) in one single \vars{ExpressionSet} object
\code{eset}.
<<eval=TRUE>>=
## Locate the external data files in the CellScore package
rdata.path <- system.file("extdata", "eset48.RData",
                          package="CellScore")
tsvdata.path <- system.file("extdata", "cell_change_test.tsv",
                            package="CellScore")

if (file.exists(rdata.path) && file.exists(tsvdata.path)) {
    ## Load the normalized expressions of 48 test samples
    load(rdata.path)

    ## Import the cell change info for the loaded test samples
    cell.change <- read.delim(file=tsvdata.path, sep="\t",
                              header=TRUE, stringsAsFactors=FALSE)
    print("Content of cell.change")
    print(cell.change)

    ## Combine the standards and the test data
    eset <- combine(eset.std, eset48)
    print("dim(eset) returns")
    print(dim(eset))
}
@

The particular examples covered in this tutorial (\code{cell.change}) include
two cell transitions producing three different derived cell types: induced
pluripotent stem cells derived from fibroblasts or keratinocytes (iPS-FIB and
iPS-KER, respectively), and partially induced pluripotent stem cells from
fibroblast (piPS-FIB).

\subsection{Calculate the on/off score}

The first CellScore metric we will calculate is the on/off score. This score is
based on the fraction of donor markers lost and the fraction of target markers
gained by the derived cells in a given cell transition. It can be calculated
on a sample level (for each sample individually) or on a group level
(for each derived cell type, aggregating the information across all samples).
The function call below calculates the on/off scores for each derived cell
type listed in \code{cell.change}. The outcome is a list of two data frames.
<<label='grouponoff', eval=TRUE, echo=FALSE>>=
group.OnOff <- OnOff(eset, cell.change, out.put="marker.list")
@
<<eval=TRUE>>=
<<grouponoff>>
summary(group.OnOff)
@

In this example, we see that as a group, the iPS-KER samples have the best
on/off score (high donor marker loss and high target marker gain).
<<eval=TRUE>>=
head(group.OnOff$scores)
@

The OnOff() function also outputs a data frame of marker genes used for the
calculation of the on/off score. A data frame of the marker genes can be found
here:
<<eval=TRUE>>=
head(group.OnOff$markers)
@

Next, we calculate the on/off score for all samples in the expression dataset.
This is provided as a separate step since it can take some time to
calculate, depending on how many samples there are in the \code{eset} object. In
this case, it was fast and took a few seconds of CPU time on an \Sexpr{pc_info}.
<<eval=TRUE, echo=TRUE, results='hide'>>=
individ.OnOff <- OnOff(eset, cell.change, out.put="individual")
@


The calculations are done and it is a good idea to save the calculated scores:
<<eval=TRUE>>=
save(file="OnOffscore.RData", individ.OnOff, group.OnOff)
@

\textbf{Things to check at this point}: Looking at the on/off score of
individual samples that make up the standards,
check if there are any clear outliers. For example, these outliers may be
caused by wrong annotation or modifications to the cell line that make them
unsuitable as standards. We can either eliminate these samples form the
analysis, or keep them but score them for different cell transitions. For this
purpose, the values of the \code{category} column in \code{eset@phenoData@data}
(the \vars{phenotype data frame}, described in Appendix
\ref{appendix:inputdata}) can be set to ``NA'' or ``unknown''.
If the category is ``NA'', the corresponding sample will not be scored. If
the category is ``unknown'', the corresponding sample will be scored for all
available cell transitions specified in \code{cell.change}.

For example, look at all samples that are embryonic stem cells (ESC) in the
transition from FIB to ESC:
<<eval=TRUE>>=
sel.transition <- individ.OnOff$scores$start == "FIB" &
                  individ.OnOff$scores$target == "ESC"
sel.esc <- grepl("embryonic stem cell", individ.OnOff$scores$cell_type)
onoff.sel <- individ.OnOff$scores[sel.esc & sel.transition,
                                  c(4,6,7,12,13,19,20,21)]
@

Ideally, standard ESC samples in the transition from FIB to ESC should have
an on/off score close to 2. However, the scores of the ESC cells ranges from
\Sexpr{round(min(onoff.sel$OnOffScore), digits=2)} to
\Sexpr{round(max(onoff.sel$OnOffScore), digits=2)}:
<<eval=TRUE>>=
summary(onoff.sel$OnOffScore)
@

These samples are clear outliers and therefore already have been removed
from the ESC standards by assigning them to the category ``test''.
<<eval=TRUE>>=
onoff.sel[order(onoff.sel$OnOffScore)[1:4],]
@

Finally, you can plot the group-wise on/off scores in a pyramid barplot
(see Figure ~\ref{fig:pyramidbarplot}). The function \code{BarplotOnOff()} also
returns the data (a list of two data frames) used for making the barplot:
<<label='barplotonoffcode', eval=FALSE, echo=FALSE>>=
barplot.out <- BarplotOnOff(eset, group.OnOff$scores)
@

<<label='barplotonoffdata', eval=TRUE, echo=TRUE, fig.keep='none'>>=
<<barplotonoffcode>>
barplot.out
@


<<pyramidbarplot, eval=TRUE, echo=FALSE, fig.pos ='!ht', fig.keep='high', fig.align='center', out.width='\\linewidth', fig.width=6, fig.height=4, fig.cap='\\textbf{Pyramid barplot of on/off scores.} The horizontal barplot shows the fraction of donor markers lost (blue) and the fraction of target markers gained (green). Each bar shows the results of one cell transition (left margin) for a particular derived cell type (right margin). The longer the coloured bars, the more successful the transition.'>>=
<<barplotonoffcode>>
@

The plot in Figure ~\ref{fig:pyramidbarplot} can be easily saved in a PDF format
as follows.
<<eval=FALSE>>=
pdf(file="GroupOnOffScore_Barplot.pdf")
<<barplotonoffcode>>
dev.off()
@


\subsection{Calculate the cosine similarity}

The second metric for cell scoring is the cosine similarity that can be
calculated with the function \code{CosineSimScore()}. First, the
expression matrix (containing only standard samples) is filtered for the most
variable probes/genes, as defined by the interquartile range (IQR). By default
we keep only the top 10\% of the genes ranked by IQR, and this cutoff can be
changed with the function argument \code{iqr.cutoff}. Then the mean
centroid of the expression values for each standard cell type (defined by
\code{eset$general_cell_type}) is calculated. Finally, the cosine similarity
 is calculated between each centroid and sample.

Note that this function can take few minutes to calculate,
depending on how many samples there are in the \code{eset} object. So,
start the execution and go for a coffee break ...
<<eval=TRUE>>=
tmp.time <- system.time(cs <- CosineSimScore(eset, cell.change,
                                             iqr.cutoff=0.1))
tmp.time
@
In our case, the calculation took approximately
\Sexpr{round(tmp.time[1], digits=0)} seconds of CPU time on an \Sexpr{pc_info}.
The result of the calculations is a list of five
data objects:
\begin{itemize}
    \item \code{cs$pdataSub:} A data frame with the phenotype of the reference
    samples.
    \item \code{cs$esetSub.IQR:} An \vars{ExpressionSet} object with the
    gene-filtered expression profiles of the reference samples.
    \item \code{cs$cosine.general.groups:} A matrix with the values of
    cosine similarity between all general groups defined by
    \code{eset$general_cell_type}
    \item \code{cs$cosine.subgroups:} A matrix with the values of
    cosine similarity between all sub groups defined by
    \code{eset$sub_cell_type1}
    \item \code{cs$cosine.samples:} A matrix with the values of cosine
    Similarity between all samples, general groups and subgroups
\end{itemize}

Now we can visualize the cosine similarity values in a heatmap. The function
\code{PlotCosineSimHeatmap()} will generate a PDF file. For example, we can plot
the cosine similarity between
\begin{itemize}
    \item the mean centroid of all general cell types, annotated by
    \code{eset$general_cell_type}
<<eval=TRUE, results='hide'>>=
PlotCosineSimHeatmap(cs$cosine.general.groups, "general groups",
                     width=20, height=20, x=-20, y=3)
@
    \item the mean centroid of all subgroup cell types, annotated by
    \code{eset$sub_cell_type1}
<<eval=TRUE, results='hide'>>=
PlotCosineSimHeatmap(cs$cosine.subgroups, "subgroups",
                     width=14, height=14, x=-14, y=3)
@
    \item all samples and subgroups; note that this plot can be enormous,
    depending on the number of samples, so you need to adjust the page size to
    make it viewable.
<<eval=FALSE>>=
PlotCosineSimHeatmap(cs$cosine.samples, "samples",
                     width=50, height=50, x=-50, y=10)
@
\end{itemize}

Rather than plotting the cosine similarity for all samples and subgroups,
you can pick which standards and test samples to plot. For example, plot the
general groups (fibroblast and embryonic stem cells)
for the transition from FIB to ESC and the relevant test cells for this
cell transition.

<<label='HeatmapDataPrep', eval=TRUE, echo=FALSE, results='hide'>>=
## Get the names (IDs) of the sample and their description
samples.cs <- colnames(cs$cosine.samples)
samples.eset <- sampleNames(eset)

## Select the samples of interest and their corresponding score
sel.ips <- eset$category == "test" &
            eset$sub_cell_type1 %in% c("piPS-FIB", "iPS-FIB")
sel <- samples.cs %in% c(c("FIB", "ESC"), samples.eset[sel.ips])
cs.sel <- cs$cosine.samples[sel, sel]

## Rename columns/rownames to more descriptive labels
## as cs.sel is a symetric matrix, these are identical
ids <- match(colnames(cs.sel), samples.eset)
ids.na <- is.na(ids)
ids.rest <- na.omit(ids)
new.colnames <- c(colnames(cs.sel)[ids.na],
                  paste(eset$sub_cell_type1[ids.rest],
                        eset$sample_id[ids.rest],
                        sep="_")
                  )
colnames(cs.sel) <- rownames(cs.sel) <- new.colnames
@

<<label='HeatmapCode', eval=TRUE, echo=FALSE, results='hide', fig.keep='none'>>=
## Plot the heatmap
PlotCosineSimHeatmap(cs.sel, "piPS", width=10, height=10, x=-10, y=3)
@
%% the chunk 'HeatmapCode' will generate a PDF in the working directory,
%% move the figure in the figure to dir.figures
%% TODO change this function to output the plot in a given folder
<<eval=TRUE, echo=FALSE, results='hide'>>=
heatmap.filename <- "CosineSimilarityHeatmap_piPS.pdf"
heatmap.newpath <- file.path(dir.figures, heatmap.filename)
system(paste("mv", heatmap.filename, heatmap.newpath))
@

<<eval=FALSE, echo=TRUE>>=
<<HeatmapDataPrep>>

<<HeatmapCode>>
@

%% include the figure in the vignette
\begin{figure}[!ht]
\includegraphics[width=1.2\linewidth]{{\Sexpr{heatmap.newpath}}}
\caption{\textbf{Heatmap of cosine similarity.} The triangular heatmap shows
the cosine similarity calculated between the centroids of the donor
(fibroblasts; FIB) and desired target cells (embryonic stem cells; ESC).
In addition, it shows the similarity between the individual samples of two cell
transitions, that is iPS-FIB and piPS-FIB.}
\label{fig:cosinesim}
\end{figure}

Notice that in the last plot, many of the partial iPS cell lines are more similar
to FIB rather than to ESC, which is to be expected (see Figure ~\ref{fig:cosinesim})

As an additional sanity check, we can perform a Principal Component Analysis (PCA)
and plot the data in the space of the the first two principal components.
For example, the PCA plot of the standard reference samples (used to calculate the
cosine similarity) should show that the similar cell types cluster closer together.
The \code{PcaStandards()} function can be used to generate such a plot. The same
allows the samples to be colored according to different properties of the samples:
\begin{itemize}
    \item experiment ID
<<eval=FALSE>>=
PcaStandards(cs$pdataSub$experiment_id, "Experiment ID",
             cs$esetSub.IQR)
@

    \item general cell type (see Figure ~\ref{fig:Pca}):
<<label='PcaCode', eval=FALSE>>=
PcaStandards(cs$pdataSub$general_cell_type, "General Cell Type",
             cs$esetSub.IQR)
@
\end{itemize}

\clearpage
\begin{landscape}

<<label='Pca', eval=TRUE, echo=FALSE, results='hide', fig.keep='high', out.width='1.3\\linewidth', fig.align='center', fig.pos='!ht', fig.width=14, fig.height=7, fig.cap='\\textbf{Principal component analysis of the reference dataset.} The reference samples cover a wide range of tissues and cell types from many studies and are the basis for comparison in the CellScore method. The analysis was applied on the expression matrix corresponidng to the most variable genes (as defined by the IQR cutoff). The plot, based on the first two prinicipal components, shows that similar cell types tend to cluster together. Brain tissue clusters on the right side, pluripotent and multipotent stem cells at the bottom, and somatic cell types on the left and upper clusters.'>>=
<<PcaCode>>
@

\end{landscape}

To identify specific sample (points) on the PCA plot you could plot the figure
including a text label for each sample. Samples with missing label (``NA'')
will have no text annotation on the plot:
<<eval=FALSE>>=
pdf(file="StandardSamples_PCA_Labels.pdf", width=28, height=14)
PcaStandards(cs$pdataSub$general_cell_type, "General Cell Type",
             cs$esetSub.IQR)
PcaStandards(cs$pdataSub$general_cell_type, "General Cell Type",
             cs$esetSub.IQR,
             text.label=cs$pdataSub$general_cell_type)

dev.off()
@

\subsection{Generate the CellScore}
Now that we've calculated the on/off scores and cosine similarities, it's
straightforward to calculate the CellScores for every sample in the expression
matrix.

<<eval=TRUE, echo=TRUE>>=
cellscore <- CellScore(eset, cell.change, individ.OnOff$scores,
                       cs$cosine.samples)
@

Finally, we can save all the scores and data in one file for later
manipulations.
<<eval=FALSE>>=
save(file="VignetteResults.RData",
     eset,                       # the combined expression dataset
     group.OnOff, individ.OnOff, # the on/off score values
     cs,                         # the cosine similaritiy values
     cellscore                   # the CellScore values
     )
@

\subsection{Generate CellScore reports}

In the final step, we will generate some diagnostic plots and save all the plots
in a summary report outputted to a PDF file.

\subsubsection{Scatter plot of donor-like and target-like scores}

In the scatter plot of all test samples, you can quickly identify samples that
have not completely transitioned to the desired cell type. Ideally, samples
with good transition to the target cell type are located in the upper
left-hand corner. Samples with poor transition will retain more donor-like
expression and tend to be located on the right-hand side of the plot
(see Figure ~\ref{fig:Scatterplot}):

The following will generate a two-paneled plot.
<<label='ScatterplotCode', eval=FALSE>>=
ScatterplotCellScoreComponents(cellscore, cell.change, FALSE)
@

\begin{landscape}
<<label='Scatterplot', eval=TRUE, echo=FALSE, out.width='\\linewidth', fig.pos='!ht', fig.align='center', fig.width=14, fig.height=7, fig.cap='\\textbf{Scatter plot of donor-like and target-like scores.} Derived cell types with the most successful transition have low donor-like score and high target-like scores and should cluster in the upper left-hand corner. In this example, the partially reprogrammed iPS cells (piPS-FIB) show gradual transition to their desired target cell type. A few iPS-FIB samples retain high donor-like scores, indicating unusual properties of these lines.'>>=
<<ScatterplotCode>>
@
\end{landscape}

If there are too many points to plot, you could also choose which transitions
to plot. For example, the code below will only plot the partial iPS
cells:

<<eval=FALSE>>=
ScatterplotCellScoreComponents(cellscore, cell.change[2,], FALSE)
@

\subsubsection{Boxplot of CellScore values}
The \code{BoxplotCellScore} function plots an overview of the CellScore values
across all the test samples, grouped by subgroup. The minimum score is
about -1.2, and the maximum score is about 1.2 (see Figure ~\ref{fig:Boxplot}):

<<label='Boxplot', eval=TRUE, fig.pos='!ht', out.width='\\linewidth', fig.align='center', fig.width=9, fig.height=6, fig.cap='\\textbf{Boxplot of CellScore values by subgroups.}'>>=
BoxplotCellScore(cellscore, cell.change)
@

\subsubsection{Rug plot of CellScore values}

The \code{RugplotCellScore()} function generates a rug plot of the CellScore
values by experiment, and the samples will be colored by a secondary property
from the \vars{phenotype data frame}. In this case, we choose to color the
samples by the values in the \code{transition_induction_method} column.
(Figure ~\ref{fig:Rugplot}).

<<label='RugplotCode', eval=FALSE, echo=FALSE>>=
secondary.property <- "transition_induction_method"
RugplotCellScore(cellscore, cell.change, secondary.property)
@

\begin{landscape}
<<label='Rugplot', eval=TRUE, echo=FALSE, out.width='1.1\\linewidth', fig.pos='!ht', fig.align='center', fig.width=14, fig.height=7, fig.cap='\\textbf{Rug plot of CellScore values.} The CellScore values are plotted for each derived cell type (left margin) within a study (GEO accession numbers in left margin). Each test sample is represented by a vertical line and colored by its transition induction method.'>>=
<<RugplotCode>>
@
\end{landscape}

\subsubsection{The summary report}

Finally, the last plotting function will generate a CellScore report for
each study and cell transition that has been defined in the \code{cell.change}
data frame.

As the report consists of many pages, it is best to simply plot everything
to a single PDF file.

<<eval=FALSE>>=
pdf(file="CellScoreReport_PerTransition.pdf", width=7, height=11)
CellScoreReport(cellscore, cell.change, group.OnOff$markers, eset)
dev.off()
@

The report is composed of several plots, including
\begin{itemize}
	\item \vars{scatter plot of the donor- and target-like scores}:
	Test samples, along with the standards for donor and target, are
	shown on the scatter plot for easy comparison
	(see Figure ~\ref{fig:ReportFig1}).

	\item \vars{density plot of the CellScore values}: The CellScore
	values of the test samples should be located somewhere between
	the CellScore values for the donor and target standards
	(see Figure ~\ref{fig:ReportFig2}, upper panel).

	\item \vars{rug plot of the CellScore values}: This plot has a
	closer look at the test samples and the target standards.
	Dashed vertical lines indicate the number of standard deviations
	from the mean target CellScore (see Figure ~\ref{fig:ReportFig2},
	lower panel).

	\item \vars{heatmap of the donor and target markers}: It gives a quick
	overview of the number of marker genes being expressed above detection
	level (as defined by the present calls in the \vars{calls matrix}) in
	the test samples (see Figure ~\ref{fig:ReportFig3}).

\end{itemize}

%% GENERATE A SUBEST-TEST DATA FOR THE REPORT PLOTS
<<label='ReportDataPerp', include=FALSE, echo=FALSE>>=
# get the test cellscores from valid transitions
# defined by cell.change table
plot.data <- extractTransitions(cellscore, cell.change)
# Extract CellScores of test that should be plotted
# on the same page into list
plotgroup <- paste(plot.data$experiment_id,
                   plot.data$cxkey.subcelltype,
                   sep="_")
temp <- data.frame(plot.data, plotgroup, stringsAsFactors=FALSE)
# sort the table, show when the list is made, everything
# is already in the right order:
#  o by target
#  o by donor
#  o by study
ind <- order(temp$target, temp$donor_tissue, temp$experiment_id)
plot.data.ordered <- temp[ind,]
tg <- unique(paste(plot.data.ordered$experiment_id,
                   plot.data.ordered$cxkey.subcelltype,
                   sep="_")
             )
# get test data from plotgroup
test.data <- plot.data.ordered[plot.data.ordered$plotgroup %in% tg[1], ]
@

%% REPORT PLOT 1
<<label='ReportFig1', echo=FALSE, out.width='1.2\\linewidth', fig.keep='high', fig.width=14, fig.height=7, fig.pos='!ht', fig.align='center', fig.cap='\\textbf{Scatter plot of CellScore components.} The first plot of the CellScore report shows a scatter plot of the donor-like and target-like scores of the donor standard (in this case, fibroblasts (FIB); red) and target standard (embryonic stem cell (ESC); blue), as well as the derived cell types (induced pluripotent stem cells from fibroblasts (iPS-FIB); brown crosses). The number of samples from each group is indicated in parentheses in the figure legend.' >>=
mat <- matrix(c(1,1,2,2), nrow=1, ncol=4, byrow=TRUE)
layout(mat)
scatterplotDonorTargetTest(test.data, cellscore, FALSE)
@

%% REPORT PLOT 2
<<label='ReportFig2', echo=FALSE, out.width='\\linewidth', fig.width=8, fig.height=8, fig.keep='high', fig.pos='!ht', fig.align='center', fig.cap='\\textbf{Density and rug plots of CellScore values.} The CellScore values are shown as a density plot (upper panel) and as a rug plot (lower panel). The rug plot displays the test cells (iPS-FIB) in relation to the desired target cell type (ESC). Vertical dashed lines indicate the number of standard deviations away from the mean CellScore (bold vertical dashed line) of the target cell type.'>>=
mat <- matrix(c(1,1,1,2,2,2), nrow=2, ncol=3, byrow=TRUE)
layout(mat)
rugplotDonorTargetTest(test.data, cellscore)
@

%% REPORT PLOT 3
<<label='ReportFig3', echo=FALSE, out.width='\\linewidth', fig.width=9, fig.height=12, fig.keep='high', fig.pos='!ht', fig.align='center', fig.cap='\\textbf{Heatmap of donor and target markers.} Fibroblasts (red) are the donor cells, and embryonic stem cells are the desired target cells (blue). The test cells are induced pluripotent stem cells from fibroblasts (iPS-FIB; green).'>>=
mat <- matrix(c(1,1,1,1,1,1), nrow=2, ncol=3, byrow=TRUE)
layout(mat)
calls <- assayDataElement(eset, "calls")
rownames(calls) <- fData(eset)[, "probe_id"]
heatmapOnOffMarkers(test.data, group.OnOff$markers, pData(eset), calls)
@

\clearpage
\subsection{R session information}
<<eval=TRUE, echo=TRUE>>=
sessionInfo()
@

\clearpage
% Make an appendix
\begin{appendices}

\section{Specifications for the input datasets}
\label{appendix:inputdata}
\subsection{Reference dataset}
If you have test samples from another microarray platform, you may want to build
your own reference data object to suit your own needs. In this case
you have to select your own reference samples. This can be as extensive as
including many cell types and biological replicates, or as little as only
providing some biological replicates for standard (donor and target) and test
cell types. You must have at least three samples in each standard
(donor/target), and at least one sample from a test cell type.
The method will perform better if there
are more samples, given that the quality of the samples is reasonable. The
advantage of having more samples is to capture a robust signal, which should be
representative for all samples of the same comparison. Limiting samples to only
a handful of replicates runs the risk of focussing on study-specific
effects.

The expression data should be transformed into an object of the \vars{ExpressionSet}
class (defined in the Biconductor package \pkg{Biobase}). For in-depth description
of the \vars{ExpressionSet} class please refer to the \pkg{Biobase} package
docmentation. In the following, we will describe only the relevant data
structures for \pkg{CellScore}.
\begin{enumerate}
    \item \vars{Calls matrix} with probe IDs in rows, samples in columns;
    1=present, 0=absent. The probe IDs should be located in the rownames of
    the matrix.
    \item \vars{Normalized expression matrix} with probe IDs in rows, samples
    in columns. The probe IDs should be located in the rownames of the matrix.
    This matrix MUST have the same dimensions and row order as the
    \vars{calls matrix}.
    \item \vars{Annotation data frame} for the probe IDs in the \vars{calls
    matrix}. The same probe IDs used in the \vars{calls matrix} MUST be
    used as the rownames of this data frame, and their order should match
    the order of the \vars{calls matrix}. There should be no duplicate probe
    IDs.
    This data frame must contain the following five columns:
    \begin{itemize}
        \item \code{probe_id}
        \item \code{platform_id}
        \item \code{gene_symbol}
        \item \code{gene_name}
        \item \code{entrezgene_id}
    \end{itemize}
    Other columns will be ignored.
    \item \vars{Phenotype data frame} with information
    about the samples in the expression matrix. Samples are in rows and columns
    are attributes of the sample. The rownames of the data frame must be the
    sample IDs and must exactly match the column names of the \vars{calls matrix}
    and \vars{normalized expression matrix}. This data frame must contain the
    following columns:
    \begin{itemize}
        \item \code{experiment_id}: This should be a unique identifier for an
        experiment, for example a GEO experiment ID or an ArrayExpress
        experiment ID.
        \item \code{sample_id}: Sample IDs should be unique, such as the GSM
        accession numbers from GEO.
        \item \code{platform_id}: Use a unique ID for each platform.
        \item \code{cell_type}: Use a short text description here.
        \item \code{category}: Each sample must be assigned to one of
        ``standard''/``test''/``NA''
        \item \code{general_cell_type}: Use an abbreviation to label the general
        cell type. For example, FIB for fibroblast.
        \item \code{donor_tissue}: If the sample is a derived cell type,
        then enter the abbreviation for the donor cell type.
        If the sample is standard cell type, then enter the donor tissue
        from which it was isolated. Otherwise, enter ``NA''.
        \item \code{sub_cell_type1}: The sub-cell type is a compound term of the
        general cell type and its donor tissue.
    \end{itemize}
    Additional columns are optional. The properties in these columns could be
    used to color the individual samples in the rug plots
    (Figure ~\ref{fig:Rugplot}). For example, the samples could be colored
    by
    \begin{itemize}
        \item \code{transition_induction_method}, the method used to engineer
        the derived cell types, or
        \item \code{donor_cell_body_location}, the anatomical area from which
        the donor cell was taken.
    \end{itemize}

\end{enumerate}

\subsection{Test dataset}
The test dataset refers to the set of test samples, that is the samples of
experimentally derived cell that we would like to evaluate by means of CellScore.
Technically, every sample not included in the reference dataset could be supplied
as a test sample. The expression profiles of the test samples should be normalized
in the same manner as the reference dataset, and stored in an
\vars{ExpressionSet} object in the same manner as the reference dataset.
The order of the genes (i.e. probes in the rows) in both (reference and test)
data objects must be the same.

\subsection{Input expression matrix for \pkg{CellScore} functions}
The \vars{ExpressionSet} objects for the standards and the test samples
should just be combined to yield one \vars{ExpressionSet} object.

\subsection{Input table of cell transitions for \pkg{CellScore} functions}
A separate data frame should define the cell transitions to be evaluated,
based on the donor and target cell type. This data frame must contain the
following columns:
\begin{itemize}
    \item \code{start}: the donor cell type, using the same abbreviations
    as in the \code{general_cell_type} column of the \vars{phenotype data frame}.
    \item \code{test}: the engineered cell type, using the same
    abbreviations as in the \code{sub_cell_type1} column of the
    \vars{phenotype data frame}.
    \item \code{target}: the target cell type, using the same abbreviations
    as in the \code{general_cell_type} column of the \vars{phenotype data frame}.
\end{itemize}

\end{appendices}

\end{document}