% automatic manuscript creation for spkTools
% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{spkTools: Spike-in Data Analysis and Visualization}
%\VignetteDepends{Biobase,RColorBrewer,gtools,xtable}
%\VignettePackage{spkTools}
\documentclass[12pt]{article}
\usepackage{hyperref}
\usepackage[authoryear, round]{natbib}

\textwidth=6.2in
\textheight=8.5in
\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand\Rpackage[1]{{\textsf{#1}\index{#1 (package)}}}
\newcommand\dataset[1]{{\textit{#1}\index{#1 (data set)}}}
\newcommand\Rclass[1]{{\textit{#1}\index{#1 (class)}}}
\newcommand\Rfunction[1]{{{\small\texttt{#1}}\index{#1 (function)}}}
\newcommand\Rfunarg[1]{{\small\texttt{#1}}}
\newcommand\Robject[1]{{\small\texttt{#1}}}

\author{Matthew N. McCall}

\begin{document}
\title{Tools for Spike-in Data Analysis and Visualization (spkTools)}
\maketitle
\tableofcontents

\section{Introduction}
  As the number of users of microarray technology continues to grow,
  so does the importance of platform assessments and
  comparisons. Spike-in experiments have been successfully used for
  internal technology assessments by microarray manufacturers and for
  comparisons of competing data analysis approaches. The microarray
  literature is saturated with statistical assessments based on
  spike-in experiment data. Unfortunately, the statistical assessments
  vary widely and are applicable only in specific cases. This has
  introduced confusion into the debate over best practices with
  regards to which platform, protocols, and data analysis tools are
  best. Furthermore, cross-platform comparisons have proven difficult
  because reported concentrations are not comparable. We present a
  novel statistical solution that enables cross-platform comparisons,
  and propose a comprehensive procedure for assessments based on
  spike-in experiments. The ideas are implemented in a user friendly
  Bioconductor package: \Rpackage{spkTools}.

  This document describes \Rpackage{spkTools}, which implements
  the methods suggested in the Nucleic Acids Research paper,
  \emph{Consolidated strategy for the analysis of microarray spike-in data}
  (McCall \& Irizarry 2008). While ideally someone interested in using
  this package would use that paper as a guide, sections relating specifically to
  the \Rpackage{spkTools} package are included here.

\section{SpikeInExpressionSet}
  This package defines a new S4 class that extends the
  \Rclass{ExpressionSet} class to include a matrix of
  nominal concentrations; this new class is called a
  \Rclass{SpikeInExpressionSet}. The functions implemented in this
  package take an object of this type as their input and
  produce the tables and plots presented in this paper. Of particular
  interest is the function \Rfunction{spkAll}, which is a wrapper
  function for all the functions contained in this package. When run
  on a \Rclass{SpikeInExpressionSet} object, it produces the full
  complement of tables and plots shown in this paper and saves them
  with easily recognizable file-names. Although this package was
  designed with the intent of producing the full array of results for
  each experiment, the functions can also be applied separately with a
  few exceptions where the output of one function is required as the
  input of another.

\section{spkTools Methods}

\subsection{Relating nominal concentrations across data sets}
  Our solution to the problem of mapping was to replace each nominal
  concentration with the average log expression across arrays (ALE)
  for genes spiked in at that concentration.  This approach assures
  that performance assessments based on spike-in data are related to
  expression measurements that are defined consistently across
  platforms: low, medium, and high ALE values correspond to low,
  medium, and high observed expression values respectively.

\subsection{Accuracy assessment}
With the ALE values in place, we were ready to adapt some of the
existing statistical assessments to cross-platform comparisons. We
started with a basic assessment of accuracy: the \emph{signal
  detection slope}. Microarrays are designed to
  measure the abundance of sample RNA. In principle, we expect a
  doubling of nominal concentration to result in a doubling of
  observed intensity. In other words, on the $\log_{2}$ scale, the
slope from the regression of expression on nominal concentration can
be interpreted as the expected observed difference when the true
difference is a fold change of 2. Thus, an optimal result is a slope
of one, and values higher and lower than one are associated with over
and under estimation respectively.

\subsection{ALE strata}
It has been noted that at very high and very low concentrations one
typically observes lower slopes compared to those seen at medium
concentrations. To address this, we consider the
signal detection slopes for genes spiked-in at low, medium, and high
ALE values. We implemented a data-driven approach to selecting
  these two cut-offs.

We defined $f$ to be the function that maps nominal log
  concentration $x$ to expected observed concentration $f(x)$. Using a
  cubic spline, fitted to the observed data, we obtained a parametric
  representation of $f$. We then looked for concentrations for which
  clear changes in sensitivity occurred, i.e. values of $x$ with large
  slope changes. Note that large changes in slope result in local
  maxima in the absolute value of the second derivative of $f$. For
  each platform, the absolute value of the second derivative $f''$
  showed two clear local maxima. For each
  platform we mapped each concentration $x$ to its corresponding
  empirical percentile $\Phi(x)$ and plotted $|f''(x)|$ against
  $\Phi(x)$. The percentiles that maximized
  the slope change were similar across platforms. The modes for the
  average curve were 0.615 and 0.993. Therefore, for the purpose of
  this comparison, we assigned as \emph{low}, ALE values less than the
  60th percentile of the distribution of background RNA. Similarly we
  defined as \emph{high} ALE values above the 99th percentile. The
  remaining ALE values, between the 60th and 99th percentile, were
  denoted as \emph{medium}. Our choice of cut-points was further
  motivated by observing that for the Affymetrix data the 60th
  percentile provided a good cut-off for distinguishing genes called
  present from genes called absent.

\subsection{Precision assessments}
To complete our comparison we needed to assess specificity. Because the
majority of microarray studies rely on relative measures (e.g. fold
change) as opposed to absolute ones, we focused on the precision of
the basic unit of relative expression: log-ratios.  We adapted the
precision assessment of Cope et al. that focused on
the variability of log-ratios generated by comparisons expected to
produce log-ratios of 0.  Our set of comparisons was created by making
all possible comparisons between spiked-in transcripts across arrays
in which they had the same nominal concentration and from all possible
comparisons within the background RNA. We referred to this group of
comparisons as the \emph{Null} set. The standard deviation (SD) of
these log-ratios served as a basic assessment of precision and has a
useful interpretation: it is the expected range of observed log-ratios
for genes that are not differentially expressed. 

Because specificity varies with nominal concentration, we stratified
these comparisons into low, medium, and high ALE values. Many outliers
were observed on each platform. This was expected given the documented
problem of cross-hybridization. Because a platform with larger SD and
small outliers might be preferable to one with a smaller SD but large
outliers we included the 99.5th percentile of the null distribution as
a second summary assessment of specificity. Note that in a typical
experiment close to 0.5\% of null genes are expected to exceed this
value, which translates to approximately 100 genes on whole genome
arrays. We also included comparisons of spike-ins expected to
yield a certain fold change. These serve to further demonstrate the
variability of relative expression across ALE strata. They also serve
as a rough illustration of the accuracy of log-ratios for each ALE
strata.

\subsection{Performance assessments}
Precision and accuracy assessments on their own may not be of much
practical use. However, the summary statistics described above can be easily combined to answer any practical question, as long as
it can be posed in a statistical context. We focus on two summaries
  related to the common problem of detecting differentially expressed
  genes. Note that we purposely developed summaries that do not
  directly penalize for a lack of accuracy and precision as long as the
  real differences are detected. However, as expected, detection
  ability was highly dependent on accuracy and precision.

For the first example, we computed the chance that, when comparing two
samples, a gene with true log fold change $\Delta=1$ will appear in a
list of the top 100 genes (highest log-ratios). We refer to this
quantity as the \emph{probability of being at the top} (POT) and
recommend computing it separately in each ALE strata. Specifically, we
assume that the log-ratios in each ALE strata follow a normal
distribution with mean and variance estimated from the data (accuracy
slope and standard deviation) and compute the probability
that a random variable from that distribution exceeds the 99.5th
percentile of the null distribution.

As a second example, we computed the expected size of a gene list
  one would have to consider to find $n$ genes that have a true log
  fold change $\Delta$. To perform this calculation we assumed $m_1$
  genes were differentially expressed and $m_0$ were not. Note that
  $m_1+m_0$ is the number of genes on the array. Furthermore, we
  assumed that the true log-ratios in each ALE strata followed a
  normal distribution with mean and variance estimated from the data
  (accuracy slope and standard deviation). The empirical
  distribution was used for the null genes.  With these assumptions in
  place we computed the gene list size for $n=10$, $m_1=100$, and
  $m_0=10000$, we calculate the gene list size, $N$, required to
  obtain $n=10$ true fold changes.  We refer to this
  quantity as the \emph{gene-list needed to detect $n$ true-positives
    (GNN)}. Again, we recommend computing it separately in each ALE
  strata.

\subsection{Imbalance measure} 
Those interested in taking advantage of our methodology should know
that an important requirement is a spike-in experimental design that
does not confound nominal concentrations and genes. A large source of
variability in microarray data is the probe-effect and these vary
across platforms. We fitted an ANOVA model to describe the probe
effect for each platform. Note that if nominal concentrations are
confounded with genes, it becomes impossible to separate differences
due to signal detection from differences in probe affinities. Many of
the previously published spike-in experiments suffer from this
confounding effect. To quantify design imbalance we used the following
measure of imbalance developed by Wu (1981):
\begin{equation}
IB = \sum_{i=1}^p \lambda_{i} \sum_{u_{i}}^{r_{i}} \Big( \sum_{t=1}^T
n_{t}^2(u_{i})-\frac{1}{T}n^2(u_i) \Big)
\end{equation}
where $i$ denotes each covariate, $\lambda_{i}$ an optional weight
associated with each covariate, $u_{i}$ are the possible levels for
covariate $i$, $t$ represents the treatment levels, $n_{t}(u_{i})$ is
the number of units with its $i$th covariate at level $u_{i}$
receiving treatment $t$, and $n(u_{i})$ is the total number of units
with its $i$th covariate at level $u_{i}$. In our
case, the two covariates are probe and array, and the treatment is
nominal concentration. Since imbalance is defined as a weighted sum of
the imbalance due to each covariate, we chose to report the probe and
array imbalance separately to give a better understanding of the
source of the imbalance in each design. In order to not penalize large
designs, we divided the probe imbalance by the number of probes and
the array imbalance by the number of arrays. 

\section{Example}

<<echo=T,results=hide>>=
library(spkTools)
@ 
Load a SpikeInExpressionSet object :
<<echo=T,results=hide>>=
data(affy)
object <- affy
@ 

Set a few parameters:
<<echo=T,results=hide>>=
fc=2
label="Affymetrix"
par(mar=c(3,2.5,2,0.5), cex=1.8)
@ 

\newpage

\begin{figure}[!h]
\begin{center}
<<fig=TRUE,echo=T>>=
spkSlopeOut <- spkSlope(object, label, pch="+")
@
\end{center} {{\bf Observed versus nominal values}: This plot
  depicts expression values plotted against the log (base 2) of the
  reported nominal concentration. The regression slope obtained
  utilizing all the data and the regression slopes obtained within each
  ALE value strata are shown. The slope of each line is reported in
  the legend. The vertical lines divide the ALE strata.}
\end{figure}

\newpage

\begin{figure}[!h]
\begin{center}
<<fig=TRUE,echo=T>>=
spkDensity(object, spkSlopeOut, cuts=TRUE, label)
@
\end{center} {{\bf Empirical densities}: This plot depicts the
  empirical density of the average (across arrays) expression values
  for the background RNA. The tick marks on the x-axis show the
  average expression at each nominal concentration. The dotted lines
  represent the cut points for low, medium, and high ALE values.}
\end{figure}

\newpage

\begin{figure}[!h]
\begin{center}
<<echo=T,fig=T>>=
spkBoxOut <- spkBox(object, spkSlopeOut, fc)
plotSpkBox(spkBoxOut, fc, ylim=c(-1.5,2.5))
sbox <- summarySpkBox(spkBoxOut)
@
\end{center} {{\bf Log-ratio distributions}: This plot depicts
  the distribution of observed log ratios for a given nominal fold
  change. The log ratios are stratified by the ALE strata into which
  the two nominal concentrations fall. The null distributions'
  log-ratios are divided into background RNA (Bg-Null) and spike-ins
  at the same nominal concentration (S-Null), for each bin. The dotted
  horizontal lines represent the expected or nominal log-ratios: zero
  for the null distribution and one for the other comparisons.}
\end{figure}

\newpage

\begin{figure}[!h]
\begin{center}
<<echo=T,fig=T>>=
spkMA(object, spkSlopeOut, fc, label=label, ylim=c(-2.5,2.5))
@ 
\end{center} {{\bf MA plots}: For each platform, we performed all
  pair-wise comparisons of the arrays. From each comparison we
  computed the log-ratio (M) and average expression value (A) for each
  gene. These plots show M plotted against A. To avoid drawing
  hundreds of points on top of each other we use a smooth scatter plot
  which shows the distribution of these points: dark and light shades
  of blue show high and low concentrations of points
  respectively. Points not associated with the spike-in transcripts
  (expected M=0) that achieved fold changes above 2 are shown as large
  blue dots. The points associated with spike-in transcripts with
  nominal fold changes of 2 are shown as triangles. The different
  colors denote the ALE groups.}
\end{figure}

\newpage

<<results=tex,echo=T>>=
vtmp <- spkVar(object)
sv <- as.numeric(vtmp[,2][-nrow(vtmp)])
bin <- c("Low", "Med", "High")
bins <- bin[spkSlopeOut$breaks[2,]]
tab1 <- data.frame(NominalConc=2^spkSlopeOut$breaks[1,],
                   AvgExp=round(spkSlopeOut$avgExp,1),
                   PropGenesBelow=round(spkSlopeOut$prop,2),
                   ALEStrata=bins,
                   SD=round(sv,2))
colnames(tab1) <- c("Nominal Conc",
                    "Avg Expression",
                    "Prop of Genes Below",
                    "ALE Strata",
                    "Std Dev")
@ 

<<results=tex,echo=F>>=
library(xtable)
tab1x <- xtable(tab1)
print(tab1x)
@ 
{{\bf Nominal concentration to ALE mapping}: This table contains
  summary measures specific to each nominal spike-in level.  The first
  column shows the nominal concentrations as originally reported. The
  second column shows the average of all observed expression values
  associated with the row's nominal concentration. The third column
  shows the proportion of background RNA with expression values less
  than the average expression value. The fourth column shows the ALE
  strata associated with the row's nominal
  concentration.  Finally, the fifth column shows the standard
  deviation of all observed expression values associated with the
  row's nominal concentration.}

\newpage

<<results=tex,echo=T>>=
AccuracySlope <- round(spkSlopeOut$slopes[-1], digits=2)
AccuracySD <- round(spkAccSD(object, spkSlopeOut), digits=2)
pot <- spkPot(object, spkSlopeOut, AccuracySlope, AccuracySD, 
              precisionQuantile=.995)
PrecisionSD <- round(sbox$madFC[1:3], digits=2)
PrecisionQuantile <- round(pot$quantiles, digits=2)
SNR <- round(AccuracySlope/PrecisionSD, digits=2)
POT <- round(pot$POTs, digits=2)
tab2 <- data.frame(AccuracySlope=AccuracySlope,
                   AccuracySD=AccuracySD,
                   PrecisionSD=PrecisionSD,
                   PrecisionQuantile=PrecisionQuantile,
                   SNR=SNR,
                   POT=POT)
@ 
<<results=tex,echo=F>>=
tab2x <- xtable(tab2)
print(tab2x)
@ 
{{\bf Assessment results}: For each of the ALE strata we
  report summary assessments for accuracy, precision, and overall
  performance. The first column shows the signal detection slope which
  can be interpreted as the expected observed difference when the true
  difference is a fold change of 2. In parenthesis is the standard
  deviation of the log-ratios associated with non-zero nominal
  log-ratios. The second column shows the standard deviation of null
  log-ratios. The SD can be interpreted as the expected range of
  observed log-ratios for genes that are not differentially
  expressed. The third column shows the 99.5th percentile of the null
  distribution. It can be interpreted as the expected minimum value
  that the top 100 non-differentially expressed genes will reach. The
  fourth column shows the ratio of the values in column 1 and column
  2. It is a rough measure of signal to noise ratio. The fifth column
  shows the probability that, when comparing two samples, a gene with
  a true log fold change of 2 will appear in a list of the 100 genes
  with the highest log-ratios.}

\newpage

<<results=tex,echo=T>>=
bals <- round(spkBal(object))
anv <- round(spkAnova(object), digits=2)
tab3 <- t(c(anv,bals))
@ 
<<results=tex,echo=F>>=
tab3x <- xtable(tab3)
print(tab3x)
@ 
{{\bf ANOVA results}: To understand the variability contributed by
  differences in nominal concentrations, probe effect, and array, we
  fitted a 3-way ANOVA model containing only main effects to the
  expression values from the spike-in transcripts. The estimated
  standard deviation of each effect is shown in the first three
  columns. The forth column shows the standard deviation of the error
  term. Finally, a measure of the amount of confounding between
  nominal concentration and the other two effects is included in
  columns five and six. We use the measure presented by Wu in
  Technometrics (1981), Volume 23, Number 1. An optimal design, such
  as a Latin Square, will have a measure of 0 for each imbalance. The
  more confounding the larger these values. Because Affymetrix using a
  latin square design, there is no imbalance.}

\nocite{*}

\bibliographystyle{plainnat}
\bibliography{spkDoc}

\end{document}