\documentclass[a4paper]{article}

%\VignetteIndexEntry{Missing value imputation}

\usepackage{hyperref}

\title{Imputing missing values using the pcaMethods package}
\author{Wolfram Stacklies and Henning Redestig\\
CAS-MPG Partner Institute for Computational Biology (PICB)\\
Shanghai, P.R. China \\
and\\
Max Planck Institute for Molecular Plant Physiology\\
Potsdam, Germany\\
\url{http://bioinformatics.mpimp-golm.mpg.de/}
}
\date{\today}

\begin{document}
\setkeys{Gin}{width=1.0\textwidth}
@
\maketitle

\section{Missing value imputation}
One application for missing value robust principal component analysis
is that it effectively can be used to impute the missing values and
thus obtain an estimated complete data set. The pcaMethods package was
partly written with this application in mind.

PCA is a way of creating a model of a matrix, $X$, by defining two
parameter matrices, the scores, $T$, and the loadings, $P$, which
together have less values than the original matrix but when multiplied
with each other well reconstruct the original matrix. I.e.:
$$X=1\times{}\bar{x} + TP' + E$$
where $E$ is the error matrix and $1\times{}\bar{x}$ denotes the
original variable averages. Now if $X$ contains missing values but we
still are able to get complete estimates of $P$ and $T$ than we can
use:
$$\hat{X}=1\times{}\bar{x} + TP'$$
as an estimate for $x_{i,j}$ if $x_{i,j}$ is missing. 

This is can be done as the following example illustrates. First we
attach the metabolite data set with missing values.
<<results=hide>>=
library(pcaMethods)
@
<<>>=
data(metaboliteData)
mD <- metaboliteData
sum(is.na(mD))
@ 
Now we get the estimated data set by using PPCA and three principal components.
<<>>=
pc <- pca(mD, nPcs=3, method="ppca")
imputed <- completeObs(pc)
@ 
If we compare with the original values we see  that the error is
rather low.
<<>>=
data(metaboliteDataComplete)
mdComp <- metaboliteDataComplete
sum((mdComp[is.na(mD)] - imputed[is.na(mD)])^2) / sum(mdComp[is.na(mD)]^2)
@
When using a different PCA algorithm, we get different performance.
<<>>=
imputedNipals <- completeObs(pca(mD, nPcs=3, method="nipals"))
sum((mdComp[is.na(mD)] - imputedNipals[is.na(mD)])^2) / sum(mdComp[is.na(mD)]^2)
@ 

If the data we are interested in was gene expression set of class
'ExpressionSet' we could simply do
<<>>= 
library(Biobase)
data(sample.ExpressionSet)
exSet <- sample.ExpressionSet
exSetNa <- exSet
exprs(exSetNa)[sample(13000, 200)] <- NA
lost <- is.na(exprs(exSetNa))
pc <- pca(exSetNa, nPcs=2, method="ppca")
impExSet <- asExprSet(pc, exSetNa)
sum((exprs(exSet)[lost] - exprs(impExSet)[lost])^2) / sum(exprs(exSet)[lost]^2)
@ 
Different results will be obtained with different PCA
algorithms. Which one to use depends on the general structure of the
data set and the imputation performance can be estimated by
cross-validation. Please see the 'introduction' vignette on further
details on how to use the cross-validation capabilities of this
package.

\end{document}