%\VignetteIndexEntry{PCA, PLS and OPLS with \emph{ropls}}
%\VignetteKeywords{ropls} \\
%\VignettePackage{ropls} \\

\documentclass[a4paper,11pt]{article}
\usepackage[table]{xcolor}
\usepackage[nottoc, notlof, notlot]{tocbibind}
\setlength{\parindent}{0cm}

<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\begin{document}
\title{\emph{ropls}: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data}
\author{Etienne A. Th\'evenot}
\maketitle

\vspace{1\baselineskip}
\vspace{1\baselineskip}
\vspace{1\baselineskip}

\tableofcontents

\pagebreak

\setkeys{Gin}{width=0.6\textwidth}


\section{The \emph{ropls} package}


The \emph{ropls} R package implements the PCA, PLS(-DA) and OPLS(-DA) approaches with the original, NIPALS-based, versions of the algorithms \cite{Wold2001,Trygg2002}. It includes the R2 and Q2 quality metrics \cite{Eriksson2001,Tenenhaus1998}, the permutation diagnostics \cite{Szymanska2012}, the computation of the VIP values \cite{Wold2001}, the score and orthogonal distances to detect outliers \cite{Hubert2005}, as well as many graphics (scores, loadings, predictions, diagnostics, outliers, etc).

\vspace{1\baselineskip}

The functionalities from \emph{ropls} can also be accessed via a graphical user interface in the \emph{Multivariate} module from the \emph{Workflow4Metabolomics.org} online resource for computational metabolomics (\url{http://workflow4metabolomics.org/}) which provides a user-friendly, web-based environment for data pre-processing, statistical analysis, and annotation \cite{Giacomoni2015}.


\section{Context}


\subsection{Orthogonal Partial Least-Squares}

Partial least-squares, which is a latent variable regression method based on covariance between the predictors and the response, has been shown to efficiently handle datasets with multi-collinear predictors, as in the case of spectrometry measurements \cite{Wold2001}. More recently, Trygg and Wold introduced the orthogonal projection to latent structures (OPLS) algorithm to model separately the variations of the predictors correlated and orthogonal to the response \cite{Trygg2002}.

\vspace{1\baselineskip}

OPLS has a similar predictive capacity compared to PLS and improves the interpretation of the predictive components and of the systematic variation \cite{Pinto2012a}. In particular, OPLS modeling of single responses only requires one predictive component.

\vspace{1\baselineskip}

Diagnostics such as the Q2Y metrics and permutation testing are of high importance to avoid overfitting and assess the statistical significance of the model. The variable importance in projection (VIP), which reflects both the loading weights for each component and the variability of the response explained by this component \cite{Pinto2012a,Mehmood2012} is often used for feature selection \cite{Trygg2002,Pinto2012a}.


\subsection{OPLS software}

OPLS is available in the SIMCA-P commercial software (Umetrics, Ume\r{a}, Sweden; \cite{Eriksson2001}). In addition, the kernel-based version of OPLS \cite{Bylesjo2008a} is available in the open-source R statistical environment \cite{RDCT2008}, and a single implementation of the linear algorithm in R has been described recently \cite{Gaude2013}.


\section{The \emph{sacurine} metabolomics dataset}


\subsection{Study objective}

The objective was to study the influence of age, body mass index and gender on metabolite concentrations in urine, by analysing 183 samples from a cohort of adults with liquid chromatography coupled to high-resolution mass spectrometry (\cite{Thevenot2015}).


\subsection{Pre-processing and annotation}

Urine samples were analyzed by using an LTQ Orbitrap in the negative ionization mode. A total of 109 metabolites were identified or annotated at the MSI level 1 or 2. After retention time alignment with XCMS, peaks were integrated with Quan Browser. Signal drift and batch effect were corrected, and each urine profile was normalized to the osmolality of the sample. Finally, the data were log10 transformed.


\subsection{Covariates}

The volunteers' \emph{age}, \emph{body mass index}, and \emph{gender} were recorded.


\section{Hands-on}


\subsection{Loading}

We first load the \emph{ropls} package:

<<>>=
library(ropls)
@

We then load the \emph{sacurine} dataset which contains:

\begin{enumerate}
\item The \emph{dataMatrix} matrix of numeric type containing the intensity profiles (log10 transformed)
\item The \emph{sampleMetadata} data frame containg sample metadata
\item The \emph{variableMetadata} data frame containg variable metadata
\end{enumerate}

<<>>=
data(sacurine)
names(sacurine)
@

We attach \emph{sacurine} to the search path and display a summary of the content of the \emph{dataMatrix}, \emph{sampleMetadata} and \emph{variableMetadata} with the \emph{strF} \textbf{F}unction of the \emph{ropls} package (see also \emph{str})

<<>>=
attach(sacurine)
strF(dataMatrix)
strF(sampleMetadata)
strF(variableMetadata)
@


\subsection{Principal Component Analysis (PCA)}

We perform a PCA on the \emph{dataMatrix} matrix (samples as rows, variables as columns):

<<echo=TRUE, eval=FALSE>>=
sacurine.pca <- opls(dataMatrix)
@

<<echo=FALSE>>=
sacurine.pca <- opls(dataMatrix, plotL = FALSE)
@

\begin{figure}[!h]
\begin{center}
<<label=figPca,fig=TRUE,echo=FALSE,results=hide>>=
layout(matrix(1:4, nrow = 2, byrow = TRUE))
for(typeC in c("overview", "outlier", "x-score", "x-loading"))
plot(sacurine.pca, typeVc = typeC, parDevNewL = FALSE)
@
\end{center}
\caption{PCA \texttt{summary} plot. Top left \texttt{overview}: the \emph{scree} plot (i.e., inertia barplot) suggests that 3 components may be sufficient to capture most of the inertia; Top right \texttt{outlier}: this graphics shows the distances within and orthogonal to the projection plane (\cite{Hubert2005}): the name of the samples with a high value for at least one of the distances are indicated; Bottom left \texttt{x-score}; Bottom right \texttt{x-loading}.}
\label{fig:pca}
\end{figure}

A summary of the model (8 components were selected) is printed. In addition the default \texttt{summary} figure is displayed (Figure~\ref{fig:pca}).

Note:
\begin{enumerate}
\item Since \emph{dataMatrix} does not contain missing value, the singular value decomposition was used by default; NIPALS can be selected with the \texttt{algoC} argument specifying the \textbf{algo}rithm (\textbf{C}haracter),
\item The \texttt{predI = NA} default number of \textbf{pred}ictive components (\textbf{I}nteger) means that only components with a variance superior to the mean variance of all components are kept (note that this rule requires all components to be computed and can be quite time-consuming for large datasets with the NIPALS algorithm; in such cases, one may specify a limited number of components with the \texttt{predI} parameter).
\end{enumerate}

Let us see if we notice any partition according to gender, by labeling/coloring the samples according to the gender and drawing the Mahalanobis ellipses for the male and female subgroups (Figure \ref{fig:pcaMah}).

<<eval=FALSE,results=hide>>=
genderFc <- sampleMetadata[, "gender"]
plot(sacurine.pca, typeVc = "x-score",
parAsColFcVn = genderFc, parEllipsesL = TRUE)
@

\begin{figure}[!h]
\begin{center}
<<label=figPcaMah,fig=TRUE,echo=FALSE,results=hide>>=
genderFc <- sampleMetadata[, "gender"]
plot(sacurine.pca, typeVc = "x-score",
parAsColFcVn = genderFc, parEllipsesL = TRUE, parDevNewL = FALSE)
@

\end{center}
\caption{PCA \texttt{score} plot colored according to \emph{gender}. The displayed components can be specified with \texttt{parCompVi} (plotting \textbf{par}ameter specifying the \textbf{Comp}onents: \textbf{V}ector of 2 \textbf{i}ntegers)}
\label{fig:pcaMah}
\end{figure}

Note that the plotting \textbf{par}ameter to be used \textbf{As} \textbf{Col}ors (\textbf{F}actor of \textbf{c}haracter type or \textbf{V}ector of \textbf{n}umeric type) has a length equal to the number of rows of the dataMatrix matrix (ie of samples) and that this qualitative or quantitative variable is converted into colors (by using an internal palette or color scale, respectively). We could have visualized the \emph{age} of the individuals by specifying \texttt{parAsColFcVn = sampleMetadata[, "age"]}.


\subsection{Partial least-squares: PLS and PLS-DA}

For PLS (and OPLS), the \textbf{Y} response(s) must be provided. \textbf{Y} can be either a numeric vector (respectively matrix) for single (respectively multiple) (O)PLS regression, or a character factor for (O)PLS-DA classification as in the following example (Figure \ref{fig:pls}).

<<eval=FALSE,results=hide>>=
sacurine.plsda <- opls(dataMatrix, genderFc)
@
<<echo=FALSE>>=
sacurine.plsda <- opls(dataMatrix, genderFc, plotL = FALSE)
@

\begin{figure}[!h]
\begin{center}
<<label=figPls,fig=TRUE,echo=FALSE,results=hide>>=
layout(matrix(1:4, nrow = 2, byrow = TRUE))
for(typeC in c("permutation", "overview", "outlier", "x-score"))
plot(sacurine.plsda, typeVc = typeC, parDevNewL = FALSE)
@
\end{center}
\caption{PLS-DA model of the \emph{gender} response. The default \texttt{overview} figure displays: Top left: \emph{significance} diagnostic: the $R2Y$ and $Q2Y$ of the model are compared with the corresponding values obtained after random permutation of the \textbf{y} response; Top right: \emph{inertia} barplot: the graphic here suggests that 3 orthogonal components may be sufficient to capture most of the inertia; Bottom left: \emph{outlier} diagnostics (see Figure~\ref{fig:pca}); Bottom right: \emph{X-score} plot: the number of components and the cumulative $R2X$, $R2Y$ and $Q2Y$ are indicated below the plot.}
\label{fig:pls}
\end{figure}

Note:
\begin{enumerate}
\item When set to NA (as in the default), the number of components \texttt{predI} is determined automatically as follows (\cite{Eriksson2001}): A new component \emph{h} is added to the model if :
\begin{enumerate}
\item $R2Y_h \geq 1\%$, i.e., the percentage of Y dispersion (i.e., sum of squares) explained by component \emph{h} is more than 1\%, and
\item $Q2Y_h=1-PRESS_h/RSS_{h-1} \geq 0$ for PLS (or 5\% when the number of samples is less than 100) or 1\% for OPLS: $Q2Y_h \geq 0$ means that the predicted residual sum of squares ($PRESS_h$) of the model including the new component \emph{h} estimated by 7-fold cross-validation is less than the residual sum of squares ($RSS_{h-1}$) of the model with the previous components only (with $RSS_0$ being the sum of squared Y values).
\end{enumerate}
\item The predictive performance of the full model is assessed by the cumulative Q2Y metric: $Q2Y=1-\prod\limits_{h=1}^r (1-Q2Y_h)$. We have $Q2Y \in [0,1]$, and the higher the Q2Y, the better the performance. Models trained on datasets with a larger number of features compared with the number of samples can be prone to overfitting: in that case, high Q2Y values are obtained by chance only (see section~\ref{subsubsec:overfit}). To estimate the significance of Q2Y (and R2Y) for single response models, permutation testing can be used \cite{Szymanska2012}: models are built after random permutation of the Y values, and $Q2Y_{perm}$ are computed. The \textit{p}-value is equal to the proportion of $Q2Y_{perm}$ above $Q2Y$ (the default number of permutations is 20 as a compromise between quality control and computation speed; it can be increased with the \texttt{permI} parameter, e.g. to 1,000, to assess if the model is significant at the 0.05 level),
\item The NIPALS algorithm is used for PLS (and OPLS); \emph{dataMatrix} matrices with (a moderate amount of) missing values can thus be analysed.
\end{enumerate}

We see that our model with 3 predictive (\texttt{pre}) components has significant and quite high R2Y and Q2Y values.

\subsection{Orthogonal partial least squares: OPLS and OPLS-DA}

To perform OPLS(-DA), we set \texttt{orthoI} (number of components which are \textbf{ortho}gonal; \textbf{I}nteger) to either a specific number of orthogonal components, or to NA. Let us build an OPLS-DA model of the \emph{gender} response (Figure \ref{fig:opl}). Note that for OPLS modeling of a single response, the number of predictive component is 1.

<<eval=FALSE,results=hide>>=
sacurine.oplsda <- opls(dataMatrix, genderFc,
predI = 1, orthoI = NA)
@
<<echo=FALSE>>=
sacurine.oplsda <- opls(dataMatrix, genderFc,
predI = 1, orthoI = NA, plotL = FALSE)
@

\begin{figure}[!h]
\begin{center}
<<label=figOpl,fig=TRUE,echo=FALSE,results=hide>>=
layout(matrix(1:4, nrow = 2, byrow = TRUE))
for(typeC in c("permutation", "overview", "outlier", "x-score"))
plot(sacurine.oplsda, typeVc = typeC, parDevNewL = FALSE)
@
\end{center}
\caption{OPLS-DA model of the \emph{gender} response. In the (\texttt{score} plot), the predictive component is displayed as abscissa and the (selected; default = 1) orthogonal component as ordinate.}
\label{fig:opl}
\end{figure}

Let us assess the predictive performance of our model. We first train the model on a subset of the samples (here we use the \texttt{odd} subset value which splits the data set into two halves with similar proportions of samples for each class; alternatively, we could have used a specific subset of indices for training):

<<eval=FALSE,results=hide>>=
sacurine.oplsda <- opls(dataMatrix, genderFc, predI = 1, orthoI = NA, subset = "odd")
@
<<echo=FALSE>>=
sacurine.oplsda <- opls(dataMatrix, genderFc, predI = 1, orthoI = NA, permI = 0, subset = "odd", plotL = FALSE)
@

We compute the performances on the training subset:

<<>>=
trainVi <- getSubsetVi(sacurine.oplsda)
table(genderFc[trainVi], fitted(sacurine.oplsda))
@

We then compute the performances on the test subset:

<<>>=
table(genderFc[-trainVi],
      predict(sacurine.oplsda, dataMatrix[-trainVi, ]))
@

As expected, the predictions on the test subset are (slightly) lower. The classifier however still achieves 91\% of correct predictions.

\subsection{Comments}

\subsubsection{Overfitting}
\label{subsubsec:overfit}

Overfitting (i.e., building a model with good performances on the training set but poor performances on a new test set) is a major caveat of machine learning techniques applied to data sets with more variables than samples. A simple simulation of a random \textbf{X} data set and a \textbf{y} response shows that perfect PLS-DA classification can be achieved as soon as the number of variables exceeds the number of samples, as detailed in the example below (Figure~\ref{fig:overfit}; \cite{Wehrens2011}). It is therefore essential to check that the $Q2Y$ value of the model is significant by random permutation of the labels.

\begin{figure}[!h]
\begin{center}
<<label=figOverfit,echo=FALSE,fig=TRUE,results=hide>>=
set.seed(123)
obsI <- 20
featVi <- c(2, 20, 200)
featMaxI <- max(featVi)
xRandMN <- matrix(runif(obsI * featMaxI), nrow = obsI)
yRandVn <- sample(c(rep(0, obsI / 2), rep(1, obsI / 2)))

layout(matrix(1:4, nrow = 2, byrow = TRUE))

for(featI in featVi) {
    randPlsi <- opls(xRandMN[, 1:featI], yRandVn,
                  predI = 2,
                  permI = ifelse(featI == featMaxI, 100, 0),
                  plotL = FALSE)
    plot(randPlsi, typeVc = "x-score", parDevNewL = FALSE,
         parCexN = 1.3, parTitleL = FALSE)
    mtext(featI/obsI, font = 2, line = 2)
    if(featI == featMaxI)
         plot(randPlsi, typeVc = "permutation", parDevNewL = FALSE,
           parCexN = 1.3)
    }
mtext(" obs./feat. ratio:", adj = 0, at = 0, font = 2, line = -2, outer = TRUE)
@
\end{center}
\caption{PLS overfit when the number of features exceeds the number of observations. A random matrix \textbf{X} of 20 observations $\times$ 200 features was generated by sampling from the uniform distribution $U(0, 1)$. A random \textbf{y} response was obtained by sampling (without replacement) from a vector of 10 zeros and 10 ones. Top left, top right, and bottom left: the X-score plots of the PLS modeling of \textbf{y} by the (sub)matrix of \textbf{X} restricted to the first 2, 20, or 200 features, are displayed (i.e., the observation/feature ratios are 0.1, 1, and 10, respectively). Despite the good separation obtained on the bottom left score plot, we see that the $Q2Y$ estimation of predictive performance is low (negative). Bottom right: a significant proportion of the models (in fact here all models) trained after random permutations of the labels have a higher $Q2Y$ value than the model trained with the true labels, confirming that PLS cannot specifically model the \textbf{y} response with the \textbf{X} predictors, as expected.}
\label{fig:overfit}
\end{figure}

<<eval=FALSE,results=hide>>=
set.seed(123)
obsI <- 20
featVi <- c(2, 20, 200)
featMaxI <- max(featVi)
xRandMN <- matrix(runif(obsI * featMaxI), nrow = obsI)
yRandVn <- sample(c(rep(0, obsI / 2), rep(1, obsI / 2)))

layout(matrix(1:4, nrow = 2, byrow = TRUE))

for(featI in featVi) {
    randPlsi <- opls(xRandMN[, 1:featI], yRandVn,
                  predI = 2,
                  permI = ifelse(featI == featMaxI, 100, 0),
                  plotL = FALSE)
    plot(randPlsi, typeVc = "x-score", parDevNewL = FALSE,
         parCexN = 1.3, parTitleL = FALSE)
    mtext(featI/obsI, font = 2, line = 2)
    if(featI == featMaxI)
         plot(randPlsi, typeVc = "permutation", parDevNewL = FALSE,
           parCexN = 1.3)
    }
mtext(" obs./feat. ratio:", adj = 0, at = 0, font = 2, line = -2, outer = TRUE)
@

\subsubsection{VIP from OPLS models}

The classical VIP metric is not useful for OPLS modeling of a single response since (\cite{Galindo-Prieto2014,Thevenot2015}):
\begin{enumerate}
\item VIP values remain identical whatever the number of orthogonal components selected,
\item VIP values are univariate (i.e., they do not provide information about interactions between variables).
\end{enumerate}
In fact, when features are standardized, we can demonstrate a mathematical relationship between VIP and \textit{p}-values from a Pearson correlation test (\cite{Thevenot2015}), as illustrated by the code below (Figure~\ref{fig:vip}). 

\begin{figure}[!h]
\begin{center}
<<label=figVip,echo=FALSE,fig=TRUE,results=hide>>=
ageVn <- sampleMetadata[, "age"]

pvaVn <- apply(dataMatrix, 2,
               function(feaVn) cor.test(ageVn, feaVn)[["p.value"]])

vipVn <- getVipVn(opls(dataMatrix, ageVn, predI = 1, orthoI = NA, plot = FALSE))

quantVn <- qnorm(1 - pvaVn / 2)
rmsQuantN <- sqrt(mean(quantVn^2))

par(font = 2, font.axis = 2, font.lab = 2, las = 1,
    mar = c(5.1, 4.6, 4.1, 2.1),
    lwd = 2, pch = 16)

plot(pvaVn, vipVn,
     col = "red",
     pch = 16,
     xlab = "p-value", ylab = "VIP", xaxs = "i", yaxs = "i")

box(lwd = 2)

curve(qnorm(1 - x / 2) / rmsQuantN, 0, 1, add = TRUE, col = "red", lwd = 3)

abline(h = 1, col = "blue")
abline(v = 0.05, col = "blue")
@
\end{center}
\caption{Relationship between VIP from one-predictive PLS or OPLS models with standardized variables, and \textit{p}-values from Pearson correlation test. The $(p_j, VIP_j)$ pairs corresponding respectively to the VIP values from OPLS modelling of the \emph{age} response with the \emph{sacurine} dataset, and the \textit{p}-values from the Pearson correlation test are shown as \emph{\textcolor{red}{red}} dots. The $y = \Phi^{-1}(1 - x/2) / z_{rms}$ curve is shown in \emph{\textcolor{red}{red}} (where $\Phi^{-1}$ is the inverse of the probability density function of the standard normal distribution, and $z_{rms}$ is the quadratic mean of the $z_j$ quantiles from the standard normal distribution; $z_{rms} = 2.6$ for the \emph{sacurine} dataset and the \emph{age} response). The vertical (resp. horizontal) \emph{\textcolor{blue}{blue}} line corresponds to univariate (resp. multivariate) thresholds of $p=0.05$ and $VIP=1$, respectively (\cite{Thevenot2015}).}
\label{fig:vip}
\end{figure}

<<eval=FALSE,results=hide>>=
ageVn <- sampleMetadata[, "age"]

pvaVn <- apply(dataMatrix, 2,
               function(feaVn) cor.test(ageVn, feaVn)[["p.value"]])

vipVn <- getVipVn(opls(dataMatrix, ageVn, predI = 1, orthoI = NA, plot = FALSE))

quantVn <- qnorm(1 - pvaVn / 2)
rmsQuantN <- sqrt(mean(quantVn^2))

par(font = 2, font.axis = 2, font.lab = 2, las = 1,
    mar = c(5.1, 4.6, 4.1, 2.1),
    lwd = 2, pch = 16)

plot(pvaVn, vipVn,
     col = "red",
     pch = 16,
     xlab = "p-value", ylab = "VIP", xaxs = "i", yaxs = "i")

box(lwd = 2)

curve(qnorm(1 - x / 2) / rmsQuantN, 0, 1, add = TRUE, col = "red", lwd = 3)

abline(h = 1, col = "blue")
abline(v = 0.05, col = "blue")
@

The VIP properties above result from:
\begin{enumerate}
\item OPLS models of a single response have a single predictive component,
\item in the case of one-predictive component (O)PLS models, the general formula for VIPs can be simplified to $VIP_j = \sqrt{m} \times |w_j|$ for each feature $j$, were $m$ is the total number of features and \textbf{w} is the vector of loading weights,
\item in OPLS, \textbf{w} remains identical whatever the number of extracted orthogonal components,
\item for a single-response model, \textbf{w} is proportional to $\textbf{X}' \textbf{y}$ (where $'$ denotes the matrix transposition),
\item if \textbf{X} and \textbf{y} are standardized, $\textbf{X}' \textbf{y}$ is the vector of the correlations between the features and the response.
\end{enumerate}

Galindo-Prieto et al. (2014) have recently suggested new VIP metrics for OPLS, $VIP_{pred}$ and $VIP_{ortho}$, to separately measure the influence of the features in the modeling of the dispersion correlated to, and orthogonal to the response, respectively (\cite{Galindo-Prieto2014}).

For OPLS(-DA) models, you can therefore get from the model generated with \emph{opls}:
\begin{itemize}
\item the predictive VIP vector (which corresponds to the $VIP_{4,pred}$ metric measuring the variable importance in prediction) with \texttt{getVipVn(model)},
\item the orthogonal VIP vector which is the $VIP_{4,ortho}$ metric measuring the variable importance in orthogonal modeling with \texttt{getVipVn(model, orthoL = TRUE)}.
\end{itemize}
As for the classical $VIP$, we still have the mean of $VIP_{pred}^2$ (and of $VIP_{ortho}^2$) which, each, equals 1.

\subsubsection{(Orthogonal) Partial Least Squares Discriminant Analysis: (O)PLS-DA}

\paragraph{Two classes}

When the \textbf{y} response is a factor of 2 levels (character vectors are also allowed), it is internally transformed into a vector of values $\in \{0,1\}$ encoding the classes. The vector is centered and unit-variance scaled, and the (O)PLS analysis is performed.

Brereton and Lloyd (2014) have demonstrated that when the sizes of the 2 classes are unbalanced, a bias is introduced in the computation of the decision rule, which penalizes the class with the highest size (\cite{Brereton2014}). In this case, an external procedure using resampling (to balance the classes) and taking into account the class sizes should be used for optimal results.

\paragraph{Multiclass}

In the case of more than 2 levels, the \textbf{y} response is internally transformed into a matrix (each class is encoded by one column of values $\in \{0,1\}$). The matrix is centered and unit-variance scaled, and the PLS analysis is performed.

In this so-called `PLS2' implementation, the proportions of 0 and 1 in the columns is usually unbalanced (even in the case of balanced size of the classes) and the bias described previously occurs (\cite{Brereton2014}). The multiclass PLS-DA results from \emph{ropls} are therefore indicative only, and we recommend to set an external procedure where each column of the matrix is modeled separately (as described above) and the resulting probabilities are aggregated (see for instance \cite{Bylesjo2006}).

\subsection{Closing session}

Before closing this example session, we detach \emph{sacurine} from the search path:

<<>>=
detach(sacurine)
@


\subsection{Session information}

<<eval=FALSE,results=hide>>=
sessionInfo()
@
<<echo=FALSE,results=tex>>=
toLatex(sessionInfo())
@

\section{Pre-processing and annotation of mass spectrometry data}


To illustrate how \emph{dataMatrix}, \emph{sampleMetadata} and \emph{variableMetadata} can be obtained from raw mass spectra file, we use the LC-MS data from the \emph{faahKO} package \cite{Saghatelian2004}. We will pre-process the raw files with the \emph{xcms} package \cite{Smith2006} and annotate isotopes and adducts with the \emph{CAMERA} package \cite{Kuhl2012}, as described in the corresponding vignettes (all these packages are from \emph{bioconductor}).

Let us start by getting the paths to the 12 raw files (6 KO and 6 WT mice) in the ".cdf" open format. The files are grouped in two sub-directories ("KO" and "WT") since \emph{xcms} can use sample class information when grouping the peaks and correcting retention times.

<<>>=
library(faahKO)
cdfpath <- system.file("cdf", package = "faahKO")
cdffiles <- list.files(cdfpath, recursive = TRUE, full.names = TRUE)
basename(cdffiles)
@

Next, \emph{xcms} is used to pre-process the individual raw files, as described in the vignette.

<<>>=
library(xcms)
@
<<results=hide>>=
xset <- xcmsSet(cdffiles)
@
<<>>=
xset
xset <- group(xset)
@
<<eval=FALSE>>=
xset2 <- retcor(xset, family = "symmetric", plottype = "mdevden")
@
<<echo=FALSE>>=
xset2 <- retcor(xset, family = "symmetric", plottype = "none")
@
<<>>=
xset2 <- group(xset2, bw = 10)
@
<<results=hide>>=
xset3 <- fillPeaks(xset2)
@

Finally, the \emph{annotateDiffreport} from \emph{CAMERA} annotates isotopes and adducts and builds a peak table containing the peak intensities and the variable metadata.

<<>>=
library(CAMERA)
diffreport <- annotateDiffreport(xset3, quick=TRUE)
diffreport[1:4, ]
@

We then build the dataMatrix, sampleMetadata and variableMetadata matrix and dataframes as follows:

<<>>=
sampleVc <- grep("^ko|^wt", colnames(diffreport), value = TRUE)
dataMatrix <- t(as.matrix(diffreport[, sampleVc]))
dimnames(dataMatrix) <- list(sampleVc, diffreport[, "name"])
sampleMetadata <- data.frame(row.names = sampleVc,
genotypeFc = substr(sampleVc, 1, 2))
variableMetadata <- diffreport[, !(colnames(diffreport) %in% c("name", sampleVc))]
rownames(variableMetadata) <- diffreport[, "name"]
@

The data can now be analysed with the \emph{ropls} package as described in the previous section (i.e. by performing a PCA and an OPLS-DA):

<<>>=
library(ropls)
opls(dataMatrix)
opls(dataMatrix, sampleMetadata[, "genotypeFc"], orthoI = NA)
@

<<echo=FALSE>>=
rm(list = ls())
@


\section{Other datasets}

In addition to the \emph{sacurine} dataset presented above, the package contains the following datasets to illustrate the functionalities of PCA, PLS and OPLS (see the examples in the documentation of the opls function):

\paragraph{aminoacids} Amino-Acids Dataset. Quantitative structure property relationship (QSPR; \cite{Wold2001}).

\paragraph{cellulose} NIR-Viscosity example data set to illustrate multivariate calibration using PLS, spectral filtering and OPLS (Multivariate calibration using spectral data. Simca tutorial. Umetrics, Sweden).

\paragraph{cornell} Octane of various blends of gasoline: Twelve mixture component proportions of the blend are analysed \cite{Tenenhaus1998}.

\paragraph{foods} Food consumption patterns accross European countries (FOODS). The relative consumption of 20 food items was compiled for 16 countries. The values range between 0 and 100 percent and a high value corresponds to a high consumption. The dataset contains 3 missing data \cite{Eriksson2001}.

\paragraph{linnerud} Three physiological and three exercise variables are measured on twenty middle-aged men in a fitness club \cite{Tenenhaus1998}.

\paragraph{lowarp} A multi response optimization data set (LOWARP) \cite{Eriksson2001}.

\paragraph{mark} Marks obtained by french students in mathematics, physics, french and english. Toy example to illustrate the potentialities of PCA \cite{Baccini2010}.

\bibliography{ropls}

\end{document}