%\VignetteIndexEntry{spikeLI}
%\VignetteKeywords{hybridisation, Langmuir}
%\VignetteDepends{spikeLI}
%\VignettePackage{spikeLI}

\documentclass[a4paper]{article}

\usepackage{graphicx}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\ul}{\underline}
\newcommand{\bc}{\begin{center}}
\newcommand{\ec}{\end{center}}
\newcommand{\be}{\begin{equation}}
\newcommand{\ee}{\end{equation}}
\newcommand{\ba}{\begin{array}}
\newcommand{\ea}{\end{array}}
\newcommand{\beqn}{\begin{eqnarray}}
\newcommand{\eeqn}{\end{eqnarray}}

\title{SpikeLI: Analysis of Affymetrix Spike-in data with Langmuir Isotherms}
\author{Delphine Baillon, Paul Leclercq, Sarah Ternisien, Thomas Heim\\ 
and Enrico Carlon}

%\affiliation{$^{(1)}$Ecole Polytechnique Universitaire de Lille, Cit\'e
%Scientifique, F-59655 Villeneuve d'Ascq, France}

%\affiliation{$^{(2)}$Interdisciplinary Research Institute c/o IEMN, Cit\'e
%Scientifique BP 60069, F-59652 Villeneuve d'Ascq, France}


\begin{document}

\maketitle

\tableofcontents

\section{Introduction}

SpikeLI (pron. spike lee) is a package that performs the analysis of
the Affymetrix spike-in data using the Langmuir Isotherm.  The aim
of this package is to show the advantages of a physical-chemistry
based analysis of the Affymetrix microarray data compared to the
traditional methods.  The spike-in (or Latin square) data for the HGU95
and HGU133 chipsets have been downloaded from the Affymetrix web site
in Ref. \cite{affy}. The model used in the spikeLI package is described
in details in Ref. \cite{carl06}.

In this section we briefly review the basic Langmuir model and an
extension introduced in Ref. \cite{carl06}, which takes into account the
effect of the hybridization in solution. The latter is the actual model
used in the calculation of target concentrations (or expression levels).

\subsection{The basic Langmuir model}

The Langmuir Isotherm describes the hybridization of target RNA to
complementary probes PM or MM. According to this model the fluorescent 
intensity measured on a feature is given by
\begin{equation}
I = I_0 + \frac{A c e^{\beta \Delta G}}{1 + c e^{\beta \Delta G}}
\label{lang_1}
\end{equation}
where $c$ is the target concentration, or a measure of the gene expression
level that one wants to determine, $\Delta G$ is the hybridization free
energy, $\beta = 1/RT$ the inverse temperature (R the gas constant),
$A$ a scale factor and $I_0$ the background non-specific signal.

In the model we handle PM and MM signals on the same footing: perfect
matches hybridize more efficiently to the probes than mismatches and this
difference results in a different hybridization free energy $\Delta G$.
The latter is calculated from experimental data for RNA/DNA hybrids
taken from Refs. \cite{sugi95_sh} and \cite{sugi00}, using the nearest
neighbor model.

In Affymetrix chips the MM probes are generated by changing the nucleotide
at the 13th position according to the rules A$\leftrightarrow$T and
C$\leftrightarrow$G. Therefore in MM probes the mismatches are of the
type dA-rA, dT-rU, dC-rC and dG-rG (where d denotes the DNA strand and r
the RNA strand). The hybridization experiments in solution \cite{sugi00}
have shown that the free energies of mismatches depend strongly on the
nature and order of the two flanking nucleotides. In total, for the four
mismatches given above, there are 64 triplet free energies as there are
16 combinations of flanking nucleotides.  Unfortunately only a part of
these have been experimentally determined, therefore we had to limit
our analysis to those mismatches for which $\Delta G$ can be calculated
(see Ref.\cite{carl06}).

In Eq. (\ref{lang_1}) we take $A$ and $\beta$ as fitting parameters. We
set $A = 10^4$ and $\beta = 1 / RT = 0.74$ mol/kcal corresponding to a
temperature of $T = 675$K.

\subsection{Including hybridization in solution}

A second more refined model (see again for details Ref.\cite{carl06})
is:
\begin{equation}
I = I_0 + \frac{A \alpha c e^{\beta \Delta G}}{1 + \alpha c e^{\beta \Delta G}}
\label{lang_2}
\end{equation}
which is similiar to the basic langmuir isotherm except for the presence
of the factor $\alpha$, describing the effective reduction of target
molecules due to hybridization in solution. The idea is the following:
as the target is composed by a complex mixture of sequences there is
a possibility that some of the sequences in solution are partially
complementary. These will form stable duplexes in solution. Therefore
the concentration of target available for hybridization to the array is
not $c$, but $\alpha c$ with $\alpha < 1$ (see Fig. \ref{langmuir2}).

For the factor $\alpha$ we use the simple input:
\begin{equation}
\alpha = \frac 1 {1 + \tilde c e^{\beta' \Delta G_R}}
\end{equation}
where $\tilde c$ and $\beta'$ are fitting parameters. The $\Delta G_R$
is the free energy of a 25-mer RNA/RNA duplex in solution (data taken
from Ref. \cite{xia98_sh}). Note that $\alpha$ is sequence dependent:
sequences richer in CG nucleotides will have a stronger affinity for
forming stable duplexes in solution.

\begin{figure}[h]
\begin{center}
\includegraphics[width=7.5cm]{langmuir2.pdf}
\caption{Hybridization model of a set of Microarray probes. The
two reactions are 1) Target-probe hybridization and 2) Target-target
hybridization in solution leading to an effective reduction of the target
concentration to a factor $\alpha c$.}
\end{center}
\label{langmuir2}
\end{figure}

\section{Loading the library and data structure}
You install the package using the terminal simply type in:
\begin{Sinput}
R> CMD INSTALL spikeLI_x.x-x.tar.gz
\end{Sinput} 
or try to install the package via the menu (for windows and Mac users).

To load the package type
\begin{Sinput}
R> library("spikeLI")
\end{Sinput}
in order to perform some non-linear fits SpikeLI requires
the following library as well:
\begin{Sinput}
R> library("stats")
\end{Sinput}
There are three basic functions in this package: collapse(), Ivsc() and
IvsDG(). For convenience there are also some vectors giving all probe
sets in the HGU95 and HGU133 spike-in sets. These vectors are
SPIKE\_IN (all spike-in probe sets for the HGU133), SPIKE\_INH (human
HGU133 spikes), SPIKE\_INA (artificial HGU133 spikes), SPIKE\_INB
(bacterial HGU133 spikes) and SPIKE\_IN95.
Try the following 
\begin{Sinput}
R> SPIKE_INH

R> SPIKE_INA
\end{Sinput}

The concentration vectors hold the values of the concentration used for
the latin square matrix, for the HGU95 and HGU133 Affymetrix Microarrays.

\begin{Sinput}
R> conc133

R> conc95
\end{Sinput}

\begin{figure}[h]
\begin{center}
\includegraphics[width=12cm]{Ivsc.pdf}
\caption{Intensity versus concentration graph for probe 1 of the "37777" gene}
\end{center}
\label{Ivsc_37777}
\end{figure}

\section{Plotting Intensities vs. concentration: Ivsc()}

A first function plots the measured intensities as function of the spike-in 
concentration for a given probe. 
For instance:
\begin{Sinput}
R> Ivsc("37777_at",1)
\end{Sinput}
shows a plot of intensity vs. spike-in concentration for the probe
1 of the probe set 37777\_at (see Fig. \ref{Ivsc_37777}). The routine
also performs a 3 parameters non-linear fit of the data using the ``nls"
function of the stats package the formula:
\begin{equation}
I = I_0 + \frac{A c}{K+c}
\end{equation}
with $I_0$, $A$ and $K$ fitting parameters.
If the probe number is not given, the routine automatically takes
probe=1. If a vector is given for the probe or for the probe set the
first elements of both is taken. So, as SPIKE\_IN95[1] is ``37777\_at"
typing 
\begin{Sinput}
R> Ivsc(SPIKE_IN95)
\end{Sinput}
has the same effects as the above.

The screen displays the fitted value of $I_{max} = I_0 + A$ which is
the asymptotic intensity expected for the limit of large concentrations
($c \to \infty$).

\section{Plotting Intensities vs. free energies: IvsDG()}

This function plots measured intensities at given fixed spike-in
concentration as function of the hybridization free energies.
The spike-in concentrations range from 0 to 1024 pM (pico molar) for
the HGU95 experiment and from 0 to 512 pM for the HGU133.  For instance
\begin{Sinput}
R> IvsDG("AFFX-r2-TagE_at",128)
\end{Sinput}
shows two plots (see Fig. \ref{IvsDG_TagE}) for the probe set
``AFFX-r2-TagE\_at" at a concentration of 128 pM.  On the left side the
plot of the PM and MM as functions of the probe number for the given
probe set. The data are background subtracted, ie the intensities at zero
spike-in concentrations are subtracted from the data. What is plotted in
the vertical axis of Fig. \ref{IvsDG_TagE} is $I(c=128 {\rm pM}) - I(0)$.
The right side of Fig. \ref{IvsDG_TagE} shows the same data points plotted
as functions of the variable $z=\Delta G - RT \log \alpha$. With this
choice the Langmuir isotherm of Eq. (\ref{lang_2}) takes the form:
\begin{equation}
I - I_0 = \frac{A c e^{z}}{1 + c e^z}
\end{equation}
which is the curve shown as a solid line (for $c=128$ pM) in
Fig. \ref{IvsDG_TagE} right. The two dashed lines correspond to
concentrations a factor $4$ higher and lower than the central one
(i.e. $512$ and $32$ pM). Note that only two MM probes (2 and 11) are
shown in Fig. \ref{IvsDG_TagE}. This is due to the fact, as mentioned
above, that the free energies only for a limited numbers of mismatches
can be calculated.

All the data in Fig. \ref{IvsDG_TagE} fall inside the two dashed lines.
We identify outliers with this analysis on the basis of their devation
from the Langmuir model.

The spike-in concentrations are stored in the two vectors conc95 and
conc133. It is possible to generate a simple demo using
\begin{Sinput}
R> for(i in (2:length(conc133))) IvsDG(SPIKE_IN[1],conc133[i])
\end{Sinput}

\begin{figure}[h]
\begin{center}
\includegraphics[width=12cm]{IvsDG_TagE.pdf}
\caption{Graph showing the Intensity for each probe and the Intensity versus 
the Delta G value for each probes for gene ``AFFX-r2-TagE" }
\label{IvsDG_TagE}
\end{center}
\end{figure}

\section{Plotting all concentrations: collapse()}

This is the main function of the whole package. It allows to plot on a
same graph all spike-in data at different concentrations corresponding
to a given probe set. Up to four probe sets can be visualized
simultaneously. For instance
\begin{Sinput}
R> collapse(SPIKE_INA[1:4])
\end{Sinput}
displays the collapse plots for the first four probe sets in the vector
SPIKE\_INA (see Fig. \ref{collapse_A14}). These are the artificial
sequences in the HGU133 experiment: AFFX-r2-TagA\_at, AFFX-r2-TagB\_at,
AFFX-r2-TagC\_at and AFFX-r2-TagD\_at.

\begin{figure}[t]
\begin{center}
\includegraphics[width=16cm]{collapse_A14.pdf}
\caption{Graph showing the Intensity versus the Hybridization value 
for each probe on 4 different genes}
\label{collapse_A14}
\end{center}
\end{figure}

The collapse is simply a plot of the background subtracted intensities
$I-I_0$ as a function of the rescaled variable $x' = \alpha c e^{\beta
\Delta G}$. When plotted as function of these variables, the data for
different concentrations and probes ought to collapse into a single
curve which has the form $Ax'/(1+x')$. The collapse is far from perfect.
It works better for some probe sets compared to others. Probes strongly
deviating from the Langmuir curve should be considered as outliers
(this is for instance probe 2 of the probe set AFFX-r2-TagA\_at in
Fig. \ref{collapse_A14}). The plot displays parameters measuring the
spreading of the data.

If only one probe set is given in input, as for instance with
\begin{Sinput}
R> collapse("1091_at")
\end{Sinput}
then two graphs are given as output. The graph in the left side is a
plot as a function of the variable $x'$, while the one to the right side
a plot as a function of the variable $x = c e^{\beta \Delta G}$, which
is the collapse plot for the basic Langmuir model (Eq. \ref{lang_1}),
in which the effect of hybridization in solution is neglected.



\begin{thebibliography}{3}
\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
\expandafter\ifx\csname bibnamefont\endcsname\relax
  \def\bibnamefont#1{#1}\fi
\expandafter\ifx\csname bibfnamefont\endcsname\relax
  \def\bibfnamefont#1{#1}\fi
\expandafter\ifx\csname citenamefont\endcsname\relax
  \def\citenamefont#1{#1}\fi
\expandafter\ifx\csname url\endcsname\relax
  \def\url#1{\texttt{#1}}\fi
\expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi
\providecommand{\bibinfo}[2]{#2}
\providecommand{\eprint}[2][]{\url{#2}}

\bibitem{affy}
\bibinfo{author}{\bibnamefont{Affymetrix download center}} 
  \bibinfo{url}{\url{http://www.affymetrix.com/analysis/index.affx}}

\bibitem{carl06}
\bibinfo{author}{\bibfnamefont{E.}~\bibnamefont{Carlon}} \bibnamefont{and}
  \bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Heim}},
  \bibinfo{journal}{Physica A} \textbf{\bibinfo{volume}{362}},
  \bibinfo{pages}{433} (\bibinfo{year}{2006}).

\bibitem{sugi95_sh}
\bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Sugimoto}}
  \bibnamefont{et~al.}, \bibinfo{journal}{Biochemistry}
  \textbf{\bibinfo{volume}{34}}, \bibinfo{pages}{11211} (\bibinfo{year}{1995}).

\bibitem{sugi00}
\bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Sugimoto}},
  \bibinfo{author}{\bibfnamefont{M.}~\bibnamefont{Nakano}}, \bibnamefont{and}
  \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Nakano}},
  \bibinfo{journal}{Biochemistry} \textbf{\bibinfo{volume}{39}},
  \bibinfo{pages}{11270} (\bibinfo{year}{2000}).

\bibitem{xia98_sh}
\bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Xia}}
\bibnamefont{et~al.},
\bibinfo{journal}{Biochemistry} \textbf{\bibinfo{volume}{37}},
  \bibinfo{pages}{14719} (\bibinfo{year}{1998}).

\end{thebibliography}
\end{document}