%\VignetteIndexEntry{spikeLI} %\VignetteKeywords{hybridisation, Langmuir} %\VignetteDepends{spikeLI} %\VignettePackage{spikeLI} \documentclass[a4paper]{article} \usepackage{graphicx} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\ul}{\underline} \newcommand{\bc}{\begin{center}} \newcommand{\ec}{\end{center}} \newcommand{\be}{\begin{equation}} \newcommand{\ee}{\end{equation}} \newcommand{\ba}{\begin{array}} \newcommand{\ea}{\end{array}} \newcommand{\beqn}{\begin{eqnarray}} \newcommand{\eeqn}{\end{eqnarray}} \title{SpikeLI: Analysis of Affymetrix Spike-in data with Langmuir Isotherms} \author{Delphine Baillon, Paul Leclercq, Sarah Ternisien, Thomas Heim\\ and Enrico Carlon} %\affiliation{$^{(1)}$Ecole Polytechnique Universitaire de Lille, Cit\'e %Scientifique, F-59655 Villeneuve d'Ascq, France} %\affiliation{$^{(2)}$Interdisciplinary Research Institute c/o IEMN, Cit\'e %Scientifique BP 60069, F-59652 Villeneuve d'Ascq, France} \begin{document} \maketitle \tableofcontents \section{Introduction} SpikeLI (pron. spike lee) is a package that performs the analysis of the Affymetrix spike-in data using the Langmuir Isotherm. The aim of this package is to show the advantages of a physical-chemistry based analysis of the Affymetrix microarray data compared to the traditional methods. The spike-in (or Latin square) data for the HGU95 and HGU133 chipsets have been downloaded from the Affymetrix web site in Ref. \cite{affy}. The model used in the spikeLI package is described in details in Ref. \cite{carl06}. In this section we briefly review the basic Langmuir model and an extension introduced in Ref. \cite{carl06}, which takes into account the effect of the hybridization in solution. The latter is the actual model used in the calculation of target concentrations (or expression levels). \subsection{The basic Langmuir model} The Langmuir Isotherm describes the hybridization of target RNA to complementary probes PM or MM. According to this model the fluorescent intensity measured on a feature is given by \begin{equation} I = I_0 + \frac{A c e^{\beta \Delta G}}{1 + c e^{\beta \Delta G}} \label{lang_1} \end{equation} where $c$ is the target concentration, or a measure of the gene expression level that one wants to determine, $\Delta G$ is the hybridization free energy, $\beta = 1/RT$ the inverse temperature (R the gas constant), $A$ a scale factor and $I_0$ the background non-specific signal. In the model we handle PM and MM signals on the same footing: perfect matches hybridize more efficiently to the probes than mismatches and this difference results in a different hybridization free energy $\Delta G$. The latter is calculated from experimental data for RNA/DNA hybrids taken from Refs. \cite{sugi95_sh} and \cite{sugi00}, using the nearest neighbor model. In Affymetrix chips the MM probes are generated by changing the nucleotide at the 13th position according to the rules A$\leftrightarrow$T and C$\leftrightarrow$G. Therefore in MM probes the mismatches are of the type dA-rA, dT-rU, dC-rC and dG-rG (where d denotes the DNA strand and r the RNA strand). The hybridization experiments in solution \cite{sugi00} have shown that the free energies of mismatches depend strongly on the nature and order of the two flanking nucleotides. In total, for the four mismatches given above, there are 64 triplet free energies as there are 16 combinations of flanking nucleotides. Unfortunately only a part of these have been experimentally determined, therefore we had to limit our analysis to those mismatches for which $\Delta G$ can be calculated (see Ref.\cite{carl06}). In Eq. (\ref{lang_1}) we take $A$ and $\beta$ as fitting parameters. We set $A = 10^4$ and $\beta = 1 / RT = 0.74$ mol/kcal corresponding to a temperature of $T = 675$K. \subsection{Including hybridization in solution} A second more refined model (see again for details Ref.\cite{carl06}) is: \begin{equation} I = I_0 + \frac{A \alpha c e^{\beta \Delta G}}{1 + \alpha c e^{\beta \Delta G}} \label{lang_2} \end{equation} which is similiar to the basic langmuir isotherm except for the presence of the factor $\alpha$, describing the effective reduction of target molecules due to hybridization in solution. The idea is the following: as the target is composed by a complex mixture of sequences there is a possibility that some of the sequences in solution are partially complementary. These will form stable duplexes in solution. Therefore the concentration of target available for hybridization to the array is not $c$, but $\alpha c$ with $\alpha < 1$ (see Fig. \ref{langmuir2}). For the factor $\alpha$ we use the simple input: \begin{equation} \alpha = \frac 1 {1 + \tilde c e^{\beta' \Delta G_R}} \end{equation} where $\tilde c$ and $\beta'$ are fitting parameters. The $\Delta G_R$ is the free energy of a 25-mer RNA/RNA duplex in solution (data taken from Ref. \cite{xia98_sh}). Note that $\alpha$ is sequence dependent: sequences richer in CG nucleotides will have a stronger affinity for forming stable duplexes in solution. \begin{figure}[h] \begin{center} \includegraphics[width=7.5cm]{langmuir2.pdf} \caption{Hybridization model of a set of Microarray probes. The two reactions are 1) Target-probe hybridization and 2) Target-target hybridization in solution leading to an effective reduction of the target concentration to a factor $\alpha c$.} \end{center} \label{langmuir2} \end{figure} \section{Loading the library and data structure} You install the package using the terminal simply type in: \begin{Sinput} R> CMD INSTALL spikeLI_x.x-x.tar.gz \end{Sinput} or try to install the package via the menu (for windows and Mac users). To load the package type \begin{Sinput} R> library("spikeLI") \end{Sinput} in order to perform some non-linear fits SpikeLI requires the following library as well: \begin{Sinput} R> library("stats") \end{Sinput} There are three basic functions in this package: collapse(), Ivsc() and IvsDG(). For convenience there are also some vectors giving all probe sets in the HGU95 and HGU133 spike-in sets. These vectors are SPIKE\_IN (all spike-in probe sets for the HGU133), SPIKE\_INH (human HGU133 spikes), SPIKE\_INA (artificial HGU133 spikes), SPIKE\_INB (bacterial HGU133 spikes) and SPIKE\_IN95. Try the following \begin{Sinput} R> SPIKE_INH R> SPIKE_INA \end{Sinput} The concentration vectors hold the values of the concentration used for the latin square matrix, for the HGU95 and HGU133 Affymetrix Microarrays. \begin{Sinput} R> conc133 R> conc95 \end{Sinput} \begin{figure}[h] \begin{center} \includegraphics[width=12cm]{Ivsc.pdf} \caption{Intensity versus concentration graph for probe 1 of the "37777" gene} \end{center} \label{Ivsc_37777} \end{figure} \section{Plotting Intensities vs. concentration: Ivsc()} A first function plots the measured intensities as function of the spike-in concentration for a given probe. For instance: \begin{Sinput} R> Ivsc("37777_at",1) \end{Sinput} shows a plot of intensity vs. spike-in concentration for the probe 1 of the probe set 37777\_at (see Fig. \ref{Ivsc_37777}). The routine also performs a 3 parameters non-linear fit of the data using the ``nls" function of the stats package the formula: \begin{equation} I = I_0 + \frac{A c}{K+c} \end{equation} with $I_0$, $A$ and $K$ fitting parameters. If the probe number is not given, the routine automatically takes probe=1. If a vector is given for the probe or for the probe set the first elements of both is taken. So, as SPIKE\_IN95[1] is ``37777\_at" typing \begin{Sinput} R> Ivsc(SPIKE_IN95) \end{Sinput} has the same effects as the above. The screen displays the fitted value of $I_{max} = I_0 + A$ which is the asymptotic intensity expected for the limit of large concentrations ($c \to \infty$). \section{Plotting Intensities vs. free energies: IvsDG()} This function plots measured intensities at given fixed spike-in concentration as function of the hybridization free energies. The spike-in concentrations range from 0 to 1024 pM (pico molar) for the HGU95 experiment and from 0 to 512 pM for the HGU133. For instance \begin{Sinput} R> IvsDG("AFFX-r2-TagE_at",128) \end{Sinput} shows two plots (see Fig. \ref{IvsDG_TagE}) for the probe set ``AFFX-r2-TagE\_at" at a concentration of 128 pM. On the left side the plot of the PM and MM as functions of the probe number for the given probe set. The data are background subtracted, ie the intensities at zero spike-in concentrations are subtracted from the data. What is plotted in the vertical axis of Fig. \ref{IvsDG_TagE} is $I(c=128 {\rm pM}) - I(0)$. The right side of Fig. \ref{IvsDG_TagE} shows the same data points plotted as functions of the variable $z=\Delta G - RT \log \alpha$. With this choice the Langmuir isotherm of Eq. (\ref{lang_2}) takes the form: \begin{equation} I - I_0 = \frac{A c e^{z}}{1 + c e^z} \end{equation} which is the curve shown as a solid line (for $c=128$ pM) in Fig. \ref{IvsDG_TagE} right. The two dashed lines correspond to concentrations a factor $4$ higher and lower than the central one (i.e. $512$ and $32$ pM). Note that only two MM probes (2 and 11) are shown in Fig. \ref{IvsDG_TagE}. This is due to the fact, as mentioned above, that the free energies only for a limited numbers of mismatches can be calculated. All the data in Fig. \ref{IvsDG_TagE} fall inside the two dashed lines. We identify outliers with this analysis on the basis of their devation from the Langmuir model. The spike-in concentrations are stored in the two vectors conc95 and conc133. It is possible to generate a simple demo using \begin{Sinput} R> for(i in (2:length(conc133))) IvsDG(SPIKE_IN[1],conc133[i]) \end{Sinput} \begin{figure}[h] \begin{center} \includegraphics[width=12cm]{IvsDG_TagE.pdf} \caption{Graph showing the Intensity for each probe and the Intensity versus the Delta G value for each probes for gene ``AFFX-r2-TagE" } \label{IvsDG_TagE} \end{center} \end{figure} \section{Plotting all concentrations: collapse()} This is the main function of the whole package. It allows to plot on a same graph all spike-in data at different concentrations corresponding to a given probe set. Up to four probe sets can be visualized simultaneously. For instance \begin{Sinput} R> collapse(SPIKE_INA[1:4]) \end{Sinput} displays the collapse plots for the first four probe sets in the vector SPIKE\_INA (see Fig. \ref{collapse_A14}). These are the artificial sequences in the HGU133 experiment: AFFX-r2-TagA\_at, AFFX-r2-TagB\_at, AFFX-r2-TagC\_at and AFFX-r2-TagD\_at. \begin{figure}[t] \begin{center} \includegraphics[width=16cm]{collapse_A14.pdf} \caption{Graph showing the Intensity versus the Hybridization value for each probe on 4 different genes} \label{collapse_A14} \end{center} \end{figure} The collapse is simply a plot of the background subtracted intensities $I-I_0$ as a function of the rescaled variable $x' = \alpha c e^{\beta \Delta G}$. When plotted as function of these variables, the data for different concentrations and probes ought to collapse into a single curve which has the form $Ax'/(1+x')$. The collapse is far from perfect. It works better for some probe sets compared to others. Probes strongly deviating from the Langmuir curve should be considered as outliers (this is for instance probe 2 of the probe set AFFX-r2-TagA\_at in Fig. \ref{collapse_A14}). The plot displays parameters measuring the spreading of the data. If only one probe set is given in input, as for instance with \begin{Sinput} R> collapse("1091_at") \end{Sinput} then two graphs are given as output. The graph in the left side is a plot as a function of the variable $x'$, while the one to the right side a plot as a function of the variable $x = c e^{\beta \Delta G}$, which is the collapse plot for the basic Langmuir model (Eq. \ref{lang_1}), in which the effect of hybridization in solution is neglected. \begin{thebibliography}{3} \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi \expandafter\ifx\csname bibnamefont\endcsname\relax \def\bibnamefont#1{#1}\fi \expandafter\ifx\csname bibfnamefont\endcsname\relax \def\bibfnamefont#1{#1}\fi \expandafter\ifx\csname citenamefont\endcsname\relax \def\citenamefont#1{#1}\fi \expandafter\ifx\csname url\endcsname\relax \def\url#1{\texttt{#1}}\fi \expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi \providecommand{\bibinfo}[2]{#2} \providecommand{\eprint}[2][]{\url{#2}} \bibitem{affy} \bibinfo{author}{\bibnamefont{Affymetrix download center}} \bibinfo{url}{\url{http://www.affymetrix.com/analysis/index.affx}} \bibitem{carl06} \bibinfo{author}{\bibfnamefont{E.}~\bibnamefont{Carlon}} \bibnamefont{and} \bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Heim}}, \bibinfo{journal}{Physica A} \textbf{\bibinfo{volume}{362}}, \bibinfo{pages}{433} (\bibinfo{year}{2006}). \bibitem{sugi95_sh} \bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Sugimoto}} \bibnamefont{et~al.}, \bibinfo{journal}{Biochemistry} \textbf{\bibinfo{volume}{34}}, \bibinfo{pages}{11211} (\bibinfo{year}{1995}). \bibitem{sugi00} \bibinfo{author}{\bibfnamefont{N.}~\bibnamefont{Sugimoto}}, \bibinfo{author}{\bibfnamefont{M.}~\bibnamefont{Nakano}}, \bibnamefont{and} \bibinfo{author}{\bibfnamefont{S.}~\bibnamefont{Nakano}}, \bibinfo{journal}{Biochemistry} \textbf{\bibinfo{volume}{39}}, \bibinfo{pages}{11270} (\bibinfo{year}{2000}). \bibitem{xia98_sh} \bibinfo{author}{\bibfnamefont{T.}~\bibnamefont{Xia}} \bibnamefont{et~al.}, \bibinfo{journal}{Biochemistry} \textbf{\bibinfo{volume}{37}}, \bibinfo{pages}{14719} (\bibinfo{year}{1998}). \end{thebibliography} \end{document}