%\VignetteIndexEntry{Handling probe sequence information}
%\VignetteDepends{Biobase, matchprobes, affy, hgu95av2cdf}
%\VignetteKeywords{Expression Analysis}
%\VignettePackage{matchprobes}

\documentclass[11pt]{article}
\usepackage[margin=2cm,nohead]{geometry}

\usepackage{color}
\definecolor{darkblue}{rgb}{0.0,0.0,0.75}
\usepackage[%
baseurl={http://www.bioconductor.org},%
pdftitle={Basic infrastructure for using oligonucleotide microarray reporter 
sequence information for preprocessing and quality assessment},%
pdfauthor={Wolfgang Huber and Robert Gentleman},%
pdfkeywords={Bioconductor},%
pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,%
pagecolor=darkblue,raiselinks,plainpages,pdftex]{hyperref}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE} 

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rfile}[1]{{\texttt{#1}}}

\newcommand{\myincfig}[3]{%
  \begin{figure}[tb]
    \begin{center}
      \includegraphics[width=#2]{#1}
      \caption{\label{#1}#3}
    \end{center}
  \end{figure}
}


\bibliographystyle{plainnat} 

\begin{document}
\title{Basic infrastructure for using oligonucleotide microarray reporter 
sequence information for preprocessing and quality assessment}
\author{Wolfgang Huber and Robert Gentleman}
\maketitle
\tableofcontents

%------------------------------------------------------------
\section{Overview}
%------------------------------------------------------------
This package provides some basic and simple tools for dealing with the
oligonucleotide microarray reporter sequence information in the
Bioconductor \textit{probe} packages. This information is used, for
example, in the \Rpackage{gcrma} package.

\textbf{For general work with genomic sequences, this package is
  inadequate.} Please refer to the \Rpackage{Biostrings} package for
that.

As an example, the package \Rpackage{hgu95av2} provides microarray
reporter sequences for Affymetrix' \textit{HgU95a version 2} genechip,
and for almost all major Affymetrix genechips, the corresponding
packages can be downloaded from the Bioconductor website. If you
have the reporter sequence information of a particular chip, you can
also create such a package yourself. This is described in
section~\ref{sec.mpp}.

First, let us load the \Rpackage{matchprobes} package and some other packages 
we will use.
%
<<startup,results=hide>>=
library("matchprobes")
library("affy")
library("hgu95av2cdf")
library("hgu95av2probe")
@

%------------------------------------------------------------
\section{Using probe packages}\label{sec.prbpkg}
%------------------------------------------------------------
Help for the probe sequence data packages can be accessed through
\begin{Schunk}
\begin{Sinput}
> ? hgu95av2probe
\end{Sinput}
\end{Schunk}

\Rpackage{matchprobes} is an infrastructure package. Using it requires
basic familiarity with R and with the design of the \Rclass{AffyBatch}
class in the \Rpackage{affy} package, Bioconductor's basic container
for Affymetrix genechip data. One of the issues that you have to deal
with is that the \textit{probe} packages do not provide the reporter
sequences of all the features present in an \Rclass{AffyBatch}. Some
sequences are missing, some are implied; in particular, the data
structure in the \textit{probe} packages does not contain explicitely
contain the sequences of the mismatch probes, since they are implied
by the perfect match probes. Also, some other features, typically
harbouring control probes or empty, do not have sequences. This is the
choice that Affymetrix made when they made files with probe sequences
available, and we followed it.

Practically, this means that the vector of probe sequences in a
\textit{probe} package does not align 1:1 with the rows of the
corresponding \Rpackage{AffyBatch}; you need to keep track of this
mapping, and some tools for this are provided and explained below (see
Section~\ref{subsec.relating}). It also means that some functions from
the \Rpackage{affy} package, such as \Rfunction{pm}, cannot be used
when the sequences of the probes corresponding to their result are
needed; since \Rfunction{pm} reports the intensities, but not the
identity of the probes it has selected, yet the latter would be needed
to retrieve their sequences.

%------------------------------------------------------------
\subsection{Basic functions}\label{subsec.fcts}
%------------------------------------------------------------
Let us look at some basic operations on the sequences.

\subsubsection{Complementary sequence}
<<complementSeq>>=
example(complementSeq)
@

\subsubsection{Reverse sequence}
<<reverseSeq>>=
example(reverseSeq)
@

\subsubsection{Matching sets of probes against each other}
<<matchprobes>>=
pm    <- hgu95av2probe$sequence
pmdup <- c(pm, pm)
query <- hgu95av2probe$sequence[21:23]
m = matchprobes(query, pmdup)
unlist(m)
@

\subsubsection{Base content}
The base content (number of occurence of each character) of the
sequences can be computed with the function
\Rfunction{basecontent}.
%
<<basecontent>>=
bcpm <- basecontent(hgu95av2probe$sequence)
head(bcpm)
@


%------------------------------------------------------------
\subsection{Relating to the features of an \Rclass{AffyBatch}}
\label{subsec.relating}
%------------------------------------------------------------
<<hgu95av2dim,print=TRUE>>=
nr = hgu95av2dim$NROW
nc = hgu95av2dim$NCOL
@ 
Each column of an \Rclass{AffyBatch} corresponds to an array, each row
to a certain probe on the arrays. The probes are stored in a way that is
related to their geometrical position on the array. For example, the
\textit{hgu95av2} array is geometrically arranged into \Sexpr{nr} rows and
\Sexpr{nc} columns; we also call them the $x$- and $y$-coordinates.  
This results in \Sexpr{nr} $\times$
\Sexpr{nc} $=$
\Sexpr{nc*nr} rows of the \Rclass{AffyBatch}; we
also call them indices. To convert between $x$- and $y$-coordinates and
indices, you can use the functions \Rfunction{xy2indices} and
\Rfunction{indices2xy} from the \Rpackage{affy} package.

The sequence data in the \textit{probe} packages is addressed by their
$x$ and $y$--coordinates. Let us construct a vector \Robject{abseq}
that aligns with the rows of an \textit{hgu95av2} \Rclass{AffyBatch}
and fill in the sequences:
<<>>=
abseq = rep(as.character(NA), nr*nc)  

ipm = with(hgu95av2probe, xy2indices(x, y, nr=nr))

abseq[ipm] = pm
table(is.na(abseq))
@ 

The mismatch sequences are not explicitely stored in the probe
packages. They are implied by the match sequences, by flipping the
middle base. This can be done with the function
\Rfunction{complementSeq}.  For Affymetrix GeneChips the length of the
probe sequences is 25, and since we start counting at 1, the middle
position is 13.
%

We compute \Robject{imm}, the indices of the mismatch probes, by
noting that each mismatch has the same $x$-coordinate as its
associated perfect match, while its $y$-coordinate is increased by 1.
%
<<mismatchSeq>>=
mm <- complementSeq(pm, start=13, stop=13)
cat(pm[1], mm[1], sep="\n")

imm = with(hgu95av2probe, xy2indices(x, y+1, nr=nr))

abseq[imm] = mm
table(is.na(abseq))
@
%
See Figures~\ref{matchprobes-bap}--\ref{matchprobes-p2p} for some applications 
of the probe sequence information to preprocessing and data quality related plots.

%------------------------------------------------------------
\section{CombineAffyBatch}\label{sec.combine}
%------------------------------------------------------------
In this section, we show how to combine the data from two different array types
whose reporter sequences are partially the same.
%
<<>>=
library("hu6800cdf")
library("hu6800probe")
@
Load the data. 
<<loadcel>>=
f1 <- system.file("extdata", "118T1.cel", package="matchprobes")
f2 <- system.file("extdata", "CL2001032020AA.cel", package="matchprobes")
pd1 <- new("AnnotatedDataFrame", data=data.frame(fromFile=I(f1), row.names="f1"))
pd2 <- new("AnnotatedDataFrame", data=data.frame(fromFile=I(f2), row.names="f2"))
x1 <- read.affybatch(filenames=f1, compress=TRUE, phenoData=pd1)
x2 <- read.affybatch(filenames=f2, compress=TRUE, phenoData=pd2)
@
\Robject{f1} and \Robject{f2} contain the filenames of the CEL files. 
In practice, they will be vectors with many filenames per array type, here, 
for demonstration, we only use one single CEL file per array type, 
which are provided in the subdirectory \Rfile{extdata} of the package.

Combine the data. For this to work it is required that the packages 
\Rpackage{hu6800probe} and \Rpackage{hgu95av2probe} are installed.
<<docombine>>=
res <- combineAffyBatch(list(x1, x2), c("hu6800probe","hgu95av2probe"), newcdf="comb")
@

The function returns a list \Robject{res} with two elements:
an \Robject{AffyBatch} \Robject{data} and a
\Robject{CDF enviroment} \Robject{newcdf}.
%
<<dorma, results=hide>>=
comb <- res$cdf
z    <- rma(res$dat)
@


First, let us look at the distribution of the number of common probes
per probeset (see Figure~\ref{matchprobes-nrcp}).
%
<<nrcp, fig=TRUE, width=8, height=4>>=
prs <- mget(ls(comb), comb, ifnotfound=NA)
nrprobes <- sapply(prs, function(x) nrow(x))
barplot(table(nrprobes), 
   xlab="number of probes per probeset", 
   ylab="frequency")
@
% 
\myincfig{matchprobes-nrcp}{0.8\textwidth}{The distribution of the number of common probes per probeset.}

Let us also make scatterplots of the two combined arrays,
first on the probe and then at the probe set level (see Figure~\ref{matchprobes-scp}).
%
\myincfig{matchprobes-scp}{0.8\textwidth}{Comparison of the intensity
data from common probes (left panel) and probesets (right panel) of the two array platforms.}
%
<<scp, results=hide>>=
png("matchprobes-scp.png", width=900, height=480)
par(mfrow=c(1,2))
plot(exprs(res$dat), main="after combine", 
  pch=".", asp=1, xlab="f1", ylab="f2", log="xy")
plot(exprs(z), main="after RMA",     
  pch=".", asp=1, xlab="f1", ylab="f2")
dev.off()
@
%
Here, we explicitely used the \Rfunction{png} device to write this plot to; this
circuments the default choice of \Rfunction{Sweave}, PDF, which can become cumbersomely
large in size for feature-rich plots such as this one.

%------------------------------------------------------------
\section{Some sequence related ``preprocessing and quality'' plots}
\label{subsec.prqplots}
%------------------------------------------------------------
The function \Rfunction{basecontent} counts the number of occurences of each
of the four bases A, C, G, T in each probe sequence.
<<bc>>=
bc = basecontent(abseq)
head(na.omit(bc))
@ 
%
Let us define an ordered factor variable for GC content:
%
<<GC>>=
GC = ordered(bc[,"G"] + bc[,"C"])
colores = rainbow(nlevels(GC))
@ 
%
Figure~\ref{matchprobes-bap} shows a barplot of the frequencies of 
counts in \Robject{GC}:
%
<<bap,fig=TRUE,width=8,height=5>>=
barplot(table(GC), col=colores, xlab="GC", ylab="number")
@ 

Figure~\ref{matchprobes-bxp}:
<<bxp,fig=TRUE,width=8,height=5>>=
boxplot(log2(exprs(x2)[,1]) ~ GC, outline=FALSE,
  col=colores, , xlab="GC", ylab=expression(log[2]~intensity))
@ 

Figure~\ref{matchprobes-p2p}:
<<p2p, results=hide>>=
png("matchprobes-p2p.png", width=900, height=480)
plot(exprs(x2)[ipm,1], exprs(x2)[imm,1], asp=1, pch=".", log="xy",
     xlab="PM", ylab="MM", col=colores[GC[ipm]])
abline(a=0, b=1, col="#404040", lty=3)
dev.off()
@
%
\myincfig{matchprobes-bap}{0.7\textwidth}{Distribution of probe GC content.
  The height of each bar corresponds to the number of probes with the 
  corresponing GC content.}
\myincfig{matchprobes-bxp}{0.7\textwidth}{Boxplots of $\log_2$ intensity 
  stratified by probe GC content.}
\myincfig{matchprobes-p2p}{0.7\textwidth}{Scatterplot of PM vs MM intensities, colored 
  by probe GC content.}


%------------------------------------------------------------
\section{Creating probe packages}\label{sec.mpp}
%------------------------------------------------------------
Probe packages are a convenient way for distributing and storing the 
probe sequences and related information. 

%------------------------------------------------------------
\subsection{For Affymetrix genechips}\label{subsec.affy}
%------------------------------------------------------------
In this section we see how a probe package can be created for
Affymetrix genechips from the tabulator-separated sequence files that
can be obtained from the vendor (at
\url{http://www.affymetrix.com/support}). As an example, the file
\Rfile{HG-U95Av2\_probe\_tab.gz} is provided in the \Rfile{extdata}
subdirectory of the \Rpackage{matchprobes} package.

<<makeprobepackage>>=
filename <- system.file("extdata", "HG-U95Av2_probe_tab.gz", 
                          package="matchprobes")
outdir   <- tempdir()
me       <- "Wolfgang Huber <w.huber@dkfz.de>"
species  <- "Homo_sapiens"
makeProbePackage("HG-U95Av2",
                 datafile   = gzfile(filename, open="r"),
                 outdir     = outdir,
                 maintainer = me,
                 species    = species,
                 version    = "0.0.1", 
                 force      = TRUE)
dir(outdir)
@

%------------------------------------------------------------
\subsection{For other chiptypes}\label{subsec.otherarraytypes}
%------------------------------------------------------------ 
To deal with different file formats and additional types of probe
annotation data from public or in-house databases, the function
\Rfunction{makeProbePackage} offers a great deal of flexibility. The
user can specify her own import function through the argument
\Robject{importfun}. By default, its value is
\Rfunction{getProbeDataAffy}, a function that reads tabular Affymetrix
genechip sequence files. Import functions for other types of arrays
can be adapted from this prototype.

The help pages and R code contained in the produced packages are
derived from a template directory that obeys the usual R package
conventions~\cite{RExt}. A prototype for such a directory is provided
within the package \Rpackage{matchprobes}.  To facilitate the
automated production of large numbers of similar packages, we provide
a text substitution mechanism similar to the one used in the GNU
\Rfunction{configure} system.

The input parameters of an import function are
\begin{itemize}
\item a character string naming the array type
\item the input data files
\item a character string with the directory name of a package 
  template directory, containing at least a file \Rfile{DESCRIPTION}, 
  a directory \Rfile{man} with a file @PKGNAME@.Rd, 
  a directory \Rfile{data}, and possible other directories and files,
  conforming to the usual R package conventions.
\item ... any sort of further parameters, as necessary.
\end{itemize}

The output of an import function is a named list with elements
\begin{itemize}
\item \texttt{pkgname}: a character string with the package name.
\item \texttt{dataEnv}: an environment containing an arbitrary number 
  of data objects; these make the core of the package. Among these,
  there should be a data frame whose name is the value of
  \texttt{pkgname} with a column \Robject{sequence}.
  This data frame may have other columns such as the $x$- and $y$-position of the probes
  on the array, identifiers linking a probe to genomic databases, or its
  length and relative position within the gene it represents.  
  The objects in the environment will be saved as individual 
  \Rfile{.rda} files of the same name into the data directory of the 
  produced package. Documentation needs to be provided for the 
  columns of the data frame, as well as  for the other objects.
\item \texttt{symVals}: a named list, containing the symbol-value 
  substitutions that are used in the text processing. It must at least 
  contain the elements
  \begin{itemize}
  \item \texttt{ARRAYTYPE}: name of the array
  \item \texttt{DATASOURCE}: a textual description of how the data
     were obtained. It should contain the URL, or the name of the
     company / person.
  \end{itemize}
\end{itemize}

For more details, please refer to the help files for the functions 
\Rfunction{makeProbePackage} and \Rfunction{getProbeDataAffy}.
For an example, refer to the source code of \Rfunction{getProbeDataAffy}.

\begin{thebibliography}{10}
\bibitem{RExt}
R Foundation (1999).
\newblock {\em Writing R Extensions}.
\newblock \url{http://www.r-project.org}.
\end{thebibliography}

\end{document}