%\VignetteIndexEntry{Handling probe sequence information} %\VignetteDepends{Biobase, matchprobes, affy, hgu95av2cdf} %\VignetteKeywords{Expression Analysis} %\VignettePackage{matchprobes} \documentclass[11pt]{article} \usepackage[margin=2cm,nohead]{geometry} \usepackage{color} \definecolor{darkblue}{rgb}{0.0,0.0,0.75} \usepackage[% baseurl={http://www.bioconductor.org},% pdftitle={Basic infrastructure for using oligonucleotide microarray reporter sequence information for preprocessing and quality assessment},% pdfauthor={Wolfgang Huber and Robert Gentleman},% pdfkeywords={Bioconductor},% pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,% pagecolor=darkblue,raiselinks,plainpages,pdftex]{hyperref} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rfile}[1]{{\texttt{#1}}} \newcommand{\myincfig}[3]{% \begin{figure}[tb] \begin{center} \includegraphics[width=#2]{#1} \caption{\label{#1}#3} \end{center} \end{figure} } \bibliographystyle{plainnat} \begin{document} \title{Basic infrastructure for using oligonucleotide microarray reporter sequence information for preprocessing and quality assessment} \author{Wolfgang Huber and Robert Gentleman} \maketitle \tableofcontents %------------------------------------------------------------ \section{Overview} %------------------------------------------------------------ This package provides some basic and simple tools for dealing with the oligonucleotide microarray reporter sequence information in the Bioconductor \textit{probe} packages. This information is used, for example, in the \Rpackage{gcrma} package. \textbf{For general work with genomic sequences, this package is inadequate.} Please refer to the \Rpackage{Biostrings} package for that. As an example, the package \Rpackage{hgu95av2} provides microarray reporter sequences for Affymetrix' \textit{HgU95a version 2} genechip, and for almost all major Affymetrix genechips, the corresponding packages can be downloaded from the Bioconductor website. If you have the reporter sequence information of a particular chip, you can also create such a package yourself. This is described in section~\ref{sec.mpp}. First, let us load the \Rpackage{matchprobes} package and some other packages we will use. % <>= library("matchprobes") library("affy") library("hgu95av2cdf") library("hgu95av2probe") @ %------------------------------------------------------------ \section{Using probe packages}\label{sec.prbpkg} %------------------------------------------------------------ Help for the probe sequence data packages can be accessed through \begin{Schunk} \begin{Sinput} > ? hgu95av2probe \end{Sinput} \end{Schunk} \Rpackage{matchprobes} is an infrastructure package. Using it requires basic familiarity with R and with the design of the \Rclass{AffyBatch} class in the \Rpackage{affy} package, Bioconductor's basic container for Affymetrix genechip data. One of the issues that you have to deal with is that the \textit{probe} packages do not provide the reporter sequences of all the features present in an \Rclass{AffyBatch}. Some sequences are missing, some are implied; in particular, the data structure in the \textit{probe} packages does not contain explicitely contain the sequences of the mismatch probes, since they are implied by the perfect match probes. Also, some other features, typically harbouring control probes or empty, do not have sequences. This is the choice that Affymetrix made when they made files with probe sequences available, and we followed it. Practically, this means that the vector of probe sequences in a \textit{probe} package does not align 1:1 with the rows of the corresponding \Rpackage{AffyBatch}; you need to keep track of this mapping, and some tools for this are provided and explained below (see Section~\ref{subsec.relating}). It also means that some functions from the \Rpackage{affy} package, such as \Rfunction{pm}, cannot be used when the sequences of the probes corresponding to their result are needed; since \Rfunction{pm} reports the intensities, but not the identity of the probes it has selected, yet the latter would be needed to retrieve their sequences. %------------------------------------------------------------ \subsection{Basic functions}\label{subsec.fcts} %------------------------------------------------------------ Let us look at some basic operations on the sequences. \subsubsection{Complementary sequence} <>= example(complementSeq) @ \subsubsection{Reverse sequence} <>= example(reverseSeq) @ \subsubsection{Matching sets of probes against each other} <>= pm <- hgu95av2probe$sequence pmdup <- c(pm, pm) query <- hgu95av2probe$sequence[21:23] m = matchprobes(query, pmdup) unlist(m) @ \subsubsection{Base content} The base content (number of occurence of each character) of the sequences can be computed with the function \Rfunction{basecontent}. % <>= bcpm <- basecontent(hgu95av2probe$sequence) head(bcpm) @ %------------------------------------------------------------ \subsection{Relating to the features of an \Rclass{AffyBatch}} \label{subsec.relating} %------------------------------------------------------------ <>= nr = hgu95av2dim$NROW nc = hgu95av2dim$NCOL @ Each column of an \Rclass{AffyBatch} corresponds to an array, each row to a certain probe on the arrays. The probes are stored in a way that is related to their geometrical position on the array. For example, the \textit{hgu95av2} array is geometrically arranged into \Sexpr{nr} rows and \Sexpr{nc} columns; we also call them the $x$- and $y$-coordinates. This results in \Sexpr{nr} $\times$ \Sexpr{nc} $=$ \Sexpr{nc*nr} rows of the \Rclass{AffyBatch}; we also call them indices. To convert between $x$- and $y$-coordinates and indices, you can use the functions \Rfunction{xy2indices} and \Rfunction{indices2xy} from the \Rpackage{affy} package. The sequence data in the \textit{probe} packages is addressed by their $x$ and $y$--coordinates. Let us construct a vector \Robject{abseq} that aligns with the rows of an \textit{hgu95av2} \Rclass{AffyBatch} and fill in the sequences: <<>>= abseq = rep(as.character(NA), nr*nc) ipm = with(hgu95av2probe, xy2indices(x, y, nr=nr)) abseq[ipm] = pm table(is.na(abseq)) @ The mismatch sequences are not explicitely stored in the probe packages. They are implied by the match sequences, by flipping the middle base. This can be done with the function \Rfunction{complementSeq}. For Affymetrix GeneChips the length of the probe sequences is 25, and since we start counting at 1, the middle position is 13. % We compute \Robject{imm}, the indices of the mismatch probes, by noting that each mismatch has the same $x$-coordinate as its associated perfect match, while its $y$-coordinate is increased by 1. % <>= mm <- complementSeq(pm, start=13, stop=13) cat(pm[1], mm[1], sep="\n") imm = with(hgu95av2probe, xy2indices(x, y+1, nr=nr)) abseq[imm] = mm table(is.na(abseq)) @ % See Figures~\ref{matchprobes-bap}--\ref{matchprobes-p2p} for some applications of the probe sequence information to preprocessing and data quality related plots. %------------------------------------------------------------ \section{CombineAffyBatch}\label{sec.combine} %------------------------------------------------------------ In this section, we show how to combine the data from two different array types whose reporter sequences are partially the same. % <<>>= library("hu6800cdf") library("hu6800probe") @ Load the data. <>= f1 <- system.file("extdata", "118T1.cel", package="matchprobes") f2 <- system.file("extdata", "CL2001032020AA.cel", package="matchprobes") pd1 <- new("AnnotatedDataFrame", data=data.frame(fromFile=I(f1), row.names="f1")) pd2 <- new("AnnotatedDataFrame", data=data.frame(fromFile=I(f2), row.names="f2")) x1 <- read.affybatch(filenames=f1, compress=TRUE, phenoData=pd1) x2 <- read.affybatch(filenames=f2, compress=TRUE, phenoData=pd2) @ \Robject{f1} and \Robject{f2} contain the filenames of the CEL files. In practice, they will be vectors with many filenames per array type, here, for demonstration, we only use one single CEL file per array type, which are provided in the subdirectory \Rfile{extdata} of the package. Combine the data. For this to work it is required that the packages \Rpackage{hu6800probe} and \Rpackage{hgu95av2probe} are installed. <>= res <- combineAffyBatch(list(x1, x2), c("hu6800probe","hgu95av2probe"), newcdf="comb") @ The function returns a list \Robject{res} with two elements: an \Robject{AffyBatch} \Robject{data} and a \Robject{CDF enviroment} \Robject{newcdf}. % <>= comb <- res$cdf z <- rma(res$dat) @ First, let us look at the distribution of the number of common probes per probeset (see Figure~\ref{matchprobes-nrcp}). % <>= prs <- mget(ls(comb), comb, ifnotfound=NA) nrprobes <- sapply(prs, function(x) nrow(x)) barplot(table(nrprobes), xlab="number of probes per probeset", ylab="frequency") @ % \myincfig{matchprobes-nrcp}{0.8\textwidth}{The distribution of the number of common probes per probeset.} Let us also make scatterplots of the two combined arrays, first on the probe and then at the probe set level (see Figure~\ref{matchprobes-scp}). % \myincfig{matchprobes-scp}{0.8\textwidth}{Comparison of the intensity data from common probes (left panel) and probesets (right panel) of the two array platforms.} % <>= png("matchprobes-scp.png", width=900, height=480) par(mfrow=c(1,2)) plot(exprs(res$dat), main="after combine", pch=".", asp=1, xlab="f1", ylab="f2", log="xy") plot(exprs(z), main="after RMA", pch=".", asp=1, xlab="f1", ylab="f2") dev.off() @ % Here, we explicitely used the \Rfunction{png} device to write this plot to; this circuments the default choice of \Rfunction{Sweave}, PDF, which can become cumbersomely large in size for feature-rich plots such as this one. %------------------------------------------------------------ \section{Some sequence related ``preprocessing and quality'' plots} \label{subsec.prqplots} %------------------------------------------------------------ The function \Rfunction{basecontent} counts the number of occurences of each of the four bases A, C, G, T in each probe sequence. <>= bc = basecontent(abseq) head(na.omit(bc)) @ % Let us define an ordered factor variable for GC content: % <>= GC = ordered(bc[,"G"] + bc[,"C"]) colores = rainbow(nlevels(GC)) @ % Figure~\ref{matchprobes-bap} shows a barplot of the frequencies of counts in \Robject{GC}: % <>= barplot(table(GC), col=colores, xlab="GC", ylab="number") @ Figure~\ref{matchprobes-bxp}: <>= boxplot(log2(exprs(x2)[,1]) ~ GC, outline=FALSE, col=colores, , xlab="GC", ylab=expression(log[2]~intensity)) @ Figure~\ref{matchprobes-p2p}: <>= png("matchprobes-p2p.png", width=900, height=480) plot(exprs(x2)[ipm,1], exprs(x2)[imm,1], asp=1, pch=".", log="xy", xlab="PM", ylab="MM", col=colores[GC[ipm]]) abline(a=0, b=1, col="#404040", lty=3) dev.off() @ % \myincfig{matchprobes-bap}{0.7\textwidth}{Distribution of probe GC content. The height of each bar corresponds to the number of probes with the corresponing GC content.} \myincfig{matchprobes-bxp}{0.7\textwidth}{Boxplots of $\log_2$ intensity stratified by probe GC content.} \myincfig{matchprobes-p2p}{0.7\textwidth}{Scatterplot of PM vs MM intensities, colored by probe GC content.} %------------------------------------------------------------ \section{Creating probe packages}\label{sec.mpp} %------------------------------------------------------------ Probe packages are a convenient way for distributing and storing the probe sequences and related information. %------------------------------------------------------------ \subsection{For Affymetrix genechips}\label{subsec.affy} %------------------------------------------------------------ In this section we see how a probe package can be created for Affymetrix genechips from the tabulator-separated sequence files that can be obtained from the vendor (at \url{http://www.affymetrix.com/support}). As an example, the file \Rfile{HG-U95Av2\_probe\_tab.gz} is provided in the \Rfile{extdata} subdirectory of the \Rpackage{matchprobes} package. <>= filename <- system.file("extdata", "HG-U95Av2_probe_tab.gz", package="matchprobes") outdir <- tempdir() me <- "Wolfgang Huber " species <- "Homo_sapiens" makeProbePackage("HG-U95Av2", datafile = gzfile(filename, open="r"), outdir = outdir, maintainer = me, species = species, version = "0.0.1", force = TRUE) dir(outdir) @ %------------------------------------------------------------ \subsection{For other chiptypes}\label{subsec.otherarraytypes} %------------------------------------------------------------ To deal with different file formats and additional types of probe annotation data from public or in-house databases, the function \Rfunction{makeProbePackage} offers a great deal of flexibility. The user can specify her own import function through the argument \Robject{importfun}. By default, its value is \Rfunction{getProbeDataAffy}, a function that reads tabular Affymetrix genechip sequence files. Import functions for other types of arrays can be adapted from this prototype. The help pages and R code contained in the produced packages are derived from a template directory that obeys the usual R package conventions~\cite{RExt}. A prototype for such a directory is provided within the package \Rpackage{matchprobes}. To facilitate the automated production of large numbers of similar packages, we provide a text substitution mechanism similar to the one used in the GNU \Rfunction{configure} system. The input parameters of an import function are \begin{itemize} \item a character string naming the array type \item the input data files \item a character string with the directory name of a package template directory, containing at least a file \Rfile{DESCRIPTION}, a directory \Rfile{man} with a file @PKGNAME@.Rd, a directory \Rfile{data}, and possible other directories and files, conforming to the usual R package conventions. \item ... any sort of further parameters, as necessary. \end{itemize} The output of an import function is a named list with elements \begin{itemize} \item \texttt{pkgname}: a character string with the package name. \item \texttt{dataEnv}: an environment containing an arbitrary number of data objects; these make the core of the package. Among these, there should be a data frame whose name is the value of \texttt{pkgname} with a column \Robject{sequence}. This data frame may have other columns such as the $x$- and $y$-position of the probes on the array, identifiers linking a probe to genomic databases, or its length and relative position within the gene it represents. The objects in the environment will be saved as individual \Rfile{.rda} files of the same name into the data directory of the produced package. Documentation needs to be provided for the columns of the data frame, as well as for the other objects. \item \texttt{symVals}: a named list, containing the symbol-value substitutions that are used in the text processing. It must at least contain the elements \begin{itemize} \item \texttt{ARRAYTYPE}: name of the array \item \texttt{DATASOURCE}: a textual description of how the data were obtained. It should contain the URL, or the name of the company / person. \end{itemize} \end{itemize} For more details, please refer to the help files for the functions \Rfunction{makeProbePackage} and \Rfunction{getProbeDataAffy}. For an example, refer to the source code of \Rfunction{getProbeDataAffy}. \begin{thebibliography}{10} \bibitem{RExt} R Foundation (1999). \newblock {\em Writing R Extensions}. \newblock \url{http://www.r-project.org}. \end{thebibliography} \end{document}