%\VignetteIndexEntry{affyPLM: Model Based QC Assessment of Affymetrix GeneChips} %\VignettePackage{affyPLM} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand\Rpackage[1]{{\textsf{#1}\index{#1 (package)}}} \newcommand\dataset[1]{{\textit{#1}\index{#1 (data set)}}} \newcommand\Rclass[1]{{\textit{#1}\index{#1 (class)}}} \newcommand\Rfunction[1]{{{\small\texttt{#1}}\index{#1 (function)}}} \newcommand\Rfunarg[1]{{\small\texttt{#1}}} \newcommand\Robject[1]{{\small\texttt{#1}}} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \title{affyPLM: Model Based QC Assessment of Affymetrix GeneChips} \author{Ben Bolstad \\ {\tt bmb@bmbolstad.com}\\ \url{http://bmbolstad.com}} \begin{document} \maketitle \tableofcontents \section{Introduction} This document is a basic users guide to the quality assessment facilities of the \Rpackage{affyPLM} package. Other vignettes for this package describe other functionalities. Quality assessment is something that should be carried out in the initial analysis of any Affymetrix GeneChip dataset. \Rpackage{affyPLM} provides a number of useful tools based on probe-level modeling procedures. To begin, load the package using <>= library(affyPLM) options(width=40) @ \section{Fitting Probe Level Models for Quality Assessment} The core function for fitting the probe-level models is \Rfunction{fitPLM}. As input it takes an \Rclass{AffyBatch} object and outputs a \Rclass{PLMset} object. Quality assessment functions operate on \Rclass{PLMset} objects. Note that there is another vignette which describes in much greater deal the \Rfunction{fitPLM}. Note that \Rpackage{affyPLM} also provides the \Rfunction{rmaPLM} and \Rfunction{threestepPLM} functions which will also take \Rclass{AffyBatch} objects and return \Rclass{PLMset} objects. Some quality assessment functions may operate successfully on these \Rclass{PLMset} objects, but in general for appropriate probe-level model quality assessment use \Rfunction{fitPLM}. In this vignette we will use the \dataset{Dilution} dataset. Note that this is a small test dataset provided by the \Rpackage{affydata} package. The following code loads this dataset and then fits probe-level models to all of the probesets. <>= require(affydata) data(Dilution) # an example dataset provided by the affydata package #FIXME:remove the next line Dilution = updateObject(Dilution) Pset <- fitPLM(Dilution) @ \section{Quality Diagnostics} \subsection{Chip Pseudo-images} Chip pseudo-images are very useful for detecting artifacts on arrays that could pose potential quality problems. More specifically, weights and residuals from model fitting procedures are stored in \Rclass{PLMset} objects and may be graphically displayed using the \Rfunction{image} function. When the \Rfunction{image} function is applied to a \Rclass{PLMset} object a pseudo-image of the chip based upon the residual is produced. By default, chip pseudo-image of the weights are produced. The following code produces a chip pseudo image of the weights for the second array in the dataset. <>= image(Pset,which=2) @ Note that the \Rfunarg{which} argument is used to control which array is drawn. Its value should be between $1$ and $n$, where $n$ is the number of arrays in the dataset. If \Rfunarg{which} is not supplied, then images for all chips are drawn, one by one in succession. Figure \ref{weights} shows the resultant chip pseudo image. Areas of low weight are greener, high weights (ie weight of 1) are light grey. <>= png("Quality-weightimage1a.png",height=4,width=4,pointsize=10,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) image(Pset,which=2) dev.off() @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.49\textwidth]{Quality-weightimage1a} \end{center} \caption{\label{weights}An image of the weights for the second chip in the dataset} \end{figure} While the conventional coloring scheme for weights pseudo chip images used terrain coloring, it is possible to override this by supplying the \Rfunarg{col} argument to \Rfunction{image}. The following code creates weights images using two different greyscale (black to white) and (white to black): <>= image(Pset,which=2,col=gray(0:25/25),add.legend=TRUE) image(Pset,which=2,col=gray(25:0/25),add.legend=TRUE) @ with Figure \ref{weightscolor} showing the results. Note that it is recommended that you use a legend if you use an alternative coloring scheme. Legends are added when \Rfunarg{add.legend=TRUE} is supplied. <>= png("Quality-weightimage2a.png",height=8,width=8,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) image(Pset,which=2,col=gray(0:25/25),add.legend=T) dev.off() png("Quality-weightimage2b.png",height=8,width=8,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) image(Pset,which=2,col=gray(25:0/25),add.legend=T) dev.off() @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.49\textwidth]{Quality-weightimage2a} \includegraphics[width=0.49\textwidth]{Quality-weightimage2b} \end{center} \caption{\label{weightscolor}Image of the weights for the second chip in the dataset using different colorings.} \end{figure} Residuals are the second quantity for which \Rpackage{affyPLM} produces chip pseudo images. Four different views of residuals are provided: residuals, positive residuals, negative residuals and sign of residuals. The \Rfunarg{type} is used to control which of these images is drawn. Setting \Rfunarg{type="resids"} gives a chip pseudo image of residuals, with higher red intensities corresponding with higher positive residuals, white corresponding with residuals close to 0 and more intense blues corresponding with high negative residuals. When \Rfunarg{type="pos.resids"} only high positive residuals are drawn in red, with negative and near 0 residuals being drawn in white. Using \Rfunarg{type="neg.resids"} only extreme negative residuals are drawn in blue, with positive negative and near 0 residuals being drawn in white. Finally, \Rfunarg{type="sign.resids"} gives images where all negative residuals regardless of magnitude are indicated by blue and all positive residuals by red. The following code produces these images, as shown in Figure \ref{Residuals}, for the second chip in the \dataset{Dilution} dataset: <>= image(Pset,which=2, type="resids") image(Pset,which=2, type="pos.resids") image(Pset,which=2, type="neg.resids") image(Pset,which=2, type="sign.resids") @ <>= png("Quality-residualimages1.png",height=4,width=4,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) image(Pset,which=2, type="resids") dev.off() png("Quality-residualimages2.png",height=4,width=4,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) image(Pset,which=2, type="pos.resids") dev.off() png("Quality-residualimages3.png",height=4,width=4,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) image(Pset,which=2, type="neg.resids") dev.off() png("Quality-residualimages4.png",height=4,width=4,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) image(Pset,which=2, type="sign.resids") dev.off() @ One thing that should be noted about the residual chips pseudo images is that a logarithmic color space is used by default. This tends to intensify the coloring of large residuals without highlighting small residuals. Supplying \Rfunarg{use.log=FALSE} to the call to \Rfunction{image} will turn this off, but tends to give duller images. \begin{figure}[htbp] \begin{center} \begin{tabular}{cc} a. & b. \\\includegraphics[width=0.45\textwidth]{Quality-residualimages1} & \includegraphics[width=0.45\textwidth]{Quality-residualimages2} \\ c. & d. \\\includegraphics[width=0.45\textwidth]{Quality-residualimages3} & \includegraphics[width=0.45\textwidth]{Quality-residualimages4} \\ \end{tabular} \caption{Various chip pseudo-images of residuals. a) Residuals b) Positive residuals c) Negative residuals d) Sign of Residuals\label{Residuals}} \end{center} \end{figure} As with chip pseudo images of weights, it is possible to use alternative color schemes for residual images. It is recommended that you use the \Rfunction{pseudoPalette} function to control your color palette in this case. This function creates a color space that moves from one color to another. Some examples are given by the following code, with the results shown in Figure \ref{residualcolor}: <>= image(Pset,which=2,type="resids",col=pseudoPalette(low="darkgreen",high="magenta",mid="lightgrey"),add.legend=TRUE) image(Pset,which=2,type="pos.resids",col=pseudoPalette(low="yellow",high="darkblue"),add.legend=TRUE) @ <>= png("Quality-residualimages5.png",height=8,width=8,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) image(Pset,which=2,type="resids",col=pseudoPalette(low="darkgreen",high="magenta",mid="lightgrey"),add.legend=TRUE) dev.off() png("Quality-residualimages6.png",height=8,width=8,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) image(Pset,which=2,type="pos.resids",col=pseudoPalette(low="yellow",high="darkblue"),add.legend=TRUE) dev.off() @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.49\textwidth]{Quality-residualimages5} \includegraphics[width=0.49\textwidth]{Quality-residualimages6} \caption{Chip pseudo-images of residuals using different coloring.\label{residualcolor}} \end{center} \end{figure} \clearpage \subsection{RLE} Another quality assessment tool are Relative Log Expression (RLE) values. Specifically, these RLE values are computed for each probeset by comparing the expression value on each array against the median expression value for that probeset across all arrays. Assuming that most genes are not changing in expression across arrays means ideally most of these RLE values will be near 0. Boxplots of these values, for each array, provides a quality assessment tool. A RLE boxplot can be produced using: <>= RLE(Pset,main="RLE for Dilution dataset") @ with the resulting plot shown in Figure \ref{RLE}. When examining this plot focus should be on the shape and position of each of the boxes. Typically arrays with poorer quality show up with boxes that are not centered about 0 and/or are more spread out. For this particular dataset there is no such array. <>= png("Quality-RLE.png",height=4,width=4,pointsize=10,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) RLE(Pset,main="RLE for Dilution dataset") dev.off() @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.6\textwidth]{Quality-RLE} \caption{RLE boxplot.\label{RLE}} \end{center} \end{figure} As an alternative to the RLE boxplot it is possible to compute summary statistics, by array, using the following command: <>= RLE(Pset,type="stats") @ the median and IQR of the RLE values for each array is returned. Note that calling \Rfunction{RLE(Pset,type=values)} returns all the RLE expression values. \subsection{NUSE} Normalized Unscaled Standard Errors (NUSE) can also be used for assessing quality. In this case, the standard error estimates obtained for each gene on each array from \Rfunction{fitPLM} are taken and standardized across arrays so that the median standard error for that genes is 1 across all arrays. This process accounts for differences in variability between genes. An array were there are elevated SE relative to the other arrays is typically of lower quality. Boxplots of these values, separated by array can be used to compare arrays. The \Rfunction{NUSE} function will produce such a plot: <>= NUSE(Pset,main="NUSE for Dilution dataset") @ <>= png("Quality-NUSE.png",height=4,width=4,pointsize=10,res=300,units="in") par(mar=c(2.0,2.1,1.6,1.1),oma=c(1,1,0,0)) NUSE(Pset,main="NUSE for Dilution dataset") dev.off() @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.6\textwidth]{Quality-NUSE} \caption{NUSE boxplot.\label{NUSE}} \end{center} \end{figure} As an alternative to the NUSE boxplot it is possible to compute summary statistics, by array, using the following command: <>= NUSE(Pset,type="stats") @ the median and IQR of the NUSE values for each array is returned. Note that calling \Rfunction{NUSE(Pset,type=values)} returns all the NUSE values. \section{Final Comments} The tools discussed here are most useful for assessing relative quality within a dataset. A typical use of these tools would be to decide whether or not any arrays should be excluded from down-stream data analysis. Some users might find it useful to see the results of these quality assessment procedures on other datasets. Such information can be found on the PLM Image Gallery website \url{http://plmimagegallery.bmbolstad.com}. <>= ## give ghostscript on Windows a few seconds to catch up (UGLY HACK) Sys.sleep(10) @ \end{document}