\documentclass{article} \title{Example of a Statistical Analysis} \author{James W. MacDonald} \usepackage{hyperref} \parindent=0.25in \usepackage[authoryear,round]{natbib} \usepackage{times} \usepackage{comment} \bibliographystyle{plainnat} \begin{document} \maketitle <>= library(affycoretools) library(KEGG) library(xtable) library(affyPLM) ## make AnnotatedDataFrame pd <- read.AnnotatedDataFrame(paste(system.file("examples",package="affycoretools"), "/pdata.txt", sep=""), header = TRUE, row.names = 1) ## no celfiles in package any more, fake this step #dat <- ReadAffy() #eset <- rma(dat) load(paste(system.file("examples", package="affycoretools"), "/abatch.Rdata", sep="")) load(paste(system.file("examples", package="affycoretools"), "/exprSet.Rdata", sep="")) ## load annotation package options(show.error.messages = FALSE) a <- try(do.call("library", list(annotation(eset)))) options(show.error.messages = TRUE) if(inherits(a, "try-error")){ source("http://www.bioconductor.org/biocLite.R") biocLite(annotation(eset)) do.call("library", list(annotation(eset))) } type <- rep(LETTERS[1:4], each = 3) @ This analysis is based on microarrays that were processed in the microarray facility in August 2004. Filenames and samples were as follows: \SweaveOpts{echo = false} <>= ## Put a table of filenames and sample types in PDF type <- rep(LETTERS[1:4], each = 3) tmp <- data.frame(sampleNames(eset), type) names(tmp) <- c("Filenames","Samples") xtable(tmp) @ The goal of this analysis is to see if there is a difference between the A and B samples, as well as between the C and D samples. The first step in my analysis was to make some quality control plots that can be used to check the overall quality of the raw data. \begin{figure} \centering <>= plotHist(dat, sampleNames(eset)) plotHist(dat[,1:6]) plotHist(dat[,7:12]) @ \caption{Plot of perfect match (PM) chip densities} \label{fig:digest} \end{figure} Figure~\ref{fig:digest} shows the distribution of the PM probes for each chip. One of the underlying assumptions for the normalization procedure I use is that the PM probe data all come from the same distribution, with the only differences being the location and scale. Basically, this means that we want the shape of the curves to be very similar, and we want the curves to be fairly close to each other. There appears to be a large difference between the A/B and C/D samples that may have an impact on our analysis. Since we are only concerned with the comparison of A/B and C/D it might make sense to pre-process these samples separately. \begin{figure} \centering <>= plotDeg(dat, sampleNames(eset)) plotDeg(dat[,1:6]) plotDeg(dat[,7:12]) @ \caption{RNA degradation plot} \label{fig:deg} \end{figure} Figure~\ref{fig:deg} is designed to show differences between samples due to mRNA degradation or from the \emph{in vitro} translation step. Any differences between samples will be indicated by a different slope. Again, the only differences are between the two sample sets. These two plots indicate that we should probably process each set of samples separately and then combine later. \begin{figure} \centering <>= boxplot(data.frame(exprs(eset))) @ \caption{Boxplot of Expression values from both sets} \label{fig:box} \end{figure} I calculated expression values for each gene using a robust multi-array average (RMA) \citet{Irizarry2003}. This is a modeling strategy that converts the PM probe values into an expression value for each gene. Note that the expression values are $log_2$ transformed data. These data can be converted to the natural scale by exponentiating (e.g., convert by using $2^x$, where \emph{x} is the expression value). Figure~\ref{fig:box} shows a boxplot of the expression values for each set of data. It appears here that the data are fairly well normalized. I then fit a principal components analysis (PCA) on the expression values and then plotted the first two principal components (PCs). PCA can be used to visualize the overall structure of high dimensional data; in this case, we are using it to see if the replicated samples are grouping together, which would indicate that the replicates are relatively similar in their gene expression profiles. \begin{figure} \centering <>= plotPCA(eset, groups = rep(1:4, each = 3), groupnames = unique(paste(pData(pd)[,1], pData(pd)[,2], sep = "-"))) plotPCA(eset[,1:6], groups=rep(1:2, each=3), groupnames=unique(paste(pData(pd)[1:4,1], pData(pd)[1:4,2], sep="-"))) plotPCA(eset[,7:12], groups=rep(1:2, each=3), groupnames=unique(paste(pData(pd)[7:10,1], pData(pd)[7:10,2], sep="-"))) @ \caption{PCA plot} \label{fig:pca} \end{figure} Figure~\ref{fig:pca} shows the PCA plot. Here again we can see that there is a very large difference between the A/B and C/D samples. The PCA plot indicates that the replicated samples are quite similar, but doesn't tell us much about the quality of the RMA model fit. For this we can do boxplots of the normalized unscaled standard errors (NUSE) from the model fit. Any chip that has large standard errors in comparison to the others is probably of lower quality, as the model isn't fitting the data from that chip very well. \begin{figure} \centering <>= library(affyPLM) pset <- fitPLM(dat[,7:12]) boxplot(pset, xaxt="n", main="NUSE") axis(1, at=7:12, label=paste("Sample", 1:6), las=2, cex.axis=0.8) @ \caption{NUSE plot} \label{fig:nuse} \end{figure} Figure~\ref{fig:nuse} shows the NUSE plot for the first six chips. The standard errors are all quite similar for each chip, indicating that the model fits the data similarly on each chip. One last QA plot that might be of interest is a relative log expression (RLE) plot. For this plot we compute the relative log expression for each probeset on each chip, relative to the median value for that probeset. Any chip that is very different from the others in this plot is typically of lower quality. \begin{figure} \centering <>= Mbox(pset, xaxt="n", main="RLE") axis(1, at=1:6, label=paste("Sample", 1:6), las=2, cex.axis=0.8) @ \caption{NUSE plot} \label{fig:nuse} \end{figure} <>= ## filter data to remove probesets that aren't changing index <- apply(exprs(eset)[,1:6], 1, var) > 0.01 eset1 <- eset[index,] ## create design matrix and give reasonable column names design <- model.matrix(~ 0 + factor(rep(1:4,each=3))) colnames(design) <- LETTERS[1:4] @ <>= ## set up a contrasts matrix using makeContrasts() contrast <- makeContrasts(A - B, C - D, levels = design) ## now do the same using matrix() contrast <- matrix(c(1,-1,0,0,0,0,1,-1), ncol=2, dimnames=list(unique(type), paste(type[c(1,7)],type[c(4,10)], sep=" vs "))) @ <>= ## fit model fit <- lmFit(eset1, design) fit2 <- contrasts.fit(fit, contrast) ## empirical Bayes step fit2 <- eBayes(fit2) @ <>= ## output text and HTML tables using limma2annaffy() out <- limma2annaffy(eset1, fit2, design, contrast, annotation(eset), pfilt = 0.05, fldfilt = 1, save = TRUE, text = TRUE, interactive = FALSE) @ <>= ## output text and HTML tables using limma2biomaRt() ## this is usually more applicable to analyses of chips that don't have BioC ## annotation packages ## Change mysql to FALSE if you don't have RMySQL installed out2 <- limma2biomaRt(eset1, fit2, design, contrast, "hsapiens", ann.source="affy_hg_focus", pfilt=0.05, fldfilt=1, addname="biomart", affyid=TRUE, mysql=TRUE, interactive=FALSE) @ Prior to making comparisons, I filtered the genes to remove any that do not appear to be differentially expressed in any samples, based on at least samples having an expression value of or greater. This resulted in a total of \Sexpr{sum(index)} genes. I then made the requested comparisons using a modeling strategy developed for microarray analyses (\citet{Smyth:2004}), selecting those probesets with an adjusted $p$-value less than 0.05 and a two-fold difference (adjusting for multiple comparisons using false discovery rate (\citet{Benjamini:1995})). This resulted in the following number of probesets: <>= a <- data.frame(paste(unique(type)[c(1,3)], unique(type)[c(2,4)], sep = " vs "), sapply(out, function(x) dim(x)[1])) names(a) <- c("Comparisons","Probesets") xtable(a, digits = rep(0, 3)) @ I output these data in HTML and text tables that include various statistics, as well as annotation for the different probesets. I also output all the expression values in a text table that can be opened using a spreadsheet and used for ongoing analyses, or to look for probesets that might not appear in the HTML tables. Please note that I used the affy, and limma packages of Bioconductor for this analysis. If you publish these results, please use the following citations. \bibliography{BioC2007} \end{document}