Lab7.Rnw
%%%%% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Lab 7} %\VignetteDepends{Biobase,affy,ctest,multtest,bioclabs,annotate,hgu95av2} %\VignetteKeywords{Microarray} \documentclass[12pt]{article}
\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}
\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in
\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}
\bibliographystyle{plainnat}
\title{Lab 7: Analyzing Microarray Data: From Images to List of Candidate Genes}
\begin{document}
\maketitle
In this lab, we demonstrate how to use R and Bioconductor to 1) read-in Affymetrix cel files, 2) do some quality control checks, 3) pre-process the cel level data to obtain expression measures, 4) obtain scores for differential expression and cut-offs to create list, and finally 5) create user-friendly web pages to report results.
<
\section{Read-In Cel Files} The function \verb+ReadAffy+ can be used to read in cel files and enter phenotypic data as well as MIAME information.
\begin{Sinput} R> spikein <- ReadAffy(phenoData="phenodata.txt",description="miame.txt", verbose=TRUE) R> phenodata <- read.phenoData("phenodata.txt",check.names=FALSE) R> phenoData(spikein) <- phenodata \end{Sinput}
The data one would obtain by doing this is available from the \verb+bioclabs+ package.
<
\section{Quality Control} Various functions exist in the \verb+affy+ package that can be used for quality control. Let's try a few...
<
The different colors represent the two different populations.
We can also use \verb+image+, \verb+boxplot+, among others...
\section{Pre-processing} Now we need to convert the probe-level data into expression measures. Various methods are available through the function \verb+expresso+ and one can easily create a new one using the function \verb+express+.
Now we will use the function \verb+rma+ which is implemented in C and
is therefore quite fast.
<
\section{Differential Expression} In this section we will compute the average log ratio for between the two populations and the t-test as well. We will then obtain adjusted p-values and create a list with genes that are statistically significant.
First, notice there are two populations and 12 replicates in each.
<
Let's get average intensities, average log ratios, t-tests, and
p-values using the \verb+t.test+ function.
<
Now let's make the genes be in the rows and give appropriate names to the columns:
<
Now let's make an M (average log ratio) vs A (average intensity plot). The horizontal line shows the typical two-fold-change cutoff.
<
Should we take the variability of the estimates into account? It's only 3 replicates but we can try a t-test. The following is a so-called volcano plot which plots the t-test versus the estimate.
<
How many genes have p-values less than 0.05? How about 0.01?
<
Maybe we should adjust the p-values. Let's use the \verb+multtest+ package to obtain adjusted p-values using Benjamini and Yekutieli's procedures for (strong) control of the false discovery rate (FDR).
The function \verb+mt.rawp2adjp+ gives adjusted p-values according to various methods using only the raw p-values.
<
This assumes that the t-test are actually t-distributed. If we had more time we could try a non-parametric method such as maxT using the function \verb+mt.maxT+.
At what FDR would we be happy? 0.01 is pretty conservative. Let's try it anyway. In the next section we will make a
\section{Creating Report}
First let's pick the AffyIDs corresponding to the genes with adjusted p-values of less than 0.01.
<
Now using the data available through the metadata package \verb+hgu95av2+ lets find the corresponding gene names and locus link IDs.
<
We can now make a nice web-page
<
How we faired with the known to be differentially expressed genes? Of
the genes we called how many where actually differential expressed?
<
Not bad... but not 0.01 FDR either.
Let's make an MVA plot with the gene names.
<
Was it worth using a t-tests over the more simple fold change estimates?
Let's see which one does better at ranking the truly differentially expressed genes:
<
\end{document}