% automatic manuscript creation for frma % -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{frma: Preprocessing for single arrays and array batches} %\VignetteDepends{frma, hgu133afrmavecs, frmaExampleData} %\VignettePackage{frma} \documentclass[12pt]{article} \usepackage{hyperref} \usepackage[authoryear, round]{natbib} \textwidth=6.2in \textheight=8.5in \parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\R}{{$\mathbf{R}$}} \newcommand\Rpackage[1]{{\textsf{#1}\index{#1 (package)}}} \newcommand\dataset[1]{{\textit{#1}\index{#1 (data set)}}} \newcommand\Rclass[1]{{\textit{#1}\index{#1 (class)}}} \newcommand\Rfunction[1]{{{\small\texttt{#1}}\index{#1 (function)}}} \newcommand\Rfunarg[1]{{\small\texttt{#1}}} \newcommand\Robject[1]{{\small\texttt{#1}}} \author{Matthew N. McCall} \begin{document} \title{Frozen Robust Multi-Array Analysis and the Gene Expression Barcode} \maketitle \tableofcontents The \Rpackage{frma} implements methods for the preprocessing (frma) and analysis (barcode) of single microarrays. The fRMA algorithm allows the user to preprocess arrays individually while retaining the advantages of multi-array preprocessing methods such as RMA. The Gene Expression Barcode provides estimates of absolute expression rather than relative expression. \section{Frozen Robust Multiarray Analysis (fRMA)} Frozen RMA (fRMA) is a microarray preprocessing algorithm that allows one to analyze microarrays individually or in small batches and then combine the data for analysis. This is accomplished by utilizing information from the large publicly available microarray databases. Specifically, estimates of probe-specific effects and variances are precomputed and frozen. Then, with new data sets, these frozen parameters are used in concert with information from the new array(s) to preprocess the data. The fRMA is particularly useful when it is not feasible to preprocess all of the data simultaneously. Such situations often arise when: \begin{itemize} \item large meta-analyses require one to preprocess more data than can be stored in memory, \item datasets grow incrementally and it would be laborious to preprocess all of the data each time a new array is added, \item microarrays are used to aid in clinical diagnosis and treatment, one needs to obtain information based on a single sample hybridized to a single microarray. \end{itemize} Additionally, fRMA down-weights probes that appear to be \emph{batchy}, those that are most susceptible to batch-effects. \subsection{From CEL files to expression estimates} For Affymetrix microarray data, it is customary to view CEL files as the starting point for preprocessing and analysis. One or more CEL files can be read into \R{} using the \Rfunction{ReadAffy} function from the \Rpackage{affy} package to produce an \Robject{AffyBatch} object or using the \Rfunction{read.celfiles} function from the \Rpackage{oligo} to produce a \Robject{FeatureSet}. The goal of fRMA is to obtain reliable gene-level intensities from the raw microarray data. This amounts to converting raw probe-level intensities into background-corrected, normalized, and summarized gene-level intensities. The \Rfunction{frma} function takes an \Robject{AffyBatch} or \Robject{FeatureSet} as input and produces an object containing gene-level expression values. This object can take one of two forms, an \Robject{ExpressionSet} or a \Robject{frmaExpressionSet}, depending upon the additional information that is requested. The gene-level intensities stored in both these objects can be accessed using the \Rfunction{exprs} method. In addition to the raw data, the fRMA algorithm requires a number of \emph{frozen} parameter vectors. Among these are the reference distribution to which the data are normalized and the probe-effect estimates. We have computed these frozen parameters for many popular Affymetrix platforms. The data for each of these platforms is stored in an \R{} package of the form \Rpackage{$<$platform$>$frmavecs}. By default, the \Rfunction{frma} function attempts to load the appropriate data package for the input data object. The remainder of this section describes typical use of the \Rfunction{frma}. For details of the statistical methodology implemented in the frma function, please read the following papers: McCall MN, Bolstad BM, and Irizarry RA (2010). Frozen Robust Multi-Array Analysis (fRMA), Biostatistics, 11(2):242-253. McCall MN, Jaffee HA, Irizarry RA (2012). fRMA ST: Frozen robust multiarray analysis for Affymetrix Exon and Gene ST arrays, Bioinformatics, 28(23):3153-3154. The following is a simple description of how to preprocess a single CEL file using the default version of frma: \begin{enumerate} \item Download and install frma and the appropriate frozen parameter package (i.e. hgu133afrmavecs, hgu133plus2frmavecs, etc.). \item Load the frma package. <>= library(frma) @ \item Read in the data as either an \Robject{AffyBatch} or \Robject{FeatureSet} object. For the early Affymetrix platforms (e.g. HGU133plus2), you should use the \Rfunction{ReadAffy} function from the \Rpackage{affy} package to read in the CEL files. For the later Affymetrix platforms (e.g. Gene ST or Exon ST), you should use the \Rfunction{read.celfiles} function from the \Rpackage{oligo} package. \item In this vignette we will load an example \Robject{AffyBatch} object: <>= library(frmaExampleData) data(AffyBatchExample) @ \item Preprocess the raw data object using the default version of frma. <>= object <- frma(AffyBatchExample) @ \end{enumerate} The final \Robject{object} will be an \Robject{ExpressionSet} or \Robject{frmaExpressionSet}. The latter is an extension of the ExpressionSet class to hold additional information related to the preprocessing procedure such as weights and residuals. To obtain a matrix of gene-level expression values, enter the following command: <>= e <- exprs(object) @ To assess the quality of the expression estimates, one can use the \Rfunction{GNUSE} function: <>= GNUSE(object,type="stats") @ The GNUSE is a modification of the NUSE quality metric that allows one to assess the quality of individual samples relative to the large training data set used to compute the frozen parameter vectors. The GNUSE is described in more detail in the following paper: McCall MN, Murakami PN, Lukk M, Huber W, Irizarry RA (2011). Assessing Affymetrix GeneChip Microarray Quality, BMC Bioinformatics, 12:137. Depending on the number of CEL files and on the memory available to your system, you might experience errors like `Cannot allocate vector \ldots'. An obvious option is to increase the memory available to your R process (by adding memory and/or closing external applications). You might also consider analyzing the data in smaller batches. Recall that fRMA allows you to preprocess data separately and then combine the preprocessed data for further analysis. \subsection{Advanced Options} The default arguments to the \Rfunction{frma} function will be sufficient for most users; however, additional options have been implemented to allow the user to control each stage of the preprocessing, as well as, the information returned by the \Rfunction{frma} function. This flexibility is instrumental in allowing users to easily explore alternative methods of preprocessing. \subsubsection{Summarization Methods} \emph{Summarization} refers to the method used to combine probe-level expression values to obtain gene-level expression estimates. There are several summarization methods that one can choose from when running frma. A brief description of each of the methods follow: \begin{itemize} \item {\bf{average}:} subtract the probe-effect and then compute the mean of the probes in each probeset. \item {\bf{median}:} subtract the probe-effect and compute the median of the probes in each probeset. \item {\bf{median\_polish}:} this is the same as the median summarization because the probe-effects have already been removed. \item {\bf{weighted average}:} compute a weighted average of the probes in each probeset with weights equal to the inverse of the sum of the precomputed within and between batch variance estimates. \item {\bf{robust\_weighted\_ average}:} compute a weighted average of the probes in each probeset with weights equal to the weights returned by an M-estimation procedure divided by the sum of the precomputed within and between batch variance estimates. \item {\bf{random\_ effect}:} the robust weighted average method adapted for a batch of new arrays (see the fRMA paper for details). \end{itemize} \subsubsection{Input Vectors} While the vast majority of users will use the precomputed vectors provided in the \Rpackage{frmavecs} packages, the \Rfunction{frma} function will accept user-supplied frozen parameter vectors. The \Rpackage{frmaTools} package contains functions to create your own frozen parameter vectors. There are several situations in which creating your own frozen parameter vectors may be beneficial. These are described in detail in: McCall MN and Irizarry RA (2011). Thawing Frozen Robust Multi-array Analysis (fRMA), BMC Bioinformatics, 12:369. If you create your own frozen parameter vectors using functions in the \Rpackage{frmaTools} package, the vectors will already be in the correct format: a list with elements normVec, probeVec, probeVarBetween, probeVarWithin, probesetSD, and medianSE. A description of each of these elements follows: \begin{itemize} \item {\bf{normVec}:} a vector containing values of the reference distribution to which samples will be quantile normalized \item {\bf{probeVec}:} a vector of probe-effect estimates \item {\bf{probeVarBetween}:} a vector of the between batch variance for each probe \item {\bf{probeVarWithin}:} a vector of the within batch variance for each probe \item {\bf{probesetSD}:} a vector of average within probeset standard deviations \item {\bf{medianSE}:} a vector of median standard errors of the expression estimates (used by the GNUSE function) \end{itemize} \subsubsection{Output Parameters} While the default is to only return the gene-level expression estimates and if applicable their standard errors, the frma function can also return additional information about the estimates depending on the summarization method chosen. A description of the arguments that can be included in the output.param argument follows: \begin{itemize} \item {\bf{weights}:} the weights from the M-estimation procedure \item {\bf{residuals}:} the residuals from fitting the probe-level model \item {\bf{randomeffects}:} estimated random effects from fitting the probe-level model with random effect summarization \end{itemize} Note that not all of these outputs are available for all of the summarization methods. \section{Creation of Gene Expression Barcodes} The barcode algorithm is designed to estimate which genes are expressed and which are unexpressed in a given microarray hybridization. This is accomplished by: (1) using the distribution of observed $log_2$ intensities across a wide variety of tissues to estimate an expressed and an unexpressed distribution for each gene, and (2) using these estimated distributions to determine which genes are expressed / unexpressed in a given sample. The first step is accomplished by fitting a hierarchical mixture model to the plethora of publicly available data. The second step is accomplished by determining where the observed intensities from the new array fall in the estimated distributions. The default output of the \Rfunction{barcode} function is a vector of ones and zeros denoting which genes are estimated to be expressed (ones) and unexpressed (zeros). We call this a \emph{gene expression barcode}. For more details about the Gene Expression Barcode, please read the following papers: McCall MN, Uppal K, Jaffee HA, Zilliox MJ, and Irizarry RA (2011). The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes, Nucleic Acids Research, 39:D1011-D1015. McCall MN, Jaffee HA, Zelisko SJ, Sinha N, Hooiveld G, Irizarry RA, Zilliox MJ (2014). The Gene Expression Barcode 3.0: improved data processing and mining tools, Nucleic Acids Research, 42(D1):D938-D943. \subsection{Getting Started} To create a gene expression barcode, one needs estimates of the gene expression distributions -- specifically the mean and variance of the unexpressed distribution for each gene. We have computed these for several popular Affymetrix platforms. To use one of these, simply preprocess your data using the default options of \Rfunction{frma} and then run the \Rfunction{barcode} function on the resulting object. Similar to the \Rfunction{frma} function, the \Rfunction{barcode} function requires platform specific precomputed parameters. These parameters are stored in the same \R{} packages as the frozen parameter vectors used by \Rfunction{frma}. In the default implementation, the \Rfunction{barcode} function attempts to load the appropriate set of parameters for the given \Robject{ExpressionSet} or \Robject{frmaExpressionSet} object. It is also possible for the user to supply the necessary parameters via optional arguments. \subsubsection{Example} \begin{enumerate} \item Download and install the frma package and the appropriate data package(s) (i.e. hgu133afrmavecs). \item Load the frma package. <>= library(frma) @ \item Read in the data and preprocess using the default options. <>= library(frmaExampleData) data(AffyBatchExample) object <- frma(AffyBatchExample) @ \item Convert the expression values to a gene expression barcode. <>= bc <- barcode(object) @ \end{enumerate} \subsection{Output Options} The default output of the \Rfunction{barcode} function is to return a vector of ones (expressed) and zeros (unexpressed); however, there are alternative output options. A brief description of each of these follows: \begin{itemize} \item {\bf{weight}:} a vector of weights which roughly correspond to the probability of expression for each gene. \item {\bf{z-score}:} a vector of z-scores under the unexpressed normal distribution for each gene. \item {\bf{p-value}:} a vector of p-values under the unexpressed normal distributionfor each gene. \item {\bf{lod}:} a vector of LOD scores under the unexpressed normal distributionfor each gene. \end{itemize} \end{document}