% % NOTE -- ONLY EDIT Biobase.Rnw!!! % Biobase.tex file will get overwritten. % %\VignetteIndexEntry{Biobase Primer} %\VignetteDepends{} %\VignetteKeywords{Expression Analysis} %\VignettePackage{Biobase} \documentclass{article} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\classdef}[1]{% {\em #1} } %%FIXME: Jianhua, can you go through the current class definitions for %%exprSet and phenoData etc and update the documentation here accordingly. \begin{document} \title{Textual Description of Biobase} \maketitle \section*{Introduction} Biobase is part of the Bioconductor project. It is meant to be the location of any reusable (or non--specific) functionality. Biobase will be required by most of the other Bioconductor libraries. \section{Data Structures} Part of the Biobase functionality is the standardization of data structures for genomic data. Currently we have designed some data structures to handle microarray data. The \classdef{exprSet} class has the following slots: \begin{description} \item[exprs] A matrix of expression levels. Arrays are columns and genes are rows. \item[se.exprs] A matrix of standard errors for expressions if they are available. It will have length 0 if they are not. \item[phenoData] An object of class \verb+phenoData+ that contains phenotypic and/or experimental data. \item[description] A description of the experiment (object of class MIAME) \item[annotation] A character string indicating the base name for the associated annotation. \item[notes] A set of notes describing aspects or features of the data, the analysis, processing done, etc. \end{description} These data are extremely large and complex. To deal with them effectively we will need better tools for combining data and documentation. The \verb+exprSet+ class represents an initial attempt by the Bioconductor project to provide better tools for documenting and handling these large and complex data sets. The expression data represent experimentally derived data. In most cases these data will benefit from making use of biologically relevant meta-data. The meta-data are very large and diverse. In order to facilitate interactions and explorations we have taken the approach of constructing a specialized meta-data package for each chip or instrument. Many of these packages are available from the Bioconductor web site. These packages contain information such as the gene name, symbol and chromosomal location. There are other meta-data packages that contain the information that is provided by other initiatives such as GO and KEGG. These data can then be linked to the \verb+exprSet+ via the \verb+annotation+ slot. The \textit{annotate} package provides basic data manipulation tools for the meta-data packages. The \texttt{phenoData} class has the following slots: \begin{description} \item[pData] A dataframe where the rows are cases and the columns are variables. \item[varLabels] A list of labels and descriptions for the variables represented by the columns of \verb+pData+. \item[varMetaData] A \Rclass{data.frame} defining meta-data for the variables contained in the \Robject{pData} slot. \end{description} Instances of this class are essentially \verb+data.frame+'s with some additional documentation on the variables stored in the \verb+varLabels+ and \Robject{varMetaData} slots. %%FIXME: does this currently exist? It should be easy to require that %%the names of the pData data.frame are the same, and in the same %%order as the column names of the exprs A mechanism for ensuring that the elements of the \verb+phenoData+ slot of an instance of \verb+exprSet+ are in the same order as the columns of the \verb+exprs+ array is needed. It is important that these be properly aligned since analyses will require this and automatic tools for checking will probably be better than ad hoc ones. In addition to the class definitions a number of special methods (or functions) have been defined to operate on instances of these classes. Some particular attention has been paid to subsetting operations. Instances of both \verb+phenoData+ and \verb+exprSet+ are closed under subsetting operations. That is, any subset of one of these objects retains its class. There are also specialized print methods for objects of both classes. We consider an instance of an \verb+exprSet+ to be an expression array with some additional information. Thus there are two subscripts, one for the rows and one for the columns. For that reason subsetting works in the following ways: \begin{itemize} \item If the first subscript is given then the appropriate subset of rows from \verb+exprs+ and \verb+se.exprs+ is taken. All the data in \verb+phenoData+ is propagated since no subset of cases was made. \item If the second subscript is given then the appropriate set of columns from \verb+exprs+ and \verb+se.exprs+ is taken. At the same time the corresponding set of {\em rows} of \verb+phenoData+ are taken. \end{itemize} \subsection{An exprSet Vignette} In the data directory for Biobase there is a small anonymized data set. It consists of expression level data for 500 genes on 26 patients. The data can be accessed with the command \verb+data(geneData)+. There are three artificial covariates provided as well. These can be accessed using \verb+data(geneCovariate)+ once the Biobase library is attached. %%FIXME: is there a better way to do this? The following vignette shows how to read in these data and to create an instance of the \verb+exprSet+ class using those data. The first step is to create an object of \verb+phenoData+ calss. This object is used to store the phenotype data. We will use the example data frame \verb+geneCovariate+ to create such an object. <>= library(Biobase) @ %%<>= %%data(geneCov) %%data(geneData) %%covN <- list(cov1 = "Covariate 1; 2 levels", %% cov2 = "Covariate 2; 2 levels", %% cov3 = "Covariate 3; 3 levels") %% %%pD <- new("phenoData", pData=geneCov, varLabels=covN) %%eSet <- new("exprSet", exprs=geneData, phenoData=pD) %% %%@ <>= data(geneCovariate) head(geneCovariate) phenoD <- new("phenoData", pData=geneCovariate, varLabels=list("sex"="Female/Male","type"="Case/Control","score"="Testing Score") ) phenoD @ We also need a \verb+MIAME+ object to represent the background information, such as the names of the experimenter and laboratory, for this data set. <>= descr <- new("MIAME", name="Pierre Fermat", lab="Francis Galton Lab", contact="pfermat@lab.not.exist", title="Smoking-Cancer Experiment", abstract="An example object of expression set (exprSet) class", url="www.lab.not.exist", other=list( notes="An example object of expression set (exprSet) class") ) descr @ Moreover, we need a \verb+data.frame+ object to group the reporters (probes) at this data set (chip). The definition of probe types is applied from \textit{Affymetrix GeneChip Expression Analysis Data Analysis Fundamentals} (\url{http://www.affymetrix.com/Auth/support/downloads/manuals/data_analysis_fundamentals_manual.pdf}), and the result is stored in the \verb+data.frame+ object \Robject{reporter}. This is a one-column data frame. The rows represent the probe ids in the data set, and the values in the data frame are the predefined types associated with each probe. <>= data(reporter) head(reporter) @ The data frame \Robject{geneData}, containing 500 probes and 26 samples, is a subset of real expression data from an Affymetrix U95v2 chip. The \Robject{seD} is a \Rclass{matrix} object containing the standard errors of \Robject{geneData}. Finaly, we can put all objects together to create the \verb+exprSet+ object. <>= data(geneData) data(seD) exSet <- new("exprSet", exprs = geneData, se.exprs = seD, phenoData = phenoD, annotation = "hgu95av2", description = descr, notes = descr@other$notes, reporterInfo = reporter ) exSet @ If instead, you have the data stored in files, e.g. \texttt{.csv} files and want to read it in using \Rfunction{read.table} you will need to do some conversion. \Rfunction{read.table} reads data into a \Robject{data.frame}, and this is fine for the phenotypic data, but not for the expression data. You will need to convert the \Robject{data.frame} into a matrix. The code example gives a short example of the commands that you would need to execute if you have the expression data stored in a file named \texttt{myexprs.csv} and the phenotypic data in a file named \texttt{mycovs.csv}. \begin{verbatim} myExprs = read.csv("myexprs.csv") myExprsmat = as.matrix(myExprs) myCovs = read.csv("mycovs.csv") if( ncol(myExprsmat) != nrow(myCovs) ) print("ERROR: these should be the same") \end{verbatim} You should probably also check to ensure that you have the appropriate samples names used as row names for \Robject{myCovs} and as column names for \Robject{myExprsmat}, and they must be in the same order. So, again some pseudo-code that gives the essential ingredients is given next. If you are unclear on the difference between a \Robject{matrix} and a \Robject{data.frame} then you should consult one of the very many introductory texts on using R as well as the manual pages and documents that come with R. \begin{verbatim} if( !all.equal( row.names(myCovs), dimnames(myExprsmat)[[2]] ) ) print("ERROR: samples names are wrong") \end{verbatim} Now, we are almost ready to create an instance of an \Robject{exprSet} object. The missing piece is a list of descriptions for the covariates that constitute the phenotypic data. Suppose that we have two variables, one the age of the patient in years and the other the particular translocation they are known to have. And suppose that in the \Robject{data.frame} \Robject{myCovs} the column names are \texttt{age} and \texttt{translocation}. Then we could make up the covariate descriptions as follows. \begin{verbatim} covDesc = list(age="age of the patient in years", translocation="known translocation") \end{verbatim} The purpose of this level of documentation is to help others who might want to re-analyze your data, or you, if at some time in the future you need to go back and see what was done. By carefully annotating the data, in a self-describing way, you increase the chances that you will understand what was done. Below is the simplest form for creating an \Robject{exprSet}. We note that typically you would fill in more details, like the name of the chip used and some additional data describing the experiment itself. \begin{verbatim} myphenoData = new("phenoData", pData=myCovs, varLabels=covDesc) myEset = new("exprSet", exprs=myExprsmat, phenoData=myphenoData) \end{verbatim} \section{Aggregate} When performing an interative computation such as cross--validation or bootstrapping it is often useful to be able to aggregate certain intermediate results. The \verb+Aggregate+ functions (and soon the Aggregate class) provide some simple tools for doing this. The strategy employed is to maintain the summary statistics in an environment. This is passed to the iterative function. It does not need to be returned since environments have a {\em pass--by--reference} semantic. Once the function has finished the environment can be queried for the summary statistics. One simple task that people often want to carry out is to determine in a cross--validation calculation which genes are selected the most often. In some sense these genes may form a more stable basis for inference. Achieving that using an Aggregator is very straight forward. At each iteration we will pass the names of the selected genes to the Aggregator. It has two functions, one for initializing and one for updating (or aggregating). The aggregator also has an environment. This environment stores the data that is being aggregated. For our cross--validation example the process goes as follows: \begin{enumerate} \item At each iteration Aggregate is called with the list of genes selected. \item For each gene in that list we check to see if it was selected before. \begin{enumerate} \item If not then \verb+initfun+ is called with that gene name. The value returned by \verb+initfun+ is then associated with the gene name in the aggregation environment. \item If so, then the current value is obtained and \verb+agfun+ is called with the the gene name and the current value. This returns a new value that is then associated with the gene name in the aggregation environment. \end{enumerate} \end{enumerate} Basically we are using this as a form of updating hash table. At the same time we are slightly subverting R's usual pass--by--value semantics. %\section*{Resampling exprSet's} %need to fill this in again \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \end{document}