% \VignetteDepends{tools} % \VignetteIndexEntry{bridge Tutorial} % \VignetteKeywords{Expression Analysis} % \VignettePackage{bridge} \documentclass[11pt]{article} \usepackage{amsmath,epsfig,fullpage} \usepackage{graphicx} \usepackage{hyperref} \usepackage{chicago} \parindent 0in \begin{document} \title{\bf Bayesian Robust Inference of Differential Gene Expression\\ The bridge package} \author{Raphael Gottardo$^*$} \maketitle \begin{center} $^*$Department Statistics, University of Washington\\ \url{http://www.rglab.org}\\ {\tt raph@stat.washington.edu} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Overview} The {\tt bridge} package consists of several functions for testing for differential expression among multiple samples. As for now, the package can only be used with two and three samples. The extension to more than three samples should be straight forward. However, it is good to note that as the number of sample increases, the complexity of the model increases exponentially. In pratice, it would be hard to deal with a large number of samples. \\ We use a robust Bayesian hierarchical model for testing for differential expression with microarrays. The model is a generalization of a model used by \shortciteN{Gottardo:2006p1950} to estimate the true mean intensities from replicates. Outliers are modeled explicitly using a $t$-distribution. The model is flexible with an exchangeable prior for the variances which allow different variances for the genes but still shrink extreme variances. Our model can be used for testing for differentially expressed genes among multiple samples. Parameter estimation is carried out using Markov Chain Monte Carlo. The robust estimation is achieved by hierarchical Bayesian modeling \shortcite{Gottardo:2006p12}. Realizations are generated from the posterior distribution via Markov chain Monte Carlo (MCMC) algorithms \cite{bib:gelfand90}. Where the full conditionals are of simple form, Gibbs updates are used; where the full conditionals were not of standard form, slice sampling is used. We use the ``stepping out'' procedure for the slice sampling as introduced in \citeN{bib:neal03}. \\ Convergence of the MCMC can be assessed using the {\tt coda} library available from CRAN. Our experience with the software is that 50,000 iterations are usually enough to estimate quantities of interest such as posterior means, credible intervals, \textit{etc}. Because the number of genes $I$ can be quite large, the computing time can be long. We recommend running the functions provided in the package in batch mode. This can be done via {\tt R CMD BATCH}. An example is presented in the next section. \\ {\bf Help files.} As with any R package, detailed information on functions, their arguments and value, can be obtained in the help files. For instance, to view the help file for the function {\tt bridge.2samples} in a browser, use {\tt ?bridge.2samples}. \section{Installation} \subsection{Windows} You can install the package using the menu \textit{Packages $\rightarrow$ Install Packages from local zip ...}. Then browse you local directory to find the package. Make sure you have root privilegies if you want to install the package in the default directory. \subsection{Unix/Linux} Under Unix, go into your favorite shell window and type {\tt R CMD INSTALL bridge\_xx.tar.gz} (the package has to be in the directory you are executing the command from). By default R will install the package in {\tt /usr/local/R/lib/R ...} and you will need to have root privilegies in order to do so. You can also install the package locally by using the option -l. If you want to install the package in /home/user/Rlib/ use the command {\tt R CMD INSTALL bridge\_xx.tar.gz -l /home/user/Rlib}. If you install the package in a directory different from the R default directory you will need to specify the full path when you load the package, i.e. <>= library(bridge) @ \section{Data Import} \subsection{bridge.2samples} You can read your data into R, using the {\tt read.table} command. It is hard to give a general guideline as different data will have different format. In order to use the package you will need to create two data sets, one for each sample (control and treatment). The data should be on the raw scale, i.e. not transformed. The two datasets should be arranged such that rows correspond to genes and columns to replicates. Table \ref{tab:data_format} give an example of a data set in the right format. \begin{table}[htbp] \caption{Format of the control and treatment datasets used in the rama package. The two samples must be seperated into two datasets. } \centering \begin{tabular}{ccccccccc}\hline \hline sample & 1 & 1 & 1 & 1 & 2 & 2 & 2 & 2 \\ replicate & 1 & 2 & 3 & 4 & 1 & 2 & 3 & 4 \\ \hline & \multicolumn{4}{c}{Dataset 1} & \multicolumn{4}{c}{Dataset 2} \\ \cline{2-5} \cline{6-9} gene 1 & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ gene 2 & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ \vdots & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ gene n & \dots & \dots & \dots & \dots & \dots & \dots & \dots & \dots \\ \hline \end{tabular} \label{tab:data_format} \end{table} \subsection{bridge.3samples} In this case, one would need to have three datasets, one for each sample. Again the data should be on the raw scale, i.e. not transformed! \section{Testing for differential expression between two samples} Testing for differential between two samples is done via the function {\tt bridge.2samples}. We demonstrate its functionality using gene expression data from an HIV study of \shortciteN{bib:wout03}. To load the HIV dataset, use {\tt data(hiv)}, and to view a description of the experiments and data, type {\tt ?hiv}. We first load the hiv dataset. <<>>= data(hiv) @ This data set consists of 4 experiments using the same RNA preparation on 4 different slides. The expression levels of about 8000 genes were assessed in CD4-T-cell lines at time $t=24$ hour after infection with HIV virus type 1. The first 4 columns correspond to the first treatment state (hiv infected). The second four represent the control state. The experiment is a balanced dye swap experiment. Finally, the last two columns contain the row and column positions of each gene on the array (slide). \\ Our goal is to test for differentially expressed genes between the two samples. To demonstrate the function {\tt bridge.2samples} we will only use a subset of 640 genes. \\ We model $y^*=\log_2(y)$ where $y$ are the raw intensities. Testing of differential expression is done using the function {\tt bridge.2samples}. This function output an object of type {\tt bridge2} containing the sampled values from the posterior distribution. In our case the inference will be based on the posterior probability of differential expression, {\tt post.p}. <<>>= bridge.hiv<-bridge.2samples(hiv[1:640,c(1:4)],hiv[1:640,c(5:8)],B=2000,min.iter=0,batch=1,mcmc.obj=NULL,affy=FALSE,verbose=FALSE) @ Once you have fitted the model you may view or plot the sampled parameters from the posterior distribution. We can also obtain point estimates of the paramaters of interest such as the gene effects <<>>= gamma1<-mat.mean(bridge.hiv$gamma1)[,1] gamma2<-mat.mean(bridge.hiv$gamma2)[,1] @ and/or plot the resulting posterior probabilities as a function of the posterior log ratios of $\gamma_1-\gamma_2$. \begin{center} <>= plot(gamma1-gamma2,bridge.hiv$post.p,col=1,pch=1,main="Posterior probability",xlab="log ratio",ylab="posterior probability") @ \end{center} Robustness is achieved by using a $t$-distribution for the errors with a different degree of freedoms for each replicate array. This is formally equivalent to giving each replicate a different weight in the estimation process for each gene. A low number of degrees of freedom for one array will indicate that the corresponding array contains numerous outliers. One could use the function {\tt hist} to look at the posterior mode of the degrees of freedom for each array. \begin{center} <>= hist(bridge.hiv$nu1[3,],main="Posterior degree of freedoms, array 3",xlab="nu",50) @ \end{center} Fitting a $t$-distribution can be seen as a weighted least square problem with a special hierarchical structure for the variances. Similarly to a weighted least squares function, our function computes weights associated with each replicate. The weights can be useful as a tool for diagnosing the quality of the replicate measurements. If one knows the exact positions of the genes on the array, it is possible to look at the spatial variation of the weights. This can be done via the function {\tt weight.plot} of the {\tt rama} package. We refer the reader to the rama package vignette for further details. \section{Testing for differential expression among three samples} Because (most) microarray technologies allow the direct comparison of at most two samples, the model used in {\tt bridge.2samples} needs to be slightly modified as well to accommodate for changes in the experimental design. In {\tt bridge.2samples}, the log measurements were modeled as a bivariate $t$-distribution for the log measurements. Depending on the technology used, with three samples, the data will either be ratios (cDNA) or raw measurements (Affymetrix) and therefore a natural model is a univariate $t$-distribution. The estimation of the parameters is done similarly to the function {\tt bridge.2samples}. The function output an object of type {\tt bridge3} containing the sampled values from the posterior distribution. In our case the inference will be based on the posterior probability of differential expression. <<>>= sample1<-matrix(exp(rnorm(150)),50,3) sample2<-matrix(exp(rnorm(200)),50,4) sample3<-matrix(exp(rnorm(150)),50,3) mcmc.bridge3<-bridge.3samples(sample1,sample2,sample3,B=10,min.iter=0,batch=1,mcmc.obj=NULL,all.out=TRUE,verbose=FALSE) @ \section{Computing time and batch mode} The computing time can be quite long depending on the number of genes and replicates. It takes about 5 hours to run the MCMC on the full HIV data set using a Linux machine with AMD Athlon at 2GHz. However we feel that this additional computational cost is worth the effort and can lead to good results. The code can be run in batch mode without any user intervention; this optimizes computing resources and allows the user to perform other tasks at the same time. R code can be run in batch mode by typing {\tt R CMD BATCH myfile.R} at the shell prompt. The file {\tt myfile.R} contains the command to execute. An example file could be the following, %<>= %library(bridge) %data(hiv) %bridge.hiv<-bridge.2samples(hiv[1:640,c(1:4)],hiv[1:640,c(5:8)],B=2000,min.iter=0,batch=1,mcmc.obj=NULL) %save(bridge.hiv,file='bridge.hiv.R') %@ Then, when the execution is done one can load the resulting output into R using the {\tt load} command. For further details about the {\tt BATCH} command type {\tt ?BATCH} in R.\\ \section{Acknowledgment} This research was supported by NIH Grant 8 R01 EB002137-02. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \bibliographystyle{chicago} \bibliography{robust-testing} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{document}