% -*- mode: noweb;noweb-font-lock-mode: R-mode; -*- % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{timecourse manual} % \VignetteDepends{timecourse} % \VignetteKeywords{timecourse} % \VignettePackage{timecourse} \documentclass[11pt]{article} \usepackage{amsmath,fullpage} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \SweaveOpts{echo=FALSE} \usepackage{a4wide} \parindent 0in \bibliographystyle{abbrvnat} \begin{document} \title{\bf The timecourse Package} \author{Yu Chuan Tai} \maketitle \begin{center} Institute for Human Genetics, University of California, San Francisco \\ {\tt taiy@humgen.ucsf.edu} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Overview} The {\tt timecourse} package aims to serve as a comprehensive library for the analysis of developmental microarray time course data. The current version includes functions for identifying genes of interest from longitudinal replicated developmental microarray time course experiments with one or more biological conditions. The main functions are {\bf mb.long()} and {\bf mb.MANOVA()}. They calculate the $MB$-statistics and/or $\widetilde{T}^2$ statistics derived in \cite{TaiSpeed2006} and \cite{Taithesis2005}, using multivariate empirical Bayes approaches. The input object for {\bf mb.long()} and {\bf mb.MANOVA()} can be a {\tt matrix}, {\tt MAList}, {\tt marrayNorm}, or {\tt ExpressionSet}, containing properly preprocessed $log_2$ expression values or ratios. \par The multivariate empirical Bayes models proposed in \cite{TaiSpeed2006} and \cite{Taithesis2005} have the advantage over the traditional $F$-statistic in that they incorporate replicate variances, the correlations among gene expression time point samples from longitudinal data, and moderation borrowing the information across genes into the analysis to reduce the numbers of false positives and false negatives induced by those poorly estimated variance-covariance matrices. Please see \cite{TaiSpeed2006}, \cite{Tai2005}, and \cite{Taithesis2005} for more details. The results from {\bf mb.long()} and {\bf mb.MANOVA()} are stored in a list object of class {\tt MArrayTC}, which contains various useful information and is used as the input object for the plotting function {\bf plotProfile()}. {\tt MArrayTC} is a subclass of the virtual class {\tt LargeDataObject} defined in package {\tt limma} \citep{Smyth2004, Smyth2005} and inherits a show method there. For more information, type <>= library(timecourse) # load timecourse library @ \begin{Schunk} \begin{Sinput} >help("MArrayTC-class") # look at the help files >help(mb.long) >help(mb.MANOVA) \end{Sinput} \end{Schunk} \section{Longitudinal one-sample problem} \subsection{Data} The dataset used in the example was generated by \cite{Tomancak2002}. Samples from {\it Drosophila} embryos were collected and hybridized onto Affymetrix chips at hours 1 to 12 post egg laying. The same procedure was conducted on 3 different days, yielding 3 replicates and 12 time points. Since mRNA samples at different times were extracted from different embryos, it is not a {\it true} longitudinal design. However, since parallel and identically treated embryos within each experiment were very similar, sampling these embryos within each experiment at different time points may approximate a longitudinal study. Therefore, we treat this dataset as longitudinal and proceed with the one-sample statistic calculated from {\bf mb.long()}. The dataset has been preprocessed using RMA algorithm \citep{Irizarry2003, Bolstad2003} and $log_2$ expression values were extracted from the RMA output. To reduce the computational time, we only include the first 2000 probesets in our demonstration. \subsection{Genes of interest} Suppose we are interested in genes which change over time, the following codes give it a quick start: 1. Assign time and replicate group to each array: The default of {\bf mb.long()} assumes the columns (arrays) of the expression matrix from the input object are arranged in the order of biological conditions, replicates and times. For the one-sample problem, we only have 1 biological condition, so the function assumes the columns are arranged in the order of replicates and times. Hence, if the 3 replicates are named as {\tt A}, {\tt B}, and {\tt C}, the following column arrangement of the matrix {\tt fruitfly} will be the same as the default. <>= data(fruitfly) dim(fruitfly) colnames(fruitfly) gnames <- rownames(fruitfly) @ The number of time points and sample sizes are always required as inputs. If the arrays are not in the default order, one needs to assign the time and replicate groups for each array. In our example, they are in the default order, so we make the assignments just for demo purpose. <>= assay <- rep(c("A", "B", "C"), each=12) time.grp <- rep(c(1:12), 3) size <- rep(3, 2000) @ 2. Calculate the $MB$-statistics and $\widetilde{T}^2$ statistics, and plot individual genes of interest. <>= out1 <- mb.long(fruitfly, times=12, reps=size) out2 <- mb.long(fruitfly, times=12, reps=size, rep.grp=assay, time.grp=time.grp) @ \section{Longitudinal two-sample problem} \subsection{Data} Now we simulate a dataset with two biological conditions. There are 1000 genes in total, and 20 of them have the same expected time course profiles between these two biological conditions. Each gene has 3 time course replicates and 5 time points for each condition. Note that this simulation is for demonstration purpose only, and does not necessarily reflect the real features of longitudinal time course data. <>= SS <- matrix(c( 1e-02, -8e-04, -0.003, 7e-03, 2e-03, -8e-04, 2e-02, 0.002, -4e-04, -1e-03, -3e-03, 2e-03, 0.030, -5e-03, -9e-03, 7e-03, -4e-04, -0.005, 2e-02, 8e-04, 2e-03, -1e-03, -0.009, 8e-04, 7e-02), ncol=5) sim.Sigma <- function() { S <- matrix(rep(0,25),ncol=5) x <- mvrnorm(n=10, mu=rep(0,5), Sigma=10*SS) for(i in 1:10) S <- S+crossprod(t(x[i,])) solve(S) } sim.data2 <- function(x, indx=1) { mu <- rep(runif(1,8,x[1]),5) if(indx==1) res <- c(as.numeric(t(mvrnorm(n=3, mu=mu+rnorm(5,sd=5), Sigma=sim.Sigma()))), as.numeric(t(mvrnorm(n=3, mu=mu+rnorm(5,sd=3.2), Sigma=sim.Sigma())))) if(indx==0) res <- as.numeric(t(mvrnorm(n=6, mu=mu+rnorm(5,sd=3), Sigma=sim.Sigma()))) res } M2 <- matrix(rep(14,1000*30), ncol=30) M2[1:20,] <- t(apply(M2[1:20,],1,sim.data2)) M2[21:1000,] <- t(apply(M2[21:1000,],1,sim.data2, 0)) @ \subsection{Genes of interest: paired two-sample problem} Assume it is a paired two-sample problem, we can do the following. Note that, R is case-sensitive and since {\tt mt} $<$ {\tt wt }, the first column of the input argument {\tt reps} should be the sample sizes of all genes for {\tt mt}, while the second column is those for {\tt wt}. The same rule applies for the input argument {\tt size} for {\bf mb.MANOVA()}. <>= trt <- rep(c("wt","mt"),each=15) assay <- rep(rep(c("rep1","rep2","rep3"),each=5),2) size <- matrix(3, nrow=1000, ncol=2) MB.paired <- mb.long(M2, method="paired", times=5, reps=size, condition.grp=trt, rep.grp=assay) genenames <- as.character(1:1000) @ \subsection{Genes of interest: independent two-sample problem} Now assume it is an independent two-sample problem <>= MB.2D <- mb.long(M2, method="2", times=5, reps=size, condition.grp=trt, rep.grp=assay) @ \section{Longitudinal multi-sample problem} \subsection{Data} Now let's simulate a dataset with three biological conditions 500 genes in total, 10 of them have different expected time course profiles across biological conditions: the first condition has 3 replicates, while the second condition has 4 replicates, and the third condition has 2 replicates. 5 time points for each condition. <>= sim.data <- function(x, indx=1) { mu <- rep(runif(1,8,x[1]),5) if(indx==1) res <- c(as.numeric(t(mvrnorm(n=3, mu=mu+rnorm(5,sd=5), Sigma=sim.Sigma()))), as.numeric(t(mvrnorm(n=4, mu=mu+rnorm(5,sd=3.2), Sigma=sim.Sigma()))), as.numeric(t(mvrnorm(n=2, mu=mu+rnorm(5,sd=2), Sigma=sim.Sigma())))) if(indx==0) res <- as.numeric(t(mvrnorm(n=9, mu=mu+rnorm(5,sd=3), Sigma=sim.Sigma()))) res } M <- matrix(rep(14,500*45), ncol=45) M[1:10,] <- t(apply(M[1:10,],1,sim.data)) M[11:500,] <- t(apply(M[11:500,],1,sim.data, 0)) @ \subsection{Genes of interest} The following codes show an example of identifying genes with different temporal profiles across biological conditions: <>= assay <- rep(c("1.2.04","2.4.04","3.5.04","5.21.04","7.17.04","9.10.04","12.1.04","1.2.05","4.1.05"),each=5) trt <- c(rep(c("wildtype","mutant1"),each=15),rep("mutant1",5), rep("mutant2", 10)) # Caution: since "mutant1" < "mutant2" < "wildtype", the sample sizes should be in the order of 4,2,3, # but NOT 3,4,2. size <- matrix(c(4,2,3), byrow=TRUE, nrow=500, ncol=3) MB.multi <- mb.MANOVA(M, times=5, D=3, size=size, rep.grp=assay, condition.grp=trt) @ \section{The amount of moderation} It is always a good idea to check the amount of moderation, which is determined by the diagonal elements of the (transformed) gene-specific (pooled) variance-covariance matrices. The more homogenous these diagonal elements are across genes, the larger the amount of moderation will occur. Try different values of the input argument {\bf prior.df} to see what happens. \section{Which Statistic?} In the one- and two- sample longitudinal problems, when the sample size(s) are {\it different} across genes, the $MB$-statistics are the ones you should use for ranking. Otherwise, the $MB$-statistics and $\widetilde{T}^2$ produce the same ranking. \section{Missing data} For version 1.0.0, missing time point samples are not allowed in {\bf mb.long()} and {\bf mb.MANOVA()}, while missing replicates are allowed for a subset of genes. In the latter case, the user still need to input a complete data matrix with missing replicates indicated by NAs. <>= fruitfly[6, 13:24] <- NA # The 6th gene has 1 missing replicate size <- rep(3, 2000) size[6] <- 2 MB.missing <- mb.long(fruitfly, times=12, reps=size, HotellingT2.only=FALSE) @ \section{Outliers} When type$=$robust, the numerator of the $\widetilde{T}^2$ statistic is calculated using the weighted average time course vector(s), where the weight for each data point is determined by Huber's weight function \citep{Huber1981, Fox2002} with the default tuning constant 1.345. \par When there are only 2 replicates within conditions, type$=$robust produces the same rankings as type$=$none since there is no concensus on gene expression values. We recommend that at least 3 replicates should be performed when designing an experiment so that outliers can be determined by concensus. Check the output weights for these outliers. %%\SweaveOpts{echo=true} \section{Acknowledgements} The {\tt timecourse} package has greatly benefited from many critical comments by Terry Speed. The {\it Drosophila} data were provided by Pavel Tomancak and his colleagues. Thanks are also due to Ben Bolstad for his advices on building up this package, and Karen Vranizan, Yun Zhou, Jun Li and their Pritzker colleagues, and Soyeon Ahn for using the test version of this package and their helpful feedbacks. \bibliography{timecourse} \section{Temporal profile plots} The temporal profile(s) for any single gene can be plotted using the {\tt plotProfile()} with the resulting {\tt MArrayTC} object as the input, see below for examples. The user can either specify the rank or the gene ID of the gene to be plotted. \begin{figure}[htbp] \begin{center} <>= ## plots the no. 1 gene plotProfile(out2, type="b", gnames=gnames, legloc=c(2,15), pch=c("A","B","C"), xlab="Hour") @ \end{center} \caption{The no. 1 gene of {\it fruitfly}.} \end{figure} \begin{figure}[htbp] \begin{center} <>= ## plots the no. 100 gene plotProfile(out2, type="b", gnames=gnames, pch=c("A","B","C"), xlab="Hour", ranking=100) @ \end{center} \caption{The no. 100 gene of {\it fruitfly}} \end{figure} \begin{figure}[htbp] \begin{center} <>= ## plots the gene 141404_at plotProfile(out2, type="b", gnames=gnames, pch=c("A","B","C"), xlab="Hour", gid="141404_at") @ \end{center} \caption{Gene 141404\_at in {\it fruitfly}} \end{figure} \begin{figure}[htbp] \begin{center} <>= plotProfile(MB.paired,type="b", gnames=genenames) @ \end{center} \caption{The no. 1 gene for paired two-sample problem.} \end{figure} \begin{figure}[htbp] \begin{center} <>= plotProfile(MB.2D,type="b", gnames=genenames) @ \end{center} \caption{The no. 1 gene for independent two-sample problem.} \end{figure} \begin{figure}[htbp] \begin{center} <>= plotProfile(MB.multi, stats="MB", type="b") @ \end{center} \caption{The no. 1 gene for multi-sample problem.} \end{figure} \begin{figure}[h] \begin{center} <>= plotProfile(MB.missing, stats="MB", type="b", gid="141205_at", gnames=gnames,pch=c("A","B","C")) @ \end{center} \caption{The gene 141205\_at in fruitfly: one replicate is missing.} \end{figure} \end{document}