%\VignetteIndexEntry{TCC} %\VignettePackage{TCC} \documentclass{article} \usepackage[a4paper]{geometry} \usepackage{color} \usepackage{Sweave} \SweaveOpts{pdf=TRUE} \definecolor{TccBlue}{cmyk}{0.74,0.26,0,0.14} \definecolor{TccRed}{cmyk}{0,1,0.78,0.1} \definecolor{TccGreen}{cmyk}{0.99,0,0.29,0.47} \definecolor{TccOrange}{cmyk}{0,0.27,1,0.03} \renewcommand{\floatpagefraction}{0.9} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\Robject{#1}}} \newcommand{\Rpackage}[1]{\textbf{\texttt{#1}}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=n,formatcom=\color{TccRed}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{fontshape=n,formatcom=\color{TccBlue}} \author {\small Jianqiang Sun$^{1\S}$, Tomoaki Nishiyama$^{2\S}$, Kentaro Shimizu$^1$, and Koji Kadota$^1$\\ \texttt{\footnotesize $^1$ The University of Tokyo, Tokyo, Japan}\\ \texttt{\footnotesize $^2$ Kanazawa University, Kanazawa, Japan}\\ \texttt{\footnotesize $^\S$ Maintainer: Jianqiang Sun (wukong@bi.a.u-tokyo.ac.jp),}\\ \texttt{\footnotesize Tomoaki Nishiyama (tomoakin@staff.kanazawa-u.ac.jp)} } \title{TCC: Differential expression analysis for tag count data with robust normalization strategies} \begin{document} \maketitle\thispagestyle{empty} \begin{abstract} The R/Bioconductor package, \Rpackage{TCC}, provides users with a robust and accurate framework to perform differential expression (DE) analysis of tag count data. We recently developed a multi-step normalization method (TbT; Kadota et al., 2012 \cite{kadota}) for two-group RNA-seq data. The strategy (called DEGES) is to remove data that are potential differentially expressed genes (DEGs) before performing the data normalization. DEGES in \Rpackage{TCC} is essential for accurate normalization of tag count data, especially when the up- and down-regulated DEGs in one of the groups are extremely biased in their number. A major characteristic of \Rpackage{TCC} is to provide the DEGES-based normalization methods for several kinds of count data (two-group with or without replicates, multi-group, and so on) by virtue of the use of combinations of functions in other sophisticated packages (especially \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}). The appropriate combination provided by \Rpackage{TCC} allows a more robust and accurate estimation to be performed more easily than directly using original packages and \Rpackage{TCC} provides a simple unified interface to perform the robust normalization. \end{abstract} \newpage \tableofcontents \newpage \section{Introduction} Differential expression analysis based on tag count data has become a fundamental task for identifying differentially expressed genes or transcripts (DEGs). The \Rpackage{TCC} package (Tag Count Comparison; Sun et al., 2013 \cite{sun}) provides users with a robust and accurate framework to perform differential expression analysis of tag count data. \Rpackage{TCC} provides integrated analysis pipelines with improved data normalization steps, compared with other packages such as \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}, by appropriately combining their functionalities. The package incorporates multi-step normalization methods whose strategy is to remove data that are potential DEGs before performing the data normalization. Kadota et al. (2012) \cite{kadota} recently reported that the normalization methods implemented in R packages (such as \Rpackage{edgeR} (Robinson et al., 2010 \cite{robinson}), \Rpackage{DESeq} (Anders and Huber, 2010 \cite{anders}), and \Rpackage{baySeq} (Hardcastle and Kelly, 2010 \cite{hardcastle})) for differential expression (DE) analysis between samples are inadequate when the up- and down-regulated DEGs in one of the samples are extremely biased in their number (i.e., biased DE). This is because the current methods implicitly assume a balanced DE, wherein the numbers of highly and lowly expressed DE entities in samples are (nearly) equal. As a result, methods assuming unbiased DE will not work well on data with biased DE. Although a major purpose of data normalization is to detect such DE entities, their existence themselves consequently interferes with their opportunity to be top-ranked. Conventional procedures for identifying DEGs from tag count data consisting of two steps (i.e., data normalization and identification of DEGs) cannot in principle eliminate the potential DE entities before data normalization. To normalize data that potentially has various scenarios (including unbiased and biased DE), we recently proposed a multi-step normalization strategy (called TbT, an acronym for the TMM-\Rpackage{baySeq}-TMM pipeline; Kadota et al., 2012 \cite{kadota}), in which the TMM normalization method (Robinson and Oshlack, 2010 \cite{robinson2}) is used in steps 1 and 3 and an empirical Bayesian method implemented in the \Rpackage{baySeq} package (Hardcastle and Kelly, 2010 \cite{hardcastle}) is used in step 2. Although this multi-step DEG elimination strategy (called "DEGES" for short) can successfully remove potential DE entities identified in step 2 prior to the estimation of the normalization factors using the TMM normalization method in step 3, the \Rpackage{baySeq} package used in step 2 of the TbT method is much more computationally intensive than competing packages like \Rpackage{edgeR} and \Rpackage{DESeq}. While the three-step TbT normalization method performed best on simulated and real tag count data, it is practically possible to make different choices for the methods in each step. A more comprehensive study regarding better choices for DEGES is needed. This package provides tools to perform multi-step normalization methods based on DEGES and enables differential expression analysis of tag count data without having to worry much about biased distributions of DEGs. The DEGES-based normalization function implemented in \Rpackage{TCC} includes the TbT method based on DEGES for two-group data with or without replicates, much faster method, and methods for multi-group comparison. \Rpackage{TCC} provides a simple unified interface to perform data normalization with combinations of functions provided by \Rpackage{baySeq}, \Rpackage{DESeq}, and \Rpackage{edgeR}. Functions to produce simulation data under various conditions and to plot the data are also provided. \subsection{Installation} \label{section-1-1} This package is available from the Bioconductor website (http://bioconductor.org/). To install the package, enter the following command after starting R: \begin{Schunk} \begin{Sinput} > source("http://bioconductor.org/biocLite.R") > biocLite("TCC") \end{Sinput} \end{Schunk} \subsection{Citations} \label{section-1-2} This package internally uses many of the functions implemented in the other packages. This is because our normalization procedures consist, in part, of combinations of existing normalization methods and differential expression (DE) methods. For example, the TbT normalization method (Kadota et al., 2012 \cite{kadota}), which is a particular functionality of the \Rpackage{TCC} package (Sun et al., 2013 \cite{sun}), consists of the TMM normalization method (Robinson and Oshlack, 2010 \cite{robinson2}) implemented in the \Rpackage{edgeR} package (Robinson et al., 2010 \cite{robinson}) and the empirical Bayesian method implemented in the \Rpackage{baySeq} package (Hardcastle and Kelly, 2010 \cite{hardcastle}). Therefore, please cite the appropriate references when you publish your results. \begin{Schunk} \begin{Sinput} > citation("TCC") \end{Sinput} \end{Schunk} \subsection{Quick start} \label{section-1-3} Let us begin by showing two examples (Cases1 and 2) of identifying DEGs between two groups from tag count data consisting of $1,000$ genes and a total of six samples (each group has three biological replicates). The hypothetical count data (termed "\Robject{hypoData}") is stored in this package (for details, see section \ref{section-2-1}). We then describe the DE analysis of count data without replicates (i.e., two samples), using the data of the first and the fourth column of \Robject{hypoData} (Case 3). We recommend the use of commands in Cases 2 and 3. Case 1: DE analysis of two-group count data with replicates by using the exact test (Robinson and Smyth, 2008 \cite{robinson3}) in \Rpackage{edgeR} coupled with TbT normalization (termed the TbT-\Rpackage{edgeR} combination). The \Rpackage{TCC} package was originally designed with the TbT normalization method, and the original study (Kadota et al., 2012 \cite{kadota}) recommended this analysis pipeline. Note that a smaller sampling size (i.e., \Robject{samplesize = 100}) is used here to reduce the computation time, but a larger sampling size of around $10,000$ (i.e., \Robject{samplesize = 10000}) is recommended (Hardcastle and Kelly, 2010 \cite{hardcastle}). Suggested citations are as follows: \Rpackage{TCC} (Sun et al., 2013 \cite{sun}), TbT (Kadota et al., 2012 \cite{kadota}), TMM (Robinson and Oshlack, 2010 \cite{robinson2}), \Rpackage{baySeq} (Hardcastle and Kelly, 2010 \cite{hardcastle}), and \Rpackage{edgeR} (Robinson et al., 2010 \cite{robinson}). For details, see section \ref{section-3-1-1}. <>= library(TCC) data(hypoData) samplesize <- 100 group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq", iteration = 1, samplesize = samplesize) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) @ Case 2: DE analysis for two-group count data with replicates by using the exact test coupled with iterative DEGES/\Rpackage{edgeR} normalization (i.e., the iDEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination). This is an alternative pipeline designed to reduce the runtime (approx. $20$ sec.), yet its performance is comparable to the above pipeline. Accordingly, we recommend using this pipeline as a default when analyzing tag count data with replicates. A notable advantage of this pipeline is that the multi-step normalization strategy only needs the methods implemented in the \Rpackage{edgeR} package. The suggested citations are as follows: \Rpackage{TCC} (Sun et al., 2013 \cite{sun}), TMM (Robinson and Oshlack, 2010 \cite{robinson2}), the exact test (Robinson and Smyth, 2008 \cite{robinson3}), and \Rpackage{edgeR} (Robinson et al., 2010 \cite{robinson}). For details, see section \ref{section-3-1-3}. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) @ Case 3: DE analysis for two-group count data without replicates by using the negative binomial (NB) test in \Rpackage{DESeq} coupled with iDEGES/\Rpackage{DESeq} normalization (i.e., the iDEGES/\Rpackage{DESeq}-\Rpackage{DESeq} combination). A procedure using the data of the first and fourth columns of \Robject{hypoData} is shown here. Similar to Case 2, this pipeline entirely consists of methods implemented in the \Rpackage{DESeq} package. Suggested citations are as follows: \Rpackage{TCC} (Sun et al., 2013 \cite{sun}) and \Rpackage{DESeq} (Anders and Huber, 2010 \cite{anders}). For details, see section \ref{section-3-2}. <>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[,c(1,4)], group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "deseq", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) @ \newpage \section{Preparations} \subsection{Reading the count data} \label{section-2-1} Similar to the other packages, \Rpackage{TCC} typically starts the DE analysis with a count table matrix where each row indicates a gene (or transcript), each column indicates a sample (or library), and each cell indicates the number of counts for a gene in a sample. Here, we assume a hypothetical count matrix consisting of 1,000 rows (or genes) and a total of six columns (the first three columns are produced from biological replicates of Group 1 and the remaining columns are from Group 2); i.e., \{G1\_rep1, G1\_rep2, G1\_rep3\} vs. \{G2\_rep1, G2\_rep2, G2\_rep3\}. We start by loading the hypothetical data (\Robject{hypoData}) from \Rpackage{TCC} and giving a numeric vector (\Robject{group}) indicating which group each sample belongs to. <>= library(TCC) data(hypoData) head(hypoData) dim(hypoData) group <- c(1, 1, 1, 2, 2, 2) @ If you want to analyze another count matrix consisting of nine columns (e.g., the first four columns are produced from biological replicates of G1, and the remaining five columns are from G2), the \Robject{group} vector should be indicated as follows. <>= group <- c(1, 1, 1, 1, 2, 2, 2, 2, 2) @ \subsection{Constructing TCC class object} \label{section-2-2} The \Rfunction{new} function has to be used to perform the main functionalities of \Rpackage{TCC}. This function constructs a \Rpackage{TCC} class object, and subsequent analyses are performed on this class object. The object is constructed from i) a count matrix (\Robject{hypoData}) and ii) the corresponding numeric vector (\Robject{group}) as follows. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc @ The count matrix and group vector information can be retrieved from the stored class object by using \Robject{tcc\$count} and \Robject{tcc\$group}, respectively. <>= head(tcc$count) tcc$group @ The subset of \Rpackage{TCC} class object can be taken by the \Rfunction{subset} or \Rfunction{"["} functions. <>= dim(tcc$count) tcc.sub1 <- subset(tcc, c(rep(TRUE, 20), rep(FALSE, 980))) dim(tcc.sub1$count) tcc.sub2 <- tcc[1:20] dim(tcc.sub2$count) @ \subsection{Filtering low-count genes (optional)} \label{section-2-3} The way to filter out genes with low-count tags across samples depends on the user's philosophy. Although we recommend removing tags with zero counts across samples as a minimum filtering, this effort is optional. The \Rfunction{filterLowCountGenes} function performs this filtering. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- filterLowCountGenes(tcc) dim(tcc$count) @ It can be seen that $\Sexpr{nrow(hypoData) - nrow(tcc$count)} (= \Sexpr{nrow(hypoData)} -\Sexpr{ nrow(tcc$count)})$ genes were filtered as non-expressed. The same procedure can be performed without the \Rfunction{filterLowCountGenes} function, in which case the filtering is performed before the \Rpackage{TCC} class object is constructed. <>= filter <- as.logical(rowSums(hypoData) > 0) dim(hypoData[filter, ]) tcc <- new("TCC", hypoData[filter, ], group) dim(tcc$count) @ \newpage \section{Normalization} \subsection{Normalization of two-group count data with replicates} \label{section-3-1} This package provides robust normalization methods based on DEGES proposed by Kadota et al. (2012) \cite{kadota}. When obtaining normalization factors from two-group data with replicates, users can select a total of six combinations (two normalization methods $\times$ three DEG identification methods) coupled with an arbitrary number of iterations ($n = 0, 1, 2, \dots, 100$) in our DEGES-based normalization pipeline. We show some of the practical combinations below. Since the three-step TbT normalization method was originally designed for normalizing tag count data with (biological) replicates, we will first explain the TbT method (\ref{section-3-1-1} DEGES/TbT). In relation to the other DEGES-based methods, we will call the method "DEGES/TbT" for convenience. As mentioned in the original study, DEGES/TbT needs a long computation time. Accordingly, we present three shorter alternatives (\ref{section-3-1-2} DEGES/\Rpackage{edgeR}, \ref{section-3-1-3} iDEGES/\Rpackage{edgeR}, and \ref{section-3-1-4} DEGES/\Rpackage{DESeq}). Note that the purpose here is to obtain accurate normalization factors to be used with statistical models (e.g., the exact test or empirical Bayes) for the DE analysis described in the next section (\ref{section-4} \textbf{Differential expression}). \subsubsection{DEGES/TbT} \label{section-3-1-1} The DEGES/TbT (Kadota et al., 2012 \cite{kadota}) with default parameter settings can be performed as follows. <>= set.seed(1000) library(TCC) data(hypoData) samplesize <- 100 group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq", iteration = 1, samplesize = samplesize) @ Note that a smaller sampling size (i.e., \Robject{samplesize = 100}) is used here to reduce the computation time when performing the empirical Bayesian method in step 2, but a larger sampling size of around $10,000$ (i.e., \Robject{samplesize = 10000}) is recommended (Hardcastle and Kelly, 2010 \cite{hardcastle}). This method estimates an empirical distribution of the parameters of the NB distribution by bootstrapping from the input data. While the sampling size can be made smaller to reduce the computation time (e.g., \Robject{samplesize = 40}), the resulting normalization factors will vary from trial to trial. In this vignette, we will call the \Rfunction{set.seed} function for obtaining reproducible results (i.e., the \Robject{tcc\$norm.factors} values) when using any random function. The calculated normalization factors and the computation time can be retrieved with the following commands. <>= tcc$norm.factors tcc$DEGES$execution.time @ Of course, the procedure can be performed by using functions in \Rpackage{edgeR} and \Rpackage{baySeq}, instead of using the \Rfunction{calcNormFators} function in \Rpackage{TCC}. The \Rfunction{calcNormFators} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <>= set.seed(1000) library(TCC) data(hypoData) samplesize <- 100 group <- c(1, 1, 1, 2, 2, 2) ### STEP 1 ### d <- DGEList(count = hypoData, group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors norm.factors <- norm.factors / mean(norm.factors) ### STEP 2 ### cD <- new("countData", data = hypoData, replicates = group, groups = list(NDE = rep(1, length = length(group)), DE = group), libsizes = colSums(hypoData) * norm.factors) cD <- getPriors.NB(cD, samplesize = samplesize, estimation = "QL", cl = NULL) cD <- getLikelihoods.NB(cD, pET = "BIC", cl = NULL) is.DEG <- as.logical(rank(-cD@posteriors[, "DE"]) < (nrow(hypoData) * cD@estProps[2])) ### STEP 3 ### d <- DGEList(count = hypoData[!is.DEG, ], group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsubsection{DEGES/edgeR} \label{section-3-1-2} Now let us describe an alternative approach that is roughly $200$-$400$ times faster than DEGES/TbT, yet has comparable performance. The TMM-\Rpackage{edgeR}-TMM pipeline (called DEGES/\Rpackage{edgeR}) employs the exact test implemented in \Rpackage{edgeR} in step 2. To use this pipeline, we have to provide a reasonable threshold for defining potential DEGs in step 2. We will define the threshold as an arbitrary false discovery rate (FDR) with a floor value of $P_{\mbox{\tiny DEG}}$. The default FDR is $< 0.1$, and the default floor $P_{\mbox{\tiny DEG}}$ is $5\%$, but different choices are of course possible. For example, in case of the default settings, $x\% (x > 5\%)$ of the top-ranked potential DEGs are eliminated in step 2 if the percentage ($= x\%$) of genes satisfying FDR $< 0.1$ is over $5\%$. The DEGES/\Rpackage{edgeR} pipeline has an apparent advantage over TbT in computation time. It can be performed as follows: <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors tcc$DEGES$execution.time @ The normalization factors calculated from the DEGES/\Rpackage{edgeR} are very similar to those of DEGES/TbT with the default parameter settings (i.e., \Robject{samplesize = 10000}). For \Rpackage{edgeR} users, we provide commands, consisting of functions in \Rpackage{edgeR}, to perform the DEGES/\Rpackage{edgeR} pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFators} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) FDR <- 0.1 floorPDEG <- 0.05 d <- DGEList(counts = hypoData, group = group) ### STEP 1 ### d <- calcNormFactors(d) ### STEP 2 ### d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) result <- exactTest(d) q.value <- p.adjust(result$table$PValue, method = "BH") if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(q.value < FDR) } else { is.DEG <- as.logical(rank(result$table$PValue, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### d <- DGEList(counts = hypoData[!is.DEG, ], group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsubsection{iDEGES/edgeR} \label{section-3-1-3} Our multi-step normalization can be repeated until the calculated normalization factors converge (Kadota et al., 2012 \cite{kadota}). An iterative version of DEGES/TbT (i.e., iDEGES/TbT) can be described as the TMM-(\Rpackage{baySeq}-TMM)$_{n}$ pipeline with $n \ge 2$. Although the iDEGES/TbT would not be practical in terms of the computation time, the TMM-(\Rpackage{edgeR}-TMM)$_{n}$ pipeline (iDEGES/\Rpackage{edgeR}) is potentially superior to both the DEGES/\Rpackage{edgeR} and the DEGES/TbT. A suggested iDEGES/\Rpackage{edgeR} implementation ($n = 3$) consists of seven steps, as follows: <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors tcc$DEGES$execution.time @ \subsubsection{DEGES/DESeq} \label{section-3-1-4} The DEGES pipeline can also be performed by using only the functions in the \Rpackage{DESeq} package. Similar to the \Rpackage{edgeR} case above, this \Rpackage{DESeq}-\Rpackage{DESeq}-\Rpackage{DESeq} pipeline (DEGES/\Rpackage{DESeq}) changes the corresponding arguments of the \Robject{norm.method} and \Robject{test.method} as follows: <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors tcc$DEGES$execution.time @ For \Rpackage{DESeq} users, we also provide commands, consisting of functions in \Rpackage{DESeq}, to perform the DEGES/\Rpackage{DESeq} pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFators} function together with the above arguments can be regarded as a wrapper function for the following commands. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) FDR <- 0.1 floorPDEG <- 0.05 cds <- newCountDataSet(hypoData, group) ### STEP 1 ### cds <- estimateSizeFactors(cds) ### STEP 2 ### cds <- estimateDispersions(cds) result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 result$padj[is.na(result$padj)] <- 1 q.value <- result$padj if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(q.value < FDR) } else { is.DEG <- as.logical(rank(result$pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### cds <- newCountDataSet(hypoData[!is.DEG, ], group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsection{Normalization of two-group count data without replicates} \label{section-3-2} It is important to keep in mind that most R packages (including \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}) are primarily for analyzing data including biological replications because the biological variability has to be accurately estimated to avoid spurious DE calls (Glaus et al., 2012 \cite{glaus}). In fact, the functions for the DEG identification method implemented in \Rpackage{edgeR} (i.e., the exact test; ver. 3.0.4) do not allow analysis without replicates, though the TMM normalization method in the package can be applied to data regardless of whether it has replicates. Although the \Rpackage{edgeR} manual provides users with some ideas on how to perform the DE analysis, it is difficult to customize the analysis with DEGES to data without replicates. When obtaining normalization factors from two-group count data without replicates, users can select a total of four combinations (two normalization methods $\times$ two DEG identification methods) coupled with an arbitrary number of iterations ($n = 0, 1, 2, \dots, 100$) in our DEGES-based normalization pipeline. That is, the \Rfunction{calcNormFators} function with the \Robject{norm.method = "deseq"} or \Robject{"tmm"} and \Robject{test.method = "deseq"} or \Robject{"bayseq"} can be indicated. Let us explain the procedure by retrieving the data of the first and the fourth columns of \Robject{hypoData}, i.e., <>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) head(tcc$count) tcc$group @ A DEGES pipeline (DEGES/\Rpackage{DESeq}) for obtaining normalization factors is as follows. <>= tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors @ An advantage of this DEGES/\Rpackage{DESeq} pipeline is that the multi-step normalization strategy only needs the methods in the \Rpackage{DESeq} package. These factors should be the same as those produced by the following procedure consisting of functions implemented in \Rpackage{DESeq}. <>= library(TCC) data(hypoData) group <- c(1, 2) FDR <- 0.1 floorPDEG <- 0.05 cds <- newCountDataSet(hypoData[, c(1, 4)], group) ### STEP 1 ### cds <- estimateSizeFactors(cds) ### STEP 2 ### cds <- estimateDispersions(cds, method = "blind", sharingMode = "fit-only") result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 result$padj[is.na(result$padj)] <- 1 q.value <- result$padj if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(q.value < FDR) } else { is.DEG <- as.logical(rank(result$pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### cds <- newCountDataSet(hypoData[!is.DEG, c(1, 4)], group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(hypoData[, c(1, 4)]) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsection{Normalization of multi-group count data with replicates} \label{section-3-3} Many R packages (including \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}) support DE analysis for multi-group tag count data. \Rpackage{TCC} provides some prototypes of DEGES-based pipelines for such data. Here, we analyze another hypothetical three-group count matrix, the \Robject{hypoData\_mg} object, provided in \Rpackage{TCC}. It consists of $1,000$ genes and a total of nine columns for testing any difference among three groups that each have triplicates. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) tcc dim(tcc$count) @ Of the $1,000$ genes, the first $200$ genes are DEGs and the remaining $800$ genes are non-DEGs. The breakdowns for the $200$ DEGs are as follows: $140$, $40$, and $20$ DEGs are up-regulated in Groups 1, 2, and 3. Below, we show some DEGES-based normalization pipelines for this multi-group data (\ref{section-3-3-1} DEGES/TbT, \ref{section-3-3-2} DEGES/\Rpackage{edgeR}, and \ref{section-3-3-3} DEGES/\Rpackage{DESeq}). \subsubsection{DEGES/TbT} \label{section-3-3-1} The DEGES/TbT pipeline for multi-group data is essentially the same as those for two-group data with/without replicates. Note that a smaller sampling size (i.e., \Robject{samplesize = 100}) is used here to reduce the computation time, but a larger sampling size of around $10,000$ (i.e., \Robject{samplesize = 10000}) is recommended (Hardcastle and Kelly, 2010 \cite{hardcastle}). <>= set.seed(1000) library(TCC) data(hypoData_mg) samplesize <- 100 group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq", iteration = 1, samplesize = samplesize) tcc$norm.factors @ \subsubsection{DEGES/edgeR} \label{section-3-3-2} \Rpackage{edgeR} employs generalized linear models (GLMs) to find DEGs between any of the groups. The DEGES/\Rpackage{edgeR} normalization pipeline in \Rpackage{TCC} internally uses functions for the GLM approach that require two models (a full model and a null model). The full model corresponds to a design matrix to describe sample groups. The null model corresponds to the model coefficients. The two models can be defined as follows: <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) design <- model.matrix(~ as.factor(group)) coef <- 2:length(unique(group)) @ The design matrix (\Robject{design}) can be constructed by using the \Rfunction{model.matrix} function. For the model coefficients (\Robject{coef}), the user should specify all the coefficients except for the intercept term. The two models (\Robject{design} and \Robject{coef}) will automatically be generated when performing the following \Rfunction{calcNormFactors} function if those models are not explicitly indicated. <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) tcc$norm.factors @ For \Rpackage{edgeR} users, we provide commands, consisting of functions in \Rpackage{edgeR}, to perform the DEGES/\Rpackage{edgeR} pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFators} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) FDR <- 0.1 floorPDEG <- 0.05 design <- model.matrix(~ as.factor(group)) coef <- 2:length(unique(group)) d <- DGEList(counts = hypoData_mg, group = group) ### STEP 1 ### d <- calcNormFactors(d) ### STEP 2 ### d <- estimateGLMCommonDisp(d, design) d <- estimateGLMTrendedDisp(d, design) d <- estimateGLMTagwiseDisp(d, design) fit <- glmFit(d, design) lrt <- glmLRT(fit, coef = coef) result <- topTags(lrt, n = nrow(hypoData_mg)) result <- result$table[rownames(hypoData_mg), ] if (sum(result$FDR < FDR) > (floorPDEG * nrow(hypoData_mg))) { is.DEG <- as.logical(result$FDR < FDR) } else { is.DEG <- as.logical(rank(result$PValue, ties.method = "min") <= nrow(hypoData_mg) * floorPDEG) } ### STEP 3 ### d <- DGEList(counts = hypoData_mg[!is.DEG, ], group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors * colSums(hypoData_mg[!is.DEG, ]) / colSums(hypoData_mg) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsubsection{DEGES/DESeq} \label{section-3-3-3} \Rpackage{DESeq} also employs GLMs for analyzing multi-group experiments. Similar to the \Rpackage{edgeR} package, it requires two models (full model and reduced model). The full model (\Robject{fit1}) and reduced model (\Robject{fit0}) can be created as follows: <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) fit1 <- count ~ condition fit0 <- count ~ 1 @ The two models (\Robject{fit1} and \Robject{fit0}) will automatically be generated when performing the following \Robject{calcNormFactors} function if those models are not explicitly indicated. <>= tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1) tcc$norm.factors @ For \Rpackage{DESeq} users, we provide commands, consisting of functions in \Rpackage{DESeq}, to perform the DEGES/ \Rpackage{DESeq} pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFators} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) FDR <- 0.1 floorPDEG <- 0.05 tcc <- new("TCC", hypoData_mg, group) fit1 <- count ~ condition fit0 <- count ~ 1 cds <- newCountDataSet(hypoData_mg, group) ### STEP 1 ### cds <- estimateSizeFactors(cds) ### STEP 2 ### cds <- estimateDispersions(cds) reduced.model <- fitNbinomGLMs(cds, fit0) full.model <- fitNbinomGLMs(cds, fit1) p.value <- nbinomGLMTest(full.model, reduced.model) p.value[is.na(p.value)] <- 1 q.value <- p.adjust(p.value, method = "BH") if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData_mg))) { is.DEG <- as.logical(q.value < FDR) } else { is.DEG <- as.logical(rank(p.value, ties.method = "min") <= nrow(hypoData_mg) * floorPDEG) } ### STEP 3 ### cds <- newCountDataSet(hypoData_mg[!is.DEG, ], group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(hypoData_mg) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsection{Retrieving normalized data} \label{section-3-4} Similar functions for calculating normalization factors are the \Rfunction{calcNormFators} function in \Rpackage{edgeR} and the \Rfunction{estimateSizeFactors} function in \Rpackage{DESeq}. Note that the terminology used in \Rpackage{DESeq} (i.e., size factors) is different from that used in \Rpackage{edgeR} (i.e., effective library sizes) and ours. The effective library size in \Rpackage{edgeR} is calculated as the library size multiplied by the normalization factor. The size factors in the \Rpackage{DESeq} package are comparable to the {\it normalized} effective library sizes wherein the summary statistics for the effective library sizes are adjusted to one. Our normalization factors, which can be obtained from \Robject{tcc\$norm.factors}, have the same names as those in \Rpackage{edgeR}. Accordingly, the normalization factors calculated from \Rpackage{TCC} with arbitrary options should be manipulated together with the library sizes when normalized read counts are to be obtained. Since biologists are often interested in such information (Dillies et al., 2012 \cite{dillies}), we provide the \Rfunction{getNormalizedData} function for retrieving normalized data. Note that the \Robject{hypoData} consists of $1,000$ genes and a total of six samples (three biological replicates for G1 and three biological replicates for G2); i.e., \{G1\_rep1, G1\_rep2, G1\_rep3\} vs. \{G2\_rep1, G2\_rep2, G2\_rep3\}. These simulation data have basically the same conditions as shown in Fig. 1 of the TbT paper (Kadota et al., 2012 \cite{kadota}); i.e., (i) the first $200$ genes are DEGs ($P_{\mbox{\small DEG}} = 200/1000 = 20\%$), (ii) the first $180$ genes of the $200$ DEGs are higher in G1 ($P_{\mbox{\small G1}} = 180/200 = 90\%$), and the remaining $20$ DEGs are higher in G2, and (iii) the level of DE is four-fold. The last $800$ genes were designed to be non-DEGs. The different normalization strategies can roughly be evaluated in terms of the similarity of their summary statistics for {\it normalized} data labeled as non-DEGs in one group (e.g., G1) to those of the other group (e.g., G2). The basic statistics for the non-DEGs are as follows. <>= library(TCC) data(hypoData) nonDEG <- 201:1000 summary(hypoData[nonDEG, ]) @ From now on, we will display only the median values for simplicity, i.e., <>= apply(hypoData[nonDEG, ], 2, median) @ <>= hypoData.median <- apply(hypoData[nonDEG, ], 2, median) @ In what follows, we show detailed examples using \Robject{hypoData}. Note, however, that the basic usage is simple. <>= normalized.count <- getNormalizedData(tcc) @ \subsubsection{Retrieving two-group DEGES/edgeR-normalized data with replicates} \label{section-3-4-1} The \Rfunction{getNormalizedData} function can be applied to the \Rpackage{TCC} class object after the normalization factors have been calculated. <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, FDR = 0.1, floorPDEG = 0.05) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) @ The same procedure consisting of functions in \Rpackage{edgeR} is <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) FDR <- 0.1 floorPDEG <- 0.05 d <- DGEList(counts = hypoData, group = group) ### Step 1 ### d <- calcNormFactors(d) ### Step 2 ### d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) result <- exactTest(d) q.value <- p.adjust(result$table$PValue, method = "BH") if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(q.value < FDR) } else { is.DEG <- as.logical(order(rank(result$table$PValue)) <= nrow(hypoData) * floorPDEG) } ### Step 3 ### d <- DGEList(counts = hypoData[!is.DEG, ], group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) effective.libsizes <- colSums(hypoData) * norm.factors normalized.count <- sweep(hypoData, 2, mean(effective.libsizes) / effective.libsizes, "*") apply(normalized.count[nonDEG, ], 2, median) @ It is obvious that the summary statistics (ranging from $\Sexpr{sprintf("%.5f", min(apply(normalized.count[nonDEG,],2,median)))}$ to $\Sexpr{sprintf("%.5f", max(apply(normalized.count[nonDEG,],2,median)))})$ from DEGES/\Rpackage{edgeR}-normalized data are close to the truth (i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData.median))}$ to $\Sexpr{sprintf("%.1f",max(hypoData.median))}$). For comparison, the summary statistics for TMM-normalized data produced using the original normalization method (i.e., TMM) in \Rpackage{edgeR} are obtained as follows. <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) d <- DGEList(count = hypoData, group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors norm.factors <- norm.factors / mean(norm.factors) effective.libsizes <- colSums(hypoData) * norm.factors normalized.count <- sweep(hypoData, 2, mean(effective.libsizes) / effective.libsizes, "*") apply(normalized.count[nonDEG, ], 2, median) @ This is the same as <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) @ From the viewpoint of the data distribution of non-DEGs, these statistics (ranging from $\Sexpr{sprintf("%.5f", min(apply(normalized.count[nonDEG,],2,median)))}$ to $\Sexpr{sprintf("%.5f", max(apply(normalized.count[nonDEG,],2,median)))}$) are not as good as those of DEGES/\Rpackage{edgeR}. \subsubsection{Retrieving two-group DEGES/DESeq-normalized data with replicates} \label{section-3-4-2} Similar to the DEGES/\Rpackage{edgeR} case, DEGES/\Rpackage{DESeq}-normalized data can be retrieved as follows. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) nonDEG <- 201:1000 tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) @ The same procedure consisting of functions in \Rpackage{DESeq} is <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) FDR <- 0.1 floorPDEG <- 0.05 cds <- newCountDataSet(hypoData, group) ### Step 1 ### cds <- estimateSizeFactors(cds) ### Step 2 ### cds <- estimateDispersions(cds) result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 result$padj[is.na(result$padj)] <- 1 q.value <- result$padj if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(q.value < FDR) } else { is.DEG <- as.logical(rank(result$pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### Step 3 ### cds <- newCountDataSet(hypoData[!is.DEG, ], group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) effective.libsizes <- colSums(hypoData) * norm.factors normalized.count <- sweep(hypoData, 2, mean(effective.libsizes) / effective.libsizes, "*") apply(normalized.count[nonDEG, ], 2, median) @ \subsubsection{Retrieving two-group DEGES/DESeq-normalized data without replicates} \label{section-3-4-3} Similar to the case of count data with replicates, the DEGES/\Rpackage{DESeq}-normalized data without replicates can be retrieved as follows. <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) @ The same procedure consisting of functions in \Rpackage{DESeq} is <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 2) FDR <- 0.1 floorPDEG <- 0.05 cds <- newCountDataSet(hypoData[,c(1, 4)], group) ### Step 1 ### cds <- estimateSizeFactors(cds) ### Step 2 ### cds <- estimateDispersions(cds, method = "blind", sharingMode = "fit-only") result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 result$padj[is.na(result$padj)] <- 1 q.value <- result$padj if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(q.value < FDR) } else { is.DEG <- as.logical(rank(result$pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### Step 3 ### cds <- newCountDataSet(hypoData[!is.DEG, c(1, 4)], group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(hypoData[, c(1, 4)]) norm.factors <- norm.factors / mean(norm.factors) effective.libsizes <- colSums(hypoData[, c(1, 4)]) * norm.factors normalized.count <- sweep(hypoData[, c(1, 4)], 2, mean(effective.libsizes) / effective.libsizes, "*") apply(normalized.count[nonDEG, ], 2, median) @ The above summary statistics from DEGES/\Rpackage{DESeq}-normalized data are closer to the truth (i.e., $\Sexpr{sprintf("%.1f", hypoData.median[1])}$ for G1\_rep1 and $\Sexpr{sprintf("%.1f",hypoData.median[4])}$ for G2\_rep1) than are the following summary statistics from data normalized using the original normalization method implemented in \Rpackage{DESeq}. <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 2) cds <- newCountDataSet(hypoData[, c(1, 4)], group) cds <- estimateSizeFactors(cds) normalized.count <- counts(cds, normalized = TRUE) apply(normalized.count[nonDEG, ], 2, median) @ \subsubsection{Retrieving multi-group iDEGES/edgeR-normalized data with replicates} \label{section-3-4-4} Here, we analyze another hypothetical three-group count matrix, the \Robject{hypoData\_mg} object, provided in \Rpackage{TCC}. It consists of $1,000$ genes and a total of nine columns for testing any difference among three groups that each have triplicates. Similar to the \Robject{hypoData} object, the first $200$ genes are DEGs and the remaining $800$ genes are non-DEGs. The basic statistics for the non-DEGs are as follows. <>= library(TCC) data(hypoData_mg) nonDEG <- 201:1000 summary(hypoData_mg[nonDEG, ]) @ From now on, we will display only the median values for simplicity, i.e., <>= apply(hypoData_mg[nonDEG, ], 2, median) @ <>= hypoData_mg.median <- apply(hypoData_mg[nonDEG, ], 2, median) @ The iDEGES/\Rpackage{edgeR}-normalized data can be retrieved as follows. <>= library(TCC) data(hypoData_mg) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) design <- model.matrix(~ as.factor(group)) coef <- 2:length(unique(group)) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <>= normByiDEGES <- range(apply(normalized.count[nonDEG, ], 2, median)) @ For comparison, the summary statistics for TMM-normalized data produced using the original normalization method (i.e., TMM) in \Rpackage{edgeR} are obtained as follows. <>= library(TCC) data(hypoData_mg) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <>= normByTMM <- range(apply(normalized.count[nonDEG, ], 2, median)) @ It is obvious that the summary statistics (ranging from $\Sexpr{sprintf("%.5f", min(normByiDEGES))}$ to $\Sexpr{sprintf("%.5f", max(normByiDEGES))}$) from iDEGES/\Rpackage{edgeR}-normalized data are closer to the truth (i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData_mg.median))}$ to $\Sexpr{sprintf("%.1f", max(hypoData_mg.median))}$) than those (ranging from $\Sexpr{sprintf("%.5f", min(normByTMM))}$ to $\Sexpr{sprintf("%.5f", max(normByTMM))}$) from TMM-normalized data. \newpage \section{Differential expression (DE)} \label{section-4} The particular feature of \Rpackage{TCC} is that it calculates robust normalization factors. Moreover, end users would like to have some accessory functions for subsequent analyses. Here, we provide the \Rfunction{estimateDE} function for identifying DEGs. Specifically, the function internally uses the corresponding functions implemented in three packages: \Rfunction{exactTest} in \Rpackage{edgeR}, \Rfunction{nbinomTest} in \Rpackage{DESeq}, and \Rfunction{getLikelihoods.NB} in \Rpackage{baySeq}. Similar to the usage in the \Rfunction{calcNormFators} function with the \Rfunction{test.method} argument in \Rpackage{TCC}, those DE methods in \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq} can be performed by using the \Rfunction{estimateDE} function with \Rfunction{test.method = "edger"}, \Rfunction{"deseq"}, and \Rfunction{"bayseq"}, respectively. Here, we show some examples of DE analysis for two-group data with replicates (\ref{section-4-1}), two-group data without replicates (\ref{section-4-2}), and multi-group data with replicates (\ref{section-4-3}). \subsection{DE analysis for two-group data with replicates} \label{section-4-1} \subsubsection{edgeR coupled with iDEGES/edgeR normalization} \label{section-4-1-1} We give a procedure for DE analysis using the exact test implemented in \Rpackage{edgeR} together with iDEGES/\Rpackage{edgeR} normalization factors (i.e., the iDEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination) for the hypothetical two-group count data with replicates (i.e., the \Robject{hypoData} object). If the user wants to determine the genes having an FDR threshold of $< 10\%$ as DEGs, one can do as follows. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) @ The results of the DE analysis are stored in the \Rpackage{TCC} class object. The summary statistics for top-ranked genes can be retrieved by using the \Rfunction{getResult} function. <>= result <- getResult(tcc, sort = TRUE) head(result) @ The DE results can be broken down as follows. <>= table(tcc$estimatedDEG) @ This means $\Sexpr{sum(tcc$estimatedDEG == FALSE)}$ non-DEGs and $\Sexpr{sum(tcc$estimatedDEG == TRUE)}$ DEGs satisfy FDR $< 0.1$. The \Rfunction{plot} function generates an M-A plot, where "M" indicates the log-ratio (i.e., $\mbox{M} = log_{2}\mbox{G2} - log_{2}\mbox{G1}$) and "A" indicates average read count (i.e., $\mbox{A} = (log_{2}\mbox{G2} + log_{2}\mbox{G1}) / 2$), from the normalized count data. The magenta points indicate the identified DEGs at FDR $< 0.1$. <>= plot(tcc) @ \subsubsection{baySeq coupled with iDEGES/edgeR normalization} \label{section-4-1-2} If the user wants to employ the empirical Bayesian method in \Rpackage{baySeq} together with iDEGES/\Rpackage{edgeR} normalization factors (i.e., the iDEGES/\Rpackage{edgeR}-\Rpackage{baySeq} combination), one can do as follows. <>= set.seed(1000) library(TCC) data(hypoData) samplesize <- 100 group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "bayseq", FDR = 0.1, samplesize = samplesize) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ Note that a smaller sampling size (i.e., \Robject{samplesize = 100}) is used here to reduce the computation time, but a larger sampling size of around $10,000$ (i.e., \Robject{samplesize = 10000}) is recommended (Hardcastle and Kelly, 2010 \cite{hardcastle}). Note also that \Rpackage{baySeq} outputs posterior likelihoods instead of the $p$-values obtained from \Rpackage{edgeR} and \Rpackage{DESeq}. The $p$-value column stores the ($1 - likelihood$) values when the \Rfunction{estimateDE} function is executed with the empirical Bayes in \Rpackage{baySeq}. Now let us describe an alternative procedure for \Rpackage{baySeq} users that corresponds to the \Rfunction{estimateDE} function. The $likelihood$ values and $p$-values (calculated as $1 - likelihood$) are retrieved as follows. <>= set.seed(1000) library(TCC) data(hypoData) samplesize <- 100 group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) effective.libsizes <- colSums(tcc$count) * tcc$norm.factors groups <- list(NDE = rep(1, length(group)), DE = group) cD <- new("countData", data = tcc$count, replicates = group, libsizes = effective.libsizes, groups = groups) cD <- getPriors.NB(cD, samplesize = samplesize, estimation = "QL", cl = NULL) cD <- getLikelihoods.NB(cD, pET = "BIC", cl = NULL) tmp <- topCounts(cD, group = "DE", number = nrow(tcc$count)) tmp <- tmp[rownames(tcc$count), ] p.value <- 1 - tmp$Likelihood q.value <- tmp$FDR result <- cbind(p.value, q.value) rownames(result) <- rownames(tmp) head(result) @ \subsection{DE analysis for two-group data without replicates} \label{section-4-2} As described previously, the functions for the DEG identification method implemented in \Rpackage{edgeR} (i.e., the exact test; ver. 3.0.4) do not allow analysis without replicates. Currently, the \Rfunction{estimateDE} function only allows the \Robject{"deseq"} or \Robject{"bayseq"} options for the \Robject{test.method} argument. Here, we show a procedure for DE analysis using the NB test implemented in \Rpackage{DESeq} together with iDEGES/\Rpackage{DESeq} normalization factors (i.e., the iDEGES/\Rpackage{DESeq}-\Rpackage{DESeq} combination) for the hypothetical two-group count data without replicates (i.e., the \Robject{hypoData[, c(1, 4)]} object). If the user wants to determine the genes having an FDR threshold of $< 10\%$ as DEGs, one can do as follows. <>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) head(tcc$count) tcc$group tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors tcc <- estimateDE(tcc, test.method = "deseq", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ It can be seen that there is no DEG having FDR $< 0.1$. \subsection{DE analysis for multi-group data with replicates} \label{section-4-3} Here, we give three examples of DE analysis coupled with DEGES/\Rpackage{edgeR} normalization for the hypothetical three-group data with replicates, i.e., the \Robject{hypoData\_mg} object. The use of the DEGES/\Rpackage{edgeR} normalization factors is simply for reducing the computation time. \subsubsection{baySeq coupled with DEGES/edgeR normalization} \label{section-4-3-1} The empirical Bayesian method implemented in \Rpackage{baySeq} after executing the DEGES/\Rpackage{edgeR} normalization (i.e., the DEGES/\Rpackage{edgeR}-\Rpackage{baySeq} combination) can be performed as follows. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### set.seed(1000) samplesize <- 100 tcc <- estimateDE(tcc, test.method = "bayseq", FDR = 0.1, samplesize = samplesize) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ It can be seen that the \Rpackage{baySeq} method identified $\Sexpr{sum(tcc$estimatedDEG == TRUE)}$ DEGs having FDR $< 0.1$. One can obtain the number of DEGs with another threshold (e.g., FDR $< 0.2$) from the result object as follows. <>= sum(result$q.value < 0.2) @ For \Rpackage{baySeq} users, we provide commands, consisting of functions in \Rpackage{baySeq}, to perform the DEG identification without the function in \Rpackage{TCC}. The \Rfunction{estimateDE} function with \Robject{test.method = "bayseq"} can be regarded as a wrapper function for the following commands after the DEGES/\Rpackage{edgeR} normalization. <>= set.seed(1000) samplesize <- 100 effective.libsizes <- colSums(tcc$count) * tcc$norm.factors groups <- list(NDE = rep(1, length(group)), DE = group) cD <- new("countData", data = tcc$count, replicates = group, libsizes = effective.libsizes, groups = groups) cD <- getPriors.NB(cD, samplesize = samplesize, estimation = "QL", cl = NULL) cD <- getLikelihoods.NB(cD, pET = "BIC", cl = NULL) tmp <- topCounts(cD, group = "DE", number = nrow(tcc$count)) tmp <- tmp[rownames(tcc$count), ] p.value <- 1 - tmp$Likelihood q.value <- tmp$FDR result <- cbind(p.value, q.value) rownames(result) <- rownames(tmp) head(result) sum(q.value < 0.1) sum(q.value < 0.2) @ \subsubsection{edgeR coupled with DEGES/edgeR normalization} \label{section-4-3-2} The exact test implemented in \Rpackage{edgeR} after executing the DEGES/\Rpackage{edgeR} normalization (i.e., the DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination) can be performed as follows. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ Note that these DEGs having FDR $< 0.1$ display DE between any of the groups because the two arguments indicated here (\Robject{design} and \Robject{coef}) correspond to an AVOVA-like test for any differences provided in \Rpackage{edgeR}, i.e., <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) design <- model.matrix(~ as.factor(group)) coef <- 2:length(unique(group)) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### d <- DGEList(tcc$count, group = group) d$samples$norm.factors <- tcc$norm.factors d <- estimateGLMCommonDisp(d, design) d <- estimateGLMTrendedDisp(d, design) d <- estimateGLMTagwiseDisp(d, design) fit <- glmFit(d, design) lrt <- glmLRT(fit, coef = coef) tmp <- topTags(lrt, n = nrow(tcc$count)) p.value <- tmp$table$PValue q.value <- tmp$table$FDR result <- cbind(p.value, q.value) rownames(result) <- rownames(tmp) head(result) sum(q.value < 0.1) sum(q.value < 0.2) @ As described in the \Rpackage{edgeR} manual, the second and third columns in the \Robject{design} object are relative to the baseline (i.e., Group 1 or G1): \Robject{coef = 2} means G2 vs. G1 and \Robject{coef = 3} means G3 vs. G1. The above procedure with the \Robject{coef} object (i.e., \Robject{2:length(unique(group))}) indicates the both comparisons (i.e., G2 vs. G1 and G3 vs. G1) and identifies DEGs between any of the three groups. In other words, one can do any two-group comparison of interest from multi-group data with replicates. For example, the DE analysis for G3 vs. G1 together with DEGES/\Rpackage{edgeR} normalization can be performed as follows. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### coef <- 3 tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1, coef = coef) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ \subsubsection{DESeq coupled with DEGES/edgeR normalization} \label{section-4-3-3} The NB test implemented in \Rpackage{DESeq} after executing the DEGES/\Rpackage{edgeR} normalization (i.e., the DEGES/\Rpackage{edgeR}-\Rpackage{DESeq} combination) can be performed as follows. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### fit1 <- count ~ condition fit0 <- count ~ 1 tcc <- estimateDE(tcc, test.method = "deseq", FDR = 0.1, fit0 = fit0, fit1 = fit1) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ For \Rpackage{DESeq} users, we provide commands, consisting of functions in \Rpackage{DESeq}, to perform the DEG identification without the function in \Rpackage{TCC}. The \Rfunction{estimateDE} function with \Robject{test.method = "deseq"} can be regarded as a wrapper function for the following commands after the DEGES/\Rpackage{edgeR} normalization. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### fit1 <- count ~ condition fit0 <- count ~ 1 cds <- newCountDataSet(tcc$count, group) sizeFactors(cds) <- tcc$norm.factors * colSums(tcc$count) cds <- estimateDispersions(cds) reduced.model <- fitNbinomGLMs(cds, fit0) full.model <- fitNbinomGLMs(cds, fit1) p.value <- nbinomGLMTest(full.model, reduced.model) p.value[is.na(p.value)] <- 1 q.value <- p.adjust(p.value, method = "BH") tmp <- cbind(p.value, q.value) rownames(tmp) <- tcc$gene_id result <- tmp[order(p.value), ] head(result) sum(q.value < 0.1) sum(q.value < 0.2) @ \section{Generation of simulation data} \subsection{Introduction and basic usage} \label{section-5-1} As demonstrated in our previous study (Kadota et al., 2012 \cite{kadota}), the DEGES-based normalization methods implemented in \Rpackage{TCC} theoretically outperform the other normalization methods when the numbers of DEGs (G1 vs. G2) in the tag count data are biased. However, it is difficult to determine whether the up- and down-regulated DEGs in one of the groups are actually biased in their number when analyzing real data (Dillies et al., 2012 \cite{dillies}). This means we have to evaluate the potential performance of our DEGES-based methods using mainly simulation data. The \Rfunction{simulateReadCounts} function generates simulation data under various conditions. This function can generate simulation data analyzed in the TbT paper (Kadota et al., 2012 \cite{kadota}), and that means it enables other researchers to compare the methods they develop with our DEGES-based methods. For example, the \Robject{hypoData} object, a hypothetical count dataset provided in \Rpackage{TCC}, was generated by using this function. The output of the \Rfunction{simulateReadCounts} function is stored as a \Rpackage{TCC} class object and is therefore ready-to-analyze. Note that different trials of simulation analysis generally yield different count data even under the same simulation conditions. As mentioned in section \ref{section-3-1-1}, we can call the \Rfunction{set.seed} function in order to obtain reproducible results (i.e., the \Robject{tcc\$count}) with the \Rfunction{simulateReadCounts} function. <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.2, DEG.assign = c(0.9, 0.1), DEG.foldchange = c(4, 4), replicates = c(3, 3)) dim(tcc$count) head(tcc$count) tcc$group @ The simulation conditions for comparing two groups (G1 vs. G2) with biological replicates are as follows: (i) the number of genes is $1,000$ (i.e., \Robject{Ngene = 1000}), (ii) the first $20\%$ of genes are DEGs (\Robject{PDEG = 0.2}), (iii) the first $90\%$ of the DEGs are up-regulated in G1, and the remaining $10\%$ are up-regulated in G2 (\Robject{DEG.assign = c(0.9, 0.1)}), (iv) the levels of DE are four-fold in both groups (\Robject{DEG.foldchange = c(4, 4)}), and (v) there are a total of six samples (three biological replicates for G1 and three biological replicates for G2) (\Robject{replicates = c(3, 3)}). The variance of the NB distribution can be modeled as $V = \mu + \phi\mu^{2}$. The empirical distribution of the read counts for producing the mean ($\mu$) and dispersion ($\phi$) parameters of the model was obtained from {\it Arabidopsis} data (three biological replicates for each of the treated and non-treated groups) in \Rpackage{NBPSeq} (Di et al., 2011 \cite{di}). The \Robject{tcc\$count} object is essentially the same as the \Robject{hypoData} object of \Rpackage{TCC}. The information about the simulation conditions can be viewed as follows. <>= str(tcc$simulation) @ Specifically, the entries for $0, 1$, and $2$ in the \Robject{tcc\$simulation\$trueDEG} object are for non-DEG, DEGs up-regulated in G1, and DEGs up-regulated in G2, respectively. The breakdowns for individual entries are the same as stated above: $800$ entries are non-DEGs, $180$ DEGs are up-regulated in G1, and $20$ DEGs are up-regulated in G2. <>= table(tcc$simulation$trueDEG) @ This information can be used to evaluate the performance of the DEGES-based normalization methods in terms of the sensitivity and specificity of the results of their DE analysis. A good normalization method coupled with a DE method such as the exact test (Robinson and Smyth, 2008 \cite{robinson3}) and the empirical Bayes (Hardcastle and Kelly, 2010) should produce well-ranked gene lists in which the true DEGs are top-ranked and non-DEGs are bottom-ranked when all genes are ranked according to the degree of DE. The ranked gene list after performing the DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination can be obtained as follows. <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) @ We can now calculate the area under the \Rpackage{ROC} curve (i.e., AUC; $0 \le $AUC$ \le 1$) between the ranked gene list and the truth (i.e., DEGs or non-DEGs) and thereby evaluate the sensitivity and specificity simultaneously. A well-ranked gene list should have a high AUC value (i.e., high sensitivity and specificity). The \Rfunction{calcAUCValue} function calculates the AUC value based on the information stored in the \Rpackage{TCC} class object. <>= calcAUCValue(tcc) @ <>= auc.degesedger <- calcAUCValue(tcc) @ This is essentially the same as <>= AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0), data = -tcc$stat$rank)) @ The following classic \Rpackage{edgeR} procedure (i.e., the TMM-\Rpackage{edgeR} combination) make it clear that the DEGES-based normalization method (i.e., the DEGES/\Rpackage{edgeR} pipeline) outperforms the default normalization method (i.e., TMM) implemented in \Rpackage{edgeR}. <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) calcAUCValue(tcc) @ The following is an alternative procedure for \Rpackage{edgeR} users. <>= d <- DGEList(counts = tcc$count, group = tcc$group$group) d <- calcNormFactors(d) d$samples$norm.factors <- d$samples$norm.factors / mean(d$samples$norm.factors) d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) result <- exactTest(d) result$table$PValue[is.na(result$table$PValue)] <- 1 AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0), data = -rank(result$table$PValue))) @ <>= set.seed(1000) samplesize <- 100 tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq", iteration = 1, samplesize = samplesize) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) auc.degestbt <- calcAUCValue(tcc) @ As can be expected from the similarity of the normalization factors of DEGES/TbT (\ref{section-3-1-1}) and DEGES/\Rpackage{edgeR} (\ref{section-3-1-2}), the AUC value ($\Sexpr{sprintf("%.7f", auc.degesedger)}$) of DEGES/\Rpackage{edgeR} is quite similar to the AUC value ($\Sexpr{sprintf("%.7f", auc.degestbt)}$) of the original TbT method (i.e., DEGES/TbT): <>= set.seed(1000) samplesize <- 100 tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq", iteration = 1, samplesize = samplesize) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) calcAUCValue(tcc) @ \subsection{Two-group data without replicates} \label{section-5-2} Let us generate tag count data without replicates, such as those used in section \ref{section-3-2} For simplicity, we first generate simulation data whose conditions are essentially the same as those in the previous section (i.e., \ref{section-5-1}), except for the number of replicates in each group: (i) the number of genes is $1,000$ (i.e., \Robject{Ngene = 1000}), (ii) the first $20\%$ of genes are DEGs (\Robject{PDEG = 0.2}), (iii) the first $90\%$ of the DEGs are up-regulated in G1, and the remaining $10\%$ are up-regulated in G2 (\Robject{DEG.assign = c(0.9, 0.1)}), (iv) the levels of DE are four-fold in both groups (\Robject{DEG.foldchange = c(4, 4)}), and (v) there are a total of two samples (one from G1 and the other from G2) (\Robject{replicates = c(1, 1)}). <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.2, DEG.assign = c(0.9, 0.1), DEG.foldchange = c(4, 4), replicates = c(1, 1)) dim(tcc$count) head(tcc$count) tcc$group @ Now let us see how the DEGES/\Rpackage{DESeq}-\Rpackage{DESeq} combination with the original \Rpackage{DESeq}-\Rpackage{DESeq} combination performs. First, we calculate the AUC value for the ranked gene list obtained from the DEGES/\Rpackage{DESeq}-\Rpackage{DESeq} combination. <>= tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "deseq") calcAUCValue(tcc) @ Next, we calculate the corresponding value using the original \Rpackage{DESeq} procedure (i.e., the \Rpackage{DESeq}-\Rpackage{DESeq} combination). <>= tcc <- calcNormFactors(tcc, norm.method = "deseq", iteration = 0) tcc <- estimateDE(tcc, test.method = "deseq") calcAUCValue(tcc) @ It can be seen that the DEGES/\Rpackage{DESeq}-\Rpackage{DESeq} combination outperforms the original procedure under the given simulation conditions. The following is an alternative approach for \Rpackage{DESeq} users. <>= cds <- newCountDataSet(tcc$count, tcc$group$group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(tcc$count) norm.factors <- norm.factors / mean(norm.factors) sizeFactors(cds) <- colSums(tcc$count) * norm.factors cds <- estimateDispersions(cds, method="blind", sharingMode="fit-only") result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0), data = -rank(result$pval))) @ This procedure is completely the same as the one in \Rpackage{TCC} that gives normalization factors corresponding to those in \Rpackage{edgeR} for different packages. However, the following commands from the \Rpackage{DESeq} manual are of practical value because they give approximately the same AUC value as above. <>= cds <- newCountDataSet(tcc$count, tcc$group$group) cds <- estimateSizeFactors(cds) cds <- estimateDispersions(cds, method="blind", sharingMode="fit-only") result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0), data = -rank(result$pval))) @ \subsection{Multi-group data with and without replicates} \label{section-5-3} The \Rfunction{simulateReadCounts} function can generate simulation data with a more complex design. First, we generate a dataset consisting of three groups. The simulation conditions for this dataset are as follows: (i) the number of genes is $1,000$ (i.e., \Robject{Ngene = 1000}), (ii) the first $30\%$ of genes are DEGs (\Robject{PDEG = 0.3}), (iii) the breakdowns of the up-regulated DEGs are respectively $70\%$, $20\%$, and $10\%$ in Groups 1-3 (\Robject{DEG.assign = c(0.7, 0.2, 0.1)}), (iv) the levels of DE are $3$-, $10$-, and $6$-fold in individual groups (\Robject{DEG.foldchange = c(3, 10, 6)}), and (v) there are a total of nine libraries (2, 4, and 3 replicates for Groups 1-3) (\Robject{replicates = c(2, 4, 3)}). <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.3, DEG.assign = c(0.7, 0.2, 0.1), DEG.foldchange = c(3, 10, 6), replicates = c(2, 4, 3)) dim(tcc$count) tcc$group head(tcc$count) @ The pseudo-color image for the generated simulation data regarding the DEGs can be obtained from the \Rfunction{plotFCPseudocolor} function. The right bar (from white to magenta) indicates the degree of fold-change (FC). As expected, it can be seen that the first $210$, $60$, and $30$ genes are up-regulated in G1, G2, and G3, respectively. <>= plotFCPseudocolor(tcc) @ Now let us see how the DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination with the original \Rpackage{edgeR}-\Rpackage{edgeR} combination performs. First we calculate the AUC value for the ranked gene list obtained from the DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination. <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) calcAUCValue(tcc) @ Next, we calculate the corresponding value using the original \Rpackage{edgeR} procedure for single factor experimental design (i.e., the \Rpackage{edgeR}-\Rpackage{edgeR} combination). <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 0) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) calcAUCValue(tcc) @ It can be seen that the DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination outperforms the original \Rpackage{edgeR} procedure under the given simulation conditions. Note that the \Robject{test.method} argument will be ignored when \Rpackage{iteration = 0} is specified. Next, let us generate another dataset consisting of a total of eight groups. The simulation conditions for this dataset are as follows: (i) the number of genes is $10,000$ (i.e., \Robject{Ngene = 10000}), (ii) the first $34\%$ of genes are DEGs (\Robject{PDEG = 0.34}), (iii) the breakdowns of the up-regulated DEGs are respectively $10\%$, $30\%$, $5\%$, $10\%$, $5\%$, $21\%$, $9\%$, and $10\%$ in Groups 1-8 (\Robject{DEG.assign = c(0.1, 0.3, 0.05, 0.1, 0.05, 0.21, 0.09, 0.1)}), (iv) the levels of DE are $3.1$-, $13$-, $2$-, $1.5$-, $9$-, $5.6$-, $4$-, and $2$-fold in individual groups (\Robject{DEG.foldchange = c(3.1, 13, 2, 1.5, 9, 5.6, 4, 2)}), and (v) there are a total of nine libraries (except for G3, none of the groups have replicates) (\Robject{replicates = c(1, 1, 2, 1, 1, 1, 1, 1)}). <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.34, DEG.assign = c(0.1, 0.3, 0.05, 0.1, 0.05, 0.21, 0.09, 0.1), DEG.foldchange = c(3.1, 13, 2, 1.5, 9, 5.6, 4, 2), replicates = c(1, 1, 2, 1, 1, 1, 1, 1)) dim(tcc$count) tcc$group head(tcc$count) plotFCPseudocolor(tcc) @ This kind of simulation data may be useful for evaluating methods aimed at identifying tissue-specific (or tissue-selective) genes. \subsection{Multi-factor data} \label{section-5-3-p1} The \Rfunction{simulateReadCounts} function can also generate simulation data in multi-factor experimental design. Different from above single-factor experimental design, the \Robject{group} argument should be used instead of \Robject{replicates} for specifying sample conditions (or factors) when generating simulation data in multi-factor design. In relation to the \Robject{group} specification, the \Robject{DEG.foldchange} argument should also be specified as a data frame object. We generate a dataset consisting of two factors for comparing (i) two Groups (i.e., "WT" vs. "KO") as the first factor, at (ii) two time points (i.e., "1d" vs. "2d") as the second factor, with all samples obtained from independent subjects. There are a total of four conditions ("WT\_1d", "WT\_2d", "KO\_1d", and "KO\_2d") each of which has two biological replicates, comprising a total of eight samples. The \Robject{group} argument for this experimental design can be described as follows: <>= group <- data.frame( GROUP = c("WT", "WT", "WT", "WT", "KO", "KO", "KO", "KO"), TIME = c("1d", "1d", "2d", "2d", "1d", "1d", "2d", "2d") ) @ Next, we design the number of types of DEGs and the levels of fold-change by the \Robject{DEG.foldchange} argument. We here introduce three types of DEGs: (a) 2-fold up-regulation in the first four samples (i.e., "WT"), (b) 3-fold up-regulation in the last four samples (i.e., "KO"), and (c) 2-fold down-regulation at "2d" in "WT" and 4-fold up-regulation at "2d" in "KO". This implies that the first two types of DEGs are related to the first factor (i.e., "WT" vs. "KO") and the third type of DEG is related to the second factor (i.e., "1d" vs. "2d"). <>= DEG.foldchange <- data.frame( FACTOR1.1 = c(2, 2, 2, 2, 1, 1, 1, 1), FACTOR1.2 = c(1, 1, 1, 1, 3, 3, 3, 3), FACTOR2 = c(1, 1, 0.5, 0.5, 1, 1, 4, 4) ) @ The other simulation conditions for this dataset are as follows: (1) the number of gene is 1,000 (i.e., \Robject{Ngene = 1000}), (2) the first 20\% of genes are DEGs (i.e., \Robject{PDEG = 0.2}), and (3) the breakdowns of the three types of DEGs are 50\%, 20\%, and 30\% (i.e., \Robject{DEG.assign = c(0.5, 0.2, 0.3)}). <>= set.seed(1000) tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.2, DEG.assign = c(0.5, 0.2, 0.3), DEG.foldchange = DEG.foldchange, group = group) @ Since the first six rows in the dataset corresponds to the first type of DEGs, we can see the 2-fold up-regulation in the first four columns (i.e., WT-related samples) compared to the last four columns (i.e., KO-related samples). <>= head(tcc$count) tcc$group plotFCPseudocolor(tcc) @ \subsection{Other utilities} \label{section-5-4} Recall that the simulation framework can handle different levels of DE for DEGs in individual groups, and the shape of the distribution for these DEGs is the same as that of non-DEGs. Let us confirm those distributions by introducing more drastic simulation conditions for comparing two groups (G1 vs. G2) with biological replicates; i.e., (i) the number of genes is $20,000$ (i.e., \Robject{Ngene = 20000}), (ii) the first $30\%$ of genes are DEGs (\Robject{PDEG = 0.30}), (iii) the first $85\%$ of the DEGs are up-regulated in G1 and the remaining $15\%$ are up-regulated in G2 (\Robject{DEG.assign = c(0.85, 0.15)}), (iv) the levels of DE are eight-fold in G1 and sixteen-fold in G2 (\Robject{DEG.foldchange = c(8, 16)}), and (v) there are a total of four samples (two biological replicates for G1 and two biological replicates for G2) (\Robject{replicates = c(2, 2)}). <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 20000, PDEG = 0.30, DEG.assign = c(0.85, 0.15), DEG.foldchange = c(8, 16), replicates = c(2, 2)) head(tcc$count) @ An M-A plot for the simulation data can be viewed as follows; the points for up-regulated DEGs in G1 and G2 are colored blue and red, respectively. The non-DEGs are in black: <>= plot(tcc) @ This plot is generated from simulation data that has been scaled in such a way that the library sizes of each sample are the same as the mean library size of the original data. That is, <>= normalized.count <- getNormalizedData(tcc) colSums(normalized.count) colSums(tcc$count) mean(colSums(tcc$count)) @ <>= xy <- plot(tcc) isnot.na <- as.logical(xy[, 1] != min(xy[, 1])) median.G1 <- median(xy[(tcc$simulation$trueDEG == 1) & isnot.na, 2]) median.G2 <- median(xy[(tcc$simulation$trueDEG == 2) & isnot.na, 2]) median.nonDEG <- median(xy[(tcc$simulation$trueDEG == 0) & isnot.na, 2]) @ The summary statistics for non-DEGs and up-regulated DEGs in G1 and G2 are upshifted compared with the original intentions of the user (i.e., respective M values of $0$, $-3$, and $4$ for non-DEGs and up-regulated DEGs in G1 and G2). Indeed, the median values, indicated as horizontal lines, are respectively $\Sexpr{sprintf("%.3f", median.nonDEG)}$, $\Sexpr{sprintf("%.3f", median.G1)}$, and $\Sexpr{sprintf("%.3f", median.G2)}$ for non-DEGs and up-regulated DEGs in G1 and G2. <>= plot(tcc, median.lines = TRUE) @ <>= tcc <- calcNormFactors(tcc, "tmm", "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) xy <- plot(tcc) isnot.na <- as.logical(xy[, 1] != min(xy[, 1])) median.nonDEG <- median(xy[(tcc$simulation$trueDEG == 0) & isnot.na, 2]) @ These upshifted M values for non-DEGs can be modified after performing the iDEGES/\Rpackage{edgeR} normalization, e.g., the median M value ($= \Sexpr{sprintf("%.3f", median.nonDEG)}$) for non-DEGs based on the iDEGES/\Rpackage{edgeR}-normalized data is nearly zero. <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) plot(tcc, median.line = TRUE) @ The comparison of those values obtained from different normalization methods might be another evaluation metric. \newpage \section{Session info} <>= sessionInfo() @ \newpage \section{References} \renewcommand{\refname}{} \renewcommand\refname{\vskip -1cm} \begin{thebibliography}{99} \bibitem{robinson} Robinson MD, McCarthy DJ, and Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26(1): 139-140 \bibitem{dillies} Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Lalo\"e D, Le Gall C, Scha\"effer B, Le Crom S, Guedj M, Jaffr\'ezic F; on behalf of The French StatOmique Consortium. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform, in press \bibitem{kadota} Kadota K, Nishiyama T, and Shimizu K. A normalization strategy for comparing tag count data. Algorithms Mol Biol. 2012, 7:5 \bibitem{robinson2} Robinson MD and Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010, 11: R25 \bibitem{wad} Kadota K, Nakai Y, Shimizu K: A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithms Mol Biol. 208, 3: 8 \bibitem{hardcastle} Hardcastle TJ and Kelly KA. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 2010, 11: 422 \bibitem{aicentropy} Kadota K, Nishimura SI, Bono H, Nakamura S, Hayashizaki Y, Okazaki Y, Takahashi K: Detection of genes with tissue-specific expression patterns using Akaike's Information Criterion (AIC) procedure. Physiol Genomics 2003, 12: 251-259 \bibitem{anders} Anders S and Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010, 11(10): R106 \bibitem{mccarthy} McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012, 40(10): 4288-4297 \bibitem{glaus} Glaus P, Honkela A, and Rattray M. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 2012, 28(13): 1721-1728 \bibitem{roku} Kadota K, Ye J, Nakai Y, Terada T, Shimizu K: ROKU: a novel method for identification of tissue-specific genes. BMC Bioinformatics 2006, 7: 294 \bibitem{ueda} Ueda T. Simple method for the detection of outliers. Japanese J Appl Stat 1996, 25: 17-26 \bibitem{robinson3} Robinson MD and Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 2008, 9: 321-332 \bibitem{sun} Sun J, Nishiyama T, Shimizu K, and Kadota K. TCC: an R package for comparing tag count data with robust normalization strategies. BMC Bioinformatics 2013, 14: 219 \bibitem{di} Di Y, Schafer DW, Cumbie JS, and Chang JH. The NBP negative binomial model for assessing differential gene expression from RNA-Seq. Stat Appl Genet Mol Biol. 2011, 10: art24 \end{thebibliography} \end{document}