%\VignetteIndexEntry{TCC} %\VignettePackage{TCC} \documentclass[oneside,12pt]{article} \usepackage[a4paper,left=2.2cm,top=2.6cm,bottom=2.3cm,right=2.2cm]{geometry} \usepackage{cite} \usepackage{color} \usepackage{Sweave} %%\usepackage{here} \SweaveOpts{pdf=TRUE} \definecolor{TccBlue}{cmyk}{0.74,0.26,0,0.14} \definecolor{TccRed}{cmyk}{0,1,0.78,0.1} \definecolor{TccGreen}{cmyk}{0.99,0,0.29,0.47} \definecolor{TccOrange}{cmyk}{0,0.27,1,0.03} \renewcommand{\floatpagefraction}{0.9} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\Robject{#1}}} \newcommand{\Rpackage}[1]{\textbf{#1}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=n,formatcom=\color{TccRed}, fontsize=\footnotesize} \DefineVerbatimEnvironment{Soutput}{Verbatim}{fontshape=n,formatcom=\color{TccBlue}, fontsize=\footnotesize} \fvset{baselinestretch=0.9} \author {\small Jianqiang Sun$^{1\S}$, Tomoaki Nishiyama$^{2\S}$, Kentaro Shimizu$^1$, and Koji Kadota$^1$\\ \texttt{\footnotesize $^1$ The University of Tokyo, Tokyo, Japan}\\ \texttt{\footnotesize $^2$ Kanazawa University, Kanazawa, Japan}\\ \texttt{\footnotesize $^\S$ Maintainer: Jianqiang Sun (wukong@bi.a.u-tokyo.ac.jp),}\\ \texttt{\footnotesize Tomoaki Nishiyama (tomoakin@staff.kanazawa-u.ac.jp)} } \title{TCC: Differential expression analysis for tag count data with robust normalization strategies} \begin{document} \maketitle\thispagestyle{empty} \begin{abstract} The R/Bioconductor package, \Rpackage{TCC}, provides users with a robust and accurate framework to perform differential expression (DE) analysis of tag count data. We recently developed a multi-step normalization method (TbT; Kadota et al., 2012) for two-group RNA-seq data. The strategy (called DEGES) is to remove data that are potential differentially expressed genes (DEGs) before performing the data normalization. DEGES in \Rpackage{TCC} is essential for accurate normalization of tag count data, especially when the up- and down-regulated DEGs in one of the groups are extremely biased in their number. A major characteristic of \Rpackage{TCC} is to provide the DEGES-based normalization methods for several kinds of count data (two-group with or without replicates, multi-group, and so on) by virtue of the use of combinations of functions in other sophisticated packages (especially \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}). The appropriate combination provided by \Rpackage{TCC} allows a more robust and accurate estimation to be performed more easily than directly using original packages and \Rpackage{TCC} provides a simple unified interface to perform the robust normalization. \end{abstract} \newpage \tableofcontents \newpage \section{Introduction} Differential expression analysis based on tag count data has become a fundamental task for identifying differentially expressed genes or transcripts (DEGs). The \Rpackage{TCC} package (Tag Count Comparison; Sun et al., 2013) provides users with a robust and accurate framework to perform differential expression analysis of tag count data. \Rpackage{TCC} provides integrated analysis pipelines with improved data normalization steps, compared with other packages such as \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}, by appropriately combining their functionalities. The package incorporates multi-step normalization methods whose strategy is to remove data that are potential DEGs before performing the data normalization. Kadota et al. (2012) recently reported that the normalization methods implemented in R packages (such as \Rpackage{edgeR} (Robinson et al., 2010), \Rpackage{DESeq} (Anders and Huber, 2010), and \Rpackage{baySeq} (Hardcastle and Kelly, 2010)) for differential expression (DE) analysis between samples are inadequate when the up- and down-regulated DEGs in one of the samples are extremely biased in their number (i.e., biased DE). This is because the current methods implicitly assume a balanced DE, wherein the numbers of highly and lowly expressed DE entities in samples are (nearly) equal. As a result, methods assuming unbiased DE will not work well on data with biased DE. Although a major purpose of data normalization is to detect such DE entities, their existence themselves consequently interferes with their opportunity to be top-ranked. Conventional procedures for identifying DEGs from tag count data consisting of two steps (i.e., data normalization and identification of DEGs) cannot in principle eliminate the potential DE entities before data normalization. To normalize data that potentially has various scenarios (including unbiased and biased DE), we recently proposed a multi-step normalization strategy (called TbT, an acronym for the TMM-baySeq-TMM pipeline; Kadota et al., 2012), in which the TMM normalization method (Robinson and Oshlack, 2010) is used in steps 1 and 3 and an empirical Bayesian method implemented in the \Rpackage{baySeq} package (Hardcastle and Kelly, 2010) is used in step 2. Although this multi-step DEG elimination strategy (called "DEGES" for short) can successfully remove potential DE entities identified in step 2 prior to the estimation of the normalization factors using the TMM normalization method in step 3, the \Rpackage{baySeq} package used in step 2 of the TbT method is much more computationally intensive than competing packages like \Rpackage{edgeR} and \Rpackage{DESeq}. While the three-step TbT normalization method performed best on simulated and real tag count data, it is practically possible to make different choices for the methods in each step. A more comprehensive study regarding better choices for DEGES is needed. This package provides tools to perform multi-step normalization methods based on DEGES and enables differential expression analysis of tag count data without having to worry much about biased distributions of DEGs. The DEGES-based normalization function implemented in \Rpackage{TCC} includes the TbT method based on DEGES for two-group data with or without replicates, much faster method, and methods for multi-group comparison. \Rpackage{TCC} provides a simple unified interface to perform data normalization with combinations of functions provided by \Rpackage{baySeq}, \Rpackage{DESeq}, and \Rpackage{edgeR}. Functions to produce simulation data under various conditions and to plot the data are also provided. \subsection{Installation} \label{section-1-1} This package is available from the Bioconductor website (http://bioconductor.org/). To install the package, enter the following command after starting R: \begin{Schunk} \begin{Sinput} > source("http://bioconductor.org/biocLite.R") > biocLite("TCC") \end{Sinput} \end{Schunk} \subsection{Citations} \label{section-1-2} This package internally uses many of the functions implemented in the other packages. This is because our normalization procedures consist, in part, of combinations of existing normalization methods and differential expression (DE) methods. For example, the TbT normalization method (Kadota et al., 2012), which is a particular functionality of the \Rpackage{TCC} package (Sun et al., 2013), consists of the TMM normalization method (Robinson and Oshlack, 2010) implemented in the \Rpackage{edgeR} package (Robinson et al., 2010) and the empirical Bayesian method implemented in the \Rpackage{baySeq} package (Hardcastle and Kelly, 2010). Therefore, please cite the appropriate references when you publish your results. \begin{Schunk} \begin{Sinput} > citation("TCC") \end{Sinput} \end{Schunk} \subsection{Quick start} \label{section-1-3} Let us begin by showing a example (Case 1) of identifying DEGs between two groups from tag count data consisting of $1,000$ genes and a total of six samples (each group has three biological replicates). The hypothetical count data (termed "\Robject{hypoData}") is stored in this package (for details, see section \ref{section-2-1}). We then describe the DE analysis of count data without replicates (i.e., two samples), using the data of the first and the fourth column of \Robject{hypoData} (Case 2). \textbf{Case 1}: DE analysis for two-group count data with replicates by using the exact test coupled with iterative DEGES/edgeR normalization (i.e., the iDEGES/edgeR-edgeR combination). This is an alternative pipeline designed to reduce the runtime (approx. $20$ sec.), yet its performance is comparable to the above pipeline. Accordingly, we recommend using this pipeline as a default when analyzing tag count data with replicates. A notable advantage of this pipeline is that the multi-step normalization strategy only needs the methods implemented in the \Rpackage{edgeR} package. The suggested citations are as follows: \Rpackage{TCC} (Sun et al., 2013), TMM (Robinson and Oshlack, 2010), the exact test (Robinson and Smyth, 2008), and \Rpackage{edgeR} (Robinson et al., 2010). For details, see section \ref{section-3-1-3} and \ref{section-4-1-1}. <<3408021901>>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) @ \textbf{Case 2}: DE analysis for two-group count data without replicates by using the negative binomial (NB) test in \Rpackage{DESeq} coupled with iDEGES/DESeq normalization (i.e., the iDEGES/DESeq-DESeq combination). A procedure using the data of the first and fourth columns of \Robject{hypoData} is shown here. This pipeline entirely consists of methods implemented in the \Rpackage{DESeq} package. Suggested citations are as follows: \Rpackage{TCC} (Sun et al., 2013) and \Rpackage{DESeq} (Anders and Huber, 2010). For details, see section \ref{section-3-2}. <<4301933845>>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[,c(1,4)], group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "deseq", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) @ \newpage \section{Preparations} \subsection{Preparing the count data} \label{section-2-1} Similar to the other packages, \Rpackage{TCC} typically starts the DE analysis with a count table matrix where each row indicates a gene (or transcript), each column indicates a sample (or library), and each cell indicates the number of counts for a gene in a sample. There are many ways to obtain the count data from aligned read files such as BAM (Binary Alignment/Map). This includes the \Rfunction{islandCounts} function in \Rpackage{htSeqTools} (Planet et al., 2012), \Rfunction{summarizeOverlaps} in \Rpackage{GenomicRanges} (Lawrence et al., 2013), \Rfunction{qCount} in \Rpackage{QuasR}, and so on. For Windows end users, we recommend to use the \Rpackage{QuasR} package. It provides a comprehensive workflow for many kinds of NGS data including ChIP-seq, RNA-seq, smallRNA-seq, and BS-seq. The main functionalities are: (1) preprocessing raw sequence reads, (2) aligning reads to the reference genome or transcriptome using \Rpackage{Rbowtie}, and (3) producing count matrix from aligned reads. \Rpackage{TCC} uses the raw count data (a matrix of integer values) as input. As also clearly mentioned in \Rpackage{DESeq2}, the raw count data corresponds to a matrix of integer values. Please DO NOT use any normalized counts such as RPM (reads per million), CPM (counts per million), and RPKM. \subsection{Reading the count data} \label{ss_reading_the_count_data} Here, we assume a hypothetical count matrix consisting of 1,000 rows (or genes) and a total of six columns: the first three columns are produced from biological replicates of Group 1 (G1\_rep1, G1\_rep2, and G1\_rep3) and the remaining columns are from Group 2 (G2\_rep1, G2\_rep2, and G2\_rep3). We start by loading the hypothetical data (\Robject{hypoData}) from \Rpackage{TCC} and giving a numeric vector (\Robject{group}) indicating which group each sample belongs to. <<4378919285>>= library(TCC) data(hypoData) head(hypoData) dim(hypoData) group <- c(1, 1, 1, 2, 2, 2) @ If you want to analyze another count matrix consisting of nine columns (e.g., the first four columns are produced from biological replicates of G1, and the remaining five columns are from G2), the \Robject{group} vector should be indicated as follows. <<1209835923>>= group <- c(1, 1, 1, 1, 2, 2, 2, 2, 2) @ \subsection{Constructing TCC class object} \label{section-2-2} The \Rfunction{new} function has to be used to perform the main functionalities of \Rpackage{TCC}. This function constructs a \Rpackage{TCC} class object, and subsequent analyses are performed on this class object. The object is constructed from i) a count matrix (\Robject{hypoData}) and ii) the corresponding numeric vector (\Robject{group}) as follows. <<4389570100>>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc @ The count matrix and group vector information can be retrieved from the stored class object by using \Robject{tcc\$count} and \Robject{tcc\$group}, respectively. <<5048219812>>= head(tcc$count) tcc$group @ \subsection{Filtering low-count genes (optional)} \label{section-2-3} The way to filter out genes with low-count tags across samples depends on the user's philosophy. Although we recommend removing tags with zero counts across samples as a minimum filtering, this effort is optional. The \Rfunction{filterLowCountGenes} function performs this filtering. <<4085249378>>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- filterLowCountGenes(tcc) dim(tcc$count) @ It can be seen that $\Sexpr{nrow(hypoData) - nrow(tcc$count)} (= \Sexpr{nrow(hypoData)} -\Sexpr{ nrow(tcc$count)})$ genes were filtered as non-expressed. The same procedure can be performed without the \Rfunction{filterLowCountGenes} function, in which case the filtering is performed before the \Rpackage{TCC} class object is constructed. <>= filter <- as.logical(rowSums(hypoData) > 0) dim(hypoData[filter, ]) tcc <- new("TCC", hypoData[filter, ], group) dim(tcc$count) @ \newpage \section{Normalization} \label{s_normalization} This chapter describes the details of our robust normalization strategy (i.e., DEGES) implemented in \Rpackage{TCC}. Alternative DEGES procedures consisting of functions in other packages (\Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}) are also provided, enabling {\it advanced} users familiar with the existing packages can easily learn what is done in \Rpackage{TCC}. Note that {\it end} users, who just want to perform robust differential expression analysis using \Rpackage{TCC}, can skip this chapter (\ref{s_normalization} Normalization) and move from here to, for example, the next chapter (\ref{s_differential_expression} Differential expression). Note also that the purpose here is to obtain accurate normalization factors to be used with statistical models (e.g., the exact test or empirical Bayes) for the DE analysis described in the next section (\ref{s_differential_expression} Differential expression). \Rpackage{TCC} can manipulate various kinds of experimental designs. Followings are some examples for individual designs. Appropriate sections should be referred for your specific experimental designs. \begin{table}[h] \begin{center} \caption{\ref{ss_norm_twogroup_reps} unpaired samples (two-group)} \begin{tabular}{|ccc|ccc|} \hline Label & Example 1 & Example 2 & Label & Example 3 & Example 4 \\ \hline G1\_rep1 & G1(mouse\_A) & Tumor(patient\_A) & G1\_rep1 & G1(mouse\_A) & Tumor(patient\_A) \\ G1\_rep2 & G1(mouse\_B) & Tumor(patient\_B) & G1\_rep2 & G1(mouse\_B) & Tumor(patient\_B) \\ G1\_rep3 & G1(mouse\_C) & Tumor(patient\_C) & G2\_rep1 & G2(mouse\_D) & Normal(patient\_G) \\ G2\_rep1 & G2(mouse\_D) & Normal(patient\_G) & G2\_rep2 & G2(mouse\_E) & Normal(patient\_H) \\ G2\_rep2 & G2(mouse\_E) & Normal(patient\_H) & & & \\ G2\_rep3 & G2(mouse\_F) & Normal(patient\_K) & & & \\ \hline \end{tabular} \end{center} \end{table} \begin{table}[h] \begin{center} \caption{\ref{ss_norm_twogroup_without_reps} paired samples (two-group)} \begin{tabular}{|ccc|ccc|} \hline Label & Example 1 & Example 2 & Label & Example 3 & Example 4 \\ \hline G1\_rep1 & G1(mouse\_A) & Tumor(patient\_G) & G1\_rep1 & G1(mouse\_B) & Tumor(patient\_J) \\ G1\_rep2 & G1(mouse\_B) & Tumor(patient\_H) & G1\_rep2 & G1(mouse\_C) & Tumor(patient\_K) \\ G1\_rep3 & G1(mouse\_C) & Tumor(patient\_K) & G2\_rep1 & G2(mouse\_B) & Normal(patient\_J) \\ G2\_rep1 & G2(mouse\_A) & Normal(patient\_G) & G2\_rep2 & G2(mouse\_C) & Normal(patient\_K) \\ G2\_rep2 & G2(mouse\_B) & Normal(patient\_H) & & & \\ G2\_rep3 & G2(mouse\_C) & Normal(patient\_K) & & & \\ \hline \end{tabular} \end{center} \end{table} \newpage \begin{table}[h] \begin{center} \caption{\ref{ss_norm_multifactor_reps} unpaired samples (multi-group)} \begin{tabular}{|ccc|ccc|} \hline Label & Example 1 & Example 2 & Label & Example 3 & Example 4 \\ \hline G1\_rep1 & G1(mouse\_A) & Kidney(sample\_A) & G1\_rep1 & G1(mouse\_A) & Liver(sample\_G) \\ G1\_rep2 & G1(mouse\_B) & Kidney(sample\_B) & G1\_rep2 & G1(mouse\_B) & Liver(sample\_H) \\ G1\_rep3 & G1(mouse\_C) & Kidney(sample\_C) & G2\_rep1 & G2(mouse\_D) & Brain(sample\_Y) \\ G2\_rep1 & G2(mouse\_D) & Liver(sample\_G) & G2\_rep2 & G2(mouse\_E) & Brain(sample\_Z) \\ G2\_rep2 & G2(mouse\_E) & Liver(sample\_H) & G3\_rep1 & G3(mouse\_U) & Kidney(sample\_B) \\ G2\_rep3 & G2(mouse\_F) & Liver(sample\_K) & G3\_rep2 & G3(mouse\_T) & Kidney(sample\_C) \\ G3\_rep1 & G3(mouse\_G) & Brain(sample\_X) & & & \\ G3\_rep2 & G3(mouse\_H) & Brain(sample\_Y) & & & \\ G3\_rep3 & G3(mouse\_I) & Brain(sample\_Z) & & & \\ \hline \end{tabular} \end{center} \end{table} \subsection{Normalization of two-group count data with replicates} \label{ss_norm_twogroup_reps} This package provides robust normalization methods based on DEGES proposed by Kadota et al. (2012). When obtaining normalization factors from two-group data with replicates, users can select a total of six combinations (two normalization methods $\times$ three DEG identification methods) coupled with an arbitrary number of iterations ($n = 0, 1, 2, \dots, 100$) in our DEGES-based normalization pipeline. We show some of the practical combinations below. Since the three-step TbT normalization method was originally designed for normalizing tag count data with (biological) replicates, we will first explain the TbT method (\ref{section-3-1-1} DEGES/TbT). In relation to the other DEGES-based methods, we will call the method "DEGES/TbT" for convenience. As mentioned in the original study, DEGES/TbT needs a long computation time. Accordingly, we present three shorter alternatives (\ref{section-3-1-2} DEGES/edgeR, \ref{section-3-1-3} iDEGES/edgeR, and \ref{section-3-1-4} DEGES/DESeq). \subsubsection{DEGES/TbT} \label{section-3-1-1} The DEGES/TbT (Kadota et al., 2012) with default parameter settings can be performed as follows. <>= set.seed(1000) library(TCC) data(hypoData) samplesize <- 100 group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq", iteration = 1, samplesize = samplesize) @ Note that a smaller sampling size (i.e., \Robject{samplesize = 100}) is used here to reduce the computation time when performing the empirical Bayesian method in step 2, but a larger sampling size of around $10,000$ (i.e., \Robject{samplesize = 10000}) is recommended (Hardcastle and Kelly, 2010). This method estimates an empirical distribution of the parameters of the NB distribution by bootstrapping from the input data. While the sampling size can be made smaller to reduce the computation time (e.g., \Robject{samplesize = 40}), the resulting normalization factors will vary from trial to trial. In this vignette, we will call the \Rfunction{set.seed} function for obtaining reproducible results (i.e., the \Robject{tcc\$norm.factors} values) when using any random function. The calculated normalization factors and the computation time can be retrieved with the following commands. <>= tcc$norm.factors tcc$DEGES$execution.time @ Of course, the procedure can be performed by using functions in \Rpackage{edgeR} and \Rpackage{baySeq}, instead of using the \Rfunction{calcNormFactors} function in \Rpackage{TCC}. The \Rfunction{calcNormFactors} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <>= set.seed(1000) library(TCC) data(hypoData) samplesize <- 100 group <- c(1, 1, 1, 2, 2, 2) floorPDEG <- 0.05 FDR <- 0.1 ### STEP 1 ### d <- DGEList(count = hypoData, group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors norm.factors <- norm.factors / mean(norm.factors) ### STEP 2 ### cD <- new("countData", data = hypoData, replicates = group, groups = list(NDE = rep(1, length = length(group)), DE = group), libsizes = colSums(hypoData) * norm.factors) cD <- getPriors.NB(cD, samplesize = samplesize, estimation = "QL", cl = NULL) cD <- getLikelihoods(cD, pET = "BIC", cl = NULL) result <- topCounts(cD, group = "DE", number = nrow(hypoData)) result <- result[order(result$rowID), ] pval <- result$FWER.DE qval <- result$FDR.DE if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### d <- DGEList(count = hypoData[!is.DEG, ], group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsubsection{DEGES/edgeR} \label{section-3-1-2} Now let us describe an alternative approach that is roughly $200$-$400$ times faster than DEGES/TbT, yet has comparable performance. The TMM-edgeR-TMM pipeline (called DEGES/edgeR) employs the exact test implemented in \Rpackage{edgeR} in step 2. To use this pipeline, we have to provide a reasonable threshold for defining potential DEGs in step 2. We will define the threshold as an arbitrary false discovery rate (FDR) with a floor value of $P_{\mbox{\tiny DEG}}$. The default FDR is $< 0.1$, and the default floor $P_{\mbox{\tiny DEG}}$ is $5\%$, but different choices are of course possible. For example, in case of the default settings, $x\% (x > 5\%)$ of the top-ranked potential DEGs are eliminated in step 2 if the percentage ($= x\%$) of genes satisfying FDR $< 0.1$ is over $5\%$. The DEGES/edgeR pipeline has an apparent advantage over TbT in computation time. It can be performed as follows: <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors tcc$DEGES$execution.time @ The normalization factors calculated from the DEGES/edgeR are very similar to those of DEGES/TbT with the default parameter settings (i.e., \Robject{samplesize = 10000}). For \Rpackage{edgeR} users, we provide commands, consisting of functions in \Rpackage{edgeR}, to perform the DEGES/edgeR pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFactors} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) FDR <- 0.1 floorPDEG <- 0.05 d <- DGEList(counts = hypoData, group = group) ### STEP 1 ### d <- calcNormFactors(d) ### STEP 2 ### d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) result <- exactTest(d) pval <- result$table$PValue qval <- p.adjust(pval, method = "BH") if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### d <- DGEList(counts = hypoData[!is.DEG, ], group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsubsection{iDEGES/edgeR} \label{section-3-1-3} Our multi-step normalization can be repeated until the calculated normalization factors converge (Kadota et al., 2012). An iterative version of DEGES/TbT (i.e., iDEGES/TbT) can be described as the TMM-(baySeq-TMM)$_{n}$ pipeline with $n \ge 2$. Although the iDEGES/TbT would not be practical in terms of the computation time, the TMM-(edgeR-TMM)$_{n}$ pipeline (iDEGES/edgeR) is potentially superior to both the DEGES/edgeR and the DEGES/TbT. A suggested iDEGES/edgeR implementation ($n = 3$) consists of seven steps, as follows: <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors tcc$DEGES$execution.time @ \subsubsection{DEGES/DESeq} \label{section-3-1-4} The DEGES pipeline can also be performed by using only the functions in the \Rpackage{DESeq} package. Similar to the \Rpackage{edgeR} case above, this DESeq-DESeq-DESeq pipeline (DEGES/DESeq) changes the corresponding arguments of the \Robject{norm.method} and \Robject{test.method} as follows: <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors tcc$DEGES$execution.time @ For \Rpackage{DESeq} users, we also provide commands, consisting of functions in \Rpackage{DESeq}, to perform the DEGES/DESeq pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFactors} function together with the above arguments can be regarded as a wrapper function for the following commands. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) FDR <- 0.1 floorPDEG <- 0.05 cds <- newCountDataSet(hypoData, group) ### STEP 1 ### cds <- estimateSizeFactors(cds) ### STEP 2 ### cds <- estimateDispersions(cds) result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 result$padj[is.na(result$padj)] <- 1 pval <- result$pval qval <- result$padj if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### cds <- newCountDataSet(hypoData[!is.DEG, ], group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsubsection{DEGES/DESeq2} \label{section-3-1-5} The DEGES pipeline can also be performed by using only the functions in the \Rpackage{DESeq2} package. Similar to the \Rpackage{DESeq} case above, this DESeq2-DESeq2-DESeq2 pipeline (DEGES/DESeq2) changes the corresponding arguments of the \Robject{norm.method} and \Robject{test.method} as follows: <<4390811048>>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "deseq2", test.method = "deseq2", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors tcc$DEGES$execution.time @ For \Rpackage{DESeq2} users, we also provide commands, consisting of functions in \Rpackage{DESeq2}, to perform the DEGES/DESeq2 pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFactors} function together with the above arguments can be regarded as a wrapper function for the following commands. <<0869034832>>= library(TCC) data(hypoData) FDR <- 0.1 floorPDEG <- 0.05 group <- factor(c(1, 1, 1, 2, 2, 2)) cl <- data.frame(group = group) design <- formula(~ group) dds <- DESeqDataSetFromMatrix(countData = hypoData, colData = cl, design = design) ### STEP 1 ### dds <- estimateSizeFactors(dds) sizeFactors(dds) <- sizeFactors(dds) / mean(sizeFactors(dds)) ### STEP 2 ### dds <- estimateDispersions(dds) dds <- nbinomWaldTest(dds) result <- results(dds) result$pvalue[is.na(result$pvalue)] <- 1 pval <- result$pvalue qval <- p.adjust(pval, method = "BH") if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### dds <- DESeqDataSetFromMatrix(countData = hypoData[!is.DEG, ], colData = cl, design = design) dds <- estimateSizeFactors(dds) norm.factors <- sizeFactors(dds) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsection{Normalization of two-group count data without replicates} \label{section-3-2} It is important to keep in mind that most R packages (including \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}) are primarily for analyzing data including biological replications because the biological variability has to be accurately estimated to avoid spurious DE calls (Glaus et al., 2012). Similar to \ref{ss_norm_twogroup_reps}, users can select a total of six combinations (two normalization methods $\times$ three DEG identification methods) when obtaining normalization factors from two-group count data without replicates. We here describe two DEGES-based normalization pipelines (\ref{ss_2gs_degesedger} DEGES/edgeR and \ref{ss_2gs_degesdeseq} DEGES/DESeq) using the subset of \Robject{hypoData} consisting of the first and the fourth columns, i.e., <<9308283201>>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) head(tcc$count) tcc$group @ \subsubsection{DEGES/edgeR} \label{ss_2gs_degesedger} The DEGES/edgeR pipeline for two-group data without replicates can be performed as follows: <<4352419012>>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors @ For \Rpackage{edgeR} users, we provide commands, consisting of functions in \Rpackage{edgeR}, to perform the DEGES/edgeR pipeline without \Rpackage{TCC}. The \Robject{calcNormFactors} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <<5670112812>>= library(TCC) data(hypoData) group <- c(1, 2) FDR <- 0.1 floorPDEG <- 0.05 d <- DGEList(counts = hypoData[, c(1, 4)], group = group) ### STEP 1 ### d <- calcNormFactors(d) ### STEP 2 ### d <- estimateGLMCommonDisp(d, method = "deviance", robust = TRUE, subset = NULL) result <- exactTest(d) pval <- result$table$PValue qval <- p.adjust(pval, method = "BH") if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### d <- DGEList(counts = hypoData[!is.DEG, c(1, 4)], group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, c(1, 4)]) / colSums(hypoData[, c(1, 4)]) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsubsection{DEGES/DESeq} \label{ss_2gs_degesdeseq} The DEGES/DESeq pipeline for two-group data without replicates can be performed as follows: <<0408278414>>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc$norm.factors @ For \Rpackage{DESeq} users, we provide commands, consisting of functions in \Rpackage{DESeq}, to perform the DEGES/DESeq pipeline without \Rpackage{TCC}. The \Robject{calcNormFactors} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <<4389572193>>= library(TCC) data(hypoData) group <- c(1, 2) FDR <- 0.1 floorPDEG <- 0.05 cds <- newCountDataSet(hypoData[, c(1, 4)], group) ### STEP 1 ### cds <- estimateSizeFactors(cds) ### STEP 2 ### cds <- estimateDispersions(cds, method = "blind", sharingMode = "fit-only") result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 result$padj[is.na(result$padj)] <- 1 pval <- result$pval qval <- result$padj if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### cds <- newCountDataSet(hypoData[!is.DEG, c(1, 4)], group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(hypoData[, c(1, 4)]) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsection{Normalization of two-group count data without replicates (paired)} \label{ss_norm_twogroup_without_reps} \Rpackage{edgeR} and \Rpackage{DESeq} (and \Rpackage{DESeq2}) employs generalized linear models (GLMs) which can apply to detect DEGs from paired two-group count data. When obtaining normalization factors from paired two group samples, users can select a total of four combinations (two normalization methods $\times$ two DEG identification methods) coupled with an arbitrary number of iterations ($n = 0, 1, 2, \cdots, 100$) in our DEGES-based normalization pipeline. The analysis for paired samples can be performed by indicating (1) the pair information when constructing the \Robject{TCC} class object and (2) \Robject{paired = TRUE} when performing the \Robject{calcNormFactors} function. For analyzing paired data, we here use the hypothetical count data (\Robject{hypoData}; for details see \ref{ss_reading_the_count_data}) by changing the label information, i.e., <<9347533928>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") head(hypoData) @ This count data consists of three individuals (or sibships), dogA, dogB, and dogC. "T" and "N" indicate tumor and normal tissues, respectively. \subsubsection{DEGES/edgeR} \label{norm_twogroup_without_reps_degesedger} The DEGES/edgeR pipeline for two-group paired data can be performed as follows: <<2394782901>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") group <- c(1, 1, 1, 2, 2, 2) pair <- c(1, 2, 3, 1, 2, 3) cl <- data.frame(group = group, pair = pair) tcc <- new("TCC", hypoData, cl) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, FDR = 0.1, floorPDEG = 0.05, paired = TRUE) tcc$norm.factors @ For \Rpackage{edgeR} users, we provide commands, consisting of functions in \Rpackage{edgeR}, to perform the DEGES/edgeR pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFactors} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <<8939487592>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") group <- factor(c(1, 1, 1, 2, 2, 2)) pair <- factor(c(1, 2, 3, 1, 2, 3)) design <- model.matrix(~ group + pair) coef <- 2 FDR <- 0.1 floorPDEG <- 0.05 d <- DGEList(counts = hypoData, group = group) ### STEP 1 ### d <- calcNormFactors(d) ### STEP 2 ### d <- estimateGLMCommonDisp(d, design) d <- estimateGLMTrendedDisp(d, design) d <- estimateGLMTagwiseDisp(d, design) fit <- glmFit(d, design) lrt <- glmLRT(fit, coef = coef) pval <- lrt$table$PValue qval <- p.adjust(pval, method = "BH") if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))){ is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### d <- DGEList(counts = hypoData[!is.DEG, ], group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsubsection{DEGES/DESeq} \label{nrom_twogroup_without_reps_degesdeseq} The DEGES/DESeq pipeline for two-group paired data can be performed as follows: <<7573490899>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") group <- c(1, 1, 1, 2, 2, 2) pair <- c(1, 2, 3, 1, 2, 3) cl <- data.frame(group = group, pair = pair) tcc <- new("TCC", hypoData, cl) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05, paired = TRUE) tcc$norm.factors @ For \Rpackage{DESeq} users, we provide commands, consisting of functions in \Rpackage{DESeq}, to perform the DEGES/DESeq pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFactors} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <<3489438904>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") group <- factor(c(1, 1, 1, 2, 2, 2)) pair <- factor(c(1, 2, 3, 1, 2, 3)) FDR <- 0.1 floorPDEG <- 0.05 cl <- data.frame(group = group, pair = pair) full <- count ~ group + pair reduced <- count ~ pair ### STEP 1 ### cds <- newCountDataSet(hypoData, cl) ### STEP 2 ### cds <- estimateSizeFactors(cds) cds <- estimateDispersions(cds, method = "blind", sharingMode = "fit-only") full.model <- fitNbinomGLMs(cds, full) reduced.model <- fitNbinomGLMs(cds, reduced) pval <- nbinomGLMTest(full.model, reduced.model) pval[is.na(pval)] <- 1 qval <- p.adjust(pval, method = "BH") if (sum(qval < FDR) > (floorPDEG * nrow(hypoData))) { is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData) * floorPDEG) } ### STEP 3 ### cds <- newCountDataSet(hypoData[!is.DEG, ], group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(hypoData) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsection{Normalization of multi-group count data with replicates} \label{ss_norm_multifactor_reps} Many R packages (including \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}) support DE analysis for multi-group tag count data. \Rpackage{TCC} provides some prototypes of DEGES-based pipelines for such data. Here, we analyze another hypothetical three-group count matrix, the \Robject{hypoData\_mg} object, provided in \Rpackage{TCC}. It consists of $1,000$ genes and a total of nine columns for testing any difference among three groups that each have triplicates. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) tcc dim(tcc$count) @ Of the $1,000$ genes, the first $200$ genes are DEGs and the remaining $800$ genes are non-DEGs. The breakdowns for the $200$ DEGs are as follows: $140$, $40$, and $20$ DEGs are up-regulated in Groups 1, 2, and 3. Below, we show some DEGES-based normalization pipelines for this multi-group data (\ref{section-3-3-1} DEGES/TbT, \ref{section-3-3-2} DEGES/edgeR, and \ref{section-3-3-3} DEGES/DESeq). \subsubsection{DEGES/TbT} \label{section-3-3-1} The DEGES/TbT pipeline for multi-group data is essentially the same as those for two-group data with or without replicates. Note that a smaller sampling size (i.e., \Robject{samplesize = 100}) is used here to reduce the computation time, but a larger sampling size of around $10,000$ (i.e., \Robject{samplesize = 10000}) is recommended (Hardcastle and Kelly, 2010). <>= set.seed(1000) library(TCC) data(hypoData_mg) samplesize <- 100 group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq", iteration = 1, samplesize = samplesize) tcc$norm.factors @ \subsubsection{DEGES/edgeR} \label{section-3-3-2} \Rpackage{edgeR} employs generalized linear models (GLMs) to find DEGs between any of the groups. The DEGES/edgeR normalization pipeline in \Rpackage{TCC} internally uses functions for the GLM approach that require two models (a full model and a null model). The full model corresponds to a design matrix to describe sample groups. The null model corresponds to the model coefficients. The two models can be defined as follows: <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) design <- model.matrix(~ as.factor(group)) coef <- 2:length(unique(group)) @ The design matrix (\Robject{design}) can be constructed by using the \Robject{model.matrix} function. For the model coefficients (\Robject{coef}), the user should specify all the coefficients except for the intercept term. The DEGES/edgeR pipeline with the two models (\Robject{design} and \Robject{coef}) can be performed as follows. <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, design = design, coef = coef) tcc$norm.factors @ The two models (\Robject{design} and \Robject{coef}) will automatically be generated when performing the following \Rfunction{calcNormFactors} function if those models are not explicitly indicated. That is <<3498028981>>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) tcc$norm.factors @ For \Rpackage{edgeR} users, we provide commands, consisting of functions in \Rpackage{edgeR}, to perform the DEGES/edgeR pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFactors} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) FDR <- 0.1 floorPDEG <- 0.05 design <- model.matrix(~ as.factor(group)) coef <- 2:length(unique(group)) d <- DGEList(counts = hypoData_mg, group = group) ### STEP 1 ### d <- calcNormFactors(d) ### STEP 2 ### d <- estimateGLMCommonDisp(d, design) d <- estimateGLMTrendedDisp(d, design) d <- estimateGLMTagwiseDisp(d, design) fit <- glmFit(d, design) lrt <- glmLRT(fit, coef = coef) result <- as.data.frame(topTags(lrt, n = nrow(hypoData_mg))) result <- result$table[rownames(hypoData_mg), ] pval <- lrt$table$PValue qval <- p.adjust(pval, method = "BH") if (sum(qval < FDR) > (floorPDEG * nrow(hypoData_mg))) { is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData_mg) * floorPDEG) } ### STEP 3 ### d <- DGEList(counts = hypoData_mg[!is.DEG, ], group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors * colSums(hypoData_mg[!is.DEG, ]) / colSums(hypoData_mg) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsubsection{DEGES/DESeq} \label{section-3-3-3} \Rpackage{DESeq} also employs GLMs for analyzing multi-group experiments. Similar to the \Rpackage{edgeR} package, it requires two models (full model and reduced model). The full model (\Robject{full}) and reduced model (\Robject{reduced}) can be created as follows: <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) full <- count ~ condition reduced <- count ~ 1 @ The DEGES/DESeq pipeline with the two models (\Robject{full} and \Robject{reduced}) can be performed as follows. <>= tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, full = full, reduced = reduced) tcc$norm.factors @ The two models (\Robject{full} and \Robject{reduced}) will automatically be generated when performing the following \Rfunction{calcNormFactors} function if those models are not explicitly indicated. That is <<3894410011>>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1) tcc$norm.factors @ For \Rpackage{DESeq} users, we provide commands, consisting of functions in \Rpackage{DESeq}, to perform the DEGES/DESeq pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFactors} function together with the above parameter settings can be regarded as a wrapper function for the following commands. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) FDR <- 0.1 floorPDEG <- 0.05 tcc <- new("TCC", hypoData_mg, group) full <- count ~ condition reduced <- count ~ 1 cds <- newCountDataSet(hypoData_mg, group) ### STEP 1 ### cds <- estimateSizeFactors(cds) ### STEP 2 ### cds <- estimateDispersions(cds) full.model <- fitNbinomGLMs(cds, full) reduced.model <- fitNbinomGLMs(cds, reduced) pval <- nbinomGLMTest(full.model, reduced.model) pval[is.na(pval)] <- 1 qval <- p.adjust(pval, method = "BH") if (sum(qval < FDR) > (floorPDEG * nrow(hypoData_mg))) { is.DEG <- as.logical(qval < FDR) } else { is.DEG <- as.logical(rank(pval, ties.method = "min") <= nrow(hypoData_mg) * floorPDEG) } ### STEP 3 ### cds <- newCountDataSet(hypoData_mg[!is.DEG, ], group) cds <- estimateSizeFactors(cds) norm.factors <- sizeFactors(cds) / colSums(hypoData_mg) norm.factors <- norm.factors / mean(norm.factors) norm.factors @ \subsection{Retrieving normalized data} \label{s_retr_norm_data} Similar functions for calculating normalization factors are the \Rfunction{calcNormFactors} function in \Rpackage{edgeR} and the \Rfunction{estimateSizeFactors} function in both \Rpackage{DESeq} and \Rpackage{DESeq2}. Note that the terminology used in \Rpackage{DESeq} (i.e., size factors) is different from that used in \Rpackage{edgeR} (i.e., effective library sizes) and ours. The effective library size in \Rpackage{edgeR} is calculated as the library size multiplied by the normalization factor. The size factors in both \Rpackage{DESeq} and \Rpackage{DESeq2} package are comparable to the {\it normalized} effective library sizes wherein the summary statistics for the effective library sizes are adjusted to one. Our normalization factors, which can be obtained from \Robject{tcc\$norm.factors}, have the same names as those in \Rpackage{edgeR}. Accordingly, the normalization factors calculated from \Rpackage{TCC} with arbitrary options should be manipulated together with the library sizes when normalized read counts are to be obtained. Since biologists are often interested in such information (Dillies et al., 2012), we provide the \Rfunction{getNormalizedData} function for retrieving normalized data. Note that the \Robject{hypoData} consists of $1,000$ genes and a total of six samples (three biological replicates for G1 and three biological replicates for G2); i.e., \{G1\_rep1, G1\_rep2, G1\_rep3\} vs. \{G2\_rep1, G2\_rep2, G2\_rep3\}. These simulation data have basically the same conditions as shown in Fig. 1 of the TbT paper (Kadota et al., 2012); i.e., (i) the first $200$ genes are DEGs ($P_{\mbox{\tiny DEG}} = 200/1000 = 20\%$), (ii) the first $180$ genes of the $200$ DEGs are higher in G1 ($P_{\mbox{\tiny G1}} = 180/200 = 90\%$), and the remaining $20$ DEGs are higher in G2, and (iii) the level of DE is four-fold. The last $800$ genes were designed to be non-DEGs. The different normalization strategies can roughly be evaluated in terms of the similarity of their summary statistics for {\it normalized} data labeled as non-DEGs in one group (e.g., G1) to those of the other group (e.g., G2). The basic statistics for the non-DEGs are as follows. <>= library(TCC) data(hypoData) nonDEG <- 201:1000 summary(hypoData[nonDEG, ]) @ From now on, we will display only the median values for simplicity, i.e., <>= apply(hypoData[nonDEG, ], 2, median) @ <>= hypoData.median <- apply(hypoData[nonDEG, ], 2, median) hypoData.14.median <- apply(hypoData[nonDEG, c(1, 4)], 2, median) @ In what follows, we show detailed examples using \Robject{hypoData}. Note, however, that the basic usage is simple. <>= normalized.count <- getNormalizedData(tcc) @ \subsubsection{Retrieving two-group DEGES/edgeR-normalized data with replicates} \label{section-3-4-1} The \Rfunction{getNormalizedData} function can be applied to the \Robject{TCC} class object after the normalization factors have been calculated. The DEGES/edgeR-normalized data can be retrieved as follows. <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, FDR = 0.1, floorPDEG = 0.05) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <<4390011292,echo = false, result = hide>>= buff_1 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ For comparison, the summary statistics for TMM-normalized data produced using the original normalization method (i.e., TMM) in \Rpackage{edgeR} can be obtained as follows. <<2391104042>>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) d <- DGEList(count = hypoData, group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors effective.libsizes <- colSums(hypoData) * norm.factors normalized.count <- sweep(hypoData, 2, mean(effective.libsizes) / effective.libsizes, "*") apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <<4330011292,echo = false, result = hide>>= buff_2 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ It is obvious that the summary statistics (ranging from $\Sexpr{sprintf("%.5f", min(buff_1))}$ to $\Sexpr{sprintf("%.5f", max(buff_1))}$ ) from DEGES/edgeR-normalized data are closer to the truth (i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData.median))}$ to $\Sexpr{sprintf("%.1f",max(hypoData.median))}$) than those (ranging from $\Sexpr{sprintf("%.5f", min(buff_2))}$ to $\Sexpr{sprintf("%.5f", max(buff_2))}$ from TMM-normalized data. \subsubsection{Retrieving two-group DEGES/DESeq-normalized data with replicates} \label{section-3-4-2} Similar to the DEGES/edgeR case, the DEGES/DESeq-normalized data can be retrieved as follows. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) nonDEG <- 201:1000 tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <<4330110292,echo = false, result = hide>>= buff_1 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ For comparison, the summary statistics for DESeq-normalized data produced using the original normalization method in \Rpackage{DESeq} can be obtained as follows. <<3259021231>>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) cds <- newCountDataSet(hypoData, group) cds <- estimateSizeFactors(cds) normalized.count <- counts(cds, normalized = TRUE) apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <<1330110292,echo = false, result = hide>>= buff_2 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ It is obvious that the summary statistics (ranging from $\Sexpr{sprintf("%.5f", min(buff_1))}$ to $\Sexpr{sprintf("%.5f", max(buff_1))}$ ) from DEGES/DESeq-normalized data are closer to the truth (i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData.median))}$ to $\Sexpr{sprintf("%.1f",max(hypoData.median))}$) than those (ranging from $\Sexpr{sprintf("%.5f", min(buff_2))}$ to $\Sexpr{sprintf("%.5f", max(buff_2))}$) from DESeq-normalized data. \subsubsection{Retrieving two-group DEGES/DESeq-normalized data without replicates} \label{section-3-4-3} Similar to the case of count data with replicates, the DEGES/DESeq-normalized data without replicates can be retrieved as follows. <>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) @ <<1664110292,echo = false, result = hide>>= buff_1 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ For comparison, the summary statistics for DESeq-normalized data produced using the original normalization method in \Rpackage{DESeq} can be obtained as follows. <<238967389>>= library(TCC) data(hypoData) nonDEG <- 201:1000 group <- c(1, 2) cds <- newCountDataSet(hypoData[, c(1, 4)], group) cds <- estimateSizeFactors(cds) normalized.count <- counts(cds, normalized = TRUE) apply(normalized.count[nonDEG, ], 2, median) @ <<1664110292,echo = false, result = hide>>= buff_2 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ It is obvious that the summary statistics ( $\Sexpr{sprintf("%.5f", min(buff_1))}$ and $\Sexpr{sprintf("%.5f", max(buff_1))}$ ) from DEGES/DESeq-normalized data are closer to the truth (i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData.14.median))}$ to $\Sexpr{sprintf("%.1f",max(hypoData.14.median))}$) than those ( $\Sexpr{sprintf("%.5f", min(buff_2))}$ and $\Sexpr{sprintf("%.5f", max(buff_2))}$) from DESeq-normalized data. \subsubsection{Retrieving two-group DEGES/edgeR-normalized data without replicates (paired)} \label{retriev_twogroup_paired_withoutreps} We here analyze the \Robject{hypoData} object provided in \Rpackage{TCC}. As described in \ref{ss_norm_twogroup_without_reps}, we regard \Robject{hypoData} as a hypothetical paired data. The DEGES/edgeR-normalized data can be retrieved as follows. <<5647309237>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) pair <- c(1, 2, 3, 1, 2, 3) cl <- data.frame(group = group, pair = pair) tcc <- new("TCC", hypoData, cl) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, FDR = 0.1, floorPDEG = 0.05, paired = TRUE) normalized.count <- getNormalizedData(tcc) head(normalized.count, n = 4) apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <<1674110292,echo = false, result = hide>>= buff_1 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ For comparison, the summary statistics for TMM-normalized data produced using the original normalization method (i.e., TMM) in \Rpackage{edgeR} can be obtained as follows. <<4387803948>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") nonDEG <- 201:1000 group <- factor(c(1, 1, 1, 2, 2, 2)) d <- DGEList(counts = hypoData, group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors effective.libsizes <- colSums(hypoData) * norm.factors normalized.count <- sweep(hypoData, 2, mean(effective.libsizes) / effective.libsizes, "*") apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <<1674110992,echo = false, result = hide>>= buff_2 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ It is obvious that the summary statistics (ranging from $\Sexpr{sprintf("%.5f", min(buff_1))}$ to $\Sexpr{sprintf("%.5f", max(buff_1))}$ ) from DEGES/edgeR-normalized data are closer to the truth (i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData.median))}$ to $\Sexpr{sprintf("%.1f",max(hypoData.median))}$) than those (ranging from $\Sexpr{sprintf("%.5f", min(buff_2))}$ to $\Sexpr{sprintf("%.5f", max(buff_2))}$) from TMM-normalized data. \subsubsection{Retrieving two-group DEGES/DESeq-normalized data without replicates (paired)} \label{ss_ret_twogroup_degesdeseq_without_reps} We here analyze the \Robject{hypoData} object provided in \Rpackage{TCC}. As described in \ref{ss_norm_twogroup_without_reps}, we regard \Robject{hypoData} as a hypothetical paired data. The DEGES/DESeq-normalized data can be retrieved as follows. <<5904580011>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2) pair <- c(1, 2, 3, 1, 2, 3) cl <- data.frame(group = group, pair = pair) tcc <- new("TCC", hypoData, cl) tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05, paired = TRUE) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <<5490200292,echo = false, result = hide>>= buff_1 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ For comparison, the summary statistics for DESeq-normalized data produced using the original normalization method in \Rpackage{DESeq} can be obtained as follows. <<4390055561>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") nonDEG <- 201:1000 group <- factor(c(1, 1, 1, 2, 2, 2)) pair <- factor(c(1, 2, 3, 1, 2, 3)) cl <- data.frame(group = group, pair = pair) cds <- newCountDataSet(hypoData, cl) cds <- estimateSizeFactors(cds) normalized.count <- counts(cds, normalized = TRUE) apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <<9600320992,echo = false, result = hide>>= buff_2 <- range(apply(normalized.count[nonDEG, ], 2, median)) @ It is obvious that the summary statistics (ranging from $\Sexpr{sprintf("%.5f", min(buff_1))}$ to $\Sexpr{sprintf("%.5f", max(buff_1))}$ ) from DEGES/DESeq-normalized data are closer to the truth (i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData.median))}$ to $\Sexpr{sprintf("%.1f",max(hypoData.median))}$) than those (ranging from $\Sexpr{sprintf("%.5f", min(buff_2))}$ to $\Sexpr{sprintf("%.5f", max(buff_2))}$ ) from DESeq-normalized data. \subsubsection{Retrieving multi-group iDEGES/edgeR-normalized data with replicates} \label{section-3-4-4} Here, we analyze another hypothetical three-group count matrix, the \Robject{hypoData\_mg} object, provided in \Rpackage{TCC}. It consists of $1,000$ genes and a total of nine columns for testing any difference among three groups that each have triplicates. Similar to the \Robject{hypoData} object, the first $200$ genes are DEGs and the remaining $800$ genes are non-DEGs. The basic statistics for the non-DEGs are as follows. <>= library(TCC) data(hypoData_mg) nonDEG <- 201:1000 summary(hypoData_mg[nonDEG, ]) @ From now on, we will display only the median values for simplicity, i.e., <>= apply(hypoData_mg[nonDEG, ], 2, median) @ <>= hypoData_mg.median <- apply(hypoData_mg[nonDEG, ], 2, median) @ The iDEGES/edgeR-normalized data can be retrieved as follows. <>= library(TCC) data(hypoData_mg) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) design <- model.matrix(~ as.factor(group)) coef <- 2:length(unique(group)) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3) normalized.count <- getNormalizedData(tcc) apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <>= normByiDEGES <- range(apply(normalized.count[nonDEG, ], 2, median)) @ For comparison, the summary statistics for TMM-normalized data produced using the original normalization method (i.e., TMM) in \Rpackage{edgeR} are obtained as follows. <>= library(TCC) data(hypoData_mg) nonDEG <- 201:1000 group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) d <- DGEList(tcc$count, group = group) d <- calcNormFactors(d) norm.factors <- d$samples$norm.factors effective.libsizes <- colSums(hypoData) * norm.factors normalized.count <- sweep(hypoData, 2, mean(effective.libsizes) / effective.libsizes, "*") apply(normalized.count[nonDEG, ], 2, median) range(apply(normalized.count[nonDEG, ], 2, median)) @ <>= normByTMM <- range(apply(normalized.count[nonDEG, ], 2, median)) @ It is obvious that the summary statistics (ranging from $\Sexpr{sprintf("%.5f", min(normByiDEGES))}$ to $\Sexpr{sprintf("%.5f", max(normByiDEGES))}$) from iDEGES/edgeR-normalized data are closer to the truth (i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData_mg.median))}$ to $\Sexpr{sprintf("%.1f", max(hypoData_mg.median))}$) than those (ranging from $\Sexpr{sprintf("%.5f", min(normByTMM))}$ to $\Sexpr{sprintf("%.5f", max(normByTMM))}$) from TMM-normalized data. \newpage \section{Differential expression (DE)} \label{s_differential_expression} The particular feature of \Rpackage{TCC} is that it calculates robust normalization factors. Moreover, end users would like to have some accessory functions for subsequent analyses. Here, we provide the \Rfunction{estimateDE} function for identifying DEGs. Specifically, the function internally uses the corresponding functions implemented in three packages. Similar to the usage in the \Rfunction{calcNormFactors} function with the \Rfunction{test.method} argument in \Rpackage{TCC}, those DE methods in \Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq} can be performed by using the \Rfunction{estimateDE} function with \Rfunction{test.method = "edger"}, \Rfunction{"deseq"}, and \Rfunction{"bayseq"}, respectively. Here, we show some examples of DE analysis for two-group data with replicates (\ref{section-4-1}), two-group data without replicates (\ref{section-4-2}), paired two-group data without replicates (\ref{ss_de_2gnr_p}), and multi-group data with replicates (\ref{section-4-3}). \subsection{DE analysis for two-group data with replicates} \label{section-4-1} \subsubsection{edgeR coupled with iDEGES/edgeR normalization} \label{section-4-1-1} We give a procedure for DE analysis using the exact test implemented in \Rpackage{edgeR} together with iDEGES/edgeR normalization factors (i.e., the iDEGES/edgeR-edgeR combination) for the hypothetical two-group count data with replicates (i.e., the \Robject{hypoData} object). If the user wants to determine the genes having an FDR threshold of $< 10\%$ as DEGs, one can do as follows. <>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) @ The results of the DE analysis are stored in the \Rpackage{TCC} class object. The summary statistics for top-ranked genes can be retrieved by using the \Rfunction{getResult} function. <>= result <- getResult(tcc, sort = TRUE) head(result) @ The DE results can be broken down as follows. <>= table(tcc$estimatedDEG) @ This means $\Sexpr{sum(tcc$estimatedDEG == FALSE)}$ non-DEGs and $\Sexpr{sum(tcc$estimatedDEG == TRUE)}$ DEGs satisfy FDR $< 0.1$. The \Rfunction{plot} function generates an M-A plot, where "M" indicates the log-ratio (i.e., $\mbox{M} = \log_{2}\mbox{G2} - \log_{2}\mbox{G1}$) and "A" indicates average read count (i.e., $\mbox{A} = (\log_{2}\mbox{G2} + \log_{2}\mbox{G1}) / 2$), from the normalized count data. The magenta points indicate the identified DEGs at FDR $< 0.1$. \begin{center} <>= plot(tcc) @ \end{center} Since we know the true information for \Robject{hypoData} (i.e., 200 DEGs and 800 non-DEGs), we can calculate the area under the ROC curve (i.e., AUC; $0 \le$ AUC $\le 1$) between the ranked gene list and the truth and thereby evaluate the sensitivity and specificity simultaneously. A well-ranked gene list should have a high AUC value (i.e., high sensitivity and specificity). <<3498757592>>= library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ For comparison, the DE results from the original procedure in \Rpackage{edgeR} can be obtained as follows. <<5901287562>>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) d <- DGEList(count = hypoData, group = group) d <- calcNormFactors(d) d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) out <- exactTest(d) result <- as.data.frame(topTags(out, n = nrow(hypoData), sort.by = "PValue")) head(result) @ This is the same as <<1027518347>>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ \subsubsection{DESeq2 coupled with iDEGES/DESeq2 normalization} \label{section-4-1-2} We give a procedure for DE analysis using the likelihood ratio test for GLMs implemented in \Rpackage{DESeq2} together with iDEGES/DESeq2 normalization factors (i.e., the iDEGES/DESeq2-DESeq2 combination) for the hypothetical two-group count data with replicates (i.e., the \Robject{hypoData} object). If the user wants to determine the genes having an FDR threshold of $< 10$\% as DEGs, one can do as follows. <<5408289011>>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "deseq2", test.method = "deseq2", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "deseq2", FDR = 0.1) @ The results of the DE analysis are stored in the \Robject{TCC} class object. The summary statistics for top-ranked genes can be retrieved by using the \Rfunction{getResult} function. <<5029042481>>= result <- getResult(tcc, sort = TRUE) head(result) @ The DE results can be broken down as follows. <<7512913498>>= table(tcc$estimatedDEG) @ <<7512013498, echo = FALSE>>= tb <- table(tcc$estimatedDEG) @ This means $\Sexpr{tb[["0"]]}$ non-DEGs and $\Sexpr{tb[["1"]]}$ DEGs satisfy FDR $< 0.1$. The \Rfunction{plot} function generates an M-A plot, where "M" indicates the log-ratio (i.e., $\mbox{M} = \log_{2}\mbox{G2} - \log_{2}\mbox{G1}$) and "A" indicates average read count (i.e., $\mbox{A} = (\log_{2}\mbox{G2} + \log_{2}\mbox{G1}) / 2$), from the normalized count data. The magenta points indicate the identified DEGs at FDR $< 0.1$. \begin{center} <<3489400103, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>= plot(tcc) @ \end{center} Since we know the true information for \Robject{hypoData} (i.e., 200 DEGs and 800 non-DEGs), we can calculate the area under the ROC curve (i.e., AUC; $0 \le$ AUC $\le 1$) between the ranked gene list and the truth and thereby evaluate the sensitivity and specificity simultaneously. A well-ranked gene list should have a high AUC value (i.e., high sensitivity and specificity). <<4012399910>>= library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ For comparison, the DE results from the original procedure in \Rpackage{DESeq2} can be obtained as follows. <<4930011190>>= library(TCC) data(hypoData) group <- factor(c(1, 1, 1, 2, 2, 2)) cl <- data.frame(group = group) design <- formula(~ group) dds <- DESeqDataSetFromMatrix(countData = hypoData, colData = cl, design = design) dds <- estimateSizeFactors(dds) sizeFactors(dds) <- sizeFactors(dds) / mean(sizeFactors(dds)) dds <- estimateDispersions(dds) dds <- nbinomWaldTest(dds) result <- results(dds) head(result[order(result$pvalue),]) library(ROC) truth <- c(rep(1, 200), rep(0, 800)) result$pvalue[is.na(result$pvalue)] <- 1 AUC(rocdemo.sca(truth = truth, data = -rank(result$pvalue))) @ This is the same as <<4390023901>>= library(TCC) data(hypoData) group <- c(1, 1, 1, 2, 2, 2) tcc <- new("TCC", hypoData, group) tcc <- calcNormFactors(tcc, norm.method = "deseq2", iteration = 0) tcc <- estimateDE(tcc, test.method = "deseq2", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) tcc$norm.factors @ <<4090911011, echo = FALSE>>= buff_1 <- AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ As described in \ref{s_retr_norm_data}, the size factors termed in \Rpackage{DESeq2} and \Rpackage{DESeq} are comparable to the {\it normalized} effective library sizes termed in \Rpackage{TCC} and \Rpackage{edgeR}. The effective library size in \Rpackage{edgeR} is calculated as the library size multiplied by the normalization factor. The normalization factors and effective library sizes in \Rpackage{DESeq2} can be retrieved as follows. <<0309001992>>= library(TCC) data(hypoData) group <- factor(c(1, 1, 1, 2, 2, 2)) cl <- data.frame(group = group) design <- formula(~ group) dds <- DESeqDataSetFromMatrix(countData = hypoData, colData = cl, design = design) dds <- estimateSizeFactors(dds) size.factors <- sizeFactors(dds) size.factors hoge <- size.factors / colSums(hypoData) norm.factors <- hoge / mean(hoge) norm.factors ef.libsizes <- norm.factors * colSums(hypoData) ef.libsizes @ Note that the following commands should be the simplest procedure provided in \Rpackage{DESeq2}. <<4328593702>>= library(TCC) data(hypoData) group <- factor(c(1, 1, 1, 2, 2, 2)) cl <- data.frame(group = group) design <- formula(~ group) dds <- DESeqDataSetFromMatrix(countData = hypoData, colData = cl, design = design) dds <- estimateSizeFactors(dds) sizeFactors(dds) <- sizeFactors(dds) / mean(sizeFactors(dds)) dds <- estimateDispersions(dds) dds <- nbinomWaldTest(dds) result <- results(dds) head(result[order(result$pvalue),]) library(ROC) truth <- c(rep(1, 200), rep(0, 800)) result$pvalue[is.na(result$pvalue)] <- 1 AUC(rocdemo.sca(truth = truth, data = -rank(result$pvalue))) @ \subsection{DE analysis for two-group data without replicates} \label{section-4-2} It is important to keep in mind that most R packages (including \Rpackage{edgeR}, \Rpackage{DESeq}, \Rpackage{DESeq2} and \Rpackage{baySeq}) are primarily for analyzing data including biological replications because the biological variability has to be accurately estimated to avoid spurious DE calls (Glaus et al., 2012). We here describe two DE analysis pipelines using the subset of \Robject{hypoData} consisting of the first and the fourth columns, i.e., <<4893438912>>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) head(tcc$count) tcc$group @ \subsubsection{edgeR coupled with iDEGES/edgeR normalization} We give a procedure for DE analysis using the exact test implemented in \Rpackage{edgeR} together with iDEGES/edgeR normalization factors (i.e., the iDEGES/edgeR-edgeR combination) for the hypothetical two-group count data without replicates (i.e., the \Robject{hypoData} object). If the user wants to determine the genes having an FDR threshold of $< 10$\% as DEGs, one can do as follows. <<8439100943>>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) @ The results of the DE analysis are stored in the \Robject{TCC} class object. The summary statistics for top-ranked genes can be retrieved by using the \Rfunction{getResult} function. <<4390087612>>= result <- getResult(tcc, sort = TRUE) head(result) @ The DE results can be broken down as follows. <<0238735032>>= table(tcc$estimatedDEG) @ <<0233731032, echo = FALSE>>= tb <- table(tcc$estimatedDEG) @ This means $\Sexpr{as.numeric(tb[["0"]])}$ non-DEGs and $\Sexpr{as.numeric(tb[["1"]])}$ DEGs satisfy FDR $< 0.1$. The \Rfunction{plot} function generates an M-A plot, where "M" indicates the log-ratio (i.e., $\mbox{M} = \log_{2}\mbox{G2} - \log_{2}\mbox{G1}$) and "A" indicates average read count (i.e., $\mbox{A} = (\log_{2}\mbox{G2} + \log_{2}\mbox{G1}) / 2$), from the normalized count data. The magenta points indicate the identified DEGs at FDR $< 0.1$. \begin{center} <<3489320103, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>= plot(tcc) @ \end{center} Since we know the true information for \Robject{hypoData} (i.e., 200 DEGs and 800 non-DEGs), we can calculate the area under the ROC curve (i.e., AUC; $0 \le$ AUC $\le 1$) between the ranked gene list and the truth and thereby evaluate the sensitivity and specificity simultaneously. A well-ranked gene list should have a high AUC value (i.e., high sensitivity and specificity). <<5702784221>>= library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ For comparison, the DE results from the original procedure in \Rpackage{edgeR} can be obtained as follows. <<4890357892>>= library(TCC) data(hypoData) group <- c(1, 2) d <- DGEList(counts = hypoData[, c(1, 4)], group = group) d <- calcNormFactors(d) d <- estimateGLMCommonDisp(d, method = "deviance", robust = TRUE, subset = NULL) out <- exactTest(d) result <- as.data.frame(topTags(out, n = nrow(hypoData), sort.by = "PValue")) head(result) @ This is the same as <<4741957232>>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ \subsubsection{DESeq2 coupled with iDEGES/DESeq2 normalization} We give a procedure for DE analysis using the likelihood ratio test for GLMs implemented in \Rpackage{DESeq2} together with iDEGES/DESeq2 normalization factors (i.e., the iDEGES/DESeq2-DESeq2 combination) for the hypothetical two-group count data without replicates (i.e., the \Robject{hypoData} object). If the user wants to determine the genes having an FDR threshold of $< 10$\% as DEGs, one can do as follows. <<7400039021>>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) tcc <- calcNormFactors(tcc, norm.method = "deseq2", test.method = "deseq2", iteration = 3, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "deseq2", FDR = 0.1) @ The results of the DE analysis are stored in the \Robject{TCC} class object. The summary statistics for top-ranked genes can be retrieved by using the \Rfunction{getResult} function. <<4890184102>>= result <- getResult(tcc, sort = TRUE) head(result) @ The DE results can be broken down as follows. <<4901822901>>= table(tcc$estimatedDEG) @ <<4911822001, echo = FALSE>>= tb <- table(tcc$estimatedDEG) @ This means $1000$ non-DEGs and $0$ DEGs satisfy FDR $< 0.1$. The \Rfunction{plot} function generates an M-A plot, where "M" indicates the log-ratio (i.e., $\mbox{M} = \log_{2}\mbox{G2} - \log_{2}\mbox{G1}$) and "A" indicates average read count (i.e., $\mbox{A} = (\log_{2}\mbox{G2} + \log_{2}\mbox{G1}) / 2$), from the normalized count data. The magenta points indicate the identified DEGs at FDR $< 0.1$. \begin{center} <<3387482103, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>= plot(tcc) @ \end{center} Since we know the true information for \Robject{hypoData} (i.e., 200 DEGs and 800 non-DEGs), we can calculate the area under the ROC curve (i.e., AUC; $0 \le$ AUC $\le 1$) between the ranked gene list and the truth and thereby evaluate the sensitivity and specificity simultaneously. A well-ranked gene list should have a high AUC value (i.e., high sensitivity and specificity). <<4919187780>>= library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ For comparison, the DE results from the original procedure in \Rpackage{DESeq2} can be obtained as follows. <<4929100782>>= library(TCC) data(hypoData) group <- factor(c(1, 2)) cl <- data.frame(group = group) design <- formula(~ group) dds <- DESeqDataSetFromMatrix(countData = hypoData[, c(1, 4)], colData = cl, design = design) dds <- estimateSizeFactors(dds) sizeFactors(dds) <- sizeFactors(dds) / mean(sizeFactors(dds)) dds <- estimateDispersions(dds) dds <- nbinomLRT(dds, reduced = ~ 1) result <- results(dds) head(result[order(result$pvalue),]) library(ROC) truth <- c(rep(1, 200), rep(0, 800)) result$pvalue[is.na(result$pvalue)] <- 1 AUC(rocdemo.sca(truth = truth, data = -rank(result$pvalue))) @ This is the same as <<8787372620>>= library(TCC) data(hypoData) group <- c(1, 2) tcc <- new("TCC", hypoData[, c(1, 4)], group) tcc <- calcNormFactors(tcc, norm.method = "deseq2", iteration = 0) tcc <- estimateDE(tcc, test.method = "deseq2", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ \subsection{DE analysis for two-group data without replicates (paired)} \label{ss_de_2gnr_p} \Rpackage{edgeR} and \Rpackage{DESeq} (and \Rpackage{DESeq2}) employs generalized linear models (GLMs) which can apply to detect DEGs from paired two-group count data. The analysis for paired samples can be performed by indicating (1) the pair information when constructing the \Robject{TCC} class object and (2) \Robject{paired = TRUE} when performing the \Rfunction{calcNormFactors} and \Rfunction{estimateDE} functions. For analyzing paired data, we here use the hypothetical count data (\Robject{hypoData}; for details see \ref{ss_reading_the_count_data}) by changing the label information, i.e., <<0285012872>>= data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") head(hypoData) @ This count data consists of three individuals (or sibships), dogA, dogB, and dogC. "T" and "N" indicate tumor and normal tissues, respectively. \subsubsection{edgeR coupled with iDEGES/edgeR normalization} We give a procedure for DE analysis using the likelihood ratio test for GLMs implemented in \Rpackage{edgeR} together with iDEGES/edgeR normalization factors (i.e., the iDEGES/edgeR-edgeR combination) for the paired two-group count data without replicates (i.e., the above \Robject{hypoData} object). If the user wants to determine the genes having an FDR threshold of $< 10$\% as DEGs, one can do as follows. <<4891209302>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") group <- c(1, 1, 1, 2, 2, 2) pair <- c(1, 2, 3, 1, 2, 3) cl <- data.frame(group = group, pair = pair) tcc <- new("TCC", hypoData, cl) tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05, paired = TRUE) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1, paired = TRUE) @ The results of the DE analysis are stored in the \Robject{TCC} class object. The summary statistics for top-ranked genes can be retrieved by using the \Rfunction{getResult} function. <<4390911012>>= result <- getResult(tcc, sort = TRUE) head(result) @ The DE results can be broken down as follows. <<3909884931>>= table(tcc$estimatedDEG) @ <<3934884932, echo = FALSE>>= tb <- table(tcc$estimatedDEG) @ This means $\Sexpr{as.numeric(tb[["0"]])}$ non-DEGs and $\Sexpr{as.numeric(tb[["1"]])}$ DEGs satisfy FDR $< 0.1$. The \Rfunction{plot} function generates an M-A plot, where "M" indicates the log-ratio (i.e., $\mbox{M} = \log_{2}\mbox{G2} - \log_{2}\mbox{G1}$) and "A" indicates average read count (i.e., $\mbox{A} = (\log_{2}\mbox{G2} + \log_{2}\mbox{G1}) / 2$), from the normalized count data. The magenta points indicate the identified DEGs at FDR $< 0.1$. \begin{center} <<3482340103, fig = TRUE, size = 0.4, png = TRUE, pdf = FALSE, eps = FALSE>>= plot(tcc) @ \end{center} Since we know the true information for \Robject{hypoData} (i.e., 200 DEGs and 800 non-DEGs), we can calculate the area under the ROC curve (i.e., AUC; $0 \le$ AUC $\le 1$) between the ranked gene list and the truth and thereby evaluate the sensitivity and specificity simultaneously. A well-ranked gene list should have a high AUC value (i.e., high sensitivity and specificity). <<3490930192>>= library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ For comparison, the DE results from the original procedure in \Rpackage{edgeR} can be obtained as follows. <<8402288128>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") group <- factor(c(1, 1, 1, 2, 2, 2)) pair <- factor(c(1, 2, 3, 1, 2, 3)) design <- model.matrix(~ group + pair) coef <- 2 d <- DGEList(counts = hypoData, group = group) d <- calcNormFactors(d) d <- estimateGLMCommonDisp(d, design) d <- estimateGLMTrendedDisp(d, design) d <- estimateGLMTagwiseDisp(d, design) fit <- glmFit(d, design) lrt <- glmLRT(fit, coef = coef) topTags(lrt, n = 6) p.value <- lrt$table$PValue library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -rank(p.value))) @ This is the same as <<4209576561>>= library(TCC) data(hypoData) colnames(hypoData) <- c("T_dogA", "T_dogB", "T_dogC", "N_dogA", "N_dogB", "N_dogC") group <- c(1, 1, 1, 2, 2, 2) pair <- c(1, 2, 3, 1, 2, 3) cl <- data.frame(group = group, pair = pair) tcc <- new("TCC", hypoData, cl) tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0, paired = TRUE) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1, paired = TRUE) result <- getResult(tcc, sort = TRUE) head(result) library(ROC) truth <- c(rep(1, 200), rep(0, 800)) AUC(rocdemo.sca(truth = truth, data = -tcc$stat$rank)) @ \subsection{DE analysis for multi-group data with replicates} \label{section-4-3} Here, we give three examples of DE analysis coupled with DEGES/edgeR normalization for the hypothetical three-group data with replicates, i.e., the \Robject{hypoData\_mg} object. The use of the DEGES/edgeR normalization factors is simply for reducing the computation time. \subsubsection{baySeq coupled with DEGES/edgeR normalization} \label{section-4-3-1} The empirical Bayesian method implemented in \Rpackage{baySeq} after executing the DEGES/edgeR normalization (i.e., the DEGES/edgeR-baySeq combination) can be performed as follows. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### set.seed(1000) samplesize <- 100 tcc <- estimateDE(tcc, test.method = "bayseq", FDR = 0.1, samplesize = samplesize) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ It can be seen that the \Rpackage{baySeq} method identified $\Sexpr{sum(tcc$estimatedDEG == TRUE)}$ DEGs having FDR $< 0.1$. One can obtain the number of DEGs with another threshold (e.g., FDR $< 0.2$) from the result object as follows. <>= sum(result$q.value < 0.2) @ For \Rpackage{baySeq} users, we provide commands, consisting of functions in \Rpackage{baySeq}, to perform the DEG identification without the function in \Rpackage{TCC}. The \Rfunction{estimateDE} function with \Robject{test.method = "bayseq"} can be regarded as a wrapper function for the following commands after the DEGES/edgeR normalization. <>= set.seed(1000) samplesize <- 100 effective.libsizes <- colSums(tcc$count) * tcc$norm.factors groups <- list(NDE = rep(1, length(group)), DE = group) cD <- new("countData", data = tcc$count, replicates = group, libsizes = effective.libsizes, groups = groups) cD <- getPriors.NB(cD, samplesize = samplesize, estimation = "QL", cl = NULL) cD <- getLikelihoods(cD, pET = "BIC", cl = NULL) tmp <- topCounts(cD, group = "DE", number = nrow(tcc$count)) tmp <- tmp[rownames(tcc$count), ] p.value <- 1 - tmp$Likelihood q.value <- tmp$FDR result <- cbind(p.value, q.value) rownames(result) <- rownames(tmp) head(result) sum(q.value < 0.1) sum(q.value < 0.2) @ \subsubsection{edgeR coupled with DEGES/edgeR normalization} \label{section-4-3-2} The exact test implemented in \Rpackage{edgeR} after executing the DEGES/edgeR normalization (i.e., the DEGES/edgeR-edgeR combination) can be performed as follows. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ Note that these DEGs having FDR $< 0.1$ display DE between any of the groups because the two arguments indicated here (\Robject{design} and \Robject{coef}) correspond to an AVOVA-like test for any differences provided in \Rpackage{edgeR}, i.e., <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) design <- model.matrix(~ as.factor(group)) coef <- 2:length(unique(group)) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### d <- DGEList(tcc$count, group = group) d$samples$norm.factors <- tcc$norm.factors d <- estimateGLMCommonDisp(d, design) d <- estimateGLMTrendedDisp(d, design) d <- estimateGLMTagwiseDisp(d, design) fit <- glmFit(d, design) lrt <- glmLRT(fit, coef = coef) tmp <- as.data.frame(topTags(lrt, n = nrow(tcc$count))) p.value <- tmp$table$PValue q.value <- tmp$table$FDR result <- cbind(p.value, q.value) head(result) sum(q.value < 0.1) sum(q.value < 0.2) @ As described in the \Rpackage{edgeR} manual, the second and third columns in the \Robject{design} object are relative to the baseline (i.e., Group 1 or G1): \Robject{coef = 2} means G2 vs. G1 and \Robject{coef = 3} means G3 vs. G1. The above procedure with the \Robject{coef} object (i.e., \Robject{2:length(unique(group))}) indicates the both comparisons (i.e., G2 vs. G1 and G3 vs. G1) and identifies DEGs between any of the three groups. In other words, one can do any two-group comparison of interest from multi-group data with replicates. For example, the DE analysis for G3 vs. G1 together with DEGES/edgeR normalization can be performed as follows. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### coef <- 3 tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1, coef = coef) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ \subsubsection{DESeq coupled with DEGES/edgeR normalization} \label{section-4-3-3} The NB test implemented in \Rpackage{DESeq} after executing the DEGES/edgeR normalization (i.e., the DEGES/edgeR-DESeq combination) can be performed as follows. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### full <- count ~ condition redueced <- count ~ 1 tcc <- estimateDE(tcc, test.method = "deseq", FDR = 0.1, full = full, reduced = reduced) result <- getResult(tcc, sort = TRUE) head(result) table(tcc$estimatedDEG) @ For \Rpackage{DESeq} users, we provide commands, consisting of functions in \Rpackage{DESeq}, to perform the DEG identification without the function in \Rpackage{TCC}. The \Rfunction{estimateDE} function with \Robject{test.method = "deseq"} can be regarded as a wrapper function for the following commands after the DEGES/edgeR normalization. <>= library(TCC) data(hypoData_mg) group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) tcc <- new("TCC", hypoData_mg, group) ### Normalization ### tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) ### DE analysis ### full <- count ~ condition reduced <- count ~ 1 cds <- newCountDataSet(tcc$count, group) cds <- estimateSizeFactors(cds) sizeFactors(cds) <- sizeFactors(cds) / mean(sizeFactors(cds)) cds <- estimateDispersions(cds) full.model <- fitNbinomGLMs(cds, full) reduced.model <- fitNbinomGLMs(cds, reduced) p.value <- nbinomGLMTest(full.model, reduced.model) p.value[is.na(p.value)] <- 1 q.value <- p.adjust(p.value, method = "BH") tmp <- cbind(p.value, q.value) rownames(tmp) <- tcc$gene_id result <- tmp[order(p.value), ] head(result) sum(q.value < 0.1) sum(q.value < 0.2) @ \section{Generation of simulation data} \subsection{Introduction and basic usage} \label{section-5-1} As demonstrated in our previous study (Kadota et al., 2012), the DEGES-based normalization methods implemented in \Rpackage{TCC} theoretically outperform the other normalization methods when the numbers of DEGs (G1 vs. G2) in the tag count data are biased. However, it is difficult to determine whether the up- and down-regulated DEGs in one of the groups are actually biased in their number when analyzing real data (Dillies et al., 2012). This means we have to evaluate the potential performance of our DEGES-based methods using mainly simulation data. The \Rfunction{simulateReadCounts} function generates simulation data under various conditions. This function can generate simulation data analyzed in the TbT paper (Kadota et al., 2012), and that means it enables other researchers to compare the methods they develop with our DEGES-based methods. For example, the \Robject{hypoData} object, a hypothetical count dataset provided in \Rpackage{TCC}, was generated by using this function. The output of the \Rfunction{simulateReadCounts} function is stored as a \Rpackage{TCC} class object and is therefore ready-to-analyze. Note that different trials of simulation analysis generally yield different count data even under the same simulation conditions. We can call the \Rfunction{set.seed} function in order to obtain reproducible results (i.e., the \Robject{tcc\$count}) with the \Rfunction{simulateReadCounts} function. <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.2, DEG.assign = c(0.9, 0.1), DEG.foldchange = c(4, 4), replicates = c(3, 3)) dim(tcc$count) head(tcc$count) tcc$group @ The simulation conditions for comparing two groups (G1 vs. G2) with biological replicates are as follows: (i) the number of genes is $1,000$ (i.e., \Robject{Ngene = 1000}), (ii) the first $20\%$ of genes are DEGs (\Robject{PDEG = 0.2}), (iii) the first $90\%$ of the DEGs are up-regulated in G1, and the remaining $10\%$ are up-regulated in G2 (\Robject{DEG.assign = c(0.9, 0.1)}), (iv) the levels of DE are four-fold in both groups (\Robject{DEG.foldchange = c(4, 4)}), and (v) there are a total of six samples (three biological replicates for G1 and three biological replicates for G2) (\Robject{replicates = c(3, 3)}). The variance of the NB distribution can be modeled as $V = \mu + \phi\mu^{2}$. The empirical distribution of the read counts for producing the mean ($\mu$) and dispersion ($\phi$) parameters of the model was obtained from {\it Arabidopsis} data (three biological replicates for each of the treated and non-treated groups) in \Rpackage{NBPSeq} (Di et al., 2011). The \Robject{tcc\$count} object is essentially the same as the \Robject{hypoData} object of \Rpackage{TCC}. The information about the simulation conditions can be viewed as follows. <>= str(tcc$simulation) @ Specifically, the entries for $0$ and $1$ in the \Robject{tcc\$simulation\$trueDEG} object are for non-DEGs and DEGs respectively. The breakdowns for individual entries are the same as stated above: $800$ entries are non-DEGs, $200$ entries are DEGs. <>= table(tcc$simulation$trueDEG) @ This information can be used to evaluate the performance of the DEGES-based normalization methods in terms of the sensitivity and specificity of the results of their DE analysis. A good normalization method coupled with a DE method such as the exact test (Robinson and Smyth, 2008) and the empirical Bayes (Hardcastle and Kelly, 2010) should produce well-ranked gene lists in which the true DEGs are top-ranked and non-DEGs are bottom-ranked when all genes are ranked according to the degree of DE. The ranked gene list after performing the DEGES/edgeR-edgeR combination can be obtained as follows. <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) result <- getResult(tcc, sort = TRUE) head(result) @ We can now calculate the area under the \Rpackage{ROC} curve (i.e., AUC; $0 \le $AUC$ \le 1$) between the ranked gene list and the truth (i.e., DEGs or non-DEGs) and thereby evaluate the sensitivity and specificity simultaneously. A well-ranked gene list should have a high AUC value (i.e., high sensitivity and specificity). The \Rfunction{calcAUCValue} function calculates the AUC value based on the information stored in the \Rpackage{TCC} class object. <>= calcAUCValue(tcc) @ <>= auc.degesedger <- calcAUCValue(tcc) @ This is essentially the same as <>= AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0), data = -tcc$stat$rank)) @ The following classic \Rpackage{edgeR} procedure (i.e., the TMM-edgeR combination) make it clear that the DEGES-based normalization method (i.e., the DEGES/edgeR pipeline) outperforms the default normalization method (i.e., TMM) implemented in \Rpackage{edgeR}. <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) calcAUCValue(tcc) @ The following is an alternative procedure for \Rpackage{edgeR} users. <>= d <- DGEList(counts = tcc$count, group = tcc$group$group) d <- calcNormFactors(d) d$samples$norm.factors <- d$samples$norm.factors / mean(d$samples$norm.factors) d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) result <- exactTest(d) result$table$PValue[is.na(result$table$PValue)] <- 1 AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0), data = -rank(result$table$PValue))) @ <>= set.seed(1000) samplesize <- 100 tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq", iteration = 1, samplesize = samplesize) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) auc.degestbt <- calcAUCValue(tcc) @ As can be expected from the similarity of the normalization factors of DEGES/TbT (\ref{section-3-1-1}) and DEGES/edgeR (\ref{section-3-1-2}), the AUC value ($\Sexpr{sprintf("%.7f", auc.degesedger)}$) of DEGES/edgeR is quite similar to the AUC value ($\Sexpr{sprintf("%.7f", auc.degestbt)}$) of the original TbT method (i.e., DEGES/TbT): <>= set.seed(1000) samplesize <- 100 tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq", iteration = 1, samplesize = samplesize) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) calcAUCValue(tcc) @ \subsection{Two-group data without replicates} \label{section-5-2} Let us generate tag count data without replicates, such as those used in section \ref{section-3-2} For simplicity, we first generate simulation data whose conditions are essentially the same as those in the previous section (i.e., \ref{section-5-1}), except for the number of replicates in each group: (i) the number of genes is $1,000$ (i.e., \Robject{Ngene = 1000}), (ii) the first $20\%$ of genes are DEGs (\Robject{PDEG = 0.2}), (iii) the first $90\%$ of the DEGs are up-regulated in G1, and the remaining $10\%$ are up-regulated in G2 (\Robject{DEG.assign = c(0.9, 0.1)}), (iv) the levels of DE are four-fold in both groups (\Robject{DEG.foldchange = c(4, 4)}), and (v) there are a total of two samples (one from G1 and the other from G2) (\Robject{replicates = c(1, 1)}). <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.2, DEG.assign = c(0.9, 0.1), DEG.foldchange = c(4, 4), replicates = c(1, 1)) dim(tcc$count) head(tcc$count) tcc$group @ Now let us see how the DEGES/DESeq-DESeq combination with the original DESeq-DESeq combination performs. First, we calculate the AUC value for the ranked gene list obtained from the DEGES/DESeq-DESeq combination. <>= tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq", iteration = 1, FDR = 0.1, floorPDEG = 0.05) tcc <- estimateDE(tcc, test.method = "deseq") calcAUCValue(tcc) @ Next, we calculate the corresponding value using the original \Rpackage{DESeq} procedure (i.e., the DESeq-DESeq combination). <>= tcc <- calcNormFactors(tcc, norm.method = "deseq", iteration = 0) tcc <- estimateDE(tcc, test.method = "deseq") calcAUCValue(tcc) @ It can be seen that the DEGES/DESeq-DESeq combination outperforms the original procedure under the given simulation conditions. The following is an alternative approach for \Rpackage{DESeq} users. <>= cds <- newCountDataSet(tcc$count, tcc$group$group) cds <- estimateSizeFactors(cds) sizeFactors(cds) <- sizeFactors(cds) / mean(sizeFactors(cds)) cds <- estimateDispersions(cds, method = "blind", sharingMode = "fit-only") result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0), data = -rank(result$pval))) @ This procedure is completely the same as the one in \Rpackage{TCC} that gives normalization factors corresponding to those in \Rpackage{edgeR} for different packages. However, the following commands from the \Rpackage{DESeq} manual are of practical value because they give approximately the same AUC value as above. <>= cds <- newCountDataSet(tcc$count, tcc$group$group) cds <- estimateSizeFactors(cds) cds <- estimateDispersions(cds, method = "blind", sharingMode = "fit-only") result <- nbinomTest(cds, 1, 2) result$pval[is.na(result$pval)] <- 1 AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0), data = -rank(result$pval))) @ \subsection{Multi-group data with and without replicates} \label{section-5-3} The \Rfunction{simulateReadCounts} function can generate simulation data with a more complex design. First, we generate a dataset consisting of three groups. The simulation conditions for this dataset are as follows: (i) the number of genes is $1,000$ (i.e., \Robject{Ngene = 1000}), (ii) the first $30\%$ of genes are DEGs (\Robject{PDEG = 0.3}), (iii) the breakdowns of the up-regulated DEGs are respectively $70\%$, $20\%$, and $10\%$ in Groups 1-3 (\Robject{DEG.assign = c(0.7, 0.2, 0.1)}), (iv) the levels of DE are $3$-, $10$-, and $6$-fold in individual groups (\Robject{DEG.foldchange = c(3, 10, 6)}), and (v) there are a total of nine libraries (2, 4, and 3 replicates for Groups 1-3) (\Robject{replicates = c(2, 4, 3)}). <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.3, DEG.assign = c(0.7, 0.2, 0.1), DEG.foldchange = c(3, 10, 6), replicates = c(2, 4, 3)) dim(tcc$count) tcc$group head(tcc$count) @ The pseudo-color image for the generated simulation data regarding the DEGs can be obtained from the \Rfunction{plotFCPseudocolor} function. The right bar (from white to magenta) indicates the degree of fold-change (FC). As expected, it can be seen that the first $210$, $60$, and $30$ genes are up-regulated in G1, G2, and G3, respectively. \begin{center} <>= plotFCPseudocolor(tcc) @ \end{center} Now let us see how the DEGES/edgeR-edgeR combination with the original edgeR-edgeR combination performs. First we calculate the AUC value for the ranked gene list obtained from the DEGES/edgeR-edgeR combination. <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) calcAUCValue(tcc) @ Next, we calculate the corresponding value using the original \Rpackage{edgeR} procedure for single factor experimental design (i.e., the edgeR-edgeR combination). <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 0) tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1) calcAUCValue(tcc) @ It can be seen that the DEGES/edgeR-edgeR combination outperforms the original \Rpackage{edgeR} procedure under the given simulation conditions. Note that the \Robject{test.method} argument will be ignored when \Robject{iteration = 0} is specified. Next, let us generate another dataset consisting of a total of eight groups. The simulation conditions for this dataset are as follows: (i) the number of genes is $10,000$ (i.e., \Robject{Ngene = 10000}), (ii) the first $34\%$ of genes are DEGs (\Robject{PDEG = 0.34}), (iii) the breakdowns of the up-regulated DEGs are respectively $10\%$, $30\%$, $5\%$, $10\%$, $5\%$, $21\%$, $9\%$, and $10\%$ in Groups 1-8 (\Robject{DEG.assign = c(0.1, 0.3, 0.05, 0.1, 0.05, 0.21, 0.09, 0.1)}), (iv) the levels of DE are $3.1$-, $13$-, $2$-, $1.5$-, $9$-, $5.6$-, $4$-, and $2$-fold in individual groups (\Robject{DEG.foldchange = c(3.1, 13, 2, 1.5, 9, 5.6, 4, 2)}), and (v) there are a total of nine libraries (except for G3, none of the groups have replicates) (\Robject{replicates = c(1, 1, 2, 1, 1, 1, 1, 1)}). \begin{center} <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.34, DEG.assign = c(0.1, 0.3, 0.05, 0.1, 0.05, 0.21, 0.09, 0.1), DEG.foldchange = c(3.1, 13, 2, 1.5, 9, 5.6, 4, 2), replicates = c(1, 1, 2, 1, 1, 1, 1, 1)) dim(tcc$count) tcc$group head(tcc$count) plotFCPseudocolor(tcc) @ \end{center} This kind of simulation data may be useful for evaluating methods aimed at identifying tissue-specific (or tissue-selective) genes. \subsection{Multi-factor data} \label{section-5-3-p1} The \Rfunction{simulateReadCounts} function can also generate simulation data in multi-factor experimental design. Different from above single-factor experimental design, the \Robject{group} argument should be used instead of \Robject{replicates} for specifying sample conditions (or factors) when generating simulation data in multi-factor design. In relation to the \Robject{group} specification, the \Robject{DEG.foldchange} argument should also be specified as a data frame object. We generate a dataset consisting of two factors for comparing (i) two Groups (i.e., "WT" vs. "KO") as the first factor, at (ii) two time points (i.e., "1d" vs. "2d") as the second factor, with all samples obtained from independent subjects. There are a total of four conditions ("WT\_1d", "WT\_2d", "KO\_1d", and "KO\_2d") each of which has two biological replicates, comprising a total of eight samples. The \Robject{group} argument for this experimental design can be described as follows: <>= group <- data.frame( GROUP = c("WT", "WT", "WT", "WT", "KO", "KO", "KO", "KO"), TIME = c("1d", "1d", "2d", "2d", "1d", "1d", "2d", "2d") ) @ Next, we design the number of types of DEGs and the levels of fold-change by the \Robject{DEG.foldchange} argument. We here introduce three types of DEGs: (a) 2-fold up-regulation in the first four samples (i.e., "WT"), (b) 3-fold up-regulation in the last four samples (i.e., "KO"), and (c) 2-fold down-regulation at "2d" in "WT" and 4-fold up-regulation at "2d" in "KO". This implies that the first two types of DEGs are related to the first factor (i.e., "WT" vs. "KO") and the third type of DEG is related to the second factor (i.e., "1d" vs. "2d"). <>= DEG.foldchange <- data.frame( FACTOR1.1 = c(2, 2, 2, 2, 1, 1, 1, 1), FACTOR1.2 = c(1, 1, 1, 1, 3, 3, 3, 3), FACTOR2 = c(1, 1, 0.5, 0.5, 1, 1, 4, 4) ) @ The other simulation conditions for this dataset are as follows: (1) the number of gene is 1,000 (i.e., \Robject{Ngene = 1000}), (2) the first 20\% of genes are DEGs (i.e., \Robject{PDEG = 0.2}), and (3) the breakdowns of the three types of DEGs are 50\%, 20\%, and 30\% (i.e., \Robject{DEG.assign = c(0.5, 0.2, 0.3)}). <>= set.seed(1000) tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.2, DEG.assign = c(0.5, 0.2, 0.3), DEG.foldchange = DEG.foldchange, group = group) @ Since the first six rows in the dataset corresponds to the first type of DEGs, we can see the 2-fold up-regulation in the first four columns (i.e., WT-related samples) compared to the last four columns (i.e., KO-related samples). \begin{center} <>= head(tcc$count) tcc$group plotFCPseudocolor(tcc) @ \end{center} \subsection{Other utilities} \label{section-5-4} Recall that the simulation framework can handle different levels of DE for DEGs in individual groups, and the shape of the distribution for these DEGs is the same as that of non-DEGs. Let us confirm those distributions by introducing more drastic simulation conditions for comparing two groups (G1 vs. G2) with biological replicates; i.e., (i) the number of genes is $20,000$ (i.e., \Robject{Ngene = 20000}), (ii) the first $30\%$ of genes are DEGs (\Robject{PDEG = 0.30}), (iii) the first $85\%$ of the DEGs are up-regulated in G1 and the remaining $15\%$ are up-regulated in G2 (\Robject{DEG.assign = c(0.85, 0.15)}), (iv) the levels of DE are eight-fold in G1 and sixteen-fold in G2 (\Robject{DEG.foldchange = c(8, 16)}), and (v) there are a total of four samples (two biological replicates for G1 and two biological replicates for G2) (\Robject{replicates = c(2, 2)}). <>= set.seed(1000) library(TCC) tcc <- simulateReadCounts(Ngene = 20000, PDEG = 0.30, DEG.assign = c(0.85, 0.15), DEG.foldchange = c(8, 16), replicates = c(2, 2)) head(tcc$count) @ An M-A plot for the simulation data can be viewed as follows; the points for DEGs are colored red and the non-DEGs are colored black: \begin{center} <>= plot(tcc) @ \end{center} This plot is generated from simulation data that has been scaled in such a way that the library sizes of each sample are the same as the mean library size of the original data. That is, <>= normalized.count <- getNormalizedData(tcc) colSums(normalized.count) colSums(tcc$count) mean(colSums(tcc$count)) @ <>= xy <- plot(tcc) isnot.na <- as.logical(xy[, 1] != min(xy[, 1])) upG1 <- as.numeric(xy$m.value < 0) upG2 <- as.numeric(xy$m.value > 0) median.G1 <- median(xy[((tcc$simulation$trueDEG == 1) & isnot.na) & upG1, 2]) median.G2 <- median(xy[((tcc$simulation$trueDEG == 1) & isnot.na) & upG2, 2]) median.nonDEG <- median(xy[(tcc$simulation$trueDEG == 0) & isnot.na, 2]) @ The summary statistics for non-DEGs and up-regulated DEGs in G1 and G2 are upshifted compared with the original intentions of the user (i.e., respective M values of $0$, $-3$, and $4$ for non-DEGs and up-regulated DEGs in G1 and G2). Indeed, the median values, indicated as horizontal lines, are respectively $\Sexpr{sprintf("%.3f", median.nonDEG)}$, $\Sexpr{sprintf("%.3f", median.G1)}$, and $\Sexpr{sprintf("%.3f", median.G2)}$ for non-DEGs and up-regulated DEGs in G1 and G2. \begin{center} <>= plot(tcc, median.lines = TRUE) @ \end{center} <>= tcc <- calcNormFactors(tcc, "tmm", "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) xy <- plot(tcc) isnot.na <- as.logical(xy[, 1] != min(xy[, 1])) median.nonDEG <- median(xy[(tcc$simulation$trueDEG == 0) & isnot.na, 2]) @ These upshifted M values for non-DEGs can be modified after performing the iDEGES/edgeR normalization, e.g., the median M value ($= \Sexpr{sprintf("%.3f", median.nonDEG)}$) for non-DEGs based on the iDEGES/edgeR-normalized data is nearly zero. \begin{center} <>= tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 3, FDR = 0.1, floorPDEG = 0.05) plot(tcc, median.line = TRUE) @ \end{center} The comparison of those values obtained from different normalization methods might be another evaluation metric. \newpage \section{Session info} <<2390118599>>= sessionInfo() @ \newpage \section{References} \renewcommand{\refname}{} \renewcommand\refname{\vskip -1cm} \begin{thebibliography}{99} \bibitem{anders} Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biol 11: R106. \bibitem{di} Di Y, Schafer DW, Cumbie JS, Chang JH. 2011. The NBP negative binomial model for assessing differential gene expression from RNA-Seq. Stat Appl Genet Mol Biol 10. \bibitem{dillies} Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Lalo\"e D, Le Gall C, Scha\"effer B, Le Crom S, Guedj M, Jaffr\'ezic F; French StatOmique Consortium. 2013. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 14: 671-683. \bibitem{glaus} Glaus P, Honkela A, and Rattray M. 2012. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 28: 1721-1728. \bibitem{hardcastle} Hardcastle TJ and Kelly KA. 2010. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11: 422. \bibitem{wad} Kadota K, Nakai Y, Shimizu K. 2008. A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithms Mol Biol 3: 8. \bibitem{aicentropy} Kadota K, Nishimura SI, Bono H, Nakamura S, Hayashizaki Y, Okazaki Y, Takahashi K. 2003. Detection of genes with tissue-specific expression patterns using Akaike's Information Criterion (AIC) procedure. Physiol Genomics 12: 251-259. \bibitem{kadota} Kadota K, Nishiyama T, and Shimizu K. 2012. A normalization strategy for comparing tag count data. Algorithms Mol Biol 7: 5. \bibitem{roku} Kadota K, Ye J, Nakai Y, Terada T, Shimizu K. 2006. ROKU: a novel method for identification of tissue-specific genes. BMC Bioinformatics 7: 294. \bibitem{mccarthy} McCarthy DJ, Chen Y, Smyth GK. 2012. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res 40: 4288-4297. \bibitem{robinson} Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139-140. \bibitem{robinson2} Robinson MD and Oshlack A. 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11: R25. \bibitem{robinson3} Robinson MD and Smyth GK. 2008. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9: 321-332. \bibitem{sun} Sun J, Nishiyama T, Shimizu K, and Kadota K. TCC: an R package for comparing tag count data with robust normalization strategies. BMC Bioinformatics 14: 219. \bibitem{ueda} Ueda T. 1996. Simple method for the detection of outliers. Japanese J Appl Stat 25: 17-26. \end{thebibliography} \end{document}