%\VignetteIndexEntry{TCC}
%\VignettePackage{TCC}

\documentclass{article}
\usepackage[a4paper]{geometry}
\usepackage{color}
\usepackage{Sweave}
\SweaveOpts{pdf=TRUE}


\definecolor{TccBlue}{cmyk}{0.74,0.26,0,0.14}
\definecolor{TccRed}{cmyk}{0,1,0.78,0.1}
\definecolor{TccGreen}{cmyk}{0.99,0,0.29,0.47}
\definecolor{TccOrange}{cmyk}{0,0.27,1,0.03}

\renewcommand{\floatpagefraction}{0.9}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\Robject{#1}}}
\newcommand{\Rpackage}[1]{\textbf{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\texttt{#1}}}

\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=n,formatcom=\color{TccRed}}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontshape=n,formatcom=\color{TccBlue}}

\author {\small Jianqiang Sun$^{1\S}$, Tomoaki Nishiyama$^{2\S}$, Kentaro Shimizu$^1$, and Koji Kadota$^1$\\
\texttt{\footnotesize $^1$ The University of Tokyo, Tokyo, Japan}\\
\texttt{\footnotesize $^2$ Kanazawa University, Kanazawa, Japan}\\
\texttt{\footnotesize $^\S$ Maintainer: Jianqiang Sun (wukong@bi.a.u-tokyo.ac.jp),}\\
\texttt{\footnotesize Tomoaki Nishiyama (tomoakin@staff.kanazawa-u.ac.jp)}
}
\title{TCC: Differential expression analysis for tag count data
with robust normalization strategies}

\begin{document}

\maketitle\thispagestyle{empty}

\begin{abstract}
The R/Bioconductor package, \Rpackage{TCC}, provides users with a
robust and accurate framework to perform differential expression (DE)
analysis of tag count data. We recently developed a multi-step
normalization method  (TbT; Kadota et al., 2012 \cite{kadota}) for
two-group RNA-seq data. The strategy (called DEGES) is to remove data
that are potential differentially expressed genes (DEGs) before
performing the data normalization. DEGES in \Rpackage{TCC} is essential
for accurate normalization of tag count data, especially when the up-
and down-regulated DEGs in one of the groups are extremely biased in
their number. A major characteristic of \Rpackage{TCC} is to provide
the DEGES-based normalization methods for several kinds of count data
(two-group with or without replicates, multi-group, and so on)
by virtue of the use of combinations of functions in other sophisticated
packages (especially \Rpackage{edgeR}, \Rpackage{DESeq}, and
\Rpackage{baySeq}). The appropriate combination provided by \Rpackage{TCC}
allows a more robust and accurate estimation to be performed more easily
than directly using original packages and \Rpackage{TCC} provides a simple
unified interface to perform the robust normalization.
\end{abstract}

\newpage

\tableofcontents

\newpage

\section{Introduction}

Differential expression analysis based on tag count data has become a 
fundamental task for identifying differentially expressed genes or 
transcripts (DEGs). The \Rpackage{TCC} package (Tag Count Comparison; 
Sun et al., 2013 \cite{sun}) provides users with a robust and accurate 
framework to perform differential expression analysis of tag count data. 
\Rpackage{TCC} provides integrated analysis pipelines with improved 
data normalization steps, compared with other packages such as 
\Rpackage{edgeR}, \Rpackage{DESeq}, and \Rpackage{baySeq}, 
by appropriately combining their functionalities. 
The package incorporates multi-step normalization methods whose strategy is 
to remove data that are potential DEGs before performing 
the data normalization.

Kadota et al. (2012) \cite{kadota} recently reported that 
the normalization methods implemented in R packages 
(such as \Rpackage{edgeR} (Robinson et al., 2010 \cite{robinson}), 
\Rpackage{DESeq} (Anders and Huber, 2010 \cite{anders}), 
and \Rpackage{baySeq} (Hardcastle and Kelly, 2010 \cite{hardcastle})) for 
differential expression (DE) analysis between samples 
are inadequate when the up- and down-regulated DEGs in one of the samples 
are extremely biased in their number (i.e., biased DE). This is because the 
current methods implicitly assume a balanced DE, 
wherein the numbers of highly and lowly expressed DE entities in 
samples are (nearly) equal. As a result, 
methods assuming unbiased DE will not work well on data with biased DE. 
Although a major purpose of data normalization is to detect such DE entities, 
their existence themselves consequently interferes with their opportunity to 
be top-ranked. Conventional procedures for identifying DEGs from tag count 
data consisting of two steps (i.e., data normalization and identification of 
DEGs) cannot in principle eliminate the potential DE entities before 
data normalization.

To normalize data that potentially has various scenarios (including unbiased 
and biased DE), we recently proposed a multi-step normalization strategy 
(called TbT, an acronym for the TMM-\Rpackage{baySeq}-TMM pipeline; Kadota 
et al., 2012 \cite{kadota}), in which the TMM normalization method 
(Robinson and Oshlack, 2010 \cite{robinson2}) is used in steps 1 and 3 
and an empirical Bayesian method implemented 
in the \Rpackage{baySeq} package 
(Hardcastle and Kelly, 2010 \cite{hardcastle}) is used in 
step 2. Although this multi-step DEG elimination strategy (called "DEGES" 
for short) can successfully remove potential DE entities identified in 
step 2 prior to the estimation of the normalization factors using the TMM 
normalization method in step 3, the \Rpackage{baySeq} package used in step 
2 of the TbT method is much more computationally intensive than competing 
packages like \Rpackage{edgeR} and \Rpackage{DESeq}. While the three-step 
TbT normalization method performed best on simulated and real tag count data, 
it is practically possible to make different choices for the methods in 
each step. A more comprehensive study regarding better choices for DEGES 
is needed. 

This package provides tools to perform multi-step normalization methods 
based on DEGES and enables differential expression analysis of tag count 
data without having to worry much about biased distributions of DEGs. 
The DEGES-based normalization function implemented in \Rpackage{TCC} includes 
the TbT method based on DEGES for two-group data with or without replicates, 
much faster method, and methods for multi-group comparison. \Rpackage{TCC} 
provides a simple unified interface to perform data normalization with 
combinations of functions provided by \Rpackage{baySeq}, \Rpackage{DESeq}, 
and \Rpackage{edgeR}. Functions to produce simulation data under various 
conditions and to plot the data are also provided.

\subsection{Installation}
\label{section-1-1}

This package is available from the Bioconductor website (http://bioconductor.org/). 
To install the package, enter the following command after starting R:


\begin{Schunk}
\begin{Sinput}
> source("http://bioconductor.org/biocLite.R")
> biocLite("TCC")
\end{Sinput}
\end{Schunk}


\subsection{Citations}
\label{section-1-2}

This package internally uses many of the functions implemented in the other 
packages. This is because our normalization procedures consist, in part, of 
combinations of existing normalization methods and differential expression 
(DE) methods. 

For example, the TbT normalization method 
(Kadota et al., 2012 \cite{kadota}), which is 
a particular functionality of the \Rpackage{TCC} package 
(Sun et al., 2013 \cite{sun}), 
consists of the TMM normalization method 
(Robinson and Oshlack, 2010 \cite{robinson2}) implemented 
in the \Rpackage{edgeR} package 
(Robinson et al., 2010 \cite{robinson}) and the empirical Bayesian 
method implemented in the \Rpackage{baySeq} package 
(Hardcastle and Kelly, 2010 \cite{hardcastle}). 
Therefore, please cite the appropriate references 
when you publish your results.

\begin{Schunk}
\begin{Sinput}
> citation("TCC")
\end{Sinput}
\end{Schunk}


\subsection{Quick start}
\label{section-1-3}

Let us begin by showing two examples (Cases1 and 2) of identifying DEGs 
between two groups from tag count data consisting of $1,000$ 
genes and a total of six samples 
(each group has three biological replicates). The hypothetical 
count data (termed "\Robject{hypoData}") is stored in this package 
(for details, see section \ref{section-2-1}). We then describe the 
DE analysis of count data without replicates (i.e., two samples), 
using the data of the first and the fourth 
column of \Robject{hypoData} (Case 3). We recommend the use of commands in 
Cases 2 and 3.

Case 1: DE analysis of two-group count data with replicates by 
using the exact test (Robinson and Smyth, 2008 \cite{robinson3}) 
in \Rpackage{edgeR} coupled with TbT normalization 
(termed the TbT-\Rpackage{edgeR} combination). 
The \Rpackage{TCC} package was originally designed with the TbT 
normalization method, and the original study (Kadota et al., 2012 
\cite{kadota}) recommended this analysis pipeline. 
Note that a smaller sampling size (i.e., \Robject{samplesize = 100}) 
is used here to reduce the computation time, but a larger sampling size of 
around $10,000$ (i.e., \Robject{samplesize = 10000}) 
is recommended (Hardcastle and Kelly, 2010 \cite{hardcastle}). 
Suggested citations are as follows: \Rpackage{TCC} (Sun et al., 
2013 \cite{sun}), TbT (Kadota et al., 2012 \cite{kadota}), 
TMM (Robinson and Oshlack, 2010 \cite{robinson2}), 
\Rpackage{baySeq} (Hardcastle and Kelly, 2010 \cite{hardcastle}), 
and \Rpackage{edgeR} (Robinson et al., 2010 \cite{robinson}). 
For details, see section \ref{section-3-1-1}.

<<quick_start_case_1>>=
library(TCC)
data(hypoData)
samplesize <- 100
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq",
                       iteration = 1, samplesize = samplesize) 
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
@

Case 2: DE analysis for two-group count data with replicates by using 
the exact test coupled with iterative DEGES/\Rpackage{edgeR} 
normalization (i.e., the iDEGES/\Rpackage{edgeR}-\Rpackage{edgeR} 
combination). This is an alternative pipeline designed to reduce 
the runtime (approx. $20$ sec.), yet its performance is comparable to 
the above pipeline. Accordingly, we recommend using this pipeline as 
a default when analyzing tag count data with replicates. 
A notable advantage of this pipeline is that the multi-step 
normalization strategy only needs the methods 
implemented in the \Rpackage{edgeR} package. The suggested citations are as 
follows: \Rpackage{TCC} (Sun et al., 2013 \cite{sun}), 
TMM (Robinson and Oshlack, 2010 \cite{robinson2}), 
the exact test (Robinson and Smyth, 2008 \cite{robinson3}), 
and \Rpackage{edgeR} (Robinson et al., 2010 \cite{robinson}). 
For details, see section \ref{section-3-1-3}.

<<quick_start_case_2>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
@

Case 3: DE analysis for two-group count data without replicates by using 
the negative binomial (NB) test in \Rpackage{DESeq} coupled with 
iDEGES/\Rpackage{DESeq} normalization 
(i.e., the iDEGES/\Rpackage{DESeq}-\Rpackage{DESeq} combination). 
A procedure using the data of the first and fourth columns of 
\Robject{hypoData} is shown here. Similar to Case 2, this pipeline entirely 
consists of methods implemented in the \Rpackage{DESeq} package. Suggested 
citations are as follows: \Rpackage{TCC} (Sun et al., 2013 \cite{sun}) 
and \Rpackage{DESeq} (Anders and Huber, 2010 \cite{anders}). For details, 
see section \ref{section-3-2}.

<<quick_start_case_3>>=
library(TCC)
data(hypoData)
group <- c(1, 2)
tcc <- new("TCC", hypoData[,c(1,4)], group)
tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq",
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
tcc <- estimateDE(tcc, test.method = "deseq", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
@

\newpage

\section{Preparations}

\subsection{Reading the count data}
\label{section-2-1}

Similar to the other packages, \Rpackage{TCC} typically starts the DE 
analysis with a count table matrix where each row indicates a gene 
(or transcript), each column indicates a sample (or library), 
and each cell indicates the number of counts for a gene in a sample. 
Here, we assume a hypothetical count matrix consisting of 1,000 rows 
(or genes) and a total of six columns (the first three columns are 
produced from biological replicates of Group 1 and the remaining 
columns are from Group 2); i.e., \{G1\_rep1, G1\_rep2, G1\_rep3\} 
vs. \{G2\_rep1, G2\_rep2, G2\_rep3\}. We start by loading the 
hypothetical data (\Robject{hypoData}) from \Rpackage{TCC} and giving 
a numeric vector (\Robject{group}) indicating 
which group each sample belongs to.

<<load_hypoData>>=
library(TCC)
data(hypoData)
head(hypoData)
dim(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
@

If you want to analyze another count matrix consisting of nine columns 
(e.g., the first four columns are produced from biological replicates 
of G1, and the remaining five columns are from G2), the \Robject{group} 
vector should be indicated as follows.

<<group_argument_setting>>=
group <- c(1, 1, 1, 1, 2, 2, 2, 2, 2)
@


\subsection{Constructing TCC class object}
\label{section-2-2}

The \Rfunction{new} function has to be used to perform the main 
functionalities of \Rpackage{TCC}. 
This function constructs a \Rpackage{TCC} class object, 
and subsequent analyses are performed on this class object. The object 
is constructed from i) a count matrix (\Robject{hypoData}) and ii) the 
corresponding numeric vector (\Robject{group}) as follows.

<<construct_tcc>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc
@

The count matrix and group vector information can be retrieved from 
the stored class object by using \Robject{tcc\$count} 
and \Robject{tcc\$group}, respectively.

<<summary_tcc>>=
head(tcc$count)
tcc$group
@

The subset of \Rpackage{TCC} class object can be taken by the 
\Rfunction{subset} or \Rfunction{"["} functions.


<<subset_tcc>>=
dim(tcc$count)
tcc.sub1 <- subset(tcc, c(rep(TRUE, 20), rep(FALSE, 980)))
dim(tcc.sub1$count)
tcc.sub2 <- tcc[1:20]
dim(tcc.sub2$count)
@


\subsection{Filtering low-count genes (optional)}
\label{section-2-3}

The way to filter out genes with low-count tags across samples depends 
on the user's philosophy. Although we recommend removing tags with zero 
counts across samples as a minimum filtering, this effort is optional. 
The \Rfunction{filterLowCountGenes} function performs this filtering.

<<filter_low_count_genes_tcc>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- filterLowCountGenes(tcc)
dim(tcc$count)
@

It can be seen that $\Sexpr{nrow(hypoData) - nrow(tcc$count)} 
(= \Sexpr{nrow(hypoData)} -\Sexpr{ nrow(tcc$count)})$ 
genes were filtered as non-expressed. 
The same procedure can be performed without 
the \Rfunction{filterLowCountGenes} function, in which case the 
filtering is performed before the \Rpackage{TCC} class object is constructed.

<<filter_low_count_genes>>=
filter <- as.logical(rowSums(hypoData) > 0)
dim(hypoData[filter, ])
tcc <- new("TCC", hypoData[filter, ], group)
dim(tcc$count)
@

\newpage

\section{Normalization}

\subsection{Normalization of two-group count data with replicates}
\label{section-3-1}

This package provides robust normalization methods based on DEGES 
proposed by Kadota et al. (2012) \cite{kadota}. When obtaining normalization 
factors from two-group data with replicates, users can select 
a total of six combinations (two normalization methods $\times$ 
three DEG identification methods) coupled with an arbitrary number 
of iterations ($n = 0, 1, 2, \dots, 100$) in our DEGES-based 
normalization pipeline. We show some of the practical combinations 
below.

Since the three-step TbT normalization method was originally designed 
for normalizing tag count data with (biological) replicates, we will 
first explain the TbT method (\ref{section-3-1-1} DEGES/TbT). 
In relation to the other DEGES-based methods, we will call the method 
"DEGES/TbT" for convenience. As mentioned in the original study, 
DEGES/TbT needs a long computation time. Accordingly, we present 
three shorter alternatives (\ref{section-3-1-2} DEGES/\Rpackage{edgeR}, 
\ref{section-3-1-3} iDEGES/\Rpackage{edgeR}, and \ref{section-3-1-4} 
DEGES/\Rpackage{DESeq}). Note that the purpose here is to obtain 
accurate normalization factors to be used with statistical models 
(e.g., the exact test or empirical Bayes) for the DE analysis described 
in the next section (\ref{section-4} \textbf{Differential expression}).

\subsubsection{DEGES/TbT}
\label{section-3-1-1}

The DEGES/TbT (Kadota et al., 2012 \cite{kadota}) with default parameter 
settings can be performed as follows.

<<run_tbt_tcc>>=
set.seed(1000)
library(TCC)
data(hypoData)
samplesize <- 100
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq",
                       iteration = 1, samplesize = samplesize)
@

Note that a smaller sampling size (i.e., \Robject{samplesize = 100}) 
is used here to reduce the computation time when performing 
the empirical Bayesian method in step 2, 
but a larger sampling size of around $10,000$ 
(i.e., \Robject{samplesize = 10000}) is recommended 
(Hardcastle and Kelly, 2010 \cite{hardcastle}). 
This method estimates an empirical distribution of the parameters 
of the NB distribution by bootstrapping from the input data. 
While the sampling size can be made smaller to reduce the 
computation time (e.g., \Robject{samplesize = 40}), the resulting 
normalization factors will vary from trial to trial. 
In this vignette, we will call the \Rfunction{set.seed} 
function for obtaining reproducible results 
(i.e., the \Robject{tcc\$norm.factors} values) when 
using any random function. The calculated normalization factors 
and the computation time can be retrieved with the following commands.

<<check_tbt_execution_time>>=
tcc$norm.factors
tcc$DEGES$execution.time
@

Of course, the procedure can be performed by using functions 
in \Rpackage{edgeR} and \Rpackage{baySeq}, 
instead of using the \Rfunction{calcNormFators} 
function in \Rpackage{TCC}. The \Rfunction{calcNormFators} function 
together with the above parameter settings can be regarded as a wrapper 
function for the following commands.

<<run_tbt_org>>=
set.seed(1000)
library(TCC)
data(hypoData)
samplesize <- 100
group <- c(1, 1, 1, 2, 2, 2)
### STEP 1 ###
d <- DGEList(count = hypoData, group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors
norm.factors <- norm.factors / mean(norm.factors)
### STEP 2 ###
cD <- new("countData", data = hypoData, replicates = group,
         groups = list(NDE = rep(1, length = length(group)), DE = group),
         libsizes = colSums(hypoData) * norm.factors)
cD <- getPriors.NB(cD, samplesize = samplesize, estimation = "QL", cl = NULL)
cD <- getLikelihoods.NB(cD, pET = "BIC", cl = NULL)
is.DEG <- as.logical(rank(-cD@posteriors[, "DE"]) < 
                     (nrow(hypoData) * cD@estProps[2]))
### STEP 3 ###
d <- DGEList(count = hypoData[!is.DEG, ], group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) / 
                colSums(hypoData)
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@


\subsubsection{DEGES/edgeR}
\label{section-3-1-2}

Now let us describe an alternative approach that is roughly $200$-$400$ times 
faster than DEGES/TbT, yet has comparable performance. The 
TMM-\Rpackage{edgeR}-TMM pipeline (called DEGES/\Rpackage{edgeR}) 
employs the exact test implemented in \Rpackage{edgeR} in step 2. 
To use this pipeline, we have to provide a reasonable threshold for 
defining potential DEGs in step 2. We will define the threshold as 
an arbitrary false discovery rate (FDR) with 
a floor value of $P_{\mbox{\tiny DEG}}$. 
The default FDR is $< 0.1$, and the default floor $P_{\mbox{\tiny DEG}}$ is 
$5\%$, but different choices are of course possible. For example, 
in case of the default settings, $x\% (x > 5\%)$ of the top-ranked 
potential DEGs are eliminated in step 2 if the percentage ($= x\%$) of 
genes satisfying FDR $< 0.1$ is over $5\%$. The DEGES/\Rpackage{edgeR} 
pipeline has an apparent advantage over TbT in computation time. 
It can be performed as follows:

<<degesEdger_tcc>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
tcc$norm.factors
tcc$DEGES$execution.time
@

The normalization factors calculated from the DEGES/\Rpackage{edgeR} 
are very similar to those of DEGES/TbT with the default parameter 
settings (i.e., \Robject{samplesize = 10000}). For \Rpackage{edgeR} 
users, we provide commands, consisting of functions in \Rpackage{edgeR}, 
to perform the DEGES/\Rpackage{edgeR} pipeline without \Rpackage{TCC}. 
The \Rfunction{calcNormFators} function together 
with the above parameter settings 
can be regarded as a wrapper function for the following commands.

<<degesEdger_edger>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
FDR <- 0.1
floorPDEG <- 0.05
d <- DGEList(counts = hypoData, group = group)
### STEP 1 ###
d <- calcNormFactors(d)
### STEP 2 ###
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
result <- exactTest(d)
q.value <- p.adjust(result$table$PValue, method = "BH")
if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) {
  is.DEG <- as.logical(q.value < FDR)
} else {
  is.DEG <- as.logical(rank(result$table$PValue, ties.method = "min") <=
                       nrow(hypoData) * floorPDEG)
}
### STEP 3 ###
d <- DGEList(counts = hypoData[!is.DEG, ], group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) /
                  colSums(hypoData)
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@


\subsubsection{iDEGES/edgeR}
\label{section-3-1-3}

Our multi-step normalization can be repeated until the calculated 
normalization factors converge (Kadota et al., 2012 \cite{kadota}). 
An iterative version of DEGES/TbT (i.e., iDEGES/TbT) can be described 
as the TMM-(\Rpackage{baySeq}-TMM)$_{n}$ pipeline with $n \ge 2$. 
Although the iDEGES/TbT would not be practical in terms of 
the computation time, the TMM-(\Rpackage{edgeR}-TMM)$_{n}$ pipeline 
(iDEGES/\Rpackage{edgeR}) is potentially superior to both the 
DEGES/\Rpackage{edgeR} and the DEGES/TbT. A suggested 
iDEGES/\Rpackage{edgeR} implementation ($n = 3$) 
consists of seven steps, as follows:

<<idegesEdger_tcc>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
tcc$norm.factors
tcc$DEGES$execution.time
@

\subsubsection{DEGES/DESeq}
\label{section-3-1-4}

The DEGES pipeline can also be performed by using only the functions 
in the \Rpackage{DESeq} package. Similar to the \Rpackage{edgeR} 
case above, this \Rpackage{DESeq}-\Rpackage{DESeq}-\Rpackage{DESeq} 
pipeline (DEGES/\Rpackage{DESeq}) changes the corresponding arguments 
of the \Robject{norm.method} and \Robject{test.method} as follows:

<<degesDeseq_tcc>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq",
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
tcc$norm.factors
tcc$DEGES$execution.time
@

For \Rpackage{DESeq} users, we also provide commands, consisting of 
functions in \Rpackage{DESeq}, to perform the DEGES/\Rpackage{DESeq} 
pipeline without \Rpackage{TCC}. The \Rfunction{calcNormFators} function 
together with the above arguments can be regarded as a wrapper function for 
the following commands.

<<degesDeseq_deseq>>=
library(TCC)
data(hypoData)  
group <- c(1, 1, 1, 2, 2, 2)
FDR <- 0.1
floorPDEG <- 0.05
cds <- newCountDataSet(hypoData, group)
### STEP 1 ###
cds <- estimateSizeFactors(cds)
### STEP 2 ###
cds <- estimateDispersions(cds)
result <- nbinomTest(cds, 1, 2)
result$pval[is.na(result$pval)] <- 1
result$padj[is.na(result$padj)] <- 1
q.value <- result$padj
if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) {
  is.DEG <- as.logical(q.value < FDR)
} else {
  is.DEG <- as.logical(rank(result$pval, ties.method = "min") <=
                       nrow(hypoData) * floorPDEG)
}
### STEP 3 ###
cds <- newCountDataSet(hypoData[!is.DEG, ], group)
cds <- estimateSizeFactors(cds)
norm.factors <- sizeFactors(cds) / colSums(hypoData)
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@


\subsection{Normalization of two-group count data without replicates}
\label{section-3-2}

It is important to keep in mind that most R packages 
(including \Rpackage{edgeR},
 \Rpackage{DESeq}, and \Rpackage{baySeq}) are primarily for analyzing data 
including biological replications because the biological variability has 
to be accurately estimated to avoid spurious DE calls 
(Glaus et al., 2012 \cite{glaus}). 
In fact, the functions for the DEG identification method implemented in 
\Rpackage{edgeR} (i.e., the exact test; ver. 3.0.4) do not allow analysis 
without replicates, though the TMM normalization method in the package can 
be applied to data regardless of whether it has replicates. Although the 
\Rpackage{edgeR} manual provides users with some ideas on how to perform 
the DE analysis, it is difficult to customize the analysis with DEGES to 
data without replicates. 

When obtaining normalization factors from two-group count data without 
replicates, users can select a total of four combinations 
(two normalization methods $\times$ two DEG identification methods) 
coupled with an arbitrary number 
of iterations ($n = 0, 1, 2, \dots, 100$) in our DEGES-based normalization 
pipeline. That is, the \Rfunction{calcNormFators} function with the 
\Robject{norm.method = "deseq"} or \Robject{"tmm"} and 
\Robject{test.method = "deseq"} or \Robject{"bayseq"} can be indicated. 
Let us explain the procedure by retrieving the data of the first and the 
fourth columns of \Robject{hypoData}, i.e., 

<<hypoData_without_replicates>>=
library(TCC)
data(hypoData)
group <- c(1, 2)
tcc <- new("TCC", hypoData[, c(1, 4)], group)
head(tcc$count)
tcc$group
@

A DEGES pipeline (DEGES/\Rpackage{DESeq}) for obtaining normalization 
factors is as follows.

<<degesDeseq_without_replicates_tcc>>=
tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq",
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
tcc$norm.factors
@

An advantage of this DEGES/\Rpackage{DESeq} pipeline is that the multi-step 
normalization strategy only needs the methods in the \Rpackage{DESeq} package.
These factors should be the same as those produced by the following procedure 
consisting of functions implemented in \Rpackage{DESeq}.

<<degesDeseq_without_replicates_deseq>>=
library(TCC)
data(hypoData)
group <- c(1, 2)
FDR <- 0.1
floorPDEG <- 0.05
cds <- newCountDataSet(hypoData[, c(1, 4)], group)
### STEP 1 ###
cds <- estimateSizeFactors(cds)
### STEP 2 ###
cds <- estimateDispersions(cds, method = "blind", sharingMode = "fit-only")
result <- nbinomTest(cds, 1, 2)
result$pval[is.na(result$pval)] <- 1
result$padj[is.na(result$padj)] <- 1
q.value <- result$padj
if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) {
  is.DEG <- as.logical(q.value < FDR)
} else {
  is.DEG <- as.logical(rank(result$pval, ties.method = "min") <=
                       nrow(hypoData) * floorPDEG)
}
### STEP 3 ###
cds <- newCountDataSet(hypoData[!is.DEG, c(1, 4)], group)
cds <- estimateSizeFactors(cds)
norm.factors <- sizeFactors(cds) / colSums(hypoData[, c(1, 4)])
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@


\subsection{Normalization of multi-group count data with replicates}
\label{section-3-3}

Many R packages (including \Rpackage{edgeR}, \Rpackage{DESeq}, 
and \Rpackage{baySeq}) support DE analysis for multi-group tag count data. 
\Rpackage{TCC} provides some prototypes of DEGES-based pipelines for 
such data. Here, we analyze another hypothetical three-group count matrix, 
the \Robject{hypoData\_mg} object, provided in \Rpackage{TCC}. 
It consists of $1,000$ genes and a total of nine columns for 
testing any difference among three groups that each have triplicates.

<<load_hypoData_mg>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
tcc
dim(tcc$count)
@

Of the $1,000$ genes, the first $200$ genes are DEGs and the remaining $800$
genes are non-DEGs. The breakdowns for the $200$ DEGs are as follows: 
$140$, $40$, and $20$ DEGs are up-regulated in Groups 1, 2, and 3. 
Below, we show some DEGES-based normalization pipelines for this 
multi-group data (\ref{section-3-3-1} DEGES/TbT, \ref{section-3-3-2} 
DEGES/\Rpackage{edgeR}, and \ref{section-3-3-3} DEGES/\Rpackage{DESeq}).

\subsubsection{DEGES/TbT}
\label{section-3-3-1}

The DEGES/TbT pipeline for multi-group data is essentially the same as 
those for two-group data with/without replicates. Note that a smaller 
sampling size (i.e., \Robject{samplesize = 100}) is used here to reduce 
the computation time, but a larger sampling size of around $10,000$ (i.e., 
\Robject{samplesize = 10000}) is recommended 
(Hardcastle and Kelly, 2010 \cite{hardcastle}).

<<degesTbt_multigroup_tcc>>=
set.seed(1000)
library(TCC)
data(hypoData_mg)
samplesize <- 100
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq",
                       iteration = 1, samplesize = samplesize)
tcc$norm.factors
@

\subsubsection{DEGES/edgeR}
\label{section-3-3-2}

\Rpackage{edgeR} employs generalized linear models (GLMs) to find DEGs 
between any of the groups. The DEGES/\Rpackage{edgeR} normalization 
pipeline in \Rpackage{TCC} internally uses functions for the GLM approach 
that require two models (a full model and a null model). The full model 
corresponds to a design matrix to describe sample groups. The null model 
corresponds to the model coefficients. 
The two models can be defined as follows:

<<degesEdger_multigroup_tcc>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
design <- model.matrix(~ as.factor(group))
coef <- 2:length(unique(group))
@

The design matrix (\Robject{design}) can be constructed by using the 
\Rfunction{model.matrix} function. 
For the model coefficients (\Robject{coef}), 
the user should specify all the coefficients except for the intercept term.
The two models (\Robject{design} and \Robject{coef}) will automatically be
generated when performing the following \Rfunction{calcNormFactors} function
if those models are not explicitly indicated.

<<degesEdger_multigroup_tcc_nf>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
tcc$norm.factors
@

For \Rpackage{edgeR} users, we provide commands, consisting of functions in 
\Rpackage{edgeR}, to perform the DEGES/\Rpackage{edgeR} pipeline without 
\Rpackage{TCC}. The \Rfunction{calcNormFators} function together with the 
above parameter settings can be regarded as a wrapper function for the 
following commands.

<<degesEdger_multigroup_edger>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
FDR <- 0.1
floorPDEG <- 0.05
design <- model.matrix(~ as.factor(group))
coef <- 2:length(unique(group))
d <- DGEList(counts = hypoData_mg, group = group)
### STEP 1 ###
d <- calcNormFactors(d)
### STEP 2 ###
d <- estimateGLMCommonDisp(d, design)
d <- estimateGLMTrendedDisp(d, design)
d <- estimateGLMTagwiseDisp(d, design)
fit <- glmFit(d, design)
lrt <- glmLRT(fit, coef = coef)
result <- topTags(lrt, n = nrow(hypoData_mg))
result <- result$table[rownames(hypoData_mg), ]
if (sum(result$FDR < FDR) > (floorPDEG * nrow(hypoData_mg))) {
  is.DEG <- as.logical(result$FDR < FDR)
} else  {
  is.DEG <- as.logical(rank(result$PValue, ties.method = "min") <=
                       nrow(hypoData_mg) * floorPDEG)
}
### STEP 3 ###
d <- DGEList(counts = hypoData_mg[!is.DEG, ], group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors * colSums(hypoData_mg[!is.DEG, ]) /
                colSums(hypoData_mg)
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@

\subsubsection{DEGES/DESeq}
\label{section-3-3-3}

\Rpackage{DESeq} also employs GLMs for analyzing multi-group experiments. 
Similar to the \Rpackage{edgeR} package, it requires two models (full 
model and reduced model). The full model (\Robject{fit1}) 
and reduced model (\Robject{fit0}) 
can be created as follows:

<<degesDeseq_multigroup_tcc>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
fit1 <- count ~ condition
fit0 <- count ~ 1
@

The two models (\Robject{fit1} and \Robject{fit0}) will automatically be generated
when performing the following \Robject{calcNormFactors} function if those models
are not explicitly indicated.

<<degesDeseq_multigroup_tcc_nf>>=
tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq",
                       iteration = 1)
tcc$norm.factors
@

For \Rpackage{DESeq} users, we provide commands, consisting of functions 
in \Rpackage{DESeq}, to perform the DEGES/ \Rpackage{DESeq} pipeline without 
\Rpackage{TCC}. The \Rfunction{calcNormFators} function together with the 
above parameter settings can be regarded as a wrapper function for the 
following commands.

<<degesDeseq_multigroup_deseq>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
FDR <- 0.1
floorPDEG <- 0.05
tcc <- new("TCC", hypoData_mg, group)
fit1 <- count ~ condition
fit0 <- count ~ 1
cds <- newCountDataSet(hypoData_mg, group)
### STEP 1 ###
cds <- estimateSizeFactors(cds)
### STEP 2 ###
cds <- estimateDispersions(cds)
reduced.model <- fitNbinomGLMs(cds, fit0)
full.model <- fitNbinomGLMs(cds, fit1)
p.value <- nbinomGLMTest(full.model, reduced.model)
p.value[is.na(p.value)] <- 1
q.value <- p.adjust(p.value, method = "BH")
if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData_mg))) {
  is.DEG <- as.logical(q.value < FDR)
} else {
  is.DEG <- as.logical(rank(p.value, ties.method = "min") <=
                       nrow(hypoData_mg) * floorPDEG)
}
### STEP 3 ###
cds <- newCountDataSet(hypoData_mg[!is.DEG, ], group)
cds <- estimateSizeFactors(cds)
norm.factors <- sizeFactors(cds) / colSums(hypoData_mg)
norm.factors <- norm.factors / mean(norm.factors)
norm.factors
@


\subsection{Retrieving normalized data}
\label{section-3-4}

Similar functions for calculating normalization factors are the 
\Rfunction{calcNormFators} function in \Rpackage{edgeR} and the 
\Rfunction{estimateSizeFactors} function in \Rpackage{DESeq}. Note that the 
terminology used in \Rpackage{DESeq} (i.e., size factors) is 
different from that used in \Rpackage{edgeR} (i.e., effective 
library sizes) and ours. The effective library size in \Rpackage{edgeR} 
is calculated as the library size multiplied by the normalization factor. 
The size factors in the \Rpackage{DESeq} package are comparable to 
the {\it normalized} effective library sizes wherein the summary statistics 
for the effective library sizes are adjusted to one. Our normalization 
factors, which can be obtained from \Robject{tcc\$norm.factors}, 
have the same names as those in \Rpackage{edgeR}. Accordingly, the 
normalization factors calculated from \Rpackage{TCC} with arbitrary 
options should be manipulated together with the library sizes when 
normalized read counts are to be obtained. Since biologists are often 
interested in such information (Dillies et al., 2012 \cite{dillies}), 
we provide the \Rfunction{getNormalizedData} function 
for retrieving normalized data. 

Note that the \Robject{hypoData} consists of $1,000$ genes and a total 
of six samples (three biological replicates for G1 and three biological 
replicates for G2); i.e., \{G1\_rep1, G1\_rep2, G1\_rep3\} vs. 
\{G2\_rep1, G2\_rep2, G2\_rep3\}. These simulation data have basically 
the same conditions as shown in Fig. 1 of the TbT paper (Kadota et 
al., 2012 \cite{kadota}); i.e., (i) the first $200$ genes are 
DEGs ($P_{\mbox{\small DEG}} = 200/1000 = 20\%$), 
(ii) the first $180$ genes of the $200$ DEGs are higher in G1 
($P_{\mbox{\small G1}} = 180/200 = 90\%$), 
and the remaining $20$ DEGs are higher in G2, and (iii) the level of 
DE is four-fold. The last $800$ genes were designed to be non-DEGs. 
The different normalization strategies can roughly be evaluated in terms 
of the similarity of their summary statistics for {\it normalized} data 
labeled as non-DEGs in one group (e.g., G1) to those of the other group 
(e.g., G2). The basic statistics for the non-DEGs are as follows.

<<hypoData_deg_label>>=
library(TCC)
data(hypoData)
nonDEG <- 201:1000
summary(hypoData[nonDEG, ])
@

From now on, we will display only the median values for simplicity, i.e.,

<<calc_median_of_nondeg_hypoData>>=
apply(hypoData[nonDEG, ], 2, median)
@
<<calc_median_of_nondeg_hypoData_hide, echo = FALSE>>=
hypoData.median <- apply(hypoData[nonDEG, ], 2, median)
@

In what follows, we show detailed examples using \Robject{hypoData}. 
Note, however, that the basic usage is simple.

<<get_normalized_data>>=
normalized.count <- getNormalizedData(tcc)
@

\subsubsection{Retrieving two-group DEGES/edgeR-normalized data with replicates}
\label{section-3-4-1}

The \Rfunction{getNormalizedData} function can be applied to the 
\Rpackage{TCC} class object after the normalization factors have been 
calculated.

<<degesEdger_tcc_normed_data>>=
library(TCC)
data(hypoData)  
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
normalized.count <- getNormalizedData(tcc)
apply(normalized.count[nonDEG, ], 2, median)
@

The same procedure consisting of functions in \Rpackage{edgeR} is 

<<degesEdger_edger_normed_data>>=
library(TCC)
data(hypoData)
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2)
FDR <- 0.1
floorPDEG <- 0.05
d <- DGEList(counts = hypoData, group = group)
###  Step 1  ###
d <- calcNormFactors(d)
###  Step 2  ###
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
result <- exactTest(d)
q.value <- p.adjust(result$table$PValue, method = "BH")
if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) {
  is.DEG <- as.logical(q.value < FDR)
} else {
  is.DEG <- as.logical(order(rank(result$table$PValue)) <= 
                       nrow(hypoData) * floorPDEG)
}
###  Step 3  ###
d <- DGEList(counts = hypoData[!is.DEG, ], group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors * colSums(hypoData[!is.DEG, ]) /
                  colSums(hypoData)
norm.factors <- norm.factors / mean(norm.factors)
effective.libsizes <- colSums(hypoData) * norm.factors
normalized.count <- sweep(hypoData, 2,
                          mean(effective.libsizes) / effective.libsizes, "*")
apply(normalized.count[nonDEG, ], 2, median)
@

It is obvious that the summary statistics (ranging from 
$\Sexpr{sprintf("%.5f", min(apply(normalized.count[nonDEG,],2,median)))}$ 
to $\Sexpr{sprintf("%.5f", max(apply(normalized.count[nonDEG,],2,median)))})$ 
from DEGES/\Rpackage{edgeR}-normalized data are close to the truth (i.e., 
ranging from $\Sexpr{sprintf("%.1f", min(hypoData.median))}$ to 
$\Sexpr{sprintf("%.1f",max(hypoData.median))}$). For comparison, 
the summary statistics for 
TMM-normalized data produced using the original normalization method 
(i.e., TMM) in \Rpackage{edgeR} are obtained as follows. 

<<tmm_edger_normed_data>>=
library(TCC)
data(hypoData)
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2)
d <- DGEList(count = hypoData, group = group)
d <- calcNormFactors(d)
norm.factors <- d$samples$norm.factors
norm.factors <- norm.factors / mean(norm.factors)
effective.libsizes <- colSums(hypoData) * norm.factors
normalized.count <- sweep(hypoData, 2, 
                          mean(effective.libsizes) / effective.libsizes, "*")
apply(normalized.count[nonDEG, ], 2, median)
@

This is the same as 

<<tmm_tcc_normed_data>>=
library(TCC)
data(hypoData)
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0)
normalized.count <- getNormalizedData(tcc)
apply(normalized.count[nonDEG, ], 2, median)
@

From the viewpoint of the data distribution of non-DEGs, 
these statistics (ranging from 
$\Sexpr{sprintf("%.5f", min(apply(normalized.count[nonDEG,],2,median)))}$ 
to $\Sexpr{sprintf("%.5f", max(apply(normalized.count[nonDEG,],2,median)))}$) 
are not as good as those of DEGES/\Rpackage{edgeR}.

\subsubsection{Retrieving two-group DEGES/DESeq-normalized data with replicates}
\label{section-3-4-2}

Similar to the DEGES/\Rpackage{edgeR} case, 
DEGES/\Rpackage{DESeq}-normalized data can be retrieved as follows.

<<degesDeseq_tcc_normed_data>>=
library(TCC)
data(hypoData)  
group <- c(1, 1, 1, 2, 2, 2)
nonDEG <- 201:1000
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq",
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
normalized.count <- getNormalizedData(tcc)
apply(normalized.count[nonDEG, ], 2, median)
@

The same procedure consisting of functions in \Rpackage{DESeq} is 

<<degesDeseq_deseq_normed_data>>=
library(TCC)
data(hypoData)  
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2)
FDR <- 0.1
floorPDEG <- 0.05
cds <- newCountDataSet(hypoData, group)
###  Step 1  ###
cds <- estimateSizeFactors(cds)
###  Step 2  ###
cds <- estimateDispersions(cds)
result <- nbinomTest(cds, 1, 2)
result$pval[is.na(result$pval)] <- 1
result$padj[is.na(result$padj)] <- 1
q.value <- result$padj
if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) {
  is.DEG <- as.logical(q.value < FDR)
} else {
  is.DEG <- as.logical(rank(result$pval, ties.method = "min") <=
                       nrow(hypoData) * floorPDEG)
}
###  Step 3  ###
cds <- newCountDataSet(hypoData[!is.DEG, ], group)
cds <- estimateSizeFactors(cds)
norm.factors <- sizeFactors(cds) / colSums(hypoData)
norm.factors <- norm.factors / mean(norm.factors)
effective.libsizes <- colSums(hypoData) * norm.factors
normalized.count <- sweep(hypoData, 2,
                          mean(effective.libsizes) / effective.libsizes, "*")
apply(normalized.count[nonDEG, ], 2, median)
@

\subsubsection{Retrieving two-group DEGES/DESeq-normalized data without replicates}
\label{section-3-4-3}

Similar to the case of count data with replicates, 
the DEGES/\Rpackage{DESeq}-normalized data without 
replicates can be retrieved as follows.

<<degesDeseq_tcc_without_replicates_normed_data>>=
library(TCC)
data(hypoData)  
nonDEG <- 201:1000
group <- c(1, 2)
tcc <- new("TCC", hypoData[, c(1, 4)], group)
tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq",
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
normalized.count <- getNormalizedData(tcc)
apply(normalized.count[nonDEG, ], 2, median)
@

The same procedure consisting of functions in \Rpackage{DESeq} is 

<<degesDeseq_deseq_without_replicates_normed_data>>=
library(TCC)
data(hypoData) 
nonDEG <- 201:1000
group <- c(1, 2)
FDR <- 0.1
floorPDEG <- 0.05
cds <- newCountDataSet(hypoData[,c(1, 4)], group)
###  Step 1  ###
cds <- estimateSizeFactors(cds)
###  Step 2  ###
cds <- estimateDispersions(cds, method = "blind", sharingMode = "fit-only")
result <- nbinomTest(cds, 1, 2)
result$pval[is.na(result$pval)] <- 1
result$padj[is.na(result$padj)] <- 1
q.value <- result$padj
if (sum(q.value < FDR) > (floorPDEG * nrow(hypoData))) {
  is.DEG <- as.logical(q.value < FDR)
} else {
  is.DEG <- as.logical(rank(result$pval, ties.method = "min") <=
                       nrow(hypoData) * floorPDEG)
}
###  Step 3  ###
cds <- newCountDataSet(hypoData[!is.DEG, c(1, 4)], group)
cds <- estimateSizeFactors(cds)
norm.factors <- sizeFactors(cds) / colSums(hypoData[, c(1, 4)])
norm.factors <- norm.factors / mean(norm.factors)
effective.libsizes <- colSums(hypoData[, c(1, 4)]) * norm.factors
normalized.count <- sweep(hypoData[, c(1, 4)], 2,
                          mean(effective.libsizes) / effective.libsizes, "*")
apply(normalized.count[nonDEG, ], 2, median)
@

The above summary statistics from DEGES/\Rpackage{DESeq}-normalized data 
are closer to the truth 
(i.e., $\Sexpr{sprintf("%.1f", hypoData.median[1])}$ for G1\_rep1 
and $\Sexpr{sprintf("%.1f",hypoData.median[4])}$ for G2\_rep1) 
than are the following summary statistics from data normalized using the 
original normalization method implemented in \Rpackage{DESeq}.

<<deseq_deseq_without_replicates_normed_data>>=
library(TCC)
data(hypoData)
nonDEG <- 201:1000
group <- c(1, 2)
cds <- newCountDataSet(hypoData[, c(1, 4)], group)
cds <- estimateSizeFactors(cds)
normalized.count <- counts(cds, normalized = TRUE)
apply(normalized.count[nonDEG, ], 2, median)
@

\subsubsection{Retrieving multi-group iDEGES/edgeR-normalized data with replicates}
\label{section-3-4-4}

Here, we analyze another hypothetical three-group count matrix, 
the \Robject{hypoData\_mg} object, provided in \Rpackage{TCC}. 
It consists of $1,000$ genes and a total of nine columns for testing 
any difference among three groups that each have triplicates. 
Similar to the \Robject{hypoData} object, the first $200$ genes 
are DEGs and the remaining $800$ genes are non-DEGs. 
The basic statistics for the non-DEGs are as follows.

<<hypoData_mg_deg_label>>=
library(TCC)
data(hypoData_mg)
nonDEG <- 201:1000
summary(hypoData_mg[nonDEG, ])
@

From now on, we will display only the median values for simplicity, i.e.,

<<calc_median_of_nondeg_hypoData_mg>>=
apply(hypoData_mg[nonDEG, ], 2, median)
@
<<NormHypoDataMultiGGet, echo = FALSE>>=
hypoData_mg.median <- apply(hypoData_mg[nonDEG, ], 2, median)
@

The iDEGES/\Rpackage{edgeR}-normalized data can be retrieved as follows.

<<degesEdger_tcc_multigroup_normed_data>>=
library(TCC)
data(hypoData_mg)
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
design <- model.matrix(~ as.factor(group))
coef <- 2:length(unique(group))
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 3)
normalized.count <- getNormalizedData(tcc)
apply(normalized.count[nonDEG, ], 2, median)
range(apply(normalized.count[nonDEG, ], 2, median))
@
<<degesEdger_tcc_multigroup_normed_data_hdie, echo = FALSE>>=
normByiDEGES <- range(apply(normalized.count[nonDEG, ], 2, median))
@

For comparison, the summary statistics for TMM-normalized data 
produced using the original normalization method (i.e., TMM) 
in \Rpackage{edgeR} are obtained as follows.

<<tmm_tcc_multigroup_normed_data>>=
library(TCC)
data(hypoData_mg)
nonDEG <- 201:1000
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0)
normalized.count <- getNormalizedData(tcc)
apply(normalized.count[nonDEG, ], 2, median)
range(apply(normalized.count[nonDEG, ], 2, median))
@
<<tmm_tcc_multigroup_normed_data_hide, echo = FALSE>>=
normByTMM <- range(apply(normalized.count[nonDEG, ], 2, median))
@

It is obvious that the summary statistics 
(ranging from $\Sexpr{sprintf("%.5f", min(normByiDEGES))}$ 
to $\Sexpr{sprintf("%.5f", max(normByiDEGES))}$) 
from iDEGES/\Rpackage{edgeR}-normalized data are closer to the truth 
(i.e., ranging from $\Sexpr{sprintf("%.1f", min(hypoData_mg.median))}$ to 
$\Sexpr{sprintf("%.1f", max(hypoData_mg.median))}$) than those 
(ranging from $\Sexpr{sprintf("%.5f", min(normByTMM))}$ 
to $\Sexpr{sprintf("%.5f", max(normByTMM))}$) 
from TMM-normalized data.

\newpage

\section{Differential expression (DE)}
\label{section-4}

The particular feature of \Rpackage{TCC} is that it calculates 
robust normalization factors. Moreover, end users would like to 
have some accessory functions for subsequent analyses. Here, 
we provide the \Rfunction{estimateDE} function for identifying DEGs. 
Specifically, the function internally uses the corresponding functions 
implemented in three packages: \Rfunction{exactTest} in \Rpackage{edgeR}, 
\Rfunction{nbinomTest} in \Rpackage{DESeq}, 
and \Rfunction{getLikelihoods.NB} in \Rpackage{baySeq}. 
Similar to the usage in the \Rfunction{calcNormFators} function 
with the \Rfunction{test.method} argument in \Rpackage{TCC}, 
those DE methods in \Rpackage{edgeR}, \Rpackage{DESeq}, 
and \Rpackage{baySeq} can be performed 
by using the \Rfunction{estimateDE} function with 
\Rfunction{test.method = "edger"}, \Rfunction{"deseq"}, 
and \Rfunction{"bayseq"}, respectively. Here, we show some 
examples of DE analysis for two-group data with replicates 
(\ref{section-4-1}), two-group data without replicates (\ref{section-4-2}), 
and multi-group data with replicates (\ref{section-4-3}).

\subsection{DE analysis for two-group data with replicates}
\label{section-4-1}

\subsubsection{edgeR coupled with iDEGES/edgeR normalization}
\label{section-4-1-1}

We give a procedure for DE analysis using the exact test implemented 
in \Rpackage{edgeR} together with iDEGES/\Rpackage{edgeR} normalization 
factors (i.e., the iDEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination) 
for the hypothetical two-group count data with replicates (i.e., 
the \Robject{hypoData} object). If the user wants to determine the 
genes having an FDR threshold of $< 10\%$ as DEGs, one can do as follows.

<<idegesEdger_tcc>>=
library(TCC)
data(hypoData)
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
@

The results of the DE analysis are stored in the \Rpackage{TCC} 
class object. The summary statistics for top-ranked genes can 
be retrieved by using the \Rfunction{getResult} function.

<<get_estimated_results>>=
result <- getResult(tcc, sort = TRUE)
head(result)
@

The DE results can be broken down as follows.

<<summary_of_estimated_deg>>=
table(tcc$estimatedDEG) 
@

This means $\Sexpr{sum(tcc$estimatedDEG == FALSE)}$ non-DEGs 
and $\Sexpr{sum(tcc$estimatedDEG == TRUE)}$ DEGs satisfy FDR $< 0.1$.
The \Rfunction{plot} function generates an M-A plot, 
where "M" indicates the log-ratio 
(i.e., $\mbox{M} = log_{2}\mbox{G2} - log_{2}\mbox{G1}$) and "A" 
indicates average read count 
(i.e., $\mbox{A} = (log_{2}\mbox{G2} + log_{2}\mbox{G1}) / 2$), 
from the normalized count data. 
The magenta points indicate the identified DEGs at FDR $< 0.1$.

<<maplot_of_estimated_result, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
plot(tcc)
@


\subsubsection{baySeq coupled with iDEGES/edgeR normalization}
\label{section-4-1-2}

If the user wants to employ the empirical Bayesian method 
in \Rpackage{baySeq} together with iDEGES/\Rpackage{edgeR} normalization 
factors (i.e., the iDEGES/\Rpackage{edgeR}-\Rpackage{baySeq} combination), 
one can do as follows.


<<idegerEdger_tcc_bayseq_summary>>=
set.seed(1000)
library(TCC)
data(hypoData)
samplesize <- 100
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
tcc <- estimateDE(tcc, test.method = "bayseq", 
                  FDR = 0.1, samplesize = samplesize)
result <- getResult(tcc, sort = TRUE)
head(result)
table(tcc$estimatedDEG)
@

Note that a smaller sampling size (i.e., \Robject{samplesize = 100}) is 
used here to reduce the computation time,
but a larger sampling size of around 
$10,000$ (i.e., \Robject{samplesize = 10000}) is recommended (Hardcastle 
and Kelly, 2010 \cite{hardcastle}). 
Note also that \Rpackage{baySeq} outputs posterior 
likelihoods instead of the $p$-values obtained from \Rpackage{edgeR} 
and \Rpackage{DESeq}. The $p$-value column stores the ($1 - likelihood$) 
values when the \Rfunction{estimateDE} function is executed with the 
empirical Bayes in \Rpackage{baySeq}. Now let us describe an alternative 
procedure for \Rpackage{baySeq} users that corresponds to the 
\Rfunction{estimateDE} function. The $likelihood$ values and 
$p$-values (calculated as $1 - likelihood$) are retrieved as follows.

<<idegerEdger_tcc_bayseq_org_summary>>=
set.seed(1000)
library(TCC)
data(hypoData)
samplesize <- 100
group <- c(1, 1, 1, 2, 2, 2)
tcc <- new("TCC", hypoData, group)
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
effective.libsizes <- colSums(tcc$count) * tcc$norm.factors
groups <- list(NDE = rep(1, length(group)), DE = group)
cD <- new("countData", data = tcc$count, replicates = group,
          libsizes = effective.libsizes, groups = groups)
cD <- getPriors.NB(cD, samplesize = samplesize, 
                   estimation = "QL", cl = NULL)
cD <- getLikelihoods.NB(cD, pET = "BIC", cl = NULL)
tmp <- topCounts(cD, group = "DE", number = nrow(tcc$count))
tmp <- tmp[rownames(tcc$count), ]
p.value <- 1 - tmp$Likelihood
q.value <- tmp$FDR
result <- cbind(p.value, q.value)
rownames(result) <- rownames(tmp)
head(result)
@


\subsection{DE analysis for two-group data without replicates}
\label{section-4-2}

As described previously, the functions for the DEG identification 
method implemented in \Rpackage{edgeR} (i.e., the exact test; ver. 3.0.4) 
do not allow analysis without replicates. Currently, 
the \Rfunction{estimateDE} function only allows the \Robject{"deseq"} or 
\Robject{"bayseq"} options for the \Robject{test.method} argument. 
Here, we show a procedure for DE analysis using 
the NB test implemented in \Rpackage{DESeq} together with 
iDEGES/\Rpackage{DESeq} normalization factors 
(i.e., the iDEGES/\Rpackage{DESeq}-\Rpackage{DESeq} combination) 
for the hypothetical two-group count data without replicates (i.e., 
the \Robject{hypoData[, c(1, 4)]} object). If the user wants to determine 
the genes having an FDR threshold of $< 10\%$ as DEGs, 
one can do as follows.

<<idegerDeseq_tcc_without_replicaets_summary>>=
library(TCC)
data(hypoData)
group <- c(1, 2)
tcc <- new("TCC", hypoData[, c(1, 4)], group)
head(tcc$count)
tcc$group
tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq",
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
tcc$norm.factors
tcc <- estimateDE(tcc, test.method = "deseq", 
                  FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
table(tcc$estimatedDEG)
@

It can be seen that there is no DEG having FDR $< 0.1$.

\subsection{DE analysis for multi-group data with replicates}
\label{section-4-3}

Here, we give three examples of DE analysis coupled with 
DEGES/\Rpackage{edgeR} normalization for the hypothetical three-group 
data with replicates, i.e., the \Robject{hypoData\_mg} object. 
The use of the DEGES/\Rpackage{edgeR} normalization 
factors is simply for reducing the computation time.


\subsubsection{baySeq coupled with DEGES/edgeR normalization}
\label{section-4-3-1}

The empirical Bayesian method implemented in \Rpackage{baySeq} 
after executing the DEGES/\Rpackage{edgeR} normalization 
(i.e., the DEGES/\Rpackage{edgeR}-\Rpackage{baySeq} combination) 
can be performed as follows.

<<degerEdegr_tcc_multigroup_summary>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
###  Normalization  ###
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
###  DE analysis  ###
set.seed(1000)
samplesize <- 100
tcc <- estimateDE(tcc, test.method = "bayseq", 
                  FDR = 0.1, samplesize = samplesize)
result <- getResult(tcc, sort = TRUE)
head(result)
table(tcc$estimatedDEG)
@

It can be seen that the \Rpackage{baySeq} method identified 
$\Sexpr{sum(tcc$estimatedDEG == TRUE)}$ DEGs 
having FDR $< 0.1$. One can obtain the number of DEGs with another 
threshold (e.g., FDR $< 0.2$) from the result object as follows.

<<num_of_fdr_less_than_2>>=
sum(result$q.value < 0.2)
@

For \Rpackage{baySeq} users, we provide commands, consisting of 
functions in \Rpackage{baySeq}, to perform the DEG identification 
without the function in \Rpackage{TCC}. The \Rfunction{estimateDE} 
function with \Robject{test.method = "bayseq"} can be regarded as a 
wrapper function for the following commands after the 
DEGES/\Rpackage{edgeR} normalization.

<<degerTbt_tcc_multigroup_summary>>=
set.seed(1000)
samplesize <- 100
effective.libsizes <- colSums(tcc$count) * tcc$norm.factors
groups <- list(NDE = rep(1, length(group)), DE = group)
cD <- new("countData", data = tcc$count, replicates = group,
          libsizes = effective.libsizes, groups = groups)
cD <- getPriors.NB(cD, samplesize = samplesize, 
                   estimation = "QL", cl = NULL)
cD <- getLikelihoods.NB(cD, pET = "BIC", cl = NULL)
tmp <- topCounts(cD, group = "DE", number = nrow(tcc$count))
tmp <- tmp[rownames(tcc$count), ]
p.value <- 1 - tmp$Likelihood
q.value <- tmp$FDR
result <- cbind(p.value, q.value)
rownames(result) <- rownames(tmp)
head(result)
sum(q.value < 0.1)
sum(q.value < 0.2)
@

\subsubsection{edgeR coupled with DEGES/edgeR normalization}
\label{section-4-3-2}

The exact test implemented in \Rpackage{edgeR} after executing 
the DEGES/\Rpackage{edgeR} normalization 
(i.e., the DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination)
can be performed as follows.

<<degerTbt_tcc_edger_multigroup_summary>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
###  Normalization  ###
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
###  DE analysis  ###
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
table(tcc$estimatedDEG)
@

Note that these DEGs having FDR $< 0.1$ display DE between any of the groups 
because the two arguments indicated here 
(\Robject{design} and \Robject{coef}) correspond 
to an AVOVA-like test for any differences provided in \Rpackage{edgeR}, i.e., 

<<degerTbt_tcc_edger_org_multigroup_summary>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
design <- model.matrix(~ as.factor(group))
coef <- 2:length(unique(group))
tcc <- new("TCC", hypoData_mg, group)
###  Normalization  ###
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
###  DE analysis  ###
d <- DGEList(tcc$count, group = group)
d$samples$norm.factors <- tcc$norm.factors
d <- estimateGLMCommonDisp(d, design)
d <- estimateGLMTrendedDisp(d, design)
d <- estimateGLMTagwiseDisp(d, design)
fit <- glmFit(d, design)
lrt <- glmLRT(fit, coef = coef)
tmp <- topTags(lrt, n = nrow(tcc$count))
p.value <- tmp$table$PValue
q.value <- tmp$table$FDR
result <- cbind(p.value, q.value)
rownames(result) <- rownames(tmp)
head(result)
sum(q.value < 0.1)
sum(q.value < 0.2)
@

As described in the \Rpackage{edgeR} manual, the second and third columns 
in the \Robject{design} object are relative to the baseline (i.e., Group 1 or G1): 
\Robject{coef = 2} means G2 vs. G1 and \Robject{coef = 3} means G3 vs. G1. 
The above procedure with the \Robject{coef} object 
(i.e., \Robject{2:length(unique(group))}) indicates 
the both comparisons (i.e., G2 vs. G1 and G3 vs. G1) and 
identifies DEGs between any of the three groups. In other words, 
one can do any two-group comparison of interest from multi-group data with 
replicates. For example, the DE analysis for G3 vs. G1 together with 
DEGES/\Rpackage{edgeR} normalization can be performed as follows.

<<degerEdger_tcc_edger_org_multigroup_summary_2>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
###  Normalization  ###
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
###  DE analysis  ###
coef <- 3
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1, coef = coef)
result <- getResult(tcc, sort = TRUE)
head(result)
table(tcc$estimatedDEG)
@

\subsubsection{DESeq coupled with DEGES/edgeR normalization}
\label{section-4-3-3}

The NB test implemented in \Rpackage{DESeq} after executing the 
DEGES/\Rpackage{edgeR} normalization 
(i.e., the DEGES/\Rpackage{edgeR}-\Rpackage{DESeq} combination) 
can be performed as follows.

<<degerEdger_tcc_deseq_tcc_multigroup_summary>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
###  Normalization  ###
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
###  DE analysis  ###
fit1 <- count ~ condition
fit0 <- count ~ 1
tcc <- estimateDE(tcc, test.method = "deseq", 
                  FDR = 0.1, fit0 = fit0, fit1 = fit1)
result <- getResult(tcc, sort = TRUE)
head(result)
table(tcc$estimatedDEG)
@

For \Rpackage{DESeq} users, we provide commands, consisting of 
functions in \Rpackage{DESeq}, to perform the DEG identification 
without the function in \Rpackage{TCC}. The \Rfunction{estimateDE} 
function with \Robject{test.method = "deseq"} can be regarded 
as a wrapper function for the following commands after the 
DEGES/\Rpackage{edgeR} normalization.

<<degerEdger_tcc_deseq_org_multigroup_summary>>=
library(TCC)
data(hypoData_mg)
group <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
tcc <- new("TCC", hypoData_mg, group)
###  Normalization  ###
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
###  DE analysis  ###
fit1 <- count ~ condition
fit0 <- count ~ 1
cds <- newCountDataSet(tcc$count, group)
sizeFactors(cds) <- tcc$norm.factors * colSums(tcc$count)
cds <- estimateDispersions(cds)
reduced.model <- fitNbinomGLMs(cds, fit0)
full.model <- fitNbinomGLMs(cds, fit1)
p.value <- nbinomGLMTest(full.model, reduced.model)
p.value[is.na(p.value)] <- 1
q.value <- p.adjust(p.value, method = "BH")
tmp <- cbind(p.value, q.value)
rownames(tmp) <- tcc$gene_id
result <- tmp[order(p.value), ]
head(result)
sum(q.value < 0.1)
sum(q.value < 0.2)
@


\section{Generation of simulation data}

\subsection{Introduction and basic usage}
\label{section-5-1}

As demonstrated in our previous study (Kadota et al., 2012 \cite{kadota}), 
the DEGES-based normalization methods implemented in \Rpackage{TCC} 
theoretically outperform the other normalization methods when the numbers 
of DEGs (G1 vs. G2) in the tag count data are biased. However, 
it is difficult to determine whether the up- and down-regulated DEGs in 
one of the groups are actually biased in their number when analyzing 
real data (Dillies et al., 2012 \cite{dillies}). 
This means we have to evaluate the 
potential performance of our DEGES-based methods using mainly simulation 
data. The \Rfunction{simulateReadCounts} function generates simulation 
data under various conditions. This function can generate simulation 
data analyzed in the TbT paper (Kadota et al., 2012 \cite{kadota}), 
and that means it enables other 
researchers to compare the methods they develop with our DEGES-based 
methods. For example, the \Robject{hypoData} object, a hypothetical 
count dataset provided in \Rpackage{TCC}, was generated by using this 
function. The output of the \Rfunction{simulateReadCounts} function is stored 
as a \Rpackage{TCC} class object and is therefore ready-to-analyze.

Note that different trials of simulation analysis generally yield 
different count data even under the same simulation conditions. 
As mentioned in section \ref{section-3-1-1}, 
we can call the \Rfunction{set.seed} 
function in order to obtain reproducible results 
(i.e., the \Robject{tcc\$count}) with the 
\Rfunction{simulateReadCounts} function. 

<<generate_simulation_count_data>>=
set.seed(1000)
library(TCC)
tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.2, 
                              DEG.assign = c(0.9, 0.1),
                              DEG.foldchange = c(4, 4), 
                              replicates = c(3, 3))
dim(tcc$count)
head(tcc$count)
tcc$group
@

The simulation conditions for comparing two groups (G1 vs. G2) with 
biological replicates are as follows: (i) the number of genes is 
$1,000$ (i.e., \Robject{Ngene = 1000}), 
(ii) the first $20\%$ of genes are DEGs 
(\Robject{PDEG = 0.2}), (iii) the first $90\%$ of the DEGs are 
up-regulated in G1, and the remaining $10\%$ are up-regulated in 
G2 (\Robject{DEG.assign = c(0.9, 0.1)}), (iv) the levels of DE are four-fold 
in both groups (\Robject{DEG.foldchange = c(4, 4)}), 
and (v) there are a total of six samples (three biological 
replicates for G1 and three biological replicates for G2) 
(\Robject{replicates = c(3, 3)}). 
The variance of the NB distribution can be modeled as
 $V = \mu + \phi\mu^{2}$. 
The empirical distribution of the read counts for producing 
the mean ($\mu$) and dispersion ($\phi$) parameters of the 
model was obtained from {\it Arabidopsis} data (three biological 
replicates for each of the treated and non-treated groups) 
in \Rpackage{NBPSeq} (Di et al., 2011 \cite{di}).

The \Robject{tcc\$count} object is essentially the same as 
the \Robject{hypoData} object of \Rpackage{TCC}. 
The information about the simulation conditions can be viewed as follows.

<<str_of_simulation_field>>=
str(tcc$simulation)
@

Specifically, the entries for $0, 1$, and $2$ in the 
\Robject{tcc\$simulation\$trueDEG} 
object are for non-DEG, DEGs up-regulated in G1, and DEGs up-regulated in G2, 
respectively. The breakdowns for individual entries are the same as 
stated above: $800$ entries are non-DEGs, $180$ DEGs are up-regulated in G1, 
and $20$ DEGs are up-regulated in G2.

<<tale_of_simulation_trueDEG>>=
table(tcc$simulation$trueDEG)
@

This information can be used to evaluate the performance of the DEGES-based 
normalization methods in terms of the sensitivity and specificity of the 
results of their DE analysis. A good normalization method coupled with a 
DE method such as the exact test 
(Robinson and Smyth, 2008 \cite{robinson3}) and the 
empirical Bayes (Hardcastle and Kelly, 2010) should produce well-ranked 
gene lists in which the true DEGs are top-ranked and non-DEGs are 
bottom-ranked when all genes are ranked according to the degree of DE. 
The ranked gene list after performing 
the DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} 
combination can be obtained as follows.

<<analysis_simulation_data>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
result <- getResult(tcc, sort = TRUE)
head(result)
@

We can now calculate the area under the \Rpackage{ROC} curve 
(i.e., AUC; $0 \le $AUC$ \le 1$) between the ranked gene list and the 
truth (i.e., DEGs or non-DEGs) and thereby evaluate the sensitivity 
and specificity simultaneously. A well-ranked gene list should have 
a high AUC value (i.e., high sensitivity and specificity). 
The \Rfunction{calcAUCValue} function calculates the AUC value 
based on the information stored in the \Rpackage{TCC} class object.

<<calc_AUC_of_simulation_data>>=
calcAUCValue(tcc)
@
<<calc_AUC_of_simulation_data_hide, echo = FALSE>>=
auc.degesedger <- calcAUCValue(tcc)
@

This is essentially the same as

<<calc_AUC_of_simulation_data_ROC>>=
AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0),
                data = -tcc$stat$rank))
@

The following classic \Rpackage{edgeR} procedure 
(i.e., the TMM-\Rpackage{edgeR} combination) 
make it clear that  the DEGES-based normalization method (i.e., 
the DEGES/\Rpackage{edgeR} pipeline) outperforms the default 
normalization method (i.e., TMM) implemented in \Rpackage{edgeR}.

<<calc_AUC_of_tmm_tcc>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", iteration = 0)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
calcAUCValue(tcc)
@

The following is an alternative procedure for \Rpackage{edgeR} users.

<<calc_AUC_of_tmm_org>>=
d <- DGEList(counts = tcc$count, group = tcc$group$group)
d <- calcNormFactors(d)
d$samples$norm.factors <- d$samples$norm.factors / 
                          mean(d$samples$norm.factors)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
result <- exactTest(d)
result$table$PValue[is.na(result$table$PValue)] <- 1
AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0),
                data = -rank(result$table$PValue)))
@
<<calc_AUC_of_degesTbt_tcc_hide, echo = FALSE>>=
set.seed(1000)
samplesize <- 100
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq",
                       iteration = 1, samplesize = samplesize)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
auc.degestbt <- calcAUCValue(tcc)
@

As can be expected from the similarity of the normalization 
factors of DEGES/TbT (\ref{section-3-1-1}) and DEGES/\Rpackage{edgeR} 
(\ref{section-3-1-2}), 
the AUC value ($\Sexpr{sprintf("%.7f", auc.degesedger)}$) of 
DEGES/\Rpackage{edgeR} is quite similar 
to the AUC value ($\Sexpr{sprintf("%.7f", auc.degestbt)}$) of the 
original TbT method (i.e., DEGES/TbT):

<<calc_AUC_of_degesTbt_tcc>>=
set.seed(1000)
samplesize <- 100
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "bayseq",
                       iteration = 1, samplesize = samplesize)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
calcAUCValue(tcc)
@


\subsection{Two-group data without replicates}
\label{section-5-2}

Let us generate tag count data without replicates, 
such as those used in section \ref{section-3-2} For simplicity, 
we first generate simulation data whose conditions are essentially 
the same as those in the previous section (i.e., \ref{section-5-1}), 
except for the number of replicates in each group: (i) the number of 
genes is $1,000$ (i.e., \Robject{Ngene = 1000}), 
(ii) the first $20\%$ of genes 
are DEGs (\Robject{PDEG = 0.2}), 
(iii) the first $90\%$ of the DEGs 
are up-regulated in G1, and the remaining $10\%$ are up-regulated in 
G2 (\Robject{DEG.assign = c(0.9, 0.1)}), 
(iv) the levels of DE are four-fold 
in both groups (\Robject{DEG.foldchange = c(4, 4)}), 
and (v) there are a total 
of two samples (one from G1 and the other from G2) 
(\Robject{replicates = c(1, 1)}).

<<generate_simulation_data_without_replicate>>=
set.seed(1000)
library(TCC)
tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.2, 
                              DEG.assign = c(0.9, 0.1),
                              DEG.foldchange = c(4, 4), 
                              replicates = c(1, 1))
dim(tcc$count)
head(tcc$count)
tcc$group
@

Now let us see how the DEGES/\Rpackage{DESeq}-\Rpackage{DESeq} combination 
with the original \Rpackage{DESeq}-\Rpackage{DESeq} combination 
performs. First, we calculate the AUC value for the ranked 
gene list obtained from the DEGES/\Rpackage{DESeq}-\Rpackage{DESeq} 
combination.

<<analysis_without_replicate_simulation_data>>=
tcc <- calcNormFactors(tcc, norm.method = "deseq", test.method = "deseq",
                       iteration = 1, FDR = 0.1, floorPDEG = 0.05)
tcc <- estimateDE(tcc, test.method = "deseq")
calcAUCValue(tcc)
@

Next, we calculate the corresponding value 
using the original \Rpackage{DESeq} procedure 
(i.e., the \Rpackage{DESeq}-\Rpackage{DESeq} combination).

<<calc_AUC_of_deseq_tcc>>=
tcc <- calcNormFactors(tcc, norm.method = "deseq", iteration = 0)
tcc <- estimateDE(tcc, test.method = "deseq")
calcAUCValue(tcc)
@

It can be seen that the DEGES/\Rpackage{DESeq}-\Rpackage{DESeq} combination 
outperforms the original procedure 
under the given simulation conditions. The following is an alternative 
approach for \Rpackage{DESeq} users.

<<calc_AUC_of_degesDeseq_org>>=
cds <- newCountDataSet(tcc$count, tcc$group$group)
cds <- estimateSizeFactors(cds)
norm.factors <- sizeFactors(cds) / colSums(tcc$count)
norm.factors <- norm.factors / mean(norm.factors)
sizeFactors(cds) <- colSums(tcc$count) * norm.factors
cds <- estimateDispersions(cds, method="blind", sharingMode="fit-only")
result <- nbinomTest(cds, 1, 2)
result$pval[is.na(result$pval)] <- 1
AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0),
                data = -rank(result$pval))) 
@

This procedure is completely the same as the one in \Rpackage{TCC} that 
gives normalization factors corresponding to those in \Rpackage{edgeR} 
for different packages. However, the following commands from the 
\Rpackage{DESeq} manual are of practical value because they give 
approximately the same AUC value as above.

<<calc_AUC_of_deseq_org>>=
cds <- newCountDataSet(tcc$count, tcc$group$group)
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds, method="blind", sharingMode="fit-only")
result <- nbinomTest(cds, 1, 2)
result$pval[is.na(result$pval)] <- 1
AUC(rocdemo.sca(truth = as.numeric(tcc$simulation$trueDEG != 0),
                data = -rank(result$pval))) 
@


\subsection{Multi-group data with and without replicates}
\label{section-5-3}

The \Rfunction{simulateReadCounts} function can generate simulation data 
with a more complex design. First, we generate a dataset consisting 
of three groups. The simulation conditions for this dataset are as 
follows: (i) the number of genes is $1,000$ 
(i.e., \Robject{Ngene = 1000}), 
(ii) the first $30\%$ of genes are DEGs (\Robject{PDEG = 0.3}), 
(iii) the breakdowns of the up-regulated DEGs are respectively 
$70\%$, $20\%$, and $10\%$ in Groups 1-3 
(\Robject{DEG.assign = c(0.7, 0.2, 0.1)}), 
(iv) the levels of DE are $3$-, $10$-, and $6$-fold in individual 
groups (\Robject{DEG.foldchange = c(3, 10, 6)}), 
and (v) there are a total of nine libraries (2, 4, and 3 
replicates for Groups 1-3) (\Robject{replicates = c(2, 4, 3)}).

<<generate_simulation_data_multigroup>>=
set.seed(1000)
library(TCC)
tcc <- simulateReadCounts(Ngene = 1000, PDEG = 0.3,
                         DEG.assign = c(0.7, 0.2, 0.1),
                         DEG.foldchange = c(3, 10, 6),
                         replicates = c(2, 4, 3))
dim(tcc$count)
tcc$group
head(tcc$count)
@

The pseudo-color image for the generated simulation data regarding 
the DEGs can be obtained from the \Rfunction{plotFCPseudocolor} function. 
The right bar (from white to magenta) indicates the degree of 
fold-change (FC). As expected, it can be seen that the first 
$210$, $60$, and $30$ genes are up-regulated in G1, G2, and G3, 
respectively.

<<plot_fc_pseudocolor_multigroup, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
plotFCPseudocolor(tcc)
@

Now let us  see how the DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination
with the original \Rpackage{edgeR}-\Rpackage{edgeR} combination performs.
First we calculate the AUC value for the ranked gene list obtained from the
DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination.

<<degesEdger_edger_tcc_multigroup>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 1)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
calcAUCValue(tcc)
@

Next, we calculate the corresponding value using the original \Rpackage{edgeR}
procedure for single factor experimental design
(i.e., the \Rpackage{edgeR}-\Rpackage{edgeR} combination).

<<tmm_edger_tcc_multigroup>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger",
                       iteration = 0)
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.1)
calcAUCValue(tcc)
@

It can be seen that the DEGES/\Rpackage{edgeR}-\Rpackage{edgeR} combination
outperforms the original \Rpackage{edgeR} procedure under the given simulation
conditions. Note that the \Robject{test.method} argument will be ignored when
\Rpackage{iteration = 0} is specified.

Next, let us generate another dataset consisting of a total of eight groups. 
The simulation conditions for this dataset are as follows: 
(i) the number of genes is $10,000$ 
(i.e., \Robject{Ngene = 10000}), 
(ii) the first $34\%$ of genes are DEGs (\Robject{PDEG = 0.34}), 
(iii) the breakdowns of the up-regulated DEGs are respectively 
$10\%$, $30\%$, $5\%$, $10\%$, $5\%$, $21\%$, $9\%$, and $10\%$ in Groups 1-8 
(\Robject{DEG.assign = c(0.1, 0.3, 0.05, 0.1, 0.05, 0.21, 0.09, 0.1)}), 
(iv) the levels of DE are $3.1$-, $13$-, $2$-, $1.5$-, $9$-, $5.6$-, $4$-, 
and $2$-fold in individual groups 
(\Robject{DEG.foldchange = c(3.1, 13, 2, 1.5, 9, 5.6, 4, 2)}), 
and (v) there are a total of nine libraries (except for G3, 
none of the groups have replicates) 
(\Robject{replicates = c(1, 1, 2, 1, 1, 1, 1, 1)}).

<<plot_fc_pseudocolor_multigroup_2, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
set.seed(1000)
library(TCC)
tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.34,
               DEG.assign = c(0.1, 0.3, 0.05, 0.1, 0.05, 0.21, 0.09, 0.1),
               DEG.foldchange = c(3.1, 13, 2, 1.5, 9, 5.6, 4, 2),
               replicates = c(1, 1, 2, 1, 1, 1, 1, 1))
dim(tcc$count)
tcc$group
head(tcc$count)
plotFCPseudocolor(tcc)
@


This kind of simulation data may be useful for evaluating methods 
aimed at identifying tissue-specific (or tissue-selective) genes.

\subsection{Multi-factor data}
\label{section-5-3-p1}

The \Rfunction{simulateReadCounts} function can also generate simulation
data in multi-factor experimental design. Different from above single-factor
experimental design, the \Robject{group} argument should be used instead of
\Robject{replicates} for specifying sample conditions (or factors) when
generating simulation data in multi-factor design.
In relation to the \Robject{group} specification,
the \Robject{DEG.foldchange} argument should also be specified as a data
frame object.

We generate a dataset consisting of two factors for comparing
(i) two Groups (i.e., "WT" vs. "KO") as the first factor,
at (ii) two time points (i.e., "1d" vs. "2d") as the second factor,
with all samples obtained from independent subjects.
There are a total of four conditions
("WT\_1d", "WT\_2d", "KO\_1d", and "KO\_2d") each of which has two
biological replicates, comprising a total of eight samples.
The \Robject{group} argument for this experimental design can be described as follows:

<<generate_simulation_data_multifactor_group>>=
group <- data.frame(
    GROUP = c("WT", "WT", "WT", "WT", "KO", "KO", "KO", "KO"),
    TIME = c("1d", "1d", "2d", "2d", "1d", "1d", "2d", "2d")
)
@

Next, we design the number of types of DEGs and the levels of fold-change
by the \Robject{DEG.foldchange} argument. We here introduce three types of
DEGs: (a) 2-fold up-regulation in the first four samples (i.e., "WT"),
(b) 3-fold up-regulation in the last four samples (i.e., "KO"),
and (c) 2-fold down-regulation at "2d" in "WT" and 4-fold up-regulation
at "2d" in "KO". This implies that the first two types of DEGs are related
to the first factor (i.e., "WT" vs. "KO") and the third type of DEG is related
to the second factor (i.e., "1d" vs. "2d").

<<generate_simulation_data_multifactor_fc>>=
DEG.foldchange <- data.frame(
    FACTOR1.1 = c(2, 2, 2, 2, 1, 1, 1, 1),
    FACTOR1.2 = c(1, 1, 1, 1, 3, 3, 3, 3),
    FACTOR2 = c(1, 1, 0.5, 0.5, 1, 1, 4, 4)
)
@

The other simulation conditions for this dataset are as follows:
(1) the number of gene is 1,000 (i.e., \Robject{Ngene = 1000}),
(2) the first 20\% of genes are DEGs (i.e., \Robject{PDEG = 0.2}),
and (3) the breakdowns of the three types of DEGs are 50\%, 20\%,
and 30\% (i.e., \Robject{DEG.assign = c(0.5, 0.2, 0.3)}).

<<generate_simulation_data_multifactor>>=
set.seed(1000)
tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.2,
               DEG.assign = c(0.5, 0.2, 0.3),
               DEG.foldchange = DEG.foldchange,
               group = group)
@

Since the first six rows in the dataset corresponds to the first type
of DEGs, we can see the 2-fold up-regulation in the first four columns
(i.e., WT-related samples) compared to the last four columns
(i.e., KO-related samples).

<<plot_fc_pseudocolor_multifactor, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
head(tcc$count)
tcc$group
plotFCPseudocolor(tcc)
@


\subsection{Other utilities}
\label{section-5-4}

Recall that the simulation framework can handle different levels 
of DE for DEGs in individual groups, and the shape of the distribution 
for these DEGs is the same as that of non-DEGs. Let us confirm those 
distributions by introducing more drastic simulation conditions for 
comparing two groups (G1 vs. G2) with biological replicates; i.e., 
(i) the number of genes is $20,000$ (i.e., \Robject{Ngene = 20000}), 
(ii) the first $30\%$ of genes are DEGs (\Robject{PDEG = 0.30}), 
(iii) the first $85\%$ of the DEGs are up-regulated in G1 and the remaining 
$15\%$ are up-regulated in G2 (\Robject{DEG.assign = c(0.85, 0.15)}), 
(iv) the levels of DE are eight-fold in G1 and sixteen-fold in 
G2 (\Robject{DEG.foldchange = c(8, 16)}), and (v) there are a total of 
four samples (two biological replicates for G1 and two biological 
replicates for G2) (\Robject{replicates = c(2, 2)}).

<<simulate_data_1000>>=
set.seed(1000)
library(TCC)
tcc <- simulateReadCounts(Ngene = 20000, PDEG = 0.30, 
                              DEG.assign = c(0.85, 0.15),
                              DEG.foldchange = c(8, 16), 
                              replicates = c(2, 2))
head(tcc$count)
@

An M-A plot for the simulation data can be viewed as follows; 
the points for up-regulated DEGs in G1 and G2 are colored 
blue and red, respectively. The non-DEGs are in black:

<<maplot_simulate_data_1000, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
plot(tcc)
@

This plot is generated from simulation data that has been scaled 
in such a way that the library sizes of each sample are the same 
as the mean library size of the original data. That is,

<<norm_simulate_data_1000>>=
normalized.count <- getNormalizedData(tcc)
colSums(normalized.count)
colSums(tcc$count)
mean(colSums(tcc$count))
@

<<maplot_normed_simulate_data_1000_hide, echo = FALSE, fig = FALSE>>=
xy <- plot(tcc)
isnot.na <- as.logical(xy[, 1] != min(xy[, 1]))
median.G1 <- median(xy[(tcc$simulation$trueDEG == 1) & isnot.na, 2])
median.G2 <- median(xy[(tcc$simulation$trueDEG == 2) & isnot.na, 2])
median.nonDEG <- median(xy[(tcc$simulation$trueDEG == 0) & isnot.na, 2])
@

The summary statistics for non-DEGs and up-regulated DEGs in G1 and G2 
are upshifted compared with the original intentions of the user (i.e., 
respective M values of $0$, $-3$, and $4$ for non-DEGs and up-regulated 
DEGs in G1 and G2). Indeed, the median values, indicated as horizontal lines, 
are respectively $\Sexpr{sprintf("%.3f", median.nonDEG)}$, 
$\Sexpr{sprintf("%.3f", median.G1)}$, 
and $\Sexpr{sprintf("%.3f", median.G2)}$ for non-DEGs 
and up-regulated DEGs in G1 and G2.

<<maplot_normed_simulate_data_1000, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
plot(tcc, median.lines = TRUE) 
@

<<maplot_normed_simulate_data_1000_2_hide, echo = FALSE, fig = FALSE>>=
tcc <- calcNormFactors(tcc, "tmm", "edger", iteration = 3, 
                       FDR = 0.1, floorPDEG = 0.05)
xy <- plot(tcc)
isnot.na <- as.logical(xy[, 1] != min(xy[, 1]))
median.nonDEG <- median(xy[(tcc$simulation$trueDEG == 0) & isnot.na, 2])
@

These upshifted M values for non-DEGs can be modified after performing the 
iDEGES/\Rpackage{edgeR} normalization, e.g., the median M value 
($= \Sexpr{sprintf("%.3f", median.nonDEG)}$) 
for non-DEGs based on the iDEGES/\Rpackage{edgeR}-normalized 
data is nearly zero. 

<<maplot_normed_simulate_data_1000_2, fig = TRUE, size = 0.8, png = TRUE, pdf = FALSE, eps = FALSE>>=
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", 
                       iteration = 3, FDR = 0.1, floorPDEG = 0.05)
plot(tcc, median.line = TRUE)
@

The comparison of those values obtained from different normalization 
methods might be another evaluation metric.

\newpage

\section{Session info}

<<session_info>>=
sessionInfo()
@

\newpage

\section{References}
\renewcommand{\refname}{}
\renewcommand\refname{\vskip -1cm}

\begin{thebibliography}{99}
\bibitem{robinson} Robinson MD, McCarthy DJ, and Smyth GK. edgeR: 
                   a Bioconductor package for differential expression 
                   analysis of digital gene expression data. Bioinformatics 
                   2010, 26(1): 139-140
\bibitem{dillies} Dillies MA, Rau A, Aubert J, Hennequet-Antier C, 
                  Jeanmougin M, Servant N, Keime C, Marot G, Castel D, 
                  Estelle J, Guernec G, Jagla B, Jouneau L, Lalo\"e D, 
                  Le Gall C, Scha\"effer B, Le Crom S, Guedj M, 
                  Jaffr\'ezic F; on behalf of The French StatOmique 
                  Consortium. A comprehensive evaluation of normalization 
                  methods for Illumina high-throughput RNA sequencing data 
                  analysis. Brief Bioinform, in press
\bibitem{kadota} Kadota K, Nishiyama T, and Shimizu K. A normalization 
                 strategy for comparing tag count data. Algorithms Mol 
                 Biol. 2012, 7:5
\bibitem{robinson2} Robinson MD and Oshlack A. A scaling normalization method
                    for differential expression analysis of RNA-seq data. 
                    Genome Biol. 2010, 11: R25
\bibitem{wad} Kadota K, Nakai Y, Shimizu K: A weighted average difference
              method for detecting differentially expressed genes from 
              microarray data. Algorithms Mol Biol. 208, 3: 8
\bibitem{hardcastle} Hardcastle TJ and Kelly KA. baySeq: 
                     empirical Bayesian methods for identifying differential
                     expression in sequence count data. BMC Bioinformatics 
                     2010, 11: 422
\bibitem{aicentropy} Kadota K, Nishimura SI, Bono H, Nakamura S, 
                     Hayashizaki Y, Okazaki Y, Takahashi K: 
                     Detection of genes with tissue-specific expression 
                     patterns using Akaike's Information Criterion (AIC) 
                     procedure. Physiol Genomics 2003, 12: 251-259
\bibitem{anders} Anders S and Huber W. Differential expression analysis 
                 for sequence count data. Genome Biol. 2010, 11(10): R106
\bibitem{mccarthy} McCarthy DJ, Chen Y, Smyth GK. Differential expression 
                   analysis of multifactor RNA-Seq experiments with respect
                   to biological variation. Nucleic Acids Res. 2012, 40(10): 
                   4288-4297
\bibitem{glaus} Glaus P, Honkela A, and Rattray M. Identifying differentially
                expressed transcripts from RNA-seq data with biological 
                variation. Bioinformatics 2012, 28(13): 1721-1728
\bibitem{roku} Kadota K, Ye J, Nakai Y, Terada T, Shimizu K: ROKU: a novel
               method for identification of tissue-specific genes. 
               BMC Bioinformatics 2006, 7: 294
\bibitem{ueda} Ueda T. Simple method for the detection of outliers.
               Japanese J Appl Stat 1996, 25: 17-26
\bibitem{robinson3} Robinson MD and Smyth GK. Small-sample estimation of 
                    negative binomial dispersion, with applications to SAGE 
                    data. Biostatistics 2008, 9: 321-332
\bibitem{sun} Sun J, Nishiyama T, Shimizu K, and Kadota K. TCC: an R package
              for comparing tag count data with robust normalization 
              strategies. BMC Bioinformatics 2013, 14: 219
\bibitem{di} Di Y, Schafer DW, Cumbie JS, and Chang JH. The NBP negative 
             binomial model for assessing differential gene expression from
             RNA-Seq. Stat Appl Genet Mol Biol. 2011, 10: art24
\end{thebibliography}
\end{document}