%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Engineering Evaluation by Gene Categorization  (eegc)}
%\VignettePackage{eegc}

% To compile this document
% library('knitr'); rm(list=ls()); knit('eegc/vignettes/eegc.Rnw')
% library('knitr'); rm(list=ls()); knit2pdf('eegc/vignettes/eegc.Rnw'); openPDF('eegc.pdf')
%

\documentclass[12pt]{article}

\newcommand{\eegc}{\textit{eegc}}
\usepackage{ dsfont }
\usepackage{multirow}
\usepackage[flushleft]{threeparttable}
\usepackage{pifont}
\usepackage[singlelinecheck=false]{caption}

<<knitr, echo=FALSE, results="hide">>=
library("knitr")
opts_chunk$set(tidy=FALSE,tidy.opts=list(width.cutoff=30),dev="pdf",
               fig.width=7,fig.hight=5,
               fig.show="hide",message=FALSE)
@

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@
\author{Xiaoyuan Zhou$^{1}$, Guofeng Meng$^{2}$, Christine Nardini$^{1,3}$ and Hongkang Mei$^{2}$ \\[1em]
  \small{$^{1}$ CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological} \\
  \small{Sciences, Chinese Acadomy of Sciences; University of Chinese Academy of Sciences;} \\
  \small{$^{2}$ Computational and Modeling Science, PTS China, GSK R\&D China, Shanghai;} \\
  \small{$^{3}$ Personalgenomics, Strada Le Grazie 15 - 37134 Verona, Italy} \\
  \small{\texttt{$^*$Correspondence to zhouxiaoyuan (at) picb.ac.cn}}}

\bioctitle[Microarray and RNA-seq Data with \eegc{}]{Engineering Evaluation by Gene Categorization of Microarray and RNA-seq Data with \eegc{} package}

\begin{document}

\maketitle

\begin{abstract}
  \vspace{1em}
  This package has been developed to evaluate cellular engineering processes for direct differentiation of stem cells or conversion (transdifferentiation) of somatic cells to primary cells based on high throughput gene expression data screened either by DNA microarray or RNA sequencing. The package takes gene expression profiles as inputs from three types of samples: (i) somatic or stem cells to be (trans)differentiated (input of the engineering process), (ii) induced cells to be evaluated (output of the engineering process) and (iii) target primary cells (reference for the output). The package performs differential gene expression analysis for each pair-wise sample comparison to identify and evaluate the transcriptional differences among the 3 types of samples (input, output, reference). The ideal goal is to have induced and primary reference cell showing overlapping profiles, both very different from the original cells.

  Using the gene expression profile of original cells versus primary cells, a gene in the induced cells can either be successfully induced to the expression level of primary cells, remain inactive as in the somatic cells, or be insufficiently induced.  Based on such differences, we can categorizes differential genes into three intuitive categories: \emph{Inactive}, \emph{Insufficient}, \emph{Successful} and two additional extreme states: \emph{Over} and \emph{Reversed} representing genes being over (way above/below the expected level of in/activation) or reversely regulated. By further functional and gene regulatory network analyses for each of the five gene categories, the package evaluates the quality of engineered cells and highlights key molecules that represent transcription factors (TFs) whose (in)activation needs to be taken into account for improvement of the cellular engineering protocol, thus offering not only a quantification of the efficacy of the engineering process, but also workable information for its improvement.

  \vspace{1em}
\textbf{\eegc{} version:} \Sexpr{packageDescription("eegc")$Version} \footnote{This document used the vignette from \Bioconductor{} package \Biocpkg{DESeq2} as \CRANpkg{knitr} template}
\end{abstract}
<<options, results="hide", echo=FALSE>>=
options(digits=3, width=80, prompt=" ", continue=" ")
@


\newpage

\tableofcontents

\section{Introduction}

Cellular engineering is among the most promising and yet questioned cellular biology related approaches, and consists of the man-made differentiation of pluripotent undifferentiated stem cells into tissue-specific (primary) cells, mimicking the processes naturally occurring during human embryonic development, and of the direct conversion from somatic to primary cells, which is relatively efficient and rapid than differentiation but is limited by incomplete conversion. One of the earliest issues to be properly addressed in this area is the possibility to quantify and assess the quality of the induced cells, i.e. to measure if/how the differentiation process has been successful and the obtained cells are sufficiently similar to the target primary cells, for further applications in regenerative therapy, disease modeling and drug discovery.

The natural approach consists of comparing the transcriptional similarity of engineered cells to the target primary cells, with a focus on the marker genes that are specifically expressed in somatic and primary cells \cite{Sandler2014}. Indeed, not all marker genes are successfully induced to the expression level typical of primary cells, implying imperfections of the reprogrammed cells to an extent that needs to be quantified and evaluated.

Indeed, the focus on a selected number of transcription fators (TFs) may be limiting, and unable to offer alternative solutions to improve an engineering process. As cells represent a complex and tightly interconnected system, the failure to activate one or more target genes impacts on a variable number of interconnected and downstream genes, affecting in turn a number of biological functions.
For this reason the approach we propose is not target-specific but systemic and beyond offering a list of TF whose (in)activation needs attention, can also qualify and quantify the detrimental effects of an imperfect reprogramming on the topology of the gene network and the biological functions affected, thus offering information on the usability of the obtained cells and suggesting a potentially broader number of targets to be affected to improve the process.

Based on these observations, we propose -as briefly introduced above- to classify the genes into five different categories which describe the states of the genes upon (trans)differentiation. The \emph{Successful} category is represented by the genes whose expression in the engineered cells is successfully induced to a level similar to the target primary cells; conversely the \emph{Inactive} category is represented by the genes whose expression are unchanged from the level of initiating cells but should be induced to the expression level of primary cells. Between \emph{Successful} and \emph{Inactive}, genes can be defined \emph{Insufficient} when their expression were modified from the input cells but not enough to reach the level of the target cells.  Additionally, because of the induction of transcription factors, some genes can be over expressed in the engineered cells in comparison to the target primary cells, here defined as \emph{Over}; finally, some genes appear to be differentially expressed in a direction opposite to the expected one, these are defined as \emph{Reserved}.
By these definitions, a successful engineering cell process is expected to offer a significant number of \emph{Successful} genes, including the marker genes, and a minority of \emph{Inactive} genes. All other three categories contribute to the understanding of the deviations that gave rise to the unexpected outcome of the engineering process.
This is namely obtained with 3 outputs offered to the package users, in addition to the identification of inefficient TFs induction: (i) functional enrichment, (ii) tissue specific and (iii) gene regulatory network analyses. Each of those contribute to a better understanding of the role that each gene category plays in the engineering process and clarify the deficiencies of the engineered cells for potential improvements.

\section{Installing the \eegc{} package}

\eegc{} requires the following CRAN-R packages:  \CRANpkg{R.utils}, \CRANpkg{sna}, \CRANpkg{wordcloud}, \CRANpkg{igraph} and \CRANpkg{pheatmap}, and the \Bioconductor{} packages:\Biocpkg{limma}, \Biocpkg{edgeR}, \Biocpkg{DESeq2}, \Biocpkg{clusterProfiler}, \Biocpkg{org.Hs.eg.db}, \Biocpkg{org.Mm.eg.db}.

When \eegc{} is installed from \Bioconductor{}, all dependencies are installed.

<<install_eegc, eval=FALSE>>=
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("eegc")
@

Then package is loaded by:

<<init_eegc, cache=FALSE, eval=TRUE,warning=FALSE>>=
library(eegc)
@

\section{Preparing Input Data}

\eegc{} takes an input of gene expression data (with genes in rows and samples in columns) screened by microarray or RNA-seq (in FPKM -Fragments Per Kilobase of transcript per Million mapped reads- or counts). Samples belong to three types of samples: original cells (input), induced cells (ouput), and primary cells (output target).
In this vignette, we used human RNA-seq expression FPKMs data published by Sandler et Al. as example data. Here, human Dermal Microvascular Endothelial Cells (DMEC) were transduced with transcription factors and cultured in vascular niche to induce the growth of haematopoietic stem and multipotent progenitor cells (rEChMPP) compared with the target primary Cord Blood cells (CB).

<<load_data, eval=TRUE>>=
# load Sandler's data set:
data(SandlerFPKM)

#the column names of the data, representing the samples CB, DMEC, and rEChMPP
colnames(SandlerFPKM)
@


\section{Gene Differential Analysis}

The differential analysis is achieved by the \Rfunction{diffGene} function, and two packages  \Biocpkg{limma} \cite{Smyth2004} and \Biocpkg{DESeq2} \cite{Love2014} are selectively applied to microarray/FPKM data and counts data, respectively.  Before the gene differential analysis, the removal of low expressed genes is selectively performed (specifying \Rcode{TRUE} or \Rcode{FALSE} in the \Rcode{filter} parameter) by removal of the genes absent above a given percentage of samples and a log transformation is done on the (filtered) data. The significantly differential genes are identified for each pair-wise sample comparison (DMEC vs rEChMPP, DEMC vs CB, rEChMPP vs CB) and with a given corrected p-value cutoff.

<<diff_analysis, eval=TRUE, warning=FALSE>>=
# differential expression analysis:
diffgene = diffGene(expr = SandlerFPKM, array=FALSE,  fpkm=TRUE,  counts=FALSE,
                    from.sample="DMEC", to.sample="rEChMPP", target.sample="CB",
                    filter=TRUE, filter.perc =0.4, pvalue = 0.05 )
@

The function gives a list of outputs further used in the following analyses, including the differential result detailed below in \Rcode{diffgene.result}, the sole differential gene names in \Rcode{diffgene.genes} and the filtered gene expression values as in \Rcode{expr.filter}.

<<diffresults, eval=TRUE, warning=FALSE >>=
# differential analysis results
diffgene.result = diffgene[[1]]

# differential genes
diffgene.genes = diffgene[[2]]

#filtered expression data
expr.filter = diffgene[[3]]
dim(expr.filter)
dim(SandlerFPKM)
@


\section{Gene Categorization}
Gene (\Rcode{g}) categorization is achieved through the pair-wise comparisons (Expression Difference, ED) as defined in eq.~\ref{eq:eq1}, a difference of \Rcode{g} average expression in  A and B samples among the three types of samples as shown in Table \ref{tab:geneCategorization}, and the ratio of such differences (\Rcode{EDg} ratio, eq.~\ref{eq:eq2}) .

\begin{equation}\label{eq:eq1}
  ED_g(A,B) = \overline{E_g} \textrm{ in A} - \overline{E_g} \textrm{ in B}
\end{equation}

\begin{equation}\label{eq:eq2}
  ED_g \textrm{ ratio} = \frac{ED_g(rEChMPP, DMEC)}{ED_g(CB, DMEC)}
\end{equation}

The five gene categories are first defined by the ED patterns observed among three pair-wise comparisons as shown in the ED columns of Table \ref{tab:geneCategorization}.  Based on these definitions, categories \emph{Reserved} and \emph{Over} are undistinguishable, but become clearly distinct by evaluation of the ED ratios (eq.~\ref{eq:eq2}) in the corresponding column of Table \ref{tab:geneCategorization}.

At this stage, \emph{Inactive} and \emph{Successful} ED ratios are, conveniently, around 0 and 1, however, they cover a relatively wide range of values, with queues overlapping with the \emph{Over} and \emph{Insufficient} categories for \emph{Successful} genes, and with \emph{Reverse} and \emph{Insufficient} for \emph{Inactive} genes. To gain an accurate and practical categorization (operational in term of indications as to which genes need attention in the engineering process), \emph{Inactive} and \emph{Successful} genes boundaries were set more stringently around the intuitive peaks of 0 and 1, by shrinking the ED ratio boundaries to what we name operational ranges in Table \ref{tab:geneCategorization}, which correspond to the 5th and 95th quantile of the ED-ranked \emph{Successful} and \emph{Inactive} genes.

\newpage
\begin{table}[]
\centering
\caption{\label{tab:geneCategorization} Gene categorization base on differential analysis and ED ratio}
\begin{tabular}{|c|c|c|c|c|c|}
\hline
\multicolumn{1}{|c|}{\multirow{2}{*}{Category}} & \multicolumn{3}{c|}{ED Patterns}                                                                                                                                       & \multicolumn{1}{l|}{\multirow{2}{*}{ED Ratio}} & \multirow{2}{*}{Operational Ranges} \\ \cline{2-4}
\multicolumn{1}{|c|}{}                          & \begin{tabular}[c]{@{}c@{}}CB,\\ DMEC\end{tabular} & \begin{tabular}[c]{@{}c@{}}rEChMPP, \\ DMEC\end{tabular} & \begin{tabular}[c]{@{}c@{}}rEChMPP,\\  CB\end{tabular} & \multicolumn{1}{l|}{}                          &                                     \\ \hline
\textit{Reversed}                               & \multirow{2}{*}{\ding{51}/\ding{55}}                               & \multirow{2}{*}{\ding{55}}                                       & \multirow{2}{*}{\ding{51}}                                     & \textless0                                     &                                     \\ \cline{1-1} \cline{5-6}
\textit{Over}                                   &                                                    &                                                          &                                                        & \textgreater1                                  &                                     \\ \hline
\textit{Inactive}                               & \ding{51}                                                  & \ding{55}                                                        & \ding{51}                                                      &$\sim$0                                                & (Q$^{5th}$ ED ratio, Q$^{95th}$ ED ratio)         \\ \hline
\textit{Insufficient}                           & \ding{51}                                                  & \ding{51}                                                        & \ding{51}                                                      & 0$\sim$1                                            &                                     \\ \hline
\textit{Successful}                             & \ding{51}                                                  & \ding{51}                                                        & \ding{55}                                                      &$\sim$1                                                & (Q$^{5th}$ ED ratio, Q$^{95th}$ ED ratio)           \\ \hline
\end{tabular}
\begin{tablenotes}
\small
\item \ding{51} differential, \ding{55} nondifferential; Q$^{xth}$ ED ratio: xth quantile of the ranked ED ratios
\end{tablenotes}
\end{table}

Function \Rfunction{categorizeGene} performs the categorization and gives a list of outputs with five categories \emph{Reversed}, \emph{Inactive}, \emph{Insufficient}, \emph{Successful} and \emph{Over} with: 1) the gene symbols in each category, 2) corresponding \Rcode{ED} ratios.

<<gene_categorization, eval=TRUE>>=
# categorizate differential genes from differential analysis
category = categorizeGene(expr = expr.filter,diffGene = diffgene.genes,
                          from.sample="DMEC",
                          to.sample="rEChMPP",
                          target.sample="CB")
cate.gene = category[[1]]
cate.ratio = category[[2]]

# the information of cate.gene
class(cate.gene)
length(cate.gene)
names(cate.gene)

head(cate.gene[[1]])
head(cate.ratio[[1]])
@


\section{Gene Expression Pattern Visualization}

Expression profile of genes in each category can be visualized as different expression patterns with the markerScatter function that generates a scatter plot of gene expression for the five categories in paired arms and highlight the marker genes in the output (Figure~\ref{fig:markerScatter}). We also apply a linear model to fit the expression profiles of samples on the x- and y-axes, with the possibility to selectively add the regression lines on  the figure to show the sample correlation.
To avoid confusion and allow a distinct visualization of all expression profiles, \emph{Inactive}, \emph{Insufficient} and \emph{Successful} genes are plotted separately (Figure~\ref{fig:markerScatter} above) from the \emph{Reserved} and \emph{Over} genes (Figure~\ref{fig:markerScatter} below), with each of the 5 categories being  plotted from left to right to highlight the expression trend change from DMEC to rEChMPP.

<<markerScattera, eval = TRUE, echo=TRUE,fig.width =6, fig.height =6,dev="pdf">>=
#load the marker genes of somatic and primary cells
data(markers)

#scatterplot
col = c("#abd9e9", "#2c7bb6", "#fee090", "#d7191c", "#fdae61")
markerScatter(expr = expr.filter, log = TRUE, samples = c("CB", "DMEC"),
              cate.gene = cate.gene[2:4], markers = markers, col = col[2:4],
              xlab = expression('log'[2]*' expression in CB (target)'),
              ylab = expression('log'[2]*' expression in DMEC (input)'),
              main = "")
@

<<markerScatterb,eval = TRUE, echo=TRUE,fig.width =6, fig.height =6,dev="pdf">>=
markerScatter(expr = expr.filter, log = TRUE, samples = c("CB", "rEChMPP"),
              cate.gene = cate.gene[2:4], markers = markers, col = col[2:4],
              xlab = expression('log'[2]*' expression in CB (target)'),
              ylab = expression('log'[2]*' expression in rEC-hMPP (output)'),
              main = "")
@

<<markerScatterc,eval = TRUE, echo=TRUE,fig.width =6, fig.height =6,dev="pdf">>=
markerScatter(expr = expr.filter, log = TRUE, samples = c("CB", "DMEC"),
              cate.gene = cate.gene[c(1,5)], markers = markers, col = col[c(1,5)],
              xlab = expression('log'[2]*' expression in CB (target)'),
              ylab = expression('log'[2]*' expression in DMEC (input)'),
              main = "")
@

<<markerScatterd, eval = TRUE, echo=TRUE,fig.width =6,fig.height =6,dev="pdf">>=
markerScatter(expr = expr.filter, log = TRUE, samples = c("CB", "rEChMPP"),
              cate.gene = cate.gene[c(1,5)], markers = markers, col = col[c(1,5)],
              xlab = expression('log'[2]*' expression in CB (target)'),
              ylab = expression('log'[2]*' expression in rEC-hMPP (output)'),
              main = "")
@

\begin{figure}
\centering
\includegraphics[width=0.4\textwidth]{figure/markerScattera-1}
\includegraphics[width=0.4\textwidth]{figure/markerScatterb-1}
\includegraphics[width=0.4\textwidth]{figure/markerScatterc-1}
\includegraphics[width=0.4\textwidth]{figure/markerScatterd-1}
\caption{Expression profile (FPKM in log scale) of the five gene categories in CB primary cells against the endothelial cells (left) and rEChMPPs (right), respectively.}
\label{fig:markerScatter}
\end{figure}


\section{Quantifying Gene Categories}

One simple metric to quantify the success of the cellular engineering process is the proportion of genes in each category among all the categorized genes. A high proportion of \emph{Successful} genes reflects a good (trans)differentiation. Thus we produce a density plot to quantify the ED ratios of each gene category.  As proposed in Table \ref{tab:geneCategorization}, the ED ratios of \emph{Successful} genes are around 1, \emph{Inactive} genes around 0 and \emph{Insufficient} genes are between 0 and 1. Extreme higher or lower ratios are given by the \emph{Reserved} or \emph{Over} genes, respectively, to make the ratios on x axis readable, we suggest to narrow the ratios of \emph{Reserved} and \emph{Over} genes to a maximum of their median values (Figure~\ref{fig:densityPlot}).

<<densityPlot,eval=TRUE,fig.width=8,fig.height=5,dev="pdf">>=
# make the extreme ED ratios in Reversed and Over categories to the median values
reverse = cate.ratio[[1]]
over = cate.ratio[[5]]
reverse[reverse[,1] <= median(reverse[,1]), 1]  = median(reverse[,1])
over[over[,1] >= median(over[,1]),1] = median(over[,1])
cate.ratio[[1]] = reverse
cate.ratio[[5]] = over

# density plot with quantified proportions
densityPlot(cate.ratio, xlab = "ED ratio", ylab = "Density", proportion = TRUE)
@

\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figure/densityPlot-1}
\caption{The proportion of genes in each category.}
\label{fig:densityPlot}
\end{figure}
\section{Functional Enrichment Analysis and Visualization}

The functional annotation for each gene category is performed by applying the R package \Biocpkg{clusterProfiler} \cite{Yu2012}.
Given an input genelist with five gene categories and having set the \Rcode{organism} parameter (optionally, \Rcode{human} or \Rcode{mouse}), \Rfunction{functionEnrich} performs the functional enrichment analysis for Gene Ontology (GO) \cite{Ashburner2000} and Kyoto Encyclopedia of genes and Genomes (KEGG) pathway \cite{Kanehisa2012} with either hypergeometric test or Gene Set Enrichment Analysis (GSEA).

<<functionEnrichment, eval=TRUE, warning = FALSE>>=
#  result in "enrichResult" class by specifying TRUE to enrichResult parameter
goenrichraw = functionEnrich(cate.gene, organism = "human", pAdjustMethod = "fdr",
                             GO = TRUE, KEGG = FALSE, enrichResult = TRUE)
class(goenrichraw[[1]])
@


<<functionEnrichment2, eval=FALSE, warning = FALSE>>=
# result of the summary of "enrichResult" by specifying FALSE to enrichResult parameter
# GO enrichment
goenrich = functionEnrich(cate.gene, organism = "human", pAdjustMethod = "fdr",
                          GO = TRUE, KEGG = FALSE, enrichResult = FALSE)
# KEGG enrichment
keggenrich = functionEnrich(cate.gene, organism = "human", pAdjustMethod = "fdr",
                            GO = FALSE, KEGG = TRUE, enrichResult = FALSE)
@

To describe the significantly enriched functional terms in each category (Figure~\ref{fig:barplotEnrich}) and for further comparison within the five gene categories (Figure~\ref{fig:heatmapPlot}), a bar plot function \Rfunction{barplotEnrich} (modified from the \Rfunction{barplot.enrichResult} function in \CRANpkg{DOSE} \cite{He2015} package) and heatmap plot function \Rfunction{heatmapPlot} are added in the package, outputting the most enriched terms selected by parameter \Rcode{top}.

<<barplotEnrich,eval=TRUE,dev="pdf",fig.width=6,fig.height=3>>=
# plot only the "enrichResult" of Inactive category
inactive = goenrichraw[[2]]
barplotEnrich(inactive, top =5, color ="#2c7bb6", title = "Inactive")
@

\begin{figure}
\centering
\includegraphics[width=0.6\textwidth]{figure/barplotEnrich-1}
\caption{Example of top 5 significantly enriched Gene Ontology terms in the \emph{Inactive} category. The shade of colors represents enrichment p-values and bar length represents the count of genes involved in each GO term.}
\label{fig:barplotEnrich}
\end{figure}

<<heatmapPlot,eval=TRUE,echo=TRUE,dev="pdf",fig.width=7,fig.heigth=7>>=
# plot the enrichment results by the five gene categories
data(goenrich)
heatmaptable = heatmapPlot(goenrich, GO = TRUE, top = 5, filter = FALSE,
                           main = "Gene ontology enrichment",
                           display_numbers =  FALSE)
@

\begin{figure}
\centering
\includegraphics[width=0.6\textwidth]{figure/heatmapPlot-1}
\caption{Top 5 significantly enriched Gene Ontology terms in each gene category based on the log transformed corrected p-values. The counts of genes involved in each functional term are displayed on the heatmap by specifying TRUE to the \Rcode{display\_numbers} parameter in heatmapPlot function.}
\label{fig:heatmapPlot}
\end{figure}

This analysis helps to evaluate the engineered cells at the functional level by identifying the non-activated functions that play important roles in primary cells but, being enriched in the \emph{Inactive} genes categories, lack in the engineered cells.

\section{Cell/Tissue-specific Enrichment Analysis}
\label{sec:tissueEnrich}

The five gene categories represent different reprogramming progresses or directions of the original cells towards primary cells. These progresses or directions can also be explored, in addition to the functional role investigated above, by cell/tissue (C/T)-specific analysis. Downstream of a successful cellular engineering process, the expression of somatic cell-specific genes will be down-regulated while the expression of primary cell-specific genes will be up-regulated.  So ideally each of the five gene categories is mainly composed by two gene types: somatic or primary genes.  By the C/T-specific analysis, assuming the most extreme values of expression represent effective up or down regulation and compute based on this, we can explore which C/T-specific genes results the success or failure of the engineering process.

The database Gene Enrichment Profiler, containing the expression profiles of ~12,000 genes with NCBI GeneID entries across 126 primary human cells/tissues in 30 C/T groups, is used for this C/T-specific analysis \cite{Benita2010}.

In Gene Enrichment Profiler, genes specificity to a given cell/tissue is ranked using a custom defined enrichment score.  For our usage in this package ranking is not sufficient as cell/tissue-genes sets are needed and therefore we applied \Biocpkg{SpeCond} \cite{Gavalli2009}, a method to detect condition-specific gene, to identify the C/T-specific gene sets . Then we apply the hypergeometric test to assess the specificity significance of gene categories in each tissue with the enrichment function and visualize the enrichment results by the heatmapPlot function (Figure~\ref{fig:tissueheatmap}).

<<tissueheatmap, eval=TRUE,echo=TRUE,dev="pdf",fig.width=7,fig.height=8>>=
#load the cell/tissue-specific genes
data(tissueGenes)
length(tissueGenes)
head(names(tissueGenes))

#load the mapping file of cells/tissues to grouped cells/tissues
data(tissueGroup)
head(tissueGroup)

#get the background genes
genes = rownames(expr.filter)
#enrichment analysis for the five gene categories
tissueenrich = enrichment(cate.gene = cate.gene, annotated.gene = tissueGenes,
                          background.gene = genes, padjust.method = "fdr")

#select a group of cells/tissues
tissueGroup.selec = c("stem cells","B cells","T cells","Myeloid","Endothelial CD105+")
tissues.selec = tissueGroup[tissueGroup[,"Group"] %in% tissueGroup.selec,c(2,3)]
tissuetable = heatmapPlot(tissueenrich, terms = tissues.selec, GO=FALSE,
                          annotated_row = TRUE,annotation_legend = TRUE,
                          main = "Tissue-specific enrichment")
@

\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{figure/tissueheatmap-1}
\caption{Cells/tissues-specific enrichment among five gene categories in selected cells/tissues.}
\label{fig:tissueheatmap}
\end{figure}

It is expected that the \emph{Successful} genes are enriched in both somatic and primary cells/tissues but not the \emph{Inactive} or \emph{Insufficient} genes.

\section{Evaluation Based on Gene Regulatory Network (GRN) Analysis}

The package also build the gene-gene regulation network for each category based on cell/tissue-specific gene regulatory networks (GRNs) constructed by the CellNet \cite{Morris2014} team through the analysis of 3419 published gene expression profiles in 16 human and mouse cells/tissues. The complete table of regulatory relationships for all genes (not limited to C/T-specific ones) and for C/T-specific ones are downloaded from the \href{http://cellnet.hms.harvard.edu/downloads/}{CellNet} website.

\subsection{CellNet-based Cell/Tissue-specific analysis}

Construction is done by checking the percentage of overlapping genes in each category with the genes involved in each  C/T-specific GRN by the dotPercentage function (Figure~\ref{fig:dotpercentage}) and then performing a C/T-specific enrichment analysis, as in the \hyperref[sec:tissueEnrich]{Cell/Tissue-specific Enrichment Analysis} section, based on these gene sets.

<<dotpercentage, eval = TRUE, echo=TRUE,dev="pdf",fig.width=7,fig.height=5>>=
#load the C/T-specific genes in 16 cells/tissues
data(human.gene)

# the 16 cells/tissues
head(names(human.gene))
perc = dotPercentage(cate.gene = cate.gene, annotated.gene = human.gene,
              order.by = "Successful")
@

\begin{figure}
\centering
\includegraphics[width=0.6\textwidth]{figure/dotpercentage-1}
\caption{Percentage of genes in each category overlapping in each cell- and tissue-specific gene set.}
\label{fig:dotpercentage}
\end{figure}

<<cellnetheatmap, eval = FALSE, echo=TRUE,dev="pdf">>=
# CellNet C/T-specific enrichment analysis
cellnetenrich = enrichment(cate.gene = cate.gene, annotated.gene = human.gene,
                           background.gene = genes, padjust.method ="fdr")
cellnetheatmap = heatmapPlot(cellnetenrich,
                             main = "CellNet tissue specific enrichment")
@

\subsection{Cell/Tissue-specific Transcription Factor Analysis}

In our specific example, the observation that some \emph{Inactive} genes are enriched in the primary cells, represents an important information regarding the transcription factors regulating these genes, TFs that are potentially necessary for a successful cellular engineering. Therefore in a second step, we extract the cell- and tissue-specific transcription factors and their down-stream regulated genes into gene sets, and apply the gene set enrichment analysis on the five gene categories.  A heatmap can be plotted to compare the C/T-specific transcription factors enriched by different categories just as shown in Figure~\ref{fig:tfheatmap}.

<<tfheatmap,eval=TRUE,echo=TRUE,fig.width=5,fig.height=6,dev="pdf">>=
# load transcription factor regulated gene sets from on CellNet data
data(human.tf)
tfenrich = enrichment(cate.gene = cate.gene,  annotated.gene = human.tf,
                      background.gene = genes, padjust.method ="fdr")
tfheatmap = heatmapPlot(tfenrich, top = 5,
                        main = "CellNet transcription factor enrichment")
@

\begin{figure}
\centering
\includegraphics[height=0.6\textwidth, width=0.5\textwidth]{figure/tfheatmap-1}
\caption{Top 5 significantly enriched cell/tissue-specific transcription factors by each gene category based on the log transformed corrected p-values.}
\label{fig:tfheatmap}
\end{figure}

\subsection{Network Topological Analysis and Visualization}

Finally, we introduce the topological analysis for the C/T-specific gene regulatory networks including genes in each of five categories, by calculating the degree, closeness, betweenness and stress centrality with the \CRANpkg{igraph} package \cite{Nepusz2006} and \CRANpkg{sna} \cite{Carter2014}. Given an input of genes and their centrality, \Rfunction{grnPlot} function plots the regulatory network with these genes as nodes and centrality as node size to represent their importance in the network (Figure~\ref{fig:grnPlot}).

<<networkanalyze,eval = TRUE,echo=TRUE>>=
# load the CellNet GRN
data(human.grn)

# specify a tissue-specifc network
tissue = "Hspc"
degree = networkAnalyze(human.grn[[tissue]], cate.gene = cate.gene,
                        centrality = "degree", mode ="all")
head(degree)
@

<<grnPlot,eval = TRUE,echo=TRUE,dev="pdf",fig.width=6,fig.height=6>>=
# select genes to shown their regulation with others
node.genes = c("ZNF641", "BCL6")

# enlarge the centrality
centrality.score = degree$centrality*100
names(centrality.score) = degree$Gene
par(mar = c(2,2,3,2))
grnPlot(grn.data = human.grn[[tissue]], cate.gene = cate.gene, filter = TRUE,
        nodes = node.genes, centrality.score = centrality.score,
        main = "Gene regulatory network")
@

\begin{figure}
\centering
\includegraphics[height=0.6\textwidth, width=0.6\textwidth]{figure/grnPlot-1}
\caption{The hematopoietic stem/progenitor cell (Hspc)-specifc gene regulatory network regulated by "ZNF641" and "BCL6" genes. Node size is represented by the degree centrality of node in a general Hspc-specific network.}
\label{fig:grnPlot}
\end{figure}

The analysis clarifies the importance of genes, especially the transcription factors, in terms of their ability to connect other genes in a network. Hence, this provides a way to predict the relevance of molecules in terms of their topological centrality, and offer information regarding potential transcription factors to be used for an improvement of cellular engineering.

\newpage

\begin{thebibliography}{1}
\bibitem{Sandler2014}
Sandler, V.M., et al.,
\newblock Reprogramming human endothelial cells to haematopoietic cells
  requires vascular induction.
\newblock {\em Nature}, 2014. 511(7509): p. 312-8.

\bibitem{Smyth2004}
Smyth, G.K.,
\newblock Linear models and empirical bayes methods for assessing
  differential expression in microarray experiments.
\newblock {\em Appl Genet Mol Biol}, 2004. 3: p. Article3.

\bibitem{Love2014}
Love, M.I., W. Huber, and S. Anders,
\newblock Moderated estimation of fold change and dispersion for
  RNA-seq data with DESeq2.
\newblock {\em Genome Biol}, 2014. 15(12): p. 550.

\bibitem{Yu2012}
Yu, G., et al.,
\newblock clusterProfiler: an R package for comparing biological
  themes among gene clusters.
\newblock {\em OMICS}, 2012. 16(5): p. 284-7.

\bibitem{Ashburner2000}
Ashburner, M., et al.,
\newblock  Gene ontology: tool for the unification of biology.
\newblock {\em The Gene Ontology Consortium.  Nat Genet}, 2000. 25(1): p. 25-9.

\bibitem{Kanehisa2012}
Kanehisa, M., et al.,
\newblock KEGG for integration and interpretation of large-scale molecular data sets.
\newblock {\em  Nucleic Acids Research}, 2012. 40(Database issue): p. D109-14.

\bibitem{He2015}
He, G.Y.a.L.-G.W.a.G.-R.Y.a.Q.-Y.,
\newblock DOSE: an R/Bioconductor package for Disease Ontology
  Semantic and Enrichment analysis.
\newblock {\em Bioinformatics}, 2015. 31(4): p. 608-609.

\bibitem{Benita2010}
Benita, Y., et al.,
\newblock Gene enrichment profiles reveal T-cell development,
  differentiation, and lineage-specific transcription factors including ZBTB25 as a novel NF-AT repressor.
\newblock {\em Blood}, 2010. 115(26): p. 5376-84.

\bibitem{Gavalli2009}
Cavalli, F.,  2009.
\newblock SpeCond: Condition specific detection from expression data.
\newblock 2009.

\bibitem{Morris2014}
Morris, S.A., et al.,
\newblock  Dissecting engineered cell types and enhancing cell fate
  conversion via CellNet.
\newblock {\em Cell}, 2014. 158(4): p. 889-902.

\bibitem{Nepusz2006}
Nepusz, G.C.a.T.,
\newblock The igraph software package for complex network research.
\newblock {\em InterJournal}, 2006. Complex Systems: p. 1695.

\bibitem{Carter2014}
Carter T. Butts,
\newblock sna: Tools for Social Network Analysis.
\newblock {\em R package version 2.3-2}, 2014.

\end{thebibliography}

% \bibliography{eegc/REFERENCES}

\section{Session Info}

<<session_info, eval=TRUE>>=
sessionInfo()
@

\end{document}