% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%% Modified using Sweave2knitr().

%\VignetteIndexEntry{Pigengene: Computing and using eigengenes}
%\VignetteDepends{Pigengene}
%\VignetteKeywords{Gene expression, Network, Biomedical Informatics, Systems Biology}
%\VignettePackage{Pigengene}
%\VignetteEngine{knitr::knitr}

%% Commands:
\newcommand{\pipa}{\Biocpkg{Pigengene} }
\newcommand{\E}{\mathcal{E}}

\documentclass[12pt]{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

\bioctitle{\Biocpkg{Pigengene}: Computing and using eigengenes}
\author{Habil Zare}
\date{Modified: 26 April, 2016. Compiled: \today}

\begin{document}

\maketitle

\tableofcontents


\section{Introduction}
Gene expression profiling technologies such as microarray or RNA Seq provide 
valuable datasets, however, inferring biological information from these data 
remains cumbersome. \pipa address two challenges:
\begin{enumerate}
\item 
  \textbf{Curse of dimensionality:} The number of features in an expression profile is
  usually very high. For instance, there are about 20,000 genes in human. In contrast, the number
  of samples (patients) is often very limited in practice, and may not exceed a few
  hundreds. Yeung et al. have shown that standard data reduction methods such as principal 
  component analysis (PCA) are not appropriate to directly apply on gene expression data 
  \cite{yeung2001principal}.
  Instead, \pipa  addresses this challenge by applying PCA on gene modules.
\item
  \textbf{Normalization:} Data produced using different technologies, or in different labs,
  are not easily comparable. \pipa identifies {\em eigengenes}, informative biological signatures
  that are robust with respect to the profiling platform. For instance, it
  can identify the signatures (compute the eigengenes) on microarray data, and infer them
  on biologically-related RNA Seq data. The resulting  signatures are directly comparable even if the
  set of samples (patients)  are independent and disjoint in the two analyzed datasets.
\end{enumerate}


\section{How to run \pipa?}
\subsection{Installation}
\pipa is an \R{} package that can be downloaded and 
installed from \Bioconductor{} by the followig commands in R:
\newline\newline
\texttt{if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")}
\newline
\texttt{BiocManager::install("Pigengene")}

Alternatively, if the source code is already available, 
the package can be installed by the following command in Linux:
\newline
$\\$
\texttt{R CMD INSTALL Pigengene\_x.y.z.tar.gz}
\newline
$\\$
where x.y.z. determines the version. The second approach requires all 
the dependencies be installed manually, therefore, the first approach
is preferred.

\subsection{A quick overview}
\pipa identifies gene modules (clusters), computes an eigengene for each 
module, and uses these biological signatures as features for classification.
The main function is \Rfunction{one.step.pigengene} which requires a gene 
expression profile and the corresponding conditions (types).
Individual functions are also provided to facilitate running the pipeline in a 
customized way. The inferred biological signatures (eigengenes)
are useful for supervised or unsupervised analyses.

\subsection{What is an eigengene?}
In most functions of this package, eigenegenes are computed or used as robust
biological signatures. Briefly, each eigengene $\E$ is a weighted average of the
expression of all genes in a given set of $n$ genes
(also known as a gene module or a cluster of genes).

%%\begin{ceqn}
\begin{equation}
    \E=\alpha_1 g_1 + \alpha_2 g_2+ \dots +  \alpha_n g_n,
    \label{eq:eigen}
\end{equation}
%%\end{ceqn}
\noindent where $\alpha_i$ represents the weight corresponding to gene $g_i$.
The weights are adjusted  in a way that the explained variance is maximized. 
This guarantees that the loss in the biological information in minimized.

\subsection{A toy example}
For a quick start, the application of \pipa pipeline on some leukemia dataset is
demonstrated below \cite{mills2009microarray}. 
The first step is to load the package and data in \R{}:

<<loading, fig.width=6, fig.height=6, echo=TRUE>>=
library(Pigengene)
data(aml)
data(mds)
d1 <- rbind(aml,mds)
Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds)))
names(Labels) <- rownames(d1)
Disease <- as.data.frame(Labels)
h1 <- pheatmap.type(d1[,1:20],annRow=Disease,show_rownames=FALSE)
@ 

Please note that the provided data in the package is
sub-sampled for a quicker demonstration. For real applications, the expression of
thousands of genes should be provided in order to the co-expression network analysis to 
be appropriate. It is common to first perform differential expression analysis, sort all the genes based
on p-values, and use the top-third as the input \cite{zhang2013integrated}.
Analyzing such input with \pipa can take a few hours and may require 5-10 GB of memory. 
The following command runs Pigengene pipeline on the toy data:
%%
<<oneStep, echo=TRUE>>=
p1 <- one.step.pigengene(Data=d1,saveDir='pigengene', bnNum=0, verbose=1,
      seed=1, Labels=Labels, toCompact=FALSE, doHeat=FALSE)
@ 

Results and figures are saved in \Rcode{pigengene} folder under the current directory.  
For more advanced applications, the user is encouraged to analyze the data step-by-step
and customize  the individual functions such as \Rfunction{compute.pigenegene} and 
\Rfunction{make.decision.tree}.

In addition to the provided decision trees, the user can also take alternative approaches to perform 
classification, clustering, survival analysis, etc. {\em using eigengenes as robust biological
signatures (informative features)}. Eigengenes and other useful objects can be retrieved from the output.
For instance, \Rcode{c5treeRes} is a list containing the results of fitting decision trees to
the eigengenes.  As shown above, a couple of trees were fitted, one per a value for 
\Rcode{minPerLeaf}. The following command plots the tree corresponding to 34, i.e.,
it was fitted requiring the minimum number of samples per every leaf to be at least 34.
%%
<<tree, fig.width=5, fig.height=5, echo=TRUE>>=
plot(p1$c5treeRes$c5Trees[["34"]])
@ 
The tree corresponding to other values are saved in \Rcode{pigengene} folder.
Of note, is the \Robject{pigenegene} object that contains the matrix of inferred eigenegenes.
Each row corresponds to a sample, and each column represents an eigengene.
%%
<<pigengene, fig.width=5, fig.height=5, echo=TRUE>>=
dim(p1$pigengene$eigengenes)
p1 <- pheatmap.type(p1$pigengene$eigengenes,annRow=Disease,show_rownames=FALSE)
@ 

\subsection{Running the \pipa pipeline step by step}
If you are curious about the specific steps in the \pipa pipeline, or you need to run some steps
with different settings, you can follow the steps below. The results will be similar to the output
of the \Rcode{one.step.pigengene} function. The first step is quality control to make sure that the
matrices have row and column names, and do not include too many \Rcode{NA} values:

<<qc, echo=TRUE>>=
## QC:
checked <- check.pigengene.input(Data=d1, Labels=Labels)
DataI <- checked$Data
LabelsI <- checked$Labels
@

We oversample the data such that the number of cases in each condition is almost balanced.
<<balance, echo=TRUE>>=
wData <- balance(Data=DataI, Labels=LabelsI, verbose=1)$balanced
@

Weighted gene co-expression network analysis (WGCNA) \cite{langfelder2008wgcna} does not assume any
cutoff (i.e., hard threshold) on the correlation values. Instead, it raises the correlation values
to a power so that the correlation values that are close to zero diminish. This hyperparameter is
called $\beta$, and can be estimated using as follows:
<<beta, echo=TRUE>>=
betaI <- calculate.beta(RsquaredCut=0.8, Data=wData, verbose=1)$power
saveDir <- "steps" ## Results will be saved in this folder.
dir.create(saveDir)
@

Once we have an estimate for $\beta$, WGCNA can be done in one step using the following function to
identify gene modules (i.e., clusters):
<<wgcna, echo=TRUE>>=
## WGCNA
wgRes <- wgcna.one.step(Data=wData, seed=1, power=betaI,
      saveDir=saveDir, verbose=1)
@
The output of \Rcode{wgcna.one.step} is a list of objects including \Rcode{modules},
which is a numeric vector named with genes. Genes that map to the same numeric value
are considered to be in the same module. We use this information to compute an eigengene for every module. 
<<eigengenes, echo=TRUE>>=
## Eigengenes:
pigengene <- compute.pigengene(Data=DataI, Labels=LabelsI,
	     	saveFile=file.path(saveDir, 'pigengene.RData'),
                modules=wgRes$modules, verbose=1)
class(pigengene)
print(dim(pigengene$eigengenes)) ##This is the eigengenes matrix.
print(pigengene$eigengenes[1:3,1:4])
@
The number of columns in the eigengene matrix is equal to the number of modules, and the number of
rows is equal to the number of samples. 

Eigengenes can be used as robust biological signatures (i.e., features in machine learning terms) for
classification, clustering, exploratory analysis, etc. For example, we can use them as random
variables to fit a Bayesian network to data \cite{agrahari2018applications}.
<<bn, echo=TRUE>>=
## Learning the BN structure:
library(bnlearn)
learnt <- learn.bn(pigengene=pigengene, bnPath=file.path(saveDir, "bn"),
       	  	bnNum=10, ## In real applications, at least 100-1000.
		seed=1, verbose=1)
BN <- learnt$consensus1$BN
@
The \Rcode{learn.bn} function has many arguments. See the corresponding documentation and 
publication  \cite{agrahari2018applications} for technical details.
Because usually thousands of individual networks are needed, it is wise to train many models in
parallel on a cluster, which can be done with appropriate settings for the \Rcode{learn.bn} input
arguments. We can replot the consensus Bayesian network, which is already saved on disk,
using the \Rcode{draw.bn} function.
<<draw, echo=TRUE, eval=FALSE>>=
drawn <- draw.bn(BN)
## By construction, the Disease node can have no parents. 
@

The \Rcode{learn.bn} function tries to find the best structure for the optimum Bayesian network. The
conditional probability tables are yet to be inferred from the data.
<<bnParam, echo=TRUE>>=
## Fit the parameters of the Bayesian network:
fit <- bn.fit(x=BN, data=learnt$consensus1$Data, method="bayes", iss=10)
##where learnt$consensus1$Data is the discretized data matrix.

## The conditional probability table for one of the children of the Disease node:
selectedFeatures <- children("Disease", x=BN)
print(fit[[selectedFeatures[1]]])
@

The fitted Bayesian network can be used for predicting the labels (i.e., values of the Disease node).
<<bnPrediction, echo=TRUE>>=
l2 <- predict(object=fit, node="Disease", data=learnt$consensus1$Data, method="bayes-lw")
table(LabelsI, l2)
@

Eigengenes can also be used in predictive models simpler than a Bayesian network. For example, a
decision tree can be fitted using the \Rcode{make.decision.tree} function  \cite{foroushani2016large}.
<<trees, echo=TRUE>>=
## Decision trees:
treePath <- file.path(saveDir, 'C5Trees')
dir.create(path=treePath)
treeRes <- make.decision.tree(pigengene=pigengene, Data=DataI,
	       selectedFeatures=selectedFeatures, saveDir=treePath,
       	       verbose=1, toCompact=FALSE)
@
If \Rcode{selectedFeatures="All"}, the \Rcode{make.decision.tree} function automatically selects a
``minimal'' subset of eigengenes in order to prevent overfitting.


\subsection{Citation}
The methodology and an interesting application of \pipa on studying
hematological malignancies is presented in the following reference \cite{foroushani2016large}.
<<citation, results='asis', eval=TRUE>>=
citation("Pigengene")
@ 

\section{Session Information}
The output of \Rfunction{sessionInfo} on the system that compiled 
this document is as follows:

<<sessionInfo, results='asis', eval=TRUE>>=
toLatex(sessionInfo())
@

\bibliography{pigengene}

\end{document}