\documentclass[12pt]{article}
%\documentclass[epsf]{siamltex}

\setlength{\topmargin}{0in}
\setlength{\topskip}{0in}
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
%\setlength{\footheight}{0in}
%\setlength{\footskip}{0in}
\setlength{\textheight}{9in}
\setlength{\textwidth}{6.5in}
\setlength{\baselineskip}{20pt}
\setlength{\leftmargini}{1.5em}
\setlength{\leftmarginii}{1.5em}
\setlength{\leftmarginiii}{1.5em}
\setlength{\leftmarginiv}{1.5em}

\usepackage{epsfig,subfig,lscape}
\usepackage[lined,ruled]{algorithm2e}

\renewcommand\theequation{\thesection.\arabic{equation}}

\def\hb{\hfil\break}

\def\ib{$\bullet$}

%\VignetteIndexEntry{networkBMA}
 
\begin{document}
\pagestyle{plain}


%Goadrich&2004,KokDomingos2005,SinglaDomingos2005,

\nocite{Brem&2002,BremKruglyak2005,Yeung&2005,Yeung&2011,Lo&2012,YEASTRACT,Hoeting&1999,BMApackage,iterativeBMApackage,networkBMApackage,SCPD,YPD,ArrayExpress}

\begin{center}
{\bf Uncovering gene regulatory relationships from time-series
  expression data using {\tt networkBMA}}\\[5pt]
Ka Yee Yeung, Chris Fraley, Adrian E. Raftery, Wm. Chad Young\\
Departments of Microbiology (KYY) and Statistics (CF, AER, and WCY)\\
University of Washington
\end{center}

\bigskip

This document illustrates the use of the {\tt networkBMA} R package 
(Fraley et~al. 2012) to uncover regulatory relationships in yeast 
({\it Saccharomyces cerevisiae}) from
microarray data measuring time-dependent gene-expression levels in 95 
genotyped yeast segregants subjected to a drug (rapamycin) perturbation.

\section{Data}

The expression data for this vignette is provided in the {\tt networkBMA} 
package in the {\tt vignette} 
database, which consists of three R objects:
\begin{itemize}
\item[]{\tt timeSeries}:
       A 582 by 102 data frame in which the first two columns are factors 
       identifying the replicate and time (in minutes) after drug perturbation,
       and the remaining 100 columns are the expression measurements  for a
       subset of 100 genes from the yeast-rapamycin experiment described in 
       Yeung et~al. (2011).
       There are 582/6 = 97 replicates (the 95 segregants plus two 
       parental strains of the segregants), each with measurements at 
       6 time points.
The complete time series data is available from Array Express 
(Parkinson et~al. 2007) with accession number 
E-MTAB-412 \\
(http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-412).
\item[] {\tt reg.known}:
 A 18 by 3 data frame giving known regulatory relationships among this
 subset of 100 genes. The first two columns
        give the regulator and target gene, respectively, 
        while the third encodes 
        the source of the regulatory information ({\tt `YPD'}
         for Yeast Proteome Database (Hodges et~al. 1999) and {\tt `SCPD'} for
          The Promoter Database of \emph{Saccharomyces cerevisiae}
            (Zhu and Zhang 1999).
        In this example, we constrained {\tt reg.known} to high-confidence 
        experimental results obtained from biochemical 
       (non-high-throughput) experiments.
\item[] {\tt reg.prob}:
A 100 by 100 matrix, giving probability estimates for regulatory 
relationships in which the $(i,j)$ entry 
gives the estimated probability that gene $i$ regulates gene $j$.
These were computed using the supervised framework integrating multiple 
data sources of Lo et~al. (2012).
\item[] {\tt referencePairs}:
 A 2-column data frame giving 287 regulator-gene pairs 
among the selected set of 100 genes reported from the literature.
In this yeast example, the reference network was extracted from the documented
evidence from the {\tt YEASTRACT} database (Teixeira et~al. 2006), which
includes curated regulatory relationships from the literature 
inferred from high-throughput experiments.
\item[] {\tt brem.data}: An 85 by 111 
subset of the data used for network inference in
yeast (Brem et~al. 2002, Brem and Kruglyak 2005).
The rows correspond to genes and the columns to experiments.
Provided courtesy of Rachel Brem.
\end{itemize}
<<vignette.data>>= 
library(networkBMA)
data(vignette)

dim(timeSeries)
dim(reg.prob)
dim(brem.data)
reg.known
@

\section{Network Modeling}
Given the yeast expression data from the Rapamycin experiments, the
{\tt networkBMA} function can be invoked to estimate the probabilities
of regulatory relationships using ScanBMA, fastBMA
or iterative Bayesian Model Averaging (Yeung et~al. 2005, 2011). The
ScanBMA algorithm, as shown in Algorithm 1, is preferred when
available. The parameter ''prior.prob'' indicates the prior
probabilities of regulatory relationships.  If the parameter
''prior.prob'' is set to a single positive fraction, it represents the
probability of a regulator-gene pair in the network ({\em i.e.} the
expected network density as defined in (Lo et~al. 2012)).  The
parameter ''prior.prob'' can also be set to a matrix in which the
(i,j) entry is the estimated prior probability that gene i regulates
gene j.  The default value of ''prior.prob'' is NULL, which implies
that no prior information will be used in modeling the network.
<<yeastScanBMA>>=
edges.ScanBMA <- networkBMA(data = timeSeries[,-(1:2)], 
                    nTimePoints = length(unique(timeSeries$time)), 
                    prior.prob = reg.prob,
                    nvar = 50,
                    ordering = "bic1+prior", diff100 = TRUE, diff0 = TRUE)
edges.ScanBMA[1:9,]
@ 
For each gene $g$, the observed gene expression of genes at time $t-1$ serve 
as linear predictors for modeling the observed expression of gene $g$ at time 
$t$. BMA modeling results in a weighted average of models consisting of 
relatively small numbers of predictors. The probability of gene $h$ being a 
linear predictor in the model for gene $g$ is taken as the probability that 
gene $h$ regulates gene $g$ in the network.

%% KY edits
There are options for ordering the variables (parameter ''ordering") and
specifying the number of ordered variables (parameter ''nvar") to be included in the
modeling.  In both algorithms ScanBMA and iBMA, all the candidate
variables (genes) are initially ranked using the method specified in
''ordering", and the top ''nvar" such variables will be used as input
in the BMA regression step.  Note that if ScanBMA is used, the
parameter ''ordering" will have no effect in the BMA regression step.

%% Kaiyuan Edits
Fast BMA is another algorithm option. It is a similar algorithm with 
Scan BMA but all coded in C++ at the backbends. It also used new optimize 
algorithm and abandoned some options in Scan BMA which would not help 
efficiently increase algorithm performance. As a result, several parameters 
will be ignored if fast BMA is used. They are "ordering", "maxreg", "diff0" 
and "diff100". And flags are controlled by fastBMAcontrol list. The optimize 
is either a positive integer indicates the accuracy of the optimize algorithm 
or 0 if no optimize used.

Differentiation can also be performed on edges returned with
0\% or 100\% posterior probability. To include known regulatory
relationships, the iBMA algorithm must be used. This can be done as
shown below, but the results shown subsequently will use the call with
ScanBMA as the algorithm, as above.

<<yeastiBMA>>=
edges.iBMA <- networkBMA(data = timeSeries[,-(1:2)], 
                          nTimePoints = length(unique(timeSeries$time)), 
                          prior.prob = reg.prob, known = reg.known,
                          nvar = 50, control = iBMAcontrolLM(),
                          ordering = "bic1+prior", diff100 = FALSE,
                          diff0 = FALSE)
edges.iBMA[1:9,]
@
%Kaiyuan Edit
<<yeastfastBMA>>=
edges.fastBMA <- networkBMA(data = timeSeries[,-(1:2)],
                            nTimePoints = length(unique(timeSeries$time)),
							prior.prob = reg.prob,
							control = fastBMAcontrol(edgeMin = 0.01,
							fastgCtrl = fastgControl(optimize = 4)))
edges.fastBMA[1:9,]
@

\begin{algorithm}
  \DontPrintSemicolon
  Initialize $\mathcal{M}_{keep}, \mathcal{M}_{next} = \{\}$\;
  Initialize $\mathcal{M}_{active} = \{null~model\}$, $bestScore$ = 0\;
  \While{$\mathcal{M}_{active}$ not empty} {
    \For{model $m_{new}$ in \textnormal{NeighborsOf}($\mathcal{M}_{active}$)} {
      $mScore$ = EvaluateModelScore($m_{new}$)\;
      \If{$mScore$ in \textnormal{OccamsWindow}($bestScore$)} {
        add $m_{new}$ to $\mathcal{M}_{next}$\;
        $bestScore$ = \textnormal{BestModelScore}($bestScore$, $mScore$)\;
      }
    }
    Trim models from $\mathcal{M}_{keep}$ according to $bestScore$\;
    Add good models from $\mathcal{M}_{active}$ to
    $\mathcal{M}_{keep}$\;
    $\mathcal{M}_{active} =$ good models from $\mathcal{M}_{next}$\;
  }
  return $\mathcal{M}_{keep}$\;
  \caption{ScanBMA}
\end{algorithm}

\section{Assessment of Network Models}

Although, except for synthetic data,  the true underlying network is unknown,
the results can be assessed 
using a set of regulator-target gene network edges reported in the literature.
The comparison is done as follows:
\begin{itemize}
\item Let $E$ be the set of  regulator-target
gene edges resulting from {\tt networkBMA}, possibly reduced using a
probability  threshold. In the context of the example in Section 2, 
$E$ corresponds to the set of edges represented in the object {\tt edges.ScanBMA}.
\item Let $K$ be the set of known regulator-target gene
edges hardcoded in the modeling. 
In the example in Section 2, $K$ corresponds to {\tt reg.known}.
\item Let $L$ be the set regulator-target gene edges reported in the 
literature. In the example in Section 2, $L$ corresponds to 
{\tt referencePairs}.
\item Let $E \backslash K$ and $L \backslash K$ be the set of pairs in $E$ 
and $L$, respectively, with any hardcoded edges removed. In the example 
of Section 2, $E$ represented by {\tt edges.ScanBMA} contains 483 pairs, 
and $L$ represented by {\tt referencePairs} contains 287 pairs.
Both $E$ and $L$ include all 18 of the known hardcoded edges $K$ 
represented by {\tt reg.known}.
Hence $E \backslash K$ contains 465 pairs,
and $L \backslash K$ contains 269 pairs.
\item Let $U$ be the set of all directed 
pairs $r$-$g$ such that $r$ is a regulator in
$L \backslash K$ and $g$ is a target gene in $L \backslash K$.
For the example of Section 2, $L \backslash K$ has 11 unique regulator
genes and 99 unique target genes. So there are 11 $\times$ 99 or 1089
pairs in $U$. Assume further that the linked pairs in $U$ are precisely
those pairs in $L \backslash K$, 
and that all other pairs are unlinked.
\item Let $U \backslash K$ be the set of pairs in $U$ with any hardcoded eges
removed (hardcoded edges may resurface in the unlinked pairs). 
For the example of Section 2, 17 of the 18 pairs in $K$ occur in
$U$, so there are 1089 - 17 = 1072 edges in $U \backslash K$.
\end{itemize}
The assessment is done using the contingency table of 
$(E \backslash K) \cap (U \backslash K)$
relative to $U \backslash K$. For the example of Section 2, the assessment 
would be done with the 57 of the 465
pairs in $E \backslash K$ that also belong to $U \backslash K$.

A function called {\tt contabs.netBMA} is provided to produce contingency
tables from a reference network according the procedure described above.
Here we compare the edges produced in Section 2 by {\tt networkBMA} modeling 
on the  yeast data with the reference network {\tt referencePairs} made up of
results reported in the literature:
<<contingency.tables>>=
ctables <- contabs.netwBMA( edges.ScanBMA, referencePairs, reg.known, 
                              thresh=c(.5,.75,.9))
ctables
@
Another function called `{\tt contabs}' is provided 
for computing contingency tables when the true underlying network is known.
The {\tt scores} function can be used to obtain common assessment statistics
from the contingency tables, including sensitivity, specificity, precision,
recall, and false discovery rate among other measures.
<<scores>>=
scores( ctables, what = c("FDR", "precision", "recall"))
@
Areas under the ROC and Precision-Recall curves covered by contingency
tables can also be estimated using functions {\tt roc} and {\tt prc},
with the option to plot the associated curves. 
%Precision-Recall curves,
%orginally developed in the context of information retrieval,
%have been considered a better assessment measure than ROC curves when
%there is considerable disparity in the sizes of the populations in question 
%(Goadrich et~al. 2004, Kok and Domingos 2004, Singla and Domingos 2005).
%Gene regulatory networks are sparse, so the number of gene pairs that are not
%edges in the network is much greater than those that are.
The following gives the ROC and Precision-Recall curvers 
associated with 
the default contingency tables, in which the
threshholds are all values for posterior probabilities that
appear in {\tt edges.ScanBMA}.
<<rocANDprc>>=
roc( contabs.netwBMA( edges.ScanBMA, referencePairs), plotit = TRUE)
title("ROC")

prc( contabs.netwBMA( edges.ScanBMA, referencePairs), plotit = TRUE)
title("Precision-Recall")
@

The resulting plots are shown in Figure \ref{fig:rocANDprc}.
\begin{figure}
\begin{center}
\begin{tabular}{ccc}
\includegraphics[width=.45\textwidth]{roc-update.pdf}&&
\includegraphics[width=.45\textwidth]{prc-update.pdf}
\end{tabular}
\end{center}
\caption{ROC and Precision-Recall curve sectors for a {\tt networkBMA}
         model of the yeast-rapamycin test data.
         The black lines delineate the estimated curves.
         The vertical red lines delineate the range of horizontal values
         covered by the contingency tables.  The dotted black lines
         are linear interpolants
         outside this range. The diagonal blue line on the
         ROC plot indicates the line betwween (0,0) and (1,1).
}
\label{fig:rocANDprc}
\end{figure}
The output components are as follows:
\begin{itemize}
\item {\tt area}: The estimated area under the curve for the horizontal
      sector ranging from 0 to 1. This should be used with caution 
      when the sector in which the data falls is small.
\item {\tt sector}: The estimated area under the horizontal sector covered
     by the contingency tables. 
\item {\tt width}: The width of the horizontal sector covered by the
      contingency tables.
\end{itemize}

\section{Linear Modeling for Static Gene Expression Data}

{\tt networkBMA} relies on sparse linear modeling via iterative
Bayesian model averaging (BMA).  BMA addresses uncertainty in model
selection, and builds a weighted--average model from plausible models.
The resulting model has better overall predictive ability than
constituent models, and tends to use few variables from among a larger
set.  BMA has been iteratively extended to data with more variables
that observations (Yeung at~al. 2005, 2009, 2011).  The {\tt
  networkBMA} package functions, {\tt ScanBMA} and {\tt iterateBMAlm},
for linear modeling via iterative BMA. We illustrate their use on a
static gene expression dataset (without any time points), {\tt
  brem.data}, to infer the regulators of a particular gene by
regressing it on the expression levels of the other genes.  Functions
{\tt ScanBMA} and {\tt iterateBMAlm} can be applied to each gene so as
to infer all edges in the network. For one gene, the procedure is as
follows:

<<ScanBMA>>=
gene <- "YNL037C"
variables <- which(rownames(brem.data) != gene)

control <- ScanBMAcontrol(OR = 20, useg = TRUE,
                          gCtrl = gControl(optimize = FALSE, g0 = 20))

ScanBMAmodel.YNL037C <- ScanBMA( x = t(brem.data[variables,]),
                  y = unlist(brem.data[gene,]), control = control)
@

Function {\tt ScanBMAcontrol} facilitates input of BMA control
parameters, including {\tt useg} for indicating whether to use
Zellner's \textit{g}-prior or BIC for model likelihood approximation
and {\tt OR} for defining the width of `Occam's window' for model
exclusion. {\tt gCtrl} allows specification of parameters related to
the use of Zellner's \textit{g}-prior, including whether to use a
static \textit{g} or optimize \textit{g} using an EM algorithm.  See
the R help documentation for {\tt ScanBMAcontrol} and {\tt gControl}
for detailed description of these parameters, and Hoeting
et~al. (1999) for a tutorial on the underlying BMA paradigm.  The
estimated posterior probabilities (in percent) for genes that regulate
{\tt YBL103C} can be seen as follows:

<<probne0>>=
ScanBMAmodel.YNL037C$probne0[ScanBMAmodel.YNL037C$probne0 > 0]
@

\section{Acknowledgements}
We would like to thank Dr. Roger Bumgarner for helpful discussion.

{\it Funding}: AER, KYY and WCY are supported by NIH grant 5R01GM084163. KYY
is also supported by 3R01GM084163-05S1. AER is also supported by NIH
grants R01 HD054511 and R01 HD070936, and by Science Foundation
Ireland ETS Walton visitor award 11/W.1/I207.

\bibliographystyle{plain}
\bibliography{networkBMA}

\end{document}
         For Precision-Recall,  the filled dots correspond to 
         the distinct ({\tt recall}, {\tt precision}) pairs derived from
         the contingency tables.