---
title: "Implementation of Bayesian mixture models for copy number estimation"
author: "Jacob Carey, Steven Cristiano, and Robert Scharpf"
date: \today
output: BiocStyle::pdf_document
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Implementation of Bayesian mixture models for copy number estimation}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc} 
---

# Introduction

CNPBayes models multi-modal densities via a hierarchical Bayesian
Gaussian mixture model.  The major application of this model is the
estimation of copy number at copy number polymorphic loci (CNPs). Two
versions of the mixture model are implemented.  A *standard* model,
referred to as a *marginal* model, that has one mean and standard
deviation for each component, and a *batch* model with batch-specific
means and standard deviations. Approximation of the posterior is by
Markov Chain Monte Carlo (MCMC) written in C++ using the Rcpp package
[@Rcpp].

For an EM-implementation of Gaussian mixture models for CNPs, see the
Bioconductor package CNVtools [@Barnes2008].  A Bayesian extension of
this model by some of the same authors was developed to automate the
analysis of the Welcome Trust Case Control Consortium (WTCCC) genotype
data [@cardin] and implemented in the R package CNVCALL
(http://niallcardin.com/CNVCALL).

<!-- how cnpbayes extends / differs from these tools -->

```{r lib} 
library(CNPBayes)
```

# Implementation

CNPBayes uses several S4 classes to encapsulate key parameters of the
MCMC simulation, reducing the need for functions with a large number
of arguments and providing an explicit contract for the arguments
passed to the C++ back-end. The three core classes are

- `McmcParams`: parameters for the number of burnin simulations - the
  number of chains to initialize, the number of simulations after
  burnin, and how often simulated values are to be saved.

- `Hyperparameters`: a virtual class extended by `HyperparametersMarginal`
  and `HyperparametersBatch` for the marginal and batch mixture model
  implementations, respectively.

- `MixtureModel`: a virtual class with slots for data and
  hyperparameters, as well as a slot for each parameter. The class is
  extended by `MarginalModel` and `BatchModel` for the *marginal* and
  *batch* implementations, respectively. S4 dispatch on these classes
  is used to handle MCMC updates that are specific to the marginal or
  batch models.

# Approximating the posterior of a mixture

## MarginalModel
### McmcParams


The class `McmcParams` specifies several MCMC options, including the
number of saved iterations (`iter`), length of burn-in (`burnin`), and
the thinning interval (`thin`).  The following code indicates that we
will run a burnin of 100 iterations saving the first 1000 iterations
after burnin, with no iterations discarded for thinning.

```{r McmcParams} 
mp <- McmcParams(iter=1000, burnin=100, thin=1)
```

If we instead specified `thin=10`, we would need to run 10,000 MCMC
iterations in order to have a chain of length 1000 (every 10th
observation is saved).

### Hyperparameters

Hyperparameters for the mixture model are specified as an instance of class
`Hyperparameters`. Hyperparameters include:  

- `k` the number of components (or copy number). Defaults to 2.  

- `mu.0` and `tau2.0` priors for $\mu \sim$ N(`mu.0`, `tau2.0`), the 
  overall mean of components. Default to 0 and 100 respectively.  

- `eta.0` and `m2.0` priors for 
  $\tau^2 \sim$ Ga(shape=`eta.0`, rate=`m2.0`), the overall variance 
  across components. Default to 1 and 0.1 respectively for *marginal* models 
  and 1800 and 1/60 respectively for *batch* models.  

- `alpha` the prior mixture probabilities. Does not have to sum to 1. By
  default, a noninformative prior of equal mixtures is used.  

- `beta` prior for $\nu_0$. Defaults to 0.1.  

- `a` and `b` priors for $\sigma^2_0 \sim$ Ga(shape=`a`, rate=`b`), the 
  rate parameter for $\sigma^2$, the variance for each batch and component.  

```{r graph-marginal, echo=FALSE}
library(grid)
bayesianGrob <- function(x, r=unit(0.25, "inches")){
  tg <- textGrob(x)
  cg <- circleGrob(r=r)
  boxedText <- gTree(children=gList(tg, cg))
}

grobY <- bayesianGrob(expression(y[i]))
grobThetas <- bayesianGrob(expression(theta[h]))
grobSigma2 <- bayesianGrob(expression(sigma[h]^2))
grobH <- bayesianGrob(expression(z[i]))
grobNu0 <- bayesianGrob(expression(nu[0]))
grobSigma02 <- bayesianGrob(expression(sigma[0]^2))
grobPi <- bayesianGrob(expression(pi[h]))
grobMu <- bayesianGrob(expression(mu[h]))
grobTau2 <- bayesianGrob(expression(tau[h]^2))

h <- unit(0.25, "inches")
e <- unit(0.05, "inches")
d <- unit(0.025, "inches")

ar <- arrow(ends="last", length=unit(0.075, "inches"), type="closed")
grid.newpage()
y.x <- 0.5; y.y <- 0.1
h.x <- 0.1; h.y <- 0.3
theta.x <- 0.5; theta.y <- 0.3
sigma2.x <- 0.7; sigma2.y <- 0.3
pi.x <- 0.1; pi.y <- 0.5
mu.x <- 0.45; mu.y <- 0.5
tau2.x <- 0.55; tau2.y <- 0.5
nu0.x <- 0.65; nu0.y <- 0.5
s20.x <- 0.75; s20.y <- 0.5
grid.draw(editGrob(grobY, vp=viewport(y.x, y.y), gp=gpar(fill="gray")))
grid.draw(editGrob(grobY, vp=viewport(y.x, y.y), gp=gpar(fill="transparent")))
grid.draw(editGrob(grobH, vp=viewport(h.x, h.y)))

grid.draw(editGrob(grobThetas, vp=viewport(theta.x, theta.y)))
grid.draw(editGrob(grobSigma2, vp=viewport(sigma2.x, sigma2.y)))

## theta -> y
grid.move.to(unit(theta.x, "npc"), unit(theta.y, "npc") - h)
grid.line.to(unit(y.x, "npc"), unit(y.y, "npc") + h, arrow=ar,
             gp=gpar(fill="black"))
## sigma2 -> y
grid.move.to(unit(sigma2.x, "npc") - e, unit(sigma2.y, "npc") -h)
grid.line.to(unit(y.x, "npc") + h, unit(y.y, "npc") + h, arrow=ar,
             gp=gpar(fill="black"))

## h -> theta
grid.move.to(unit(h.x, "npc") + h, unit(h.y, "npc") - h)
grid.line.to(unit(y.x, "npc") - h, unit(y.y, "npc") + h,  arrow=ar,
             gp=gpar(fill="black"))

##pi
grid.draw(editGrob(grobPi, vp=viewport(pi.x, pi.y)))
## pi -> h
grid.move.to(x=unit(pi.x, "npc"), y=unit(pi.y, "npc") - h)
grid.line.to(x=unit(h.x, "npc"),
             y=unit(h.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))


## mu_h
grid.draw(editGrob(grobMu, vp=viewport(mu.x, mu.y)))
## mu_h -> theta
grid.move.to(x=unit(mu.x, "npc")+e, y=unit(mu.y, "npc") - h)
grid.line.to(x=unit(theta.x, "npc")-e, y=unit(theta.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))


## sigma2_h
grid.draw(editGrob(grobTau2, vp=viewport(tau2.x, tau2.y)))
## sigma2_h -> theta_h
grid.move.to(x=unit(tau2.x, "npc")-e, y=unit(tau2.y, "npc") - h)
grid.line.to(x=unit(theta.x, "npc")+e, y=unit(theta.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))

grid.draw(editGrob(grobNu0, vp=viewport(nu0.x, nu0.y)))
## nu_0 -> sigma2_0
grid.move.to(x=unit(nu0.x, "npc")+e, y=unit(nu0.y, "npc") - h)
grid.line.to(x=unit(sigma2.x, "npc")-e, y=unit(sigma2.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))

grid.draw(editGrob(grobSigma02, vp=viewport(s20.x, s20.y)))
## nu_0 -> sigma2_0
grid.move.to(x=unit(s20.x, "npc")-e, y=unit(s20.y, "npc") - h)
grid.line.to(x=unit(sigma2.x, "npc")+e, y=unit(sigma2.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))
```

Constructing an instance of class `Hyperparameters` for a `MarginalModel` can
be performed as follows.
```{r Hyperparameters}
hypp <- Hyperparameters(type="marginal", k=3)
```

### simulateData

CNPBayes allows for the simulation of test data. The number of observations, 
mixture proportions, means for each component, and standard deviations for 
each component must be specified. 
```{r simulateData}
sim.data <- simulateData(N=2500, p=rep(1/3, 3),
                         theta=c(-1, 0, 1),
                         sds=rep(0.1, 3))
```

### posteriorSimulation

To simulate the posterior distribution, an instance of class `MarginalModel`
must first be constructed. Note that when initializing objects of this
class, the parameters `data`, `k` (number of *a priori* components), 
`hypp`, (object of class `Hyperparameters`) and `mcmc.params` (object of class
`McmcParams`) should be specified. Default values for `hypp` and `mcmc.params`
will be used if not specified. The `y` method should be used to retrieve the 
data from an object. After construction of a model, the `posteriorSimulation` 
method should be used.

```{r model}
model <- MarginalModel(data=y(sim.data), k=3,
                       hypp=hypp,
                       mcmc.params=mp)

model <- posteriorSimulation(model)
```

The results of a `DensityModel` call returns an object of class
`DensityModel`. Use the data and a `DensityModel` for plotting.

```{r plot}
plot(DensityModel(model), y(sim.data), 
     main="Marginal Model posterior")
```

## BatchModel

In general, the construction and posterior simulation of a `BatchModel` is 
similar to that of a `MarginalModel`. The `BatchModel` is hierarchical over 
the batches, and thus requires information about the batches. Instances of 
`McmcParams` are equivalent between `BatchModel` and `MarginalModel`. 
However, an object of class `Hyperparameters` is constructed with a `type` 
of "batch". Additionally, simulated batch data is created using 
`simulateBatchData` which requires `theta` and `sds` to be specifed as 
$B \times K$ matrices, for $K$ components and $B$ batches. `simulateBatchData` 
also requires a `batch` parameter, labelling the batch of each observation. 
Finally, `BatchModel`s are constructed using the function `BatchModel` which 
operates similarly to `MarginalModel` but requires a `batch` parameter, 
similar to `Hyperparameters`. `plot`s of `BatchModel`s include batch specific 
density estimates.

```{r batch}
## Create McmcParams for batch model
mp <- McmcParams(iter=2000, burnin=1000, thin=1)

## Create Hyperparameters for batch model
hypp <- Hyperparameters(type="batch", k=3)

## simulate batch data
k <- 3
nbatch <- 3
means <- matrix(c(-1.2, -1.0, -0.8,
                -0.2, 0, 0.2,
                0.8, 1, 1.2), nbatch, k, byrow=FALSE)
sds <- matrix(0.1, nbatch, k)
N <- 1500
sim.data <- simulateBatchData(N=N,
                              batch=rep(letters[1:3], length.out=N),
                              theta=means,
                              sds=sds,
                              p=c(1/5, 1/3, 1-1/3-1/5))

# create BatchModel and run posteriorSimulation
model <- BatchModel(data=y(sim.data), k=3,
                    batch=batch(sim.data),
                    hypp=hypp,
                    mcmc.params=mp)

model <- posteriorSimulation(model)
plot(DensityModel(model), y(sim.data),
     breaks=100)
```

While aspects of the `BatchModel` look somewhat Gaussian, there is some 
evidence of departure from a normal distribution. Due to this skewness, a
`MarginalModel` would not capture the distribution well. However, the
`BatchModel` does a good job of capturing the shape.

# K unknown

Generally, the number of components is not known *a priori*. To
estimate the number of components *K*, the `marginalLikelihood`
function is used. It is important to note that a list of models passed to this
function must have converged. For ease of use, one can specify `k` in the 
posteriorSimulation to use the same data, `Hyperparameters`, and `mcmc.params`,
but with a different number of components.

```{r marginalLikelihood}
mp <- McmcParams(iter=2e3, burnin=1e3)
model <- MarginalModel(data=y(sim.data), k=2, mcmc.params=mp)

mlist <- list(posteriorSimulation(model),
              posteriorSimulation(model, k=3))

x <- marginalLikelihood(mlist)
x
```

`marginalLikelihood` uses Chib's Estimator [@chib] to run a reduced Gibbs
to estimate the marginal likelihood of each component size. Additionally, 
`marginalLikelihood` can be used in a similar manner for `BatchModel`s.

Alternative methods for selection of *k* are also included. As an example, the
`bic` method can be used for calculating the Bayesian Information Criterion. 

```{r bic}
## simulate k=2 model
sim.data <- simulateData(N=2500, p=rep(1/3, 3),
                         theta=c(-1, 0, 1),
                         sds=rep(0.1, 3))
hypp1 <- Hyperparameters(k=2)
m1 <- MarginalModel(data=y(sim.data), k=2,
                    hypp=hypp1,
                    mcmc.params=mp)
m1 <- posteriorSimulation(m1)

## simulate k=3 model
hypp2 <- Hyperparameters(k=3)
m2 <- MarginalModel(data=y(sim.data), k=3,
                    hypp=hypp2,
                    mcmc.params=mp)
m2 <- posteriorSimulation(m2)

bic(m1)
bic(m2)
```

As the data is simulated from a three component distribution, `m2` should be
chosen. Using the Bayesian Information Criterion, the correct model is chosen
in this simple example.

Finally, in an overfit model (one with too many components), merging components
is supported. 

```{r merge}
mp <- McmcParams(iter=5000, burnin=1000, thin=1)
model <- MarginalModel(data=y(sim.data), k=4,
                       hypp=Hyperparameters(k=4),
                       mcmc.params=mp)
model <- posteriorSimulation(model)
dm <- DensityModel(model)
modes(dm)
dm_merged <- DensityModel(model, merge=TRUE)
k(dm_merged)

par(mfrow=c(1,2), las=1)
plot(dm, y(sim.data))
plot(dm_merged, y(sim.data))
```

This merging procedure also ensures that there is no underfitting. For example,
merging a model with three true components will yield a model with three
components.

```{r merge-under}
model <- MarginalModel(data=y(sim.data), k=3,
                       hypp=Hyperparameters(k=3),
                       mcmc.params=mp)
model <- posteriorSimulation(model)
dm <- DensityModel(model)
dm_merged <- DensityModel(model, merge=TRUE)
k(dm)
k(dm_merged)
```

# Batch Estimation

Differences in the one-dimensional summaries between groups of samples
can arise due to technical sources of variation referred to as *batch
effects*.  For example, date, temperature, machine calibration, and
lab technician are all potentially important technical sources of
variation [@batch-effect].  Batch effects can vary by locus, with some
loci more susceptible to technical factors that may affect
measurement.  The 96 well chemistry plate on which the samples were
processed is often a useful surrogate for batch effects since this
tends to capture samples that were processed (e.g., PCR amplification)
and scanned at approximately the same time. However, in large studies
involving potentially hundreds of chemistry plates, it is not
computationally feasible to fit fully Bayesian plate-specific mixture
models.  Because chemistry plates are often processed at similar times
and may be comparable in terms of the distribution of a statistic of
interest, it may be more useful (and computationally scalable) to
*batch* the chemistry plates.  Here, we implement a simple two-sample
Kolmogorov-Smirnov (KS) test for each pairwise combination of
chemistry plates in the function `collapseBatch`. The two-sample KS
test is applied recursively until no two combinations of grouped
plates can be combined.  We illustrate with a trivial example.


```{r collapseBatch}
k <- 3
nbatch <- 3
means <- matrix(c(-1.2, -1.0, -1.0,
                -0.2, 0, 0,
                0.8, 1, 1), nbatch, k, byrow=FALSE)
sds <- matrix(0.1, nbatch, k)
N <- 1500
sim.data <- simulateBatchData(N=N,
                              batch=rep(letters[1:3], length.out=N),
                              theta=means,
                              sds=sds,
                              p=c(1/5, 1/3, 1-1/3-1/5))

plot(sim.data)

model <- BatchModel(data=y(sim.data), k=3,
                    batch=batch(sim.data))

model <- BatchModel(data=y(sim.data), k=3,
                    batch=collapseBatch(model))

model <- posteriorSimulation(model)
plot(model)
```

As is shown, the model correctly collapses batches two and three, since they
are drawn from the same distribution.

# References