---
title: "Overview of CNPBayes package"
author: "Jacob Carey, Steven Cristiano, and Robert Scharpf"
date: \today
output: BiocStyle::pdf_document
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Overview of CNPBayes package}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc} 
---

# Introduction

CNPBayes models multi-modal densities via a hierarchical Bayesian
Gaussian mixture model.  The major application of this model is the
estimation of copy number at copy number polymorphic loci (CNPs). Two
versions of the mixture model are implemented.  A *standard* model,
referred to as a *marginal* model, that has one mean and standard
deviation for each component, and a *batch* model with batch-specific
means and standard deviations. Approximation of the posterior is by
Markov Chain Monte Carlo (MCMC) written in C++ using the Rcpp package
[@Rcpp].  

For an EM-implementation of Gaussian mixture models for CNPs, see the
Bioconductor package CNVtools [@Barnes2008].  A Bayesian extension of
this model by some of the same authors was developed to automate the
analysis of the Welcome Trust Case Control Consortium (WTCCC) genotype
data [@cardin] and implemented in the R package CNVCALL
(http://niallcardin.com/CNVCALL).

This vignette provides a concise workflow for fitting mixture models
in large array-based genome-wide association studies.  We refer the
reader to other vignettes included with this package for details
regarding implementation.

```{r lib} 
library(CNPBayes)
library(GenomicRanges)
```

# Workflow

## Delineate CNPs for each ancestry group

Provided in the `CNPBayes` package is example `SnpArrayExperiment` and `GRangesList` data.

```{r find_cnps}
se <- readRDS(system.file("extdata", "simulated_se.rds", package="CNPBayes"))
grl <- readRDS(system.file("extdata", "grl_deletions.rds", package="CNPBayes"))
```

Using this data, we identify CNP loci and summarize within sample and locus by median.

```{r summary, message=FALSE}
cnv.region <- consensusCNP(grl, max.width=5e6)
i <- subjectHits(findOverlaps(cnv.region, rowRanges(se)))
med.summary <- matrixStats::colMedians(assays(se)[["cn"]][i, ], na.rm=TRUE)
```

See [Identifying Copy Number Polymorphisms](FindCNPs.pdf) for instructions on finding CNPs with a `SnpArrayExperiment` and `GRangesList`. 

A `MixtureModel` is constructed using this summarized data for a CNP locus with a call to `MarginalModel` or `BatchModel`, depending on whether the simulation should be *marginal* across batch effect, or *hierarchical*. The batch of a subject is determined by the chemistry plate and is specified using the `batch` parameter in the `BatchModel`.

```{r model_construction}
model.marginal <- MarginalModel(data=med.summary)
model.batch <- BatchModel(data=med.summary, 
                          batch=c(rep(1, 12), rep(2, 23)))
```

For details about model construction and other optional parameters, see [Bayesian mixture models for copy number estimation](Implementation.pdf).

## Fitting mixture models at each CNP

Independently for each CNP, we group the samples into batches using
the `collapseBatch` function and create a new model.

```{r collapse}
model.batch <- BatchModel(data=med.summary,
                          batch=collapseBatch(model.batch))
```

For array-based estimates and germline genomes, there are typically
between 1 and 4 copy number states at any given CNP.  Since we do not
know the number of components a priori, we fit a model for each k.
In addition, we fit models with and without a term for batch.

The `posteriorSimulation` function takes a constructed `MixtureModel` and `k`, a vector of the number of components for which to fit models. The posterior simulations of each model is held in two lists - one for `MarginalModel`'s and one for `BatchModel`'s.

```{r posteriorSimulation}
set.seed(1337)
mlist.marginal <- posteriorSimulation(model.marginal, k=1:4)
mlist.batch <- posteriorSimulation(model.batch, k=1:4)
```

## Selecting a model

Marginal likelihood is estimated for each of the 8 models and the
Bayes' factors is used to select one of these models for further
study.

```{r likelihood}
# marginal likelihood of each model
marginal.lik <- marginalLikelihood(mlist.marginal)
batch.lik <- marginalLikelihood(mlist.batch)

# bayes factor for comparing top two models
logBayesFactor(marginal.lik)
logBayesFactor(batch.lik)
```

## Extracting MAP estimates and posterior probabilities

For the selected model, we can view the *maximum a posteriori* estimates for component.
```{r map}
selected.model <- mlist.marginal[[which.max(marginal.lik)]]
map(selected.model)
```

# References