---
title: "Identifying Copy Number Polymorphisms"
author: "Jacob Carey, Steven Cristiano, and Robert Scharpf"
date: \today
output: BiocStyle::pdf_document
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Identifying Copy Number Polymorphisms}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc} 
---

# Introduction

Identify consensus start and stop coordinates of a copy number
polymorphism

The collection of copy number variants (CNVs) identified in a study
can be encapulated in a GRangesList, where each element is a GRanges
of the CNVs identified for an individual.  (For a study with 1000
subjects, the GRangesList object would have length 1000 if each
individual had 1 or more CNVs.)  For regions in which CNVs occur in
more than 2 percent of study participants, the start and end
boundaries of the CNVs may differ because of biological differences in
the CNV size as well as due to technical noise of the assay and the
uncertainty of the breakpoints identified by a segmentation of the
genomic data.  Among subjects with a CNV called at a given locus, the
`consensusCNP` function identifies the largest region that is copy
number variant in half of these subjects.

# Simulating CNP data

Included in the CNPBayes package are objects of class `SnpArrayExperiment`
and `GRangesList`. We begin by loading the necessary libraries and data.

```{r prelim}
suppressMessages(library(CNPBayes))
suppressMessages(library(SummarizedExperiment))
se <- readRDS(system.file("extdata", "simulated_se.rds", package="CNPBayes"))
grl <- readRDS(system.file("extdata", "grl_deletions.rds", package="CNPBayes"))
```

The object `se` contains log R ratios and B Allele Frequencies, and the object
`grl` is a `GRangesList` of simulated deletions. 

# Finding CNPs

After reading this saved data, we visualize the CNVs.

```{r plot-cnvs}
cnv.region <- consensusCNP(grl[1], max.width=5e6)
i <- subjectHits(findOverlaps(cnv.region, rowRanges(se)))
xlim <- c(min(start(se)), max(end(se)))
par(las=1)
plot(0, xlim=xlim, ylim=c(0, 26), xlab="Mb", ylab="sample index", type="n",
     xaxt="n")
at <- pretty(xlim, n=10)
axis(1, at=at, labels=round(at/1e6, 1), cex.axis=0.8)
rect(start(grl), seq_along(grl)-0.2, end(grl), seq_along(grl)+0.2,
     col="gray", border="gray")
```

Before further analysis can be performed, the log R ratios in a CNV region must
be summarized to a one dimensional object [@cardin]. Within each CNP locus, 
log R ratios at each SNP are summarized by sample. Below are examples of using 
the median and the first principal component to create a one dimensional 
summary.

## CNP Boundaries with Median Summary
To summarize samples within a CNV locus by the median log R ratios, we define 
the largest region that spans 50 percent of the samples using the function
`consensusCNP`.  Using log R ratios of SNPs contained in this region, the
median is taken across samples.  Because the deletions in this example are 
large ($>$ 2Mb), we specify a large value for `max.width` to avoid filtering
these CNVs.

```{r median}
cnv.region <- consensusCNP(grl, max.width=5e6)
i <- subjectHits(findOverlaps(cnv.region, rowRanges(se)))
med.summary <- matrixStats::colMedians(assays(se)[["cn"]][i, ], na.rm=TRUE)
par(las=1)
plot(0, xlim=xlim, ylim=c(0, 26), xlab="Mb", ylab="sample index", type="n",
     xaxt="n")
at <- pretty(xlim, n=10)
axis(1, at=at, labels=round(at/1e6, 1), cex.axis=0.8)
rect(start(grl), seq_along(grl)-0.2, end(grl), seq_along(grl)+0.2,
     col="gray", border="gray")
abline(v=c(start(cnv.region), end(cnv.region)), lty=2)
```

## CNP Boundaries with PCA Summary

Another method for summarizing the log R ratios is by the first principal
component on the markers for the entire region [@cardin]. Disadvantages of this
scale are interpretion of the copy number of the mixture components the
potential for picking up batch effects rather than biological differences in
copy number between groups of samples. The primary advantage of a principal
component analysis is for non-germline events where the CNVs may be partially
overlapping and do not share the same boundaries. We find that for germline copy
number analyses, a median summary is adequate, retains an interpretable scale,
and is less susceptible to batch effects than principal components.

```{r pca}
cnv.region2 <- reduce(unlist(grl))
i.pc <- subjectHits(findOverlaps(cnv.region2, rowRanges(se)))
x <- assays(se)[["cn"]][i.pc, ]
nas <- rowSums(is.na(x))
na.index <- which(nas > 0)
x <- x[-na.index, , drop=FALSE]
pc.summary <- prcomp(t(x))$x[, 1]
meds.for.pc <- matrixStats::colMedians(x, na.rm=TRUE)
if(cor(pc.summary, meds.for.pc) < 0) pc.summary <- -1*pc.summary

par(las=1)
plot(0, xlim=xlim, ylim=c(0, 26), xlab="Mb", ylab="sample index", type="n",
     xaxt="n")
at <- pretty(xlim, n=10)
axis(1, at=at, labels=round(at/1e6, 1), cex.axis=0.8)
rect(start(grl), seq_along(grl)-0.2, end(grl), seq_along(grl)+0.2,
     col="gray", border="gray")
abline(v=c(start(cnv.region2), 
           end(cnv.region2)), lty=2)
```

Note that boundaries of created by principal component summary method
are wider than those in the above median summary method. In this
example, the boundaries are not significantly different, but in samples
with ragged starts and ends, the PCA method will provide greater
coverage.

## Summary Plots
Finally, we plot the one dimensional summaries.

```{r summary-plots}
par(mfrow=c(1,2), las=1)
plot(med.summary, main="median summary of\nconsensus CNP", cex.main=0.7, pch=20)
plot(pc.summary, main="PC summary of\nreduced CNV ranges", cex.main=0.7, pch=20)
```

## Filtering
Loci with no duplications or deletions can be identified visually as a uni-modal normal distribution. A Shapiro-Wilk test test of normality [@shapiro] can be used to filter unimodal regions.  To be conservative, we suggest retaining only those loci which have a *p* value $< 0.1$. In practice, we prefer Bayes factors or estimates of the marginal likelihood for model selection (see vignette `Overview.Rmd`). 
```{r shapiro}
shapiro.test(med.summary)
```
# References