---
title: "The conumee vignette"
author: "Volker Hovestadt, Marc Zapatka (Division of Molecular Genetics, German Cancer Research Center, Heidelberg, Germany)"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    number_sections: true
bibliography: conumee.bib
csl: biomed-central.csl
vignette: >
  %\VignetteIndexEntry{conumee}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction

The `conumee` package provides tools for performing **co**py-**nu**mber variation (CNV) analysis using Illumina 450k or EPIC DNA **me**thylation arrays. Although the primary purpose of these arrays is the detection of genome-wide DNA methylation levels [@bibikovahigh2011], it can additionally be used to extract useful information about copy-number alterations, e.g. in clinical cancer samples. The approach was initially described in Sturm et al., 2012 [@sturmhotspot2012]. Other tools following a similar strategy include `ChAMP` [@feberusing2014] and `CopyNumber450k` (now deprecated).

CNV analysis is performed using a two-step approach. First, the combined intensity values of both 'methylated' and 'unmethylated' channel of each CpG are normalized using a set of normal controls (i.e. with a flat genome not showing any copy-number alterations). This step is required for correcting for probe and sample bias (e.g. caused by GC-content, type I/II differences or technical variability). Secondly, neighboring probes are combined in a hybrid approach, resulting in bins of a minimum size and a minimum number of probes. This step is required to reduce remaining technical variability and enable meaningful segmentation results.

The `conumee` package has been written for seamless integration with frequently-used Bioconductor packages. It is recommended that methylation array data is loaded using the `minfi` package [@aryeeminfi2014]. Segmentation is performed using the circular binary segmentation (CBS) algorithm implemented in the `DNAcopy` package [@olshencircular2004]. Processing time is typically less than a minute per sample.

Finally, the `conumee` package also provides a set of plotting methods to create publication-grade CNV plots of the whole genome, selected chromosomes or previously defined regions of interest. Writing of text-based output files for visualization in other tools (e.g. the IGV browser) or downstream processing (e.g. using GISTIC) is also supported. 


# Load data

The recommended input format are `Mset` objects generated from raw IDAT files using the `minfi` package. Depending on the analysis workflow, these objects might be already available. A good example 450k data-set for testing the `conumee` package can be downloaded from TCGA (971.6 MB):

~~https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/brca/cgcc/jhu-usc.edu/humanmethylation450/methylation/jhu-usc.edu_BRCA.HumanMethylation450.Level_1.8.8.0.tar.gz~~

UPDATE October 2016: The TCGA Data Portal is no longer operational. TCGA DNA methylation data can now be downloaded from the NCI GDC Legacy Archive: https://gdc-portal.nci.nih.gov/legacy-archive

This data-set comprises of 42 primary breast cancer samples, 2 breast cancer cell lines and 16 control samples. Make sure to unpack the archive. For the purpose of this vignette, only two illustrative examples are downloaded using the `read.450k.url` method.

```{r, echo = TRUE, collapse = TRUE, results = 'markup', warning = FALSE, message = FALSE}
library("minfi")
library("conumee")
#RGsetTCGA <- read.450k.exp(base = "~/conumee_analysis/jhu-usc.edu_BRCA.HumanMethylation450.Level_1.8.8.0")  # adjust path
RGsetTCGA <- read.450k.url()  # use default parameters for vignette examples
MsetTCGA <- preprocessIllumina(RGsetTCGA)
MsetTCGA
```


Alternatively, you can use the `minfiData` example data-set, comprising of 3 cancer samples and 3 normal controls.

```{r, echo = TRUE, collapse = TRUE, results = 'markup', warning = FALSE, message = FALSE}
library("minfiData")
data(MsetEx)
MsetEx
```


~~The `CopyNumber450k` package provides a large data-set of 52 control samples which can be used for normalization.~~

UPDATE April 2017: The `CopyNumber450k` and `CopyNumber450kData` packages are now depreacted. We will use the 3 normal controls from the `minfiData` package for now. Using a larger set of control samples would give better results.

```{r, echo = TRUE, collapse = TRUE, results = 'markup', warning = FALSE, message = FALSE}
# library("CopyNumber450kData")
# data(RGcontrolSetEx)
# MsetControls <- preprocessIllumina(RGcontrolSetEx)
```


If raw IDAT files are unavailable, data can also be supplied by importing text-based input files, e.g. as generated by GenomeStudio or downloaded from the GEO repository. More details are given later.


# Create annotation object

To begin with the CNV analysis, an annotation object, generated by the `CNV.create_anno` function, is required. This object holds information which only needs to be generated once, irrespective of the number of samples that are processed.

Arguments `bin_minprobes` and `bin_minsize` define the minimum number of probes per bin and the minimum bin size (default values that were optimized for 450k data are 15 and 50000, respectively). Argument `exclude_regions` defines regions to be excluded (e.g. polymorphic regions, an example is given in `data(exclude_regions)`). Please see `?CNV.create_anno` for more details.

Argument `detail_regions` defines regions to be examined in detail (e.g. dedicated detail plots or text output, see below). For example, detail regions can contain known oncogenes or tumor suppressor genes. These regions should either be supplied as a path to a BED file or as a GRanges object (an example is given in `data(detail_regions)`). The start and stop coordinates indicate the regions in which probes are analyzed in detail. The plotting range of detail plots are defined in a second set of start and stop coordinates in the GRanges object values (or the 7^th^ and 8^th^ column in the BED file).

```{r, echo = TRUE, collapse = TRUE, results = 'markup', warning = TRUE, message = TRUE}
data(exclude_regions)
data(detail_regions)
head(detail_regions, n = 2)

anno <- CNV.create_anno(array_type = "450k", exclude_regions = exclude_regions, detail_regions = detail_regions)

anno
```


# Combine intensity values

Intensity values of the 'methylated' and 'unmethylated' channel are combined using the `CNV.load` function. Input can be either an `Mset` object (recommended, see above), or a `data.frame` or `matrix` object containing intensities generated by GenomeStudio or downloaded from GEO (imported using e.g. `read.table`, should work without reformatting for most tables, please refer to `?CNV.load` for more details).

```{r, echo = TRUE, collapse = TRUE, results = 'markup', warning = FALSE, message = FALSE}
data(tcgaBRCA.sentrix2name)  # TCGA sample IDs are supplied with the conumee package
sampleNames(MsetTCGA) <- tcgaBRCA.sentrix2name[sampleNames(MsetTCGA)]
tcga.data <- CNV.load(MsetTCGA)
tcga.controls <- grep("-11A-", names(tcga.data))
names(tcga.data)
tcga.data

minfi.data <- CNV.load(MsetEx)
minfi.controls <- pData(MsetEx)$status == "normal"

# controls.data <- CNV.load(MsetControls)
```


# Perform CNV analysis

The main CNV analysis is separated into four parts:

* First, `CNV.fit` is used to normalize a single query sample to a set of control samples by multiple linear regression. For best results control samples of matched normal tissues that are profiled within the same experiment are used (which are likely to have the same technical bias). Essentially this regression analysis yields the linear combination of control samples that most closely fits the intensities of the query sample. Subsequently, the log2-ratio of probe intensities of the query sample versus the combination of control samples are calculated and used for further analysis. More details are given in the publication.
* Secondly, `CNV.bin` is used to combine probes within predefined genomic bins. Bins are previously generated using `CNV.create_anno`. Intensity values are shifted to minimize the median absolute deviation from all bins to zero to determine the copy-number neutral state.  
* Thirdly, `CNV.detail` is used to analyze detail regions in detail. This step is optional, but required if detail regions should be outputted in plots and text files. Detail regions are defined using `CNV.create_anno`.
* Finally, `CNV.segment` is used to segment the genome into regions of the same copy-number state. This function is a wrapper of the `CNA`, `segment`, `segments.summary` and `segments.p` functions of the `DNAcopy` package. Default parameters were optimized for 450k data, but can be changed. See `?CNV.segment` for more details.


```{r, echo = TRUE, collapse = TRUE, results = 'markup', warning = FALSE, message = FALSE}
x <- CNV.fit(minfi.data["GroupB_1"], minfi.data[minfi.controls], anno)
# y <- CNV.fit(tcga.data["TCGA-AR-A1AU-01A-11D-A12R-05"], controls.data, anno)  # CopyNumber450kData package is deprecated
# z <- CNV.fit(tcga.data["TCGA-AR-A1AY-01A-21D-A12R-05"], controls.data, anno)
y <- CNV.fit(tcga.data["TCGA-AR-A1AU-01A-11D-A12R-05"], minfi.data[minfi.controls], anno)  # minfiData control samples
z <- CNV.fit(tcga.data["TCGA-AR-A1AY-01A-21D-A12R-05"], minfi.data[minfi.controls], anno)

x <- CNV.bin(x)
x <- CNV.detail(x)
x <- CNV.segment(x)

y <- CNV.segment(CNV.detail(CNV.bin(y)))
z <- CNV.segment(CNV.detail(CNV.bin(z)))

x
```


# Output plots and text files

The `conumee` package supports two types of plots:

* The `CNV.genomeplot` method produces plots of the complete genome or of one or multiple chromosomes. Intensity values of each bin are plotted in colored dots. Segments are shown as blue lines. See `?CNV.genomeplot` for more details.
* The `CNV.detailplot` methods produces plots of individual detail regions, as defined in `CNV.create_anno`. Intensity values of individual probes are plotted in colored crosses. Bins are shown as blue lines. `CNV.detailplot_wrap` is a wrapper function that produces a single plot of all detail regions.

Text output is generated using the `CNV.write` method. Parameter `what` specifies if "probes", "bins", "detail" or "segments" should be returned. If parameter `file` is specified, the output is written into a file, otherwise a `data.frame` is returned. See `?CNV.write` for more details.


```{r, echo = TRUE, fig.show = 'hold', collapse = TRUE, results = 'markup', warning = FALSE, message = FALSE, fig.width = 14, fig.height = 7, out.width = "800px", fig.retina = 1}
#pdf("~/conumee_analysis/CNVplots.pdf", height = 9, width = 18)
CNV.genomeplot(x)

CNV.genomeplot(y)
CNV.genomeplot(y, chr = "chr6")

CNV.genomeplot(z)
CNV.genomeplot(z, chr = "chr10")
CNV.detailplot(z, name = "PTEN")
CNV.detailplot_wrap(z)
#dev.off()

head(CNV.write(y, what = "segments"), n = 5)

#CNV.write(y, what = "segments", file = "~/conumee_analysis/TCGA-AR-A1AU.CNVsegments.seg")  # adjust path
#CNV.write(y, what = "bins", file = "~/conumee_analysis/TCGA-AR-A1AU.CNVbins.igv")
#CNV.write(y, what = "detail", file = "~/conumee_analysis/TCGA-AR-A1AU.CNVdetail.txt")
#CNV.write(y, what = "probes", file = "~/conumee_analysis/TCGA-AR-A1AU.CNVprobes.igv")
```

# Contact & citation

For bug-reports, comments and feature requests please write to <conumee@hovestadt.bio>.

When using `conumee` in your work, please cite as:

```{r, echo = TRUE, collapse = TRUE, results = 'markup', warning = FALSE, message = FALSE}
citation("conumee")
```


# Session info

```{r, echo = TRUE, collapse = TRUE, results = 'markup', warning = FALSE, message = FALSE}
sessionInfo()
```


# References