%\VignetteIndexEntry{CopywriteR}
%\VignetteDepends{CopywriteR}
%\VignetteKeywords{Copy number detection from targeted sequencing using off-target reads.}
%\VignettePackage{CopywriteR}
%\VignetteEngine{utils::Sweave}

\documentclass{article}

<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\newcommand{\exitem}[3]{%
  \item \texttt{\textbackslash#1\{#2\}} #3 \csname#1\endcsname{#2}.%
}

\title{\bf CopywriteR: DNA copy number detection from off-target sequence data.}
\author{Thomas Kuilman}

\begin{document}

\maketitle

\begin{center}
  Department of Molecular Oncology \\*
  Netherlands Cancer Institute \\*
  The Netherlands \\*
\end{center}

\begin{center}
  \email{t.kuilman@nki.nl} or \email{thomaskuilman@yahoo.com}
\end{center}

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Overview}

To extract copy number information from targeted sequencing data while
circumventing problems related to the use of on-target reads, we developed
the CopywriteR package. Unlike other currently available tools, CopywriteR is
based on off-target reads. We have shown that this approach has several
advantages relative to current approaches \cite{s13059-015-0617-1} and it
constitutes a viable alternative for copy number detection from targeted
sequencing data.

\section{CopywriteR workflow}

CopywriteR uses .bam files as an input. These files are processed in
several steps to allow copy number detection from off-target reads of targeted
sequencing efforts. These steps include:

\begin{itemize}
    \item Removing of low-quality and anomalous reads
    \item Peak calling (in reference when available, otherwise in sample itself)
    \item Discarding of reads in peak regions
    \item Counting reads on bins of pre-defined size
    \item Compensating for the difference in effective bin size upon discarding
     peaks in peak regions
    \item Correcting for GC-content and mappability; applying a blacklist filter
     for CNV regions
\end{itemize}

\section{Analysis workflow}

The full analysis of copy number data with CopywriteR includes three sequential
steps, using the \Rcode{preCopywriteR()}, \Rcode{CopywriteR()} and
\Rcode{plotCNA()} functions respectively.

In this analysis workflow, we will use CopywriteR to extract copy number
information from whole-exome sequening data from a murine small-cell lung
cancer. The complete dataset can be downloaded from the European Nucleotide
Archive (\url{http://www.ebi.ac.uk/ena/home}) using the accession number
PRJEB6954. In this vignette, only the sequence reads on chromosome 4 are used.
The .bam file, which has been pre-filtered to minimize disk usage, is contained
in the \Biocpkg{SCLCBam} package. Before starting the analysis, GC-content and
mappability helper files need to be created for the desired bin size and
reference genome. This is done using the \Rcode{preCopywriteR()} function.

\subsection{\Rcode{preCopywriteR()}}

CopywriteR uses binned read count data as a basis for copy number detection. We
have pre-assembled 1 kb bin GC-content and mappability information for the hg18,
hg19, hg38, mm9 and mm10 reference genomes in the \Biocpkg{CopyhelpeR} package.
The \Rcode{preCopywriteR()} function allows to create the necessary helper
files. These files can be created for any custom bin size that is a multiple of
1000 bp, and for the above-mentioned reference genomes. The helper files are
saved in a new folder that is names after the relevant reference genome, the bin
size, and the prefix (all separated by '\symbol{95}'). The helper files are
required to run CopywriteR. If previously created helper files for a specific
bin size / reference genome / prefix combination are available, the
{\Rcode{preCopywriteR()} step can be omitted.

In this example, CopywriteR will be applied to a bin size of 20 kb. We
recommend to use a bin size of 20 kb when trying to analyze whole-exome
sequencing data, while we would start analyzing at 50 kb resolution for targeted
sequencing on smaller gene panels. As the .bam file contained in the
\Biocpkg{SCLCBam} package is mapped to the mm10 reference genome, we run
{\Rcode{preCopywriteR()} as follows:

<<load.CopywriteR, print=FALSE, eval=TRUE>>=
library(CopywriteR)
@

<<get.root.folder, print=FALSE, eval=TRUE>>=
data.folder <- tools::file_path_as_absolute(file.path(getwd()))
@

<<preCopywriteR, print=FALSE, eval=TRUE>>=
preCopywriteR(output.folder = file.path(data.folder),
              bin.size = 20000,
              ref.genome = "mm10_4")
@

Please not that the "mm10\symbol{95}4" reference genome exclusively contains
helper files for chromosome 4, and therefore should only be used when
replicating the analysis in this vignette. The custom helper files are placed in
a folder that is named after the reference genome and the bin size (in this
example: mm10\symbol{95}4\symbol{95}20kb). This folder, on its turn, is placed
inside the folder specified by the output.folder argument. The bin.size argument
specifies the custom bin size (in bp), and the ref.genome argument can be either
"hg18", "hg19", "hg38", "mm9" or "mm10". The default chromosome notation is "1",
"2", ..., "X", "Y". The prefix argument is optional and can be used to add a
prefix to these names.

If you are interested in getting helper files for other genomes, please contact
me by the email addresses supplied at the beginning of this vignette.

<<list.dirs, print=FALSE, eval=TRUE>>=
list.dirs(path = file.path(data.folder), full.names = FALSE)[2]
@

<<list.dirs, print=FALSE, eval=TRUE>>=
list.files(path = file.path(data.folder, "mm10_4_20kb"), full.names = FALSE)
@

The data are contained in two GRanges objects (from the \Biocpkg{GenomicRanges}
package) of which one (blacklist.rda) contains regions of copy number variation:

<<show.blacklist, print=FALSE, eval=TRUE>>=
load(file = file.path(data.folder, "mm10_4_20kb", "blacklist.rda"))
blacklist.grange
@

The other helper file (GC\symbol{95}mappability.rda) contains GC-content and
mappability information for a particular bin size:

<<show.GC.mappa, print=FALSE, eval=TRUE>>=
load(file = file.path(data.folder, "mm10_4_20kb", "GC_mappability.rda"))
GC.mappa.grange[1001:1011]
@

\subsection{\Biocpkg{BiocParallel} for parallel computing}

CopywriteR fully supports paralel computing and is implemented in such a way
that every sample is processed on a single core. CopywriteR uses the
\Biocpkg{BiocParallel} for parallel computing. This package requires the user to
specify which parallel environment is used by creating an instance of
\Rcode{BiocParallelParam}. In the example case, we are using the \Rcode{snow}
environment and specify a \Rcode{SnowParam}:

<<create.BiocParallelParam, print=FALSE, eval=TRUE>>=
bp.param <- SnowParam(workers = 1, type = "SOCK")
bp.param
@

For using other instances of the \Rcode{BiocParallelParam} parameter, please
refer to the \Biocpkg{BiocParallel} package vignette. The bp.param parameter can
now be passed on to \Rcode{CopywriteR()} to allow parallel computation.

\subsection{\Rcode{CopywriteR()}}

The \Rcode{CopywriteR()} function allows to calculate binned read count data
based on the helper files created by \Rcode{preCopywriteR()}. CopywriteR uses a
peak calling algorithm to remove 'on-target' reads. For every sample that is to
be analyzed by CopywriteR, the user can specify on which sample the peak regions
should be identified. The argument that controls this is sample.control. This
\Rcode{matrix} or \Rcode{data.frame} should contain path names to 'samples' in
the first column, and their corresponding 'controls', in the second column. The
peaks called in a 'control' are used to remove reads from the corresponding
'sample'. Please note that any sample that is to be used by the downstream
plotCNA function (including those that are only to be used as a reference) needs
to be analyzed by the CopywriteR function.

In the example used here, no matched germline control is available, and peaks
are called on the sample itself. Therefore the paths specified in the samples
and controls columns of the sample.control \Rcode{data.frame} below are
identical.

<<CopywriteR, print=FALSE, eval=TRUE>>=
path <- SCLCBam::getPathBamFolder()
samples <- list.files(path = path, pattern = ".bam$", full.names = TRUE)
controls <- samples
sample.control <- data.frame(samples, controls)

CopywriteR(sample.control = sample.control,
           destination.folder = file.path(data.folder),
           reference.folder = file.path(data.folder, "mm10_4_20kb"),
           bp.param = bp.param)
@

The destination.folder argument determines in which folder the output folder is
going to be placed, and the reference.folder points to the location of the
custom bin size helper files. The keep.intermediary.files is an optional
argument that determines whether intermediary .bam, .bai and peak .bed files are
kept. The default value is FALSE to limit disk space usage; it can be set to
TRUE however for troubleshooting purposes. Finally, the capture.regions.file
argument is optional, and can be used to define the location of a capture
regions bed file. If this file is specified, statistics of the overlap of these
regions with the called peaks is provided.

CopywriteR will create a new folder named CNAprofiles in the directory specified
by destination.folder. This folder contains the following files:

<<CNAprofiles.folder.contents, print=FALSE, eval=TRUE>>=
cat(list.files(path = file.path(data.folder, "CNAprofiles")), sep = "\n")
@

The read\symbol{95}counts.txt file contains both uncompensated and compensated
read counts for every sample, as well as the corresponding fraction.of.bin
values. The 'fraction of bin' indicates what fraction of the bin is not on
peaks, and therefore effectively contributes to the read count. In our example,
the read\symbol{95}counts.txt has the following content:

<<read.counts.example, print=FALSE, eval=TRUE>>=
read.table(file = file.path(data.folder, "CNAprofiles", "read_counts.txt"),
           header = TRUE)[1001:1006, ]
@

The log2\symbol{95}read\symbol{95}counts.igv file can be opened in the IGV
browser (\url{http://www.broadinstitute.org/igv/}), and contains
log2-transformed, normalized (ratios of) read counts. These data can be used for
further downstream analysis, and are required for \Rcode{plotCNA()} to allow
plotting of the copy number profiles:

<<log2.read.counts.example, print=FALSE, eval=TRUE>>=
read.table(file = file.path(data.folder, "CNAprofiles",
                            "log2_read_counts.igv"), header = TRUE)[817:822, ]
@

The input.Rdata file contains a number of variables that are required to run
the last function of the CopywriteR package, \Rcode{plotCNA()}. The
CopywriteR.log file contains log information of the R commands that have been
used to perform the various subfunctions, and specifications of the input
material. Finally, the qc folder contains two types of quality control plots.

<<CNAprofiles.folder.contents, print=FALSE, eval=TRUE>>=
cat(list.files(path = file.path(data.folder, "CNAprofiles", "qc")), sep = "\n")
@

The fraction.of.bin files contain the empirical cumulative distribution
function for the 'fractions of bins'. The read.counts.compensated files
contain the plots and the loesses that are used for GC-content and
mappability corrections.

\subsection{\Rcode{plotCNA()}}

The \Rcode{plotCNA()} function allows segmentation of the copy number data using
DNAcopy \cite{Olshen:2004ge}, and subsequent plotting. We run the plotting
function as follows:

<<CNAprofiles.folder.contents, print=FALSE, eval=TRUE>>=
plotCNA(destination.folder = file.path(data.folder))
@

The plotting function saves the DNAcopy object containing the segmentation
values in the segment.Rdata file. In addition, it creates the folder 'plots':

<<CNAprofiles.folder.contents.2, print=FALSE, eval=TRUE>>=
cat(list.files(path = file.path(data.folder, "CNAprofiles")), sep = "\n")
@

For every sample a separate directory is created:

<<plots.folder.contents, print=FALSE, eval=TRUE>>=
cat(list.files(path = file.path(data.folder, "CNAprofiles", "plots")), sep = "\n")
@

The samples are plotted per chromosome, as well as in a genome-wide fashion.
We provide here the result for chromosome 4:

\begin{center}
\includegraphics[width=\textwidth]{CNAprofiles/plots/log2.T43_4.bam.vs.none/chrom_4}
\end{center}

In addition to the plots, the raw segmented values can be obtained from the
DNAcopy object 'segment.Rdata'.

We are interested in further improving CopywriteR, so if CopywriteR fails to
analyze your samples, please don't hesitate to contact me. In this case, please
provide the full command line output and the log-file.

\section{Troubleshooting}

We maintain a troubleshooting GitHub page where various issues are discussed
that have been raised by CopywriteR users. Please follow this link:
\url{https://github.com/PeeperLab/CopywriteR#troubleshooting}.

\newpage

\section{Session Information}

The version number of \R{} and packages loaded for generating the vignette were:
<<sessionInfo, results=tex, print=TRUE, eval=TRUE>>=
toLatex(sessionInfo())
@

\bibliography{CopywriteR}

\end{document}