%\VignetteIndexEntry{Introduction to QDNAseq}

\documentclass{article}

<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\begin{document}

\title{Introduction to QDNAseq}
\author{Ilari Scheinin}
\maketitle

\tableofcontents

\section{Running QDNAseq}

This is a short tutorial on how to use the \Biocpkg{QDNAseq} package. It covers
an example run using the included data set of chromosomes 7--10 of a low grade
glioma (LGG) sample. First step is naturally to load the package.

<<>>=
library(QDNAseq)
@

<<echo=FALSE>>=
## Send verbose output to stdout instead of stderr
options("QDNAseq::verbose"=NA)
options(width=40)
@

\subsection{Bin annotations}

Then we need to obtain bin annotations. These are available pre-calculated for
genome build hg19 and bin sizes 1, 5, 10, 15, 30, 50, 100, 500, and 1000 kbp.
They are available in the \Biocpkg{QDNAseq.hg19} package, which has to be
installed from Bioconductor separately.  With that package installed, the
bin annotations can be acquired as:

\begin{Schunk}
\begin{Sinput}
> bins <- getBinAnnotations(binSize=15)
Loaded bin annotations for genome 'hg19', bin size 15 kbp, and
experiment type 'SR50' from annotation package QDNAseq.hg19 v1.14.0
> bins
QDNAseq bin annotations for Hsapiens, build hg19.
Created by Ilari Scheinin with QDNAseq 0.7.5, 2014-02-06 12:48:04.
An object of class 'AnnotatedDataFrame'
  rowNames: 1:1-15000 1:15001-30000 ... Y:59370001-59373566 (206391
    total)
  varLabels: chromosome start ... use (9 total)
  varMetadata: labelDescription
\end{Sinput}
\end{Schunk}

If you are working with another genome build (or another species), see the
section on generating the bin annotations.

\subsection{Processing BAM files}

Next step is to load the sequencing data from BAM files. This can be done for
example with one of the commands below.

<<eval=FALSE>>=
readCounts <- binReadCounts(bins)
# all files ending in .bam from the current working directory

# or

readCounts <- binReadCounts(bins, bamfiles="tumor.bam")
# file 'tumor.bam' from the current working directory

# or

readCounts <- binReadCounts(bins, path="tumors")
# all files ending in .bam from the subdirectory 'tumors'
@

This will return an object of class \Rclass{QDNAseqReadCounts}. If the same BAM
files will be used as input in future \R{} sessions, option \Rcode{cache=TRUE}
can be used to cache intermediate files, which will speed up future analyses.
Caching is done with package \CRANpkg{R.cache}.

For large BAM files it is advisable to use the \Rcode{chunkSize} parameter to
control memory usage. A non-\Rcode{NULL}, non-numeric value will use the length
of the longest chromosome, effectively chunking by chromosome. A numeric value
will use that many reads at a time. Note that total peak memory usage is
controlled both by the chunk size and the number of parallel workers. See
\autoref{sec:parallel}.

For the purpose of this tutorial, we load an example data set of chromosomes
7--10 of low grade glioma sample LGG150.

<<>>=
data(LGG150)
readCounts <- LGG150
readCounts
@

Plot a raw copy number profile (read counts across the genome), and highlight
bins that will be removed with default filtering (Figure~\ref{fig:rawprofile}).

<<echo=FALSE>>=
png("rawprofile.png")
@
<<label=rawprofile>>=
plot(readCounts, logTransform=FALSE, ylim=c(-50, 200))
highlightFilters(readCounts, logTransform=FALSE,
  residual=TRUE, blacklist=TRUE)
@
<<echo=FALSE, results=hide>>=
dev.off()
@

\begin{figure}[h]
  \centering
  \caption{Read counts per bins. Highlighted with red are bins that will be
    filtered out.}
  \label{fig:rawprofile}
  \includegraphics{rawprofile}
\end{figure}

Apply filters and plot median read counts as a function of GC content and
mappability (Figure~\ref{fig:isobar}). As the example data set only contains a
subset of the chromosomes, the distribution looks slightly less smooth than
expected for the entire genome.

<<>>=
readCountsFiltered <- applyFilters(readCounts, residual=TRUE, blacklist=TRUE)
@
<<echo=FALSE>>=
png("isobar.png")
@
<<label=isobar>>=
isobarPlot(readCountsFiltered)
@
<<echo=FALSE, results=hide>>=
dev.off()
@

\begin{figure}[h]
  \centering
  \caption{Median read counts per bin shown as a function of GC content and
    mappability.}
  \label{fig:isobar}
  \includegraphics{isobar}
\end{figure}

Estimate the correction for GC content and mappability, and make a plot for the
relationship between the observed standard deviation in the data and its read
depth (Figure~\ref{fig:noise}). The theoretical expectation is a linear
relationship, which is shown in the plot with a black line. Samples with
low-quality DNA will be noisier than expected and appear further above the line
than good-quality samples.

<<>>=
readCountsFiltered <- estimateCorrection(readCountsFiltered)
@
<<echo=FALSE>>=
png("noise.png")
@
<<label=noise>>=
noisePlot(readCountsFiltered)
@
<<echo=FALSE, results=hide>>=
dev.off()
@

\begin{figure}[h]
  \centering
  \caption{The relationship between sequence depth and noise.}
  \label{fig:noise}
  \includegraphics{noise}
\end{figure}

Next, we apply the correction for GC content and mappability. This will return
a \Rclass{QDNAseqCopyNumbers} object, which we then normalize, smooth outliers,
and plot the copy number profile (Figure~\ref{fig:profile}).

<<>>=
copyNumbers <- correctBins(readCountsFiltered)
copyNumbers
copyNumbersNormalized <- normalizeBins(copyNumbers)
copyNumbersSmooth <- smoothOutlierBins(copyNumbersNormalized)
@
<<echo=FALSE>>=
png("profile.png")
@
<<label=profile>>=
plot(copyNumbersSmooth)
@
<<echo=FALSE, results=hide>>=
dev.off()
@

\begin{figure}[h]
  \centering
  \caption{Copy number profile after correcting for GC content and mappability.}
  \label{fig:profile}
  \includegraphics{profile}
\end{figure}

Data is now ready to be analyzed with a downstream package of choice. For
analysis with an external program or for visualizations in \software{IGV}, the
data can be exported to a file.

<<eval=FALSE>>=
exportBins(copyNumbersSmooth, file="LGG150.txt")
exportBins(copyNumbersSmooth, file="LGG150.igv", format="igv")
exportBins(copyNumbersSmooth, file="LGG150.bed", format="bed")
@

\subsection{Downstream analyses}

Segmentation with the \software{CBS} algorithm from \Biocpkg{DNAcopy}, and
calling copy number aberrations with \Biocpkg{CGHcall} or cutoffs have been
implemented for convenience.

By default, segmentation uses a $\log_2$-transformation, but a sqrt(x + 3/8)
can also be used as it stabilizes the variance of a Poisson distribution
(Anscombe transform):

<<>>=
copyNumbersSegmented <- segmentBins(copyNumbersSmooth, transformFun="sqrt")
copyNumbersSegmented <- normalizeSegmentedBins(copyNumbersSegmented)
@
<<echo=FALSE>>=
png("segments.png")
@
<<label=segments>>=
plot(copyNumbersSegmented)
@
<<echo=FALSE, results=hide>>=
dev.off()
@

\begin{figure}[h]
  \centering
  \caption{Copy number profile after segmenting.}
  \label{fig:segments}
  \includegraphics{segments}
\end{figure}

Tune segmentation parameters and iterate until satisfied. Next, call
aberrations, and plot the final results.

<<>>=
copyNumbersCalled <- callBins(copyNumbersSegmented)
@
<<echo=FALSE>>=
png("calls.png")
@
<<label=calls>>=
plot(copyNumbersCalled)
@
<<echo=FALSE, results=hide>>=
dev.off()
@

\begin{figure}[h]
  \centering
  \caption{Copy number profile after calling gains and losses.}
  \label{fig:calls}
  \includegraphics{calls}
\end{figure}

Called data can be exported as VCF file or SEG for further downstream analysis.

<<eval=FALSE>>=
exportBins(copyNumbersCalled, format="vcf")
exportBins(copyNumbersCalled, format="seg")
@

It should be noted that \Biocpkg{CGHcall} (which \Rfunction{callBins()} uses by
default) was developed for the analysis of sets of cancer samples. It is based
on a mixture model, and when there are not enough aberrations present in the
data, model fitting can fail. This can happen especially with non-cancer
samples, and/or when analyzing individual cases instead of larger data sets.

If \Biocpkg{CGHcall} fails, \Rfunction{callBins()} can also perform simple
cutoff-based calling by setting parameter \Rcode{method="cutoff"}. The default
cutoff values are based on the assumption of uniform cell populations, and in
case of cancer samples will most likely need calibration by adjusting parameter
\Rcode{cutoffs}.

Finally, for other downstream analyses, such as running \Biocpkg{CGHregions},
it might be useful to convert to a \Rclass{cghCall} object.

<<>>=
cgh <- makeCgh(copyNumbersCalled)
cgh
@

This command can also be used to generate \Rclass{cghRaw} or \Rclass{cghSeg}
objects by running it before segmentation or calling.

% caching

% 661 BAM files = 445G
% cache/reads = 13G
% cache/readCounts/15kbp = 139M
% cache/readCounts/30kbp = 82M
% cache/readCounts/100kbp = 32M
% cache/readCounts/1000kbp = 5.2M

\clearpage

\section{Parallel computation}
\label{sec:parallel}

\Biocpkg{QDNAseq} supports parallel computing via the \CRANpkg{future} package.
All that is required is to select an appropriate \textit{plan}.

The instructions below apply to all of \Biocpkg{QDNAseq}'s own functions that
support parallel processing. At the moment these include
\Rfunction{estimateCorrection()}, \Rfunction{segmentBins()},
\Rfunction{createBins()}, and \Rfunction{calculateBlacklist()}. 
\Rfunction{binReadCounts()} parallelizes by chromosome when \Rcode{chunkSize} 
is used.

However, when argument \Rcode{method="CGHcall"} (which is the default),
function \Rfunction{callBins()} calls function \Rfunction{CGHcall()} from
package \Biocpkg{CGHcall}, which uses another mechanism for parallel
computation. For that, the number of processes to use should be specified with
argument \Rcode{ncpus}, with something along the lines of:

<<eval=FALSE>>=
copyNumbers <- callBins(..., ncpus=4)
@

\subsection{Non-parallel processing}

The default is to use single-core processing via ``sequential'' futures. This can be
set explicitly with:

<<eval=FALSE>>=
future::plan("sequential")
@

\subsection{Parallel processing on the current machine}

To process data in parallel using multiple processes on the current machine,
use the following:

<<eval=FALSE>>=
future::plan("multiprocess")
@

After that, all functions that support parallel processing will automatically
use it.  The \Rpackage{future} framework attempts to play nice with the
current compute environment.  It will automatically respect environment
variables and \R{} options that are used to limit the number of parallel works.
It will also respect environment variables such as number of cores assigned
to job scripts in high-performance compute (HPC) clusters.  If no such
restrictions are set, the default is to use all cores available.
To explicitly set, and override other settings, the number of parallel workers,
use argument \textit{workers}, e.g.

<<eval=FALSE>>=
future::plan("multiprocess", workers=4)
@

For more details, see the \Rpackage{future} documentation.


\subsection{Parallel processing on an ad-hoc cluster}

To process data using multiple \R{} sessions running on different machines, use
something along the lines of:

<<eval=FALSE>>=
cl <- future::makeClusterPSOCK(...)
future::plan("cluster", cluster=cl)
@

See package \Rpackage{future} for more details.


\clearpage

\section{Sex chromosomes}

By default, \Biocpkg{QDNAseq} ignores sex chromosomes. In order to include them
in the analysis, function \Rfunction{applyFilters()} should be run with
argument \Rcode{chromosomes=NA} to include both X and Y, or
\Rcode{chromosomes="Y"} to include X only.

However, this will also affect which chromosomes are used when calculating the
LOESS correction with \Rfunction{estimateCorrection()}. Unless the data set
consists of only females, this could be undesirable. The solution is to first
filter out the sex chromosomes, run \Rfunction{estimateCorrection()}, and
then reverse the filtering of sex chromosomes:

<<eval=FALSE>>=
readCounts <- binReadCounts(getBinAnnotations(15))
readCounts <- applyFilters(readCounts)
readCounts <- estimateCorrection(readCounts)
readCounts <- applyFilters(readCounts, chromosomes=NA)
copyNumbers <- correctBins(readCounts)
@

Running \Rfunction{estimateCorrection()} and \Rfunction{correctBins()} with a
different set of bins can have one side effect. This is caused by the fact that
there can be bins in the sex chromosomes with a combination of GC content and
mappability that is not found anywhere else in the genome. This will cause
those bins to miss a correction estimate altogether, and these bins will be
filtered out from subsequent steps by \Rfunction{correctBins()}. If this
happens, it will print out a message specifying the number of bins affected.

Another possible approach is to allow extrapolation while calculating the LOESS
correction. But please do note that the effect of extrapolation has not been
properly evaluated.

<<eval=FALSE>>=
readCounts <- estimateCorrection(readCounts,
  control=loess.control(surface="direct"))
@

\clearpage

\section{Generating bin annotations}

This section describes how bin annotations have been created for the hg19 build
of the human reference genome, and can be applied for other genome builds and
species. The first step is to create the bins based on chromosome sizes, and
calculate their GC content and proportion of characterized nucleotides (non-N
bases in the reference sequence). For this, the corresponding
\Biocpkg{BSgenome} package is needed.

<<eval=FALSE>>=
# load required packages for human reference genome build hg19
library(QDNAseq)
library(Biobase)
library(BSgenome.Hsapiens.UCSC.hg19)

# set the bin size
binSize <- 15

# create bins from the reference genome
bins <- createBins(bsgenome=BSgenome.Hsapiens.UCSC.hg19, binSize=binSize)
@

The result is a \Rclass{data.frame} with columns \Robject{chromosome},
\Robject{start}, \Robject{end}, \Robject{gc}, and \Robject{bases}. Next step is
to calculate the average mappabilities, which requires a mappability file in
the \Rcode{bigWig} format and the \software{bigWigAverageOverBed} binary. The
mappability file can be generated with \software{GEnomic Multi-Tool (GEM) Mapper}
part of the \software{\href{https://sourceforge.net/projects/gemlibrary/}{GEM library}} from the
reference genome sequence. Or it might be available directly, as was the case
for hg19, and file \file{wgEncodeCrgMapabilityAlign50mer.bigWig} downloaded from
\href{https://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability}
{ENCODE's download section of the UCSC Genome Browser}. The
\software{bigWigAverageOverBed} binary can also be downloaded from
\href{https://hgdownload.soe.ucsc.edu/admin/exe/}{UCSC Genome Browser's Other
utilities section}.

<<eval=FALSE>>=
# calculate mappabilites per bin from ENCODE mapability tracks
bins$mappability <- calculateMappability(bins,
  bigWigFile="/path/to/wgEncodeCrgMapabilityAlign50mer.bigWig",
  bigWigAverageOverBed="/path/to/bigWigAverageOverBed")
@

If there are genomic regions that should excluded from analyses, such as
ENCODE's Blacklisted Regions, the percentage overlap between the generated bins
and these regions can be calculated as follows. The regions to be excluded need
to be in the \Rcode{BED} format, like files
\file{wgEncodeDacMapabilityConsensusExcludable.bed} and
\file{wgEncodeDukeMapabilityRegionsExcludable.bed} that were downloaded from
\href{https://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability}{
ENCODE's download section of the UCSC Genome Browser} for hg19.

<<eval=FALSE>>=
# calculate overlap with ENCODE blacklisted regions
bins$blacklist <- calculateBlacklist(bins,
  bedFiles=c("/path/to/wgEncodeDacMapabilityConsensusExcludable.bed",
             "/path/to/wgEncodeDukeMapabilityRegionsExcludable.bed"))
@

For any list of regions, the percentage of bin overlap can be calculated by using
the following command. 

<<eval=FALSE>>=
# generic calculation of overlap with blacklisted regions
bins$blacklist <- calculateBlacklistByRegions(bins, 
  cbind(chromosome, bpStart, bpEnd))
@

To calculate median residuals of the LOESS fit from a control dataset, the
following command can be used. For the pre-generated annotations, the control
set used is 38 samples from the
\href{https://www.internationalgenome.org/}{1000 Genomes Project}.
See the next section on how those were downloaded.

<<eval=FALSE>>=
# load data for the 1000 Genomes (or similar) data set, and generate residuals
ctrl <- binReadCounts(bins, path="/path/to/control-set/bam/files")
ctrl <- applyFilters(ctrl, residual=FALSE, blacklist=FALSE,
  mappability=FALSE, bases=FALSE)
bins$residual <- iterateResiduals(ctrl)
@

The column \Robject{use} specifies whether each bin should be used for
subsequent analyses by default. The command \Rfunction{applyFilters()} will
change its value accordingly. By default, bins in the sex chromosomes, or with
only uncharacterized nucleotides (N's) in their reference sequence, are flagged
for exclusion.

<<eval=FALSE>>=
# by default, use all autosomal bins that have a reference sequence
# (i.e. not only N's)
bins$use <- bins$chromosome %in% as.character(1:22) & bins$bases > 0
@

Optionally, the resulting \Rclass{data.frame} can be converted to an
\Rclass{AnnotateDataFrame} and metadata added for the columns.

<<eval=FALSE>>=
# convert to AnnotatedDataFrame and add metadata
bins <- AnnotatedDataFrame(bins,
  varMetadata=data.frame(labelDescription=c(
  "Chromosome name",
  "Base pair start position",
  "Base pair end position",
  "Percentage of non-N nucleotides (of full bin size)",
  "Percentage of C and G nucleotides (of non-N nucleotides)",
  "Average mappability of 50mers with a maximum of 2 mismatches",
  "Percent overlap with ENCODE blacklisted regions",
  "Median loess residual from 1000 Genomes (50mers)",
  "Whether the bin should be used in subsequent analysis steps"),
  row.names=colnames(bins)))
@

For the pre-generated annotations, some additional descriptive metadata has
also been added.

<<eval=FALSE>>=
attr(bins, "QDNAseq") <- list(
  author="Ilari Scheinin",
  date=Sys.time(),
  organism="Hsapiens",
  build="hg19",
  version=packageVersion("QDNAseq"),
  md5=digest::digest(bins@data),
  sessionInfo=sessionInfo())
@

\clearpage

\section{Downloading 1000 Genomes samples}

This section defines the criteria that were used to download samples from the
1000 Genomes Project for the pre-generated bin annotations.

<<eval=FALSE>>=
# download table of samples
urlroot <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp"
g1k <- read.table(file.path(urlroot, "sequence.index"),
  header=TRUE, sep="\t", as.is=TRUE, fill=TRUE)

# keep cases that are Illumina, low coverage, single-read, and not withdrawn
g1k <- g1k[g1k$INSTRUMENT_PLATFORM == "ILLUMINA", ]
g1k <- g1k[g1k$ANALYSIS_GROUP == "low coverage", ]
g1k <- g1k[g1k$LIBRARY_LAYOUT == "SINGLE", ]
g1k <- g1k[g1k$WITHDRAWN == 0, ]

# keep cases with read lengths of at least 50 bp
g1k <- g1k[!g1k$BASE_COUNT %in% c("not available", ""), ]
g1k$BASE_COUNT <- as.numeric(g1k$BASE_COUNT)
g1k$READ_COUNT <- as.integer(g1k$READ_COUNT)
g1k$readLength <- g1k$BASE_COUNT / g1k$READ_COUNT
g1k <- g1k[g1k$readLength > 50, ]

# keep samples with a minimum of one million reads
readCountPerSample <- aggregate(g1k$READ_COUNT,
  by=list(sample=g1k$SAMPLE_NAME), FUN=sum)
g1k <- g1k[g1k$SAMPLE_NAME %in%
  readCountPerSample$sample[readCountPerSample$x >= 1e6], ]

g1k$fileName <- basename(g1k$FASTQ_FILE)

# download FASTQ files
for (i in rownames(g1k)) {
  sourceFile <- file.path(urlroot, g1k[i, "FASTQ_FILE"])
  destFile <- g1k[i, "fileName"]
  if (!file.exists(destFile))
    download.file(sourceFile, destFile, mode="wb")
}
@

Next, reads were trimmed to 50 bp, and the multiple files for each sample (as
defined by column SAMPLE\_NAME) were combined by concatenating the FASTQ files
together. Finally, they were aligned with \software{BWA} allowing two
mismatches and end-trimming of bases with qualities below 40 (options
\Rcode{-n 2 -q 40}).

\clearpage

\section{Session information}

The version number of \R{} and packages loaded for generating the vignette were:

<<echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@

\end{document}

% EOF