%\VignetteKeywords{runAbsoluteCN}
%\VignetteEngine{knitr::knitr}
%\VignetteDepends{PureCN}
%\VignettePackage{PureCN}
%\VignetteIndexEntry{PureCN overview}

\documentclass{article}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex2()
@ 

\usepackage{booktabs} % book-quality tables

\title{Copy number calling and SNV classification using 
targeted short read sequencing}
\author[1]{Markus Riester}
\affil[1]{Novartis Institutes for BioMedical Research, Cambridge, MA}

\begin{document}

\maketitle

\begin{abstract}
\Biocpkg{PureCN} \cite{Riester2016} is a purity and ploidy aware copy number
caller for cancer samples inspired by the \software{ABSOLUTE} algorithm
\cite{Carter2012}.  It was designed for hybrid capture sequencing data,
especially with medium-sized targeted gene panels without matching normal
samples in mind (matched whole-exome data is of course supported). 

It can be used to supplement existing normalization and segmentation algorithms,
i.e. the software can start from BAM files, from target-level coverage data,
from copy number log-ratios or from already segmented data. After the correct
purity and ploidy solution was identified, \Biocpkg{PureCN} will accurately
classify variants as germline vs. somatic or clonal vs. sub-clonal.

\Biocpkg{PureCN} was further designed to integrate well with industry standard
pipelines \cite{McKenna2010}, but it is straightforward to generate input data
from other pipelines.
\end{abstract}

\packageVersion{\Sexpr{BiocStyle::pkg_ver("PureCN")}}

\tableofcontents
\newpage

<<load-purecn, echo=FALSE, message=FALSE>>=
library(PureCN)
set.seed(1234)
@

\section{Introduction}

This tutorial will demonstrate on a toy example how we recommend running
\Biocpkg{PureCN} on targeted sequencing data. To estimate tumor purity, we
jointly utilize both target-level\footnote{The captured genomic regions, e.g.
exons.} coverage data and allelic fractions of single nucleotide
variants (SNVs), inside - and optionally outside - the targeted regions.
Knowledge of purity will in turn allow us to accurately (i) infer integer copy
number and (ii) classify variants (somatic vs. germline, mono-clonal vs.
sub-clonal, heterozygous vs. homozygous etc.).

This requires 3 basic input files:

\begin{enumerate}
    \item A VCF file containing germline SNPs and somatic mutations. Somatic
status is not required in case the variant caller was run without matching
normal sample.
    \item The tumor BAM file.
    \item A BAM file from a normal control sample, either matched or
process-matched.
\end{enumerate}

In addition, we need to know a little bit more about the assay.  This is the
annoying step, since here the user needs to provide some information.  Most
importantly, we need to know the positions of all targets.  Then we need to
correct for GC-bias, for which we need GC-content for each target.  Optionally,
if gene-level calls are wanted, we also need for each target a gene symbol.  To
obtain best results, we can finally use a pool of normal samples to
automatically learn more about the assay and its biases and common artifacts. 

The next sections will show how to do all this with \Biocpkg{PureCN} alone or
with the help of \software{GATK} and/or existing copy number pipelines. 

\section{Basic input files}

\subsection{VCF}

Germline SNPs and somatic mutations are expected in a single VCF file. At the
bare minimum, this VCF should contain read depths of reference and alt alleles
in an AD genotype field and a DB info flag for dbSNP membership. Without DB
flag, variant ids starting with \Rcode{rs} are assumed to be in dbSNP. If a
matched normal is available, then somatic status information is currently
expected in a SOMATIC info flag in the VCF. The \Biocpkg{VariantAnnotation}
package provides examples how to add info fields to a VCF in case the used
variant caller does not add this flag. If the VCF contains a BQ genotype field
containing base quality scores, \Biocpkg{PureCN} can remove low quality calls.

VCF files generated by \software{MuTect} \cite{Cibulskis2013} should work well
and in general require no post-processing. \Biocpkg{PureCN} can handle
\software{MuTect} VCF files generated in both tumor-only and matched normal
mode.  Experimental support for \software{MuTect 2} and \software{FreeBayes}
VCFs generated in tumor-only mode is available.  

\subsection{Coverage data}
\label{gatkcoverage}
For the default segmentation function provided by \Biocpkg{PureCN}, the
algorithm first needs to calculate log-ratios of tumor vs. normal control
coverage.  Coverage data needs to be provided in \software{GATK
DepthOfCoverage} format\footnote{\software{GATK} will generate additional
columns which are not shown here and which \Biocpkg{PureCN} will ignore when
provided}:

\begin{verbatim}
    Target  total_coverage  average_coverage    
    chr1:69091-70009    0   0   0   
    chr1:367659-368598  6358    9.25    
    chr1:621096-622035  6294    9.16    
\end{verbatim}

The intervals define the captured genomic regions (targets) and are provided by
the manufacturer of your capture kit\footnote{While \Biocpkg{PureCN} can use a
pool of normal samples to learn which intervals are reliable and which not, it
is highly recommended to provide the correct intervals. Garbage in, garbage
out.}.  Default parameters assume that these intervals do NOT include a
"padding" to include flanking regions of targets.  \Biocpkg{PureCN} will
automatically include variants in the 50bp flanking regions if the variant
caller was either run without interval file or with interval padding (See
section~\ref{faqinput}). Please double check that the genome version of the
interval file matches the reference.

It is recommended to GC-normalize coverage (how this done is described later in
Section~\ref{secgcbias}). The GC-content for each target is expected in
\software{GATK GCContentByInterval} format:

\begin{verbatim}
    Target    gc_bias    Gene
    chr1:69091-70009    0.427638737758433    OR4F5
    chr1:367659-368598    0.459574468085106    OR4F29
    chr1:621096-622035    0.459574468085106    OR4F3
\end{verbatim}

The \Rcode{Gene} column is optional and only required for providing gene-level
copy number and LOH calls. This information should be available in the
technical documentation of your capture kit. Simplistic example code for
annotating target intervals with gene symbols is given in the Appendix of this
vignette for testing purposes.

\subsection{Generating coverage data without \software{GATK}}
\label{purecncoverage}

The \Rfunction{calculateBamCoverageByInterval} function can be used to generate
the required coverage data from BAM files:

<<examplecoverage>>=
bam.file <- system.file("extdata", "ex1.bam", package="PureCN", 
    mustWork = TRUE)
interval.file <- system.file("extdata", "ex1_intervals.txt", 
    package="PureCN", mustWork = TRUE)

calculateBamCoverageByInterval(bam.file=bam.file, 
 interval.file=interval.file, output.file="ex1_coverage.txt")
@

To calculate GC-content, \Biocpkg{PureCN} provides the 
\Rfunction{calculateGCContentByInterval} function:

<<examplegc>>=
interval.file <- system.file("extdata", "ex2_intervals.txt", 
        package = "PureCN", mustWork = TRUE)
reference.file <- system.file("extdata", "ex2_reference.fa", 
    package = "PureCN", mustWork = TRUE)
calculateGCContentByInterval(interval.file, reference.file, 
    output.file = "ex2_gc_file.txt")
@

\Rfunction{calculateGCContentByInterval} can also be used 
to create an interval file from a BED file for example:

<<examplegc2>>=
bed.file <- system.file("extdata", "ex2_intervals.bed", 
        package = "PureCN", mustWork = TRUE)

intervals <- import(bed.file)

calculateGCContentByInterval(intervals, reference.file, 
    output.file = "ex2_gc_file.txt")
@

\subsection{Third-party segmentation tools}

\Biocpkg{PureCN} integrates well with existing copy number pipelines. Instead of
coverage data, the user then needs to provide either already segmented data or a
wrapper function. This is described in Section~\ref{customseg}.

\subsection{Example data}

We now load a few example files that we will use throughout this tutorial:

<<example_files, message=FALSE, warning=FALSE, results='hide'>>=
library(PureCN)

normal.coverage.file <- system.file("extdata", "example_normal.txt",
    package="PureCN") 
normal2.coverage.file <- system.file("extdata", "example_normal2.txt", 
    package="PureCN") 
normal.coverage.files <- c(normal.coverage.file, normal2.coverage.file) 
tumor.coverage.file <- system.file("extdata", "example_tumor.txt", 
    package="PureCN") 
seg.file <- system.file("extdata", "example_seg.txt",
    package = "PureCN")
vcf.file <- system.file("extdata", "example_vcf.vcf", package="PureCN")
gc.gene.file <- system.file("extdata", "example_gc.gene.file.txt", 
    package="PureCN")
@

\section{GC-bias}
\label{secgcbias}
The algorithm works best when the coverage files are GC-normalized. We can
easily create GC-normalized coverage files
(Figure~\ref{figure/figuregccorrect-1}).

<<figuregccorrect, fig.show='hide', fig.width=7, fig.height=4, warning=FALSE>>=
correctCoverageBias(normal.coverage.file, gc.gene.file, 
    output.file="example_normal_loess.txt", plot.gc.bias=TRUE)
@
\incfig{figure/figuregccorrect-1}{0.95\textwidth}{Coverage before and after
GC-normalization.} 
{This plot shows coverage as a function of target GC-content before and after
normalization. Each dot is a target interval. The example data is already
GC-normalized; real data will show more dramatic differences.}

\textbf{All the following steps in this vignette assume that the coverage data
are GC-normalized.} The example coverage files are already GC-normalized.
Section~\ref{cmdcoverage} presents a convenient command line script for
generating GC-normalized coverage data from BAM files or from \software{GATK}
coverage files.


\section{Pool of normals}

\subsection{Selection of normals for log-ratio calculation}
\label{bestnormal}

For calculating copy number log-ratios of tumor vs. normal, \Biocpkg{PureCN}
requires coverage from a process-matched normal sample.  Using a normal that
was sequenced using a similar, but not identical assay, rarely works, since
differently covered genomic regions result in too many log-ratio outliers. This
section describes how to identify a good process-matched normal in case no
matched normal is available or in case the matched normal has low or uneven
coverage. 

The \Rfunction{createNormalDatabase} function builds a database of coverage
files (a command line script proving this functionality is described in
Section~\ref{cmdnormaldb}): 

<<normaldb>>=
normalDB <- createNormalDatabase(normal.coverage.files)
# serialize, so that we need to do this only once for each assay
saveRDS(normalDB, file="normalDB.rds")
@

Again, please make sure that all coverage files were GC-normalized prior to
building the database (Section~\ref{secgcbias}).  Internally,
\Rfunction{createNormalDatabase} determines the sex of the samples and trains a
PCA that is later used for clustering a tumor file with all normal samples in
the database. This clustering is performed by the \Rfunction{findBestNormal}
function:

<<normaldbpca>>=
normalDB <- readRDS("normalDB.rds")
# get the best normal
best.normal.coverage.file <- findBestNormal(tumor.coverage.file, 
    normalDB)
@

This function can also return multiple normal files that can be averaged into a
single pool:  

<<normaldbpcapool>>=
# get the best 2 normals and average them
pool <- findBestNormal(tumor.coverage.file, normalDB, 
    num.normals=2, pool=TRUE, remove.chrs=c("chrX", "chrY"))
@

This function will by default optimize averaging weights using the
\Rfunction{voomWithQualityWeights} function of the \Biocpkg{limma} package. The
\Rcode{num.normals} should be set to a value between 2 and 10. More than 10
usually results in long runtimes with no significant gain in accuracy.

Giving recommendations for optimal strategies is hard since they depend on the
size, coverage and variance of the pool of normals. It is worth experimenting
with different strategies and the \Rfunction{plotBestNormal} function might be
helpful. Starting with the most simple one, i.e. using the single best normal,
and then testing more complicated strategies, is a sound approach. 

Note that this example removes coverage from sex chromosomes; if the normal
database contains a sufficient number of samples with matching sex,
\Rfunction{findBestNormal} will return only normal samples with matching sex.

\subsection{Artifact filtering}
\label{artifactfiltering}

It is important to remove as many artifacts as possible, since low ploidy
solutions are typically punished more by artifacts than high ploidy solutions.
High ploidy solutions are complex and usually find ways of explaining artifacts
reasonably well. The following steps in this section are optional, but
recommended since they will reduce the number of samples requiring manual
curation, especially when matching normal samples are not available. 

\subsubsection{VCF}

We recommend running \software{MuTect} with a pool of normal samples to filter
common sequencing errors and alignment artifacts from the VCF.
\software{MuTect} requires a single VCF containing all normal samples, for
example generated by the \software{GATK CombineVariants} tool (see
Section~\ref{faqinput}). 

It is highly recommended to provide \Biocpkg{PureCN} this combined VCF as well;
it might help the software correcting non-reference read mapping biases. This
is described in the \Rfunction{setMappingBiasVcf} documentation.  To reduce
memory usuage, the normal panel VCF can be reduced to contain only variants
present in 5 or more samples (the VCF for \software{MuTect} should however
contain variants present in 2-3 samples).

\subsubsection{Coverage data}
We next use coverage data of normal samples to estimate the expected variance
in coverage per target:

<<targetweightfile1, message=FALSE>>=
target.weight.file <- "target_weights.txt"
createTargetWeights(tumor.coverage.file, normal.coverage.files, 
    target.weight.file)
@

This function calculates target-level copy number log-ratios using all normal
samples provided in the \Rcode{normal.coverage.files} argument.  Assuming that
all normal samples are in general diploid, a high variance in log-ratio
is indicative of an target with either common germline alterations or frequent
artifacts; high or low copy number log-ratios in these targets are unlikely
measuring somatic copy number events.  For the log-ratio calculation, we provide
a coverage file that is used as tumor in the log-ratio calculation. The
corresponding \Rcode{tumor.coverage.file} argument can also be an array of a
small number of coverage files, in which case the target coverage variance is
averaged over all provided tumor files. 

This \Rcode{target.weight.file} is automatically generated by the NormalDB.R
script described in Section~\ref{cmdnormaldb}. 

\subsection{Artifact filtering without a pool of normals}
\label{artifactfiltering2}

By default, \Biocpkg{PureCN} will exclude targets with coverage below 15X from
segmentation. For SNVs, the same 15X cutoff is applied. \software{MuTect}
applies more sophisticated artifact tests and flags suspicious variants. If
\software{MuTect} was run in matched normal mode, then both potential artifacts
and germline variants are rejected, that means we cannot just filter by the
PASS/REJECT \software{MuTect} flags. The \Rfunction{filterVcfMuTect} function
optionally reads the \software{MuTect 1.1.7} stats file and will keep germline
variants, while removing potential artifacts.  Without the stats file,
\Biocpkg{PureCN} will use only the filters based on read depths as defined in
\Rfunction{filterVcfBasic}.  Both functions are automatically called by
\Biocpkg{PureCN}, but can be easily modified and replaced if necessary.

Instead of using a pool of normals to find SNPs with extremely biased allelic
fractions, we can also use a BED file to blacklist regions of low mappability.
For example the simple repeats track from the UCSC.  This is recommended when
neither matching normals nor a pool of normal VCF
(Section~\ref{artifactfiltering}) is available. 

<<ucsc_segmental>>=
# Instead of using a pool of normals to find low quality regions,
# we use suitable BED files, for example from the UCSC genome browser.

# We do not download these in this vignette to avoid build failures 
# due to internet connectivity problems.
downloadFromUCSC <- FALSE
if (downloadFromUCSC) {
    library(rtracklayer)
    mySession <- browserSession("UCSC")
    genome(mySession) <- "hg19"
    simpleRepeats <- track( ucscTableQuery(mySession, 
        track="Simple Repeats", table="simpleRepeat"))
    export(simpleRepeats, "hg19_simpleRepeats.bed")
    
    # when off-target reads are used, we can provide one of the 
    # whole-genome blacklists tracks
    # hg19_DukeBlacklist <- track( ucscTableQuery(mySession,  
    #     track="Mappability", 
    #     table="wgEncodeDukeMapabilityRegionsExcludable"))
    # export(hg19_DukeBlacklist, "hg19_DukeBlacklist.bed")
}

snp.blacklist <- "hg19_simpleRepeats.bed"
@

\section{Recommended run}
\label{secrecommended}
Finally, we can run \Biocpkg{PureCN} with all that information:

<<runpurecn>>=
ret <-runAbsoluteCN(normal.coverage.file=pool,
# normal.coverage.file=normal.coverage.file, 
    tumor.coverage.file=tumor.coverage.file, vcf.file=vcf.file, 
    genome="hg19", sampleid="Sample1", 
    gc.gene.file=gc.gene.file, normalDB=normalDB,
# args.setMappingBiasVcf=list(normal.panel.vcf.file=normal.panel.vcf.file),
# args.filterVcf=list(snp.blacklist=snp.blacklist, 
# stats.file=mutect.stats.file), 
    args.segmentation=list(target.weight.file=target.weight.file), 
    post.optimize=FALSE, plot.cnv=FALSE, verbose=FALSE)
@

The \Rcode{normal.coverage.file} argument points to a coverage file obtained from
either a matched or a process-matched normal sample, but can be also a small
pool of best normals (Section~\ref{bestnormal}).  

The \Rcode{normalDB} argument (Section~\ref{bestnormal}) provides a pool of
normal samples and for example allows the segmentation function to skip targets
with low coverage in the pool of normals.  If available, a VCF containing all
variants from the normal samples should be provided via
\Rcode{args.setMappingBiasVcf} to correct read mapping biases.  The files
specified in \Rcode{args.filterVcf} help \Biocpkg{PureCN} filtering SNVs more
efficiently for artifacts as described in Sections~\ref{artifactfiltering} and
\ref{artifactfiltering2}.  The \Rcode{snp.blacklist} is only necessary if
neither a matched normal nor a pool of normals is available. 

The \Rcode{post.optimize} flag will increase the runtime by about a factor of
2-5, but might return slightly more accurate purity estimates. For high quality
whole-exome data, this is typically not necessary for copy number calling (but
might be for variant classification, see Section~\ref{predictsomatic}).  The
\Rcode{plot.cnv} argument allows the segmentation function to generate
additional plots if set to \Rcode{TRUE}.  Finally, \Rcode{verbose} outputs
important and helpful information about all the steps performed and is
therefore set to \Rcode{TRUE} by default.

\section{Output}

\subsection{Plots}

We now create a few output files:

<<createoutput>>=
file.rds <- "Sample1_PureCN.rds"
saveRDS(ret, file=file.rds)
pdf("Sample1_PureCN.pdf", width=10, height=11)
plotAbs(ret, type="all")
dev.off()
@

The RDS file now contains the serialized return object of the
\Rfunction{runAbsoluteCN} call. The PDF contains helpful plots for all local
minima, sorted by likelihood.  The first plot in the generated PDF is displayed
in Figure~\ref{figure/figureexample1-1} and shows the purity and ploidy local
optima, sorted by final likelihood score after fitting both copy number and
allelic fractions.

<<figureexample1, fig.show='hide', fig.width=6, fig.height=6>>=
plotAbs(ret, type="overview")
@
\incfig{figure/figureexample1-1}{0.75\textwidth}{Overview.}
{The colors visualize the copy number fitting score from low (blue) to high
(red). The numbers indicate the ranks of the local optima. Yellow fonts
indicate that the corresponding solutions were flagged, which does not
necessarily mean the solutions are wrong. The correct solution (number 1) of
this toy example was flagged due to large amount of LOH.} 

We now look at the main plots of the maximum likelihood solution in more detail.

<<figureexample2, fig.show='hide', fig.width=6, fig.height=6>>=
plotAbs(ret, 1, type="hist")
@
\incfig{figure/figureexample2-1}{0.75\textwidth}{Log-ratio histogram.}

Figure~\ref{figure/figureexample2-1} displays a histogram of tumor vs. normal
copy number log-ratios for the maximum likelihood solution (number 1 in
Figure~\ref{figure/figureexample1-1}). The height of a bar in this plot is
proportional to the fraction of the genome falling into the particular
log-ratio copy number range. The vertical dotted lines and numbers visualize
the, for the given purity/ploidy combination, expected log-ratios for all
integer copy numbers from 0 to 7. It can be seen that most of the log-ratios of
the maximum likelihood solution align well to expected values for copy numbers
of 0, 1, 2 and 4. 

<<figureexample3, fig.show='hide', fig.width=8, fig.height=8>>=
plotAbs(ret, 1, type="BAF")
@
\incfig{figure/figureexample3-1}{0.95\textwidth}{B-allele frequency plot.}{Each
dot is a (predicted) germline SNP. The first panel shows the allelic fractions
as provided in the VCF file.  The alternating blue and white background colors
visualize odd and even chromosome numbers, respectively. The black lines
visualize the expected (not the average!) allelic fractions in the segment.
These are calculated using the estimated purity and the total and minor segment
copy numbers. These are visualized in black and grey, respectively, in the
second and third panel.  The second panel shows the copy number log-ratios, the
third panel the integer copy numbers.}

Germline variant data are informative for calculating integer copy number,
because unbalanced maternal and paternal chromosome numbers in the tumor
portion of the sample lead to unbalanced germline allelic fractions.
Figure~\ref{figure/figureexample3-1} shows the allelic fractions of predicted
germline SNPs. The goodness of fit (GoF) is provided on an arbitrary scale in
which 100\% corresponds to a perfect fit and 0\% to the worst possible fit.
The latter is defined as a fit in which allelic fractions on average differ by
0.2 from their expected fractions. Note that this does not take purity into
account and low purity samples are expected to have a better fit. In the
middle panel, the corresponding copy number log-ratios are shown. The lower
panel displays the calculated integer copy numbers, corrected for purity and
ploidy. We can zoom into particular chromosomes 
(Figure~\ref{figure/figureexample3b-1}).

<<figureexample3b, fig.show='hide', fig.width=9, fig.height=8>>=
plotAbs(ret, 1, type="BAF", chr="chr19")
@
\incfig{figure/figureexample3b-1}{0.95\textwidth}{Chromosome plot.}{Similar
    to Figure~\ref{figure/figureexample3-1}, but zoomed into a particular 
    chromosome. The grey dots in the middle panel visualize copy number 
    log-ratios of targets without heterozygous SNPs, which are omitted from the 
    previous genome-wide plot. The x-axis now indicates genomic coordinates in
    kbps.}

<<figureexample4, fig.show='hide', fig.width=8, fig.height=8>>=
plotAbs(ret, 1, type="AF")
@
\incfig{figure/figureexample4-1}{0.95\textwidth}{Allele fraction plots.}{Each
dot is again a (predicted) germline SNP. The size of dots indicate quality,
defined here as the product of mapping bias and coverage. The shapes visualize
the different SNV groups based on prior and posterior probabilities.  The labels
show the expected values for all called states; 2m1 would be diploid,
heterozygous, 2m2 diploid, homozygous.  The relationship of allelic fraction and
coverage is shown in the top right panel.  This plot normally also shows somatic
mutations in two additional panels, with the left panel showing the same plot as
for germline SNPs and the bottom right panel a histogram of cellular fraction of
predicted somatic mutations. This toy example contains only germline SNPs
however.}

Finally, Figure~\ref{figure/figureexample4-1} provides more insight into how
well the variants fit the expected values.
\clearpage

\subsection{Data structures}

The R data file (\Rcode{file.rds}) contains gene-level copy number calls,
SNV status and LOH calls.  The purity/ploidy combinations are sorted by
likelihood and stored in \Rcode{ret\$results}.

<<output1>>=
names(ret)
@

We provide convenient functions to extract information from this data structure
and show their usage in the next sections. We recommend using these functions
instead of accessing the data directly since data structures might change in
future versions. 

\subsubsection{Prediction of somatic status and cellular fraction}
\label{predictsomatic}
To understand allelic fractions of particular SNVs, we must know the (i)
somatic status, the (ii) tumor purity, the (iii) local copy number, as well as
the (iv) number of chromosomes harboring the mutations or SNPs. One of
\Biocpkg{PureCN} main functions is to find the most likely combination of these
four values. We further assign posterior probabilities to all possible
combinations or states. Availability of matched normals reduces the search space
by already providing somatic status. 

The \Rfunction{predictSomatic} function provides access to these probabilities.
For predicted somatic mutations, this function also provides cellular fraction
estimates, i.e. the fraction of tumor cells with mutation. Fractions
significantly below 1 indicate sub-clonality\footnote{This number can be above 1
when the observed allelic fraction is higher than expected for a clonal
mutation. This may be due to random sampling, wrong copy number, sub-clonal copy
number events, or wrong purity/ploidy estimates.}:

<<output3>>=
head(predictSomatic(ret), 3)
@

The output columns are explained in Table~\ref{tbl:predictSomatic}.

To annotate the input VCF file with these values:
<<output4>>=
vcf <- predictSomatic(ret, return.vcf=TRUE)
writeVcf(vcf, file="Sample1_PureCN.vcf") 
@

For optimal classification results:
\begin{itemize}
    \item Set \Rcode{post.optimize=TRUE} since small inaccuracies in purity can
decrease the classification performance significantly
    \item Provide \Rcode{args.setMappingBias} a pool of normal VCF to
        obtain position-specific mapping bias information
    \item Exclude variants in regions of low mappability    
    \item Use a somatic posterior probability cutoff of 0.8 and 0.2 for somatic
        and germline variants, respectively.  This appears to be a good
        compromise of call rate and accuracy.
\end{itemize}        

\textbf{Note that the posterior probabilities assume that the purity and ploidy
combination is correct. Before classifying variants, it is thus important to
manually curate samples.} 

\begin{table*}
\caption{\Rfunction{predictSomatic} output columns.}
\begin{tabular}{lp{9.5cm}}
\toprule
Column name       & Description \\
\midrule
\Rcode{chr, start, end}   & Variant coordinates \\
\Rcode{SOMATIC.M*} & Posterior probabilities for all somatic states.  M0 to M7 
are multiplicity values, i.e. the number of chromosomes harboring the 
mutation (e.g.  1 heterozygous, 2 homozygous if copy number C is 2). SOMATIC.M0
represents a sub-clonal state (somatic mutations by definition have a 
multiplicity larger than 0). \\
\Rcode{GERMLINE.M*} & Posterior probabilities for all heterozygous germline 
states \\
\Rcode{GERMLINE.CONTHIGH} & Posterior probability for contamination.  This 
state corresponds to homozygous germline SNPs that were not filtered out 
because reference alleles from another individual were sequenced, resulting in 
allelic fractions smaller than 1. \\
\Rcode{GERMLINE.CONTLOW} & Posterior probability for contamination. This state
corresponds to non-reference alleles only present in the contamination. \\
\Rcode{GERMLINE.HOMOZYGOUS} & Posterior probability that SNP is homozygous in 
normal. Requires the \Rcode{model.homozygous} option in 
\Rfunction{runAbsoluteCN}. See Section~\ref{sec:celllines}. \\
\Rcode{ML.SOMATIC} & \Rcode{TRUE} if the maximum likelihood state is a somatic
    state \\
\Rcode{POSTERIOR.SOMATIC} & The posterior probability that the variant is 
somatic (sum of all somatic state posterior probabilities) \\
\Rcode{ML.M}  & Maximum likelihood multiplicity \\
\Rcode{ML.C}  & Maximum likelihood integer copy number \\
\Rcode{ML.M.SEGMENT} & Maximum likelihood minor segment copy number \\
\Rcode{ML.AR}  & Expected allelic fraction of the maximum likelihood state \\
\Rcode{AR}   & Observed allelic fraction (as provided in VCF) \\
\Rcode{AR.ADJUSTED} & Observed allelic fraction adjusted for mapping bias\\
\Rcode{ML.LOH} & \Rcode{TRUE} if variant is most likely in LOH \\
\Rcode{CN.SUBCLONAL} & \Rcode{TRUE} if copy number segment is sub-clonal \\
\Rcode{CELLFRACTION} & Fraction of tumor cells harboring the somatic mutation\\
\Rcode{FLAGGED} & Flag indicating that call is unreliable 
    (currently only due to high mapping bias and high pool of normal counts)\\
\Rcode{log.ratio} & The copy number log-ratio (tumor vs. normal) for this 
    variant \\
\Rcode{depth} & The total sequencing depth at this position \\
\Rcode{prior.somatic} & Prior probability of variant being somatic \\
\Rcode{prior.contamination} & Prior probability that variant is contamination 
    from another individual \\
\Rcode{on.target} & 1 for variants within intervals, 2 for variants in flanking
 regions, 0 for off-target SNVs \\
\Rcode{seg.id}       & Segment id  \\
\Rcode{pon.count}  & Number of hits in the \Rcode{normal.panel.vcf.file} \\
\Rcode{gene.symbol} & Gene symbol if available \\ 
\bottomrule
\end{tabular}
\label{tbl:predictSomatic}
\end{table*}

\subsubsection{Amplifications and deletions}

To call amplifications, we recommend using a cutoff of 6 for focal
amplifications and a cutoff of 7 otherwise. For homozygous deletions, a cutoff
of 0.5 is useful to allow some heterogeneity in copy number.

For samples that failed \Biocpkg{PureCN} calling we recommended using
common log-ratio cutoffs to call amplifications, for example 0.9.

This strategy is implemented in the \Rfunction{callAlterations} function:

<<calling2>>=
gene.calls <- callAlterations(ret)
head(gene.calls)
@

It is also often useful to filter the list further by known biology, for
example to exclude non-focal amplifications of tumor suppressor genes. The
Sanger Cancer Gene Census \cite{Futreal2004} for example provides such a list.

The output columns of \Rfunction{callAlterations} are explained in
Table~\ref{tbl:callAlterations}.

\begin{table*}
\caption{\Rfunction{callAlterations} output columns.}
\begin{tabular}{lp{9.5cm}}
\toprule
Column name       & Description \\
\midrule
\Rcode{chr, start, end} & Gene coordinates \\
\Rcode{C}            & Segment integer copy number \\
\Rcode{seg.mean}     & Segment mean of copy number log-ratios 
(not adjusted for purity/ploidy)  \\
\Rcode{seg.id}       & Segment id  \\
\Rcode{number.targets} & Number of targets annotated with this symbol \\
\Rcode{gene.*}       & Gene copy number log-ratios  
(not adjusted for purity/ploidy)  \\
\Rcode{focal}       & \Rcode{TRUE} for focal amplification, as defined by the 
\Rfunction{fun.focal} argument in \Rfunction{runAbsoluteCN} \\
\Rcode{breakpoints} & Number of breakpoints between \Rcode{start} and 
\Rcode{end} \\
\Rcode{pvalue} & Gene p-value against pool of normals. 
Requires \Rcode{normalDB}. Not adjusted for multiple testing. \\
\Rcode{num.snps.segment} & Number of SNPs in this segment informative for 
    LOH detection\\
\Rcode{loh}         & \Rcode{TRUE} if segment is in LOH, \Rcode{FALSE} if not 
    and \Rcode{NA} if number of SNPs is insufficient \\
\Rcode{type}        & Amplification or deletion \\
\bottomrule
\end{tabular}
\label{tbl:callAlterations}
\end{table*}


\subsubsection{Find genomic regions in LOH}

The \Rcode{gene.calls} data.frame described above provides gene-level LOH
information. To find the corresponding genomic regions in LOH, we can use the
\Rfunction{callLOH} function:

<<loh>>=
loh <- callLOH(ret)
head(loh)
@

The output columns are explained in Table~\ref{tbl:callLOH}.

\begin{table*}
\caption{\Rfunction{callLOH} output columns.}
\begin{tabular}{lp{9.5cm}}
\toprule
Column name & Description \\
\midrule
\Rcode{chr, start, end} & Segment coordinates \\
\Rcode{arm}         & Chromosome arm \\
\Rcode{C}           & Segment integer copy number \\
\Rcode{M}           & Minor integer copy number ($M+N=C,M\leq N$) \\
\Rcode{type}        & LOH type if $M=0$ \\
\bottomrule
\end{tabular}
\label{tbl:callLOH}
\end{table*}

\section{Curation}

For prediction of SNV status (germline vs. somatic, sub-clonal vs. clonal,
homozygous vs. heterozygous), it is important that both purity and ploidy are
correct. We provide functionality for curating results:

<<curationfile>>=
createCurationFile(file.rds) 
@

This will generate a CSV file in which the correct purity and ploidy values
can be manually entered. It also contains a column "Curated", which should be
set to \Rcode{TRUE}, otherwise the file will be overwritten when re-run.

Then in R, the correct solution (closest to the combination in the CSV file)
can be loaded with the \Rfunction{readCurationFile} function:

<<readcurationfile>>=
ret <- readCurationFile(file.rds)
@

This function has various handy features, but most importantly it will re-order
the local optima so that the curated purity and ploidy combination is ranked
first. This means \Rcode{plotAbs(ret,1,type="hist")} would show the plot for
the curated purity/ploidy combination, for example.

The default curation file will list the maximum likelihood solution:

<<curationfileshow>>=
read.csv("Sample1_PureCN.csv")
@

\Biocpkg{PureCN} currently only flags samples with warnings, it does not mark
any samples as failed. The \Rcode{Failed} column in the curation file can be
used to manually flag samples for exclusion in downstream analyses.

\section{Cell lines}
\label{sec:celllines}

Default parameters assume some normal contamination. In 100\% pure samples
without matching normal samples such as cell lines, we cannot distinguish
homozygous SNPs from LOH by looking at single allelic fractions alone. It is
thus necessary to keep homozygous variants and include this uncertainty in the
likelihood model. This is done by setting the \Rfunction{runAbsoluteCN} argument
\Rcode{model.homozygous=TRUE}. If matched normals are available, then variants
homozygous in normal are automatically removed since they are uninformative.

For technical reasons, the maximum purity \Biocpkg{PureCN} currently models is
0.99. We recommend setting \Rcode{test.purity=seq(0.9,0.99,by=0.01)} in
\Rfunction{runAbsoluteCN} for cell lines. 

\section{Maximizing the number of heterozygous SNPs}
\label{secmaxsnps}

It is possible to use SNPs in off-target reads in the SNV fitting step by
running \software{MuTect} without interval file and then setting the
\Rfunction{filterVcfBasic} argument \Rcode{remove.off.target.snvs} to
\Rcode{FALSE}. We recommend a large pool of normals in this case and then
generating SNP blacklists as described in Sections~\ref{artifactfiltering} and
\ref{artifactfiltering2}.  Remember to also run all the normals in
\software{MuTect} without interval file. 

An often better alternative to including all off-target reads is only including
SNVs in the flanking regions of targets (between 50-100bp). This will usually
significantly increase the number of heterozygous SNPs (see
Section~\ref{faqinput}). These SNPs are automatically added if the variant
caller was run without interval file or with interval padding.

\section{Advanced usage}
\subsection{Custom normalization and segmentation}
\label{customseg}
Copy number normalization and segmentation are crucial for obtaining good
purity and ploidy estimates. If you have a well-tested pipeline that produces
clean results for your data, you might want to use \Biocpkg{PureCN} as add-on
to your pipeline. By default, we will use \CRANpkg{DNAcopy}
\cite{Venkatraman2007} to segment normalized target-level coverage log-ratios.
It is straightforward to replace the default with other methods and the
\Rfunction{segmentationCBS} function can serve as an example.

The next section describes how to replace the default segmentation. For the
probably more uncommon case that only the coverage normalization is performed by
third-party tools, see Section~\ref{customnorm}.

\subsubsection{Custom segmentation}
It is possible to provide already segmented data, which is especially
recommended when matched SNP6 data are available or when third-party
segmentation tools are not written in R. Otherwise it is usually however better
to customize the default segmentation function, since the algorithm then has
access to the raw log-ratio distribution\footnote{If the third-party tool
provides target-level log-ratios, then these can be provided via the
\Rcode{log.ratio} argument in addition to \Rcode{seg.file} though. See also
Section~\ref{customnorm}.}. The expected file format for already segmented copy
number data is\footnote{This segmentation file can contain multiple samples, in
which case the provided \Rcode{sampleid} must match a sample in the column
\Rcode{ID}}:

\begin{verbatim}
    ID  chrom   loc.start   loc.end num.mark    seg.mean
    Sample1   1   61723   5773942 2681    0.125406444072723
    Sample1   1   5774674 5785170 10  -0.756511807441712
\end{verbatim}

Since its likelihood model is exon-based, \Biocpkg{PureCN} currently still
requires an interval file to generate simulated target-level log-ratios from a
segmentation file.  For simplicity, this interval file is expected either in
\software{GATK DepthOfCoverage} format and provided via the
\Rcode{tumor.coverage.file} argument or via the \Rcode{gc.gene.file} argument (see
Figure~\ref{figure/figurecustombaf-1}). Note that \Biocpkg{PureCN} will 
re-segment the simulated log-ratios using the default
\Rfunction{segmentationCBS} function, in particular to identify regions of
copy-number neutral LOH and to cluster segments with similar allelic imbalance
and log-ratio. The provided interval file should therefore cover
all significant copy number alterations\footnote{If this behaviour is not
wanted, because maybe the custom function already identifies CNNLOH reliably,
\Rfunction{segmentationCBS} can be replaced with a minimal version.}. Please
check that the log-ratios are similar to the ones obtained by the default
\Biocpkg{PureCN} segmentation and normalization.

<<customseg>>=
retSegmented <- runAbsoluteCN(seg.file=seg.file, 
    gc.gene.file=gc.gene.file, vcf.file=vcf.file, 
    max.candidate.solutions=1, genome="hg19",
    test.purity=seq(0.3,0.7,by=0.05), verbose=FALSE, 
    plot.cnv=FALSE)
@

The \Rcode{max.candidate.solutions} and \Rcode{test.purity} arguments are set
to non-default values to reduce the runtime of this vignette. 

<<figurecustombaf, fig.show='hide', fig.width=8, fig.height=8>>=
plotAbs(retSegmented, 1, type="BAF")
@
\incfig{figure/figurecustombaf-1}{0.95\textwidth}{B-allele frequency plot for
segmented data.}{This plot shows the maximum likelihood solution for an example
where segmented data are provided instead of coverage data. Note that the middle
panel shows no variance in log-ratios, since only segment-level log-ratios are
available.}
  

\subsubsection{Custom normalization}
\label{customnorm}
If third-party tools such as \software{GATK4} are used to calculate
target-level copy number log-ratios, and \Biocpkg{PureCN} should be used for
segmentation and purity/ploidy inference only, it is possible to provide these
log-ratios:

<<customlogratio, message=FALSE>>=
# We still use the log-ratio exactly as normalized by PureCN for this
# example
log.ratio <- calculateLogRatio(readCoverageFile(normal.coverage.file),
    readCoverageFile(tumor.coverage.file))

retLogRatio <- runAbsoluteCN(log.ratio=log.ratio,
    gc.gene.file=gc.gene.file, vcf.file=vcf.file, 
    max.candidate.solutions=1, genome="hg19",
    test.purity=seq(0.3,0.7,by=0.05), verbose=FALSE, 
    normalDB=normalDB, plot.cnv=FALSE)
@

Again, the \Rcode{max.candidate.solutions} and \Rcode{test.purity} arguments
are set to non-default values to reduce the runtime of this vignette. It is
highly recommended to compare the log-ratios obtained by \Biocpkg{PureCN} and
the third-party tool, since some pipelines automatically adjust log-ratios for
a default purity value. Note that this example uses a pool of normals to filter
low quality targets. Interval coordinates are again expected in either a
\Rcode{gc.gene.file} or a \Rcode{tumor.coverage.file}. If a tumor coverage file is
provided, then all targets below the coverage minimum are further excluded.

\subsection{COSMIC annotation}

If a matched normal is not available, it is also helpful to provide
\Rfunction{runAbsoluteCN} the COSMIC database \cite{Forbes2015} via
\Rcode{cosmic.vcf.file} (or via a \Rcode{Cosmic.CNT} INFO field in the VCF).
While this has limited effect on purity and ploidy estimation due the sparsity
of hotspot mutations, it often helps in the manual curation to compare how well
high confidence germline (dbSNP) vs. somatic (COSMIC) variants fit a particular
purity/ploidy combination. 


\subsection{Power to detect somatic mutations}

As final quality control step, we can test if coverage and tumor purity are
sufficent to detect mono-clonal or even sub-clonal somatic mutations. We
strictly follow the power calculation by Carter et al. \cite{Carter2012}. 

The following Figure~\ref{figure/power1-1} shows the power to detect mono-clonal
somatic mutations as a function of tumor purity and sequencing coverage
(reproduced from \cite{Carter2012}):

<<power1, fig.show='hide', fig.width=6, fig.height=6>>=
purity <- c(0.1,0.15,0.2,0.25,0.4,0.6,1)
coverage <- seq(5,35,1)
power <- lapply(purity, function(p) sapply(coverage, function(cv) 
    calculatePowerDetectSomatic(coverage=cv, purity=p, ploidy=2, 
    verbose=FALSE)$power))

# Figure S7b in Carter et al.
plot(coverage, power[[1]], col=1, xlab="Sequence coverage", 
    ylab="Detection power", ylim=c(0,1), type="l")

for (i in 2:length(power)) lines(coverage, power[[i]], col=i)
abline(h=0.8, lty=2, col="grey")     
legend("bottomright", legend=paste("Purity", purity),
    fill=seq_along(purity))
@
\incfig{figure/power1-1}{0.95\textwidth}{Power to detect mono-clonal somatic
mutations.}{Reproduced from \cite{Carter2012}.}

Figure~\ref{figure/power2-1} then shows the same plot for sub-clonal mutations
present in 20\% of all tumor cells: 

<<power2, fig.show='hide', fig.width=6, fig.height=6>>=
coverage <- seq(5,350,1)
power <- lapply(purity, function(p) sapply(coverage, function(cv) 
    calculatePowerDetectSomatic(coverage=cv, purity=p, ploidy=2, 
    cell.fraction=0.2, verbose=FALSE)$power))
plot(coverage, power[[1]], col=1, xlab="Sequence coverage", 
    ylab="Detection power", ylim=c(0,1), type="l")

for (i in 2:length(power)) lines(coverage, power[[i]], col=i)
abline(h=0.8, lty=2, col="grey")     
legend("bottomright", legend=paste("Purity", purity),
    fill=seq_along(purity))
@
\incfig{figure/power2-1}{0.95\textwidth}{Power to detect sub-clonal somatic
mutations present in 20\% of all tumor cells.}{Reproduced from \cite{Carter2012}.}

\clearpage

\section{Command line scripts}

The following section describe command line tools that provide a complete
pipeline from BAM files.  

\subsection{IntervalFile}

A simple helper script to convert a BED file containing target coordinates into 
a valid \Rcode{gc.gene.file} (without gene symbols):

\begin{verbatim}
 Rscript IntervalFile.R --infile ex_intervals.bed --fasta ex_reference.fa \ 
 --outfile ex_gcgene.txt
\end{verbatim}

In case third-party segmentation tools are used, you can now jump to
Section~\ref{cmdpurecn}.

\begin{table*}
\caption{IntervalFile command line script.}
\begin{tabular}{lll}
\toprule
Argument name  & Corresponding PureCN argument & PureCN function \\
\midrule
--help -h      & & \\
--version -v   & & \\
--force -f     & & \\
--fasta -a     & reference.file & \Rfunction{calculateGCContentByInterval} \\
--infile -i    & interval.file & \Rfunction{calculateGCContentByInterval} \\
--outfile -o   & & \\
\bottomrule
\end{tabular}
\end{table*}

\subsection{Coverage}
\label{cmdcoverage}

We provide a basic template for GC-normalizing BAM or \software{GATK
DepthOfCoverage} files from the command line. For example:

\begin{verbatim}
# From a BAM file (not working with provided toy example BAM file)
Rscript Coverage.R --outdir ~/tmp/ --bam ex1.bam \
     --gcgene ex1_gcgene.txt

# From a GATK DepthOfCoverage file
Rscript Coverage.R --outdir ~/tmp/ --gatkcoverage example_tumor.txt \ 
    --gcgene  example_gc.gene.file.txt     
\end{verbatim}

\begin{table*}
\caption{Coverage command line script.}
\begin{tabular}{lll}
\toprule
Argument name       & Corresponding PureCN argument & PureCN function \\
\midrule
--help -h           & & \\
--version -v        & & \\
--force -f          & & \\
--seed -S           & & \\
--outdir -o         & & \\
--bam -b            & bam.file & \Rfunction{calculateBamCoverageByInterval} \\
--bai -a            & index.file & \Rfunction{calculateBamCoverageByInterval} \\
--gatkcoverage -g   & coverage.file & \Rfunction{correctCoverageBias} \\
--gcgene -c         & gc.gene.file  & \Rfunction{correctCoverageBias} \\
--method -m         & method  & \Rfunction{correctCoverageBias} \\
\bottomrule
\end{tabular}
\end{table*}

\subsection{NormalDB}
\label{cmdnormaldb}

NormalDB.R is a convenient helper script that generates pool of normal
files given a list of coverage files. It can automatically GC-normalize
if necessary:
\begin{verbatim}
# From GATK DepthOfCoverage files, not GC-normalized
Rscript NormalDB.R --outdir ~/tmp --coveragefiles example_normal.list \
 --gcgene example_gc.gene.file.txt --genome hg19 --method LOESS 

# From already GC-normalized files
Rscript NormalDB.R --outdir ~/tmp --coveragefiles example_normal.list \
 --genome hg19 
\end{verbatim}    

\begin{table*}
\caption{NormalDB command line script.}
\begin{tabular}{lll}
\toprule
Argument name       & Corresponding PureCN argument & PureCN function \\
\midrule
--help -h           & & \\
--version -v        & & \\
--force -f          & & \\
--seed -S           & & \\
--outdir -o         & & \\
--coveragefiles -b  & coverage.file & \Rfunction{correctCoverageBias} \\
--gcgene -c         & gc.gene.file  & \Rfunction{correctCoverageBias} \\
--method -m         & method  & \Rfunction{correctCoverageBias} \\
--assay -a          & Optional assay name & Used in output file names. \\
--genome -g        & Optional genome version & Used in output file names. \\
\bottomrule
\end{tabular}
\end{table*}

\subsection{PureCN}
\label{cmdpurecn}

PureCN.R is an example script to run PureCN with basic parameters:

\begin{verbatim}
# From GC-normalized coverage data
Rscript PureCN.R --outdir ~/tmp/ --tumor example_tumor.txt \ 
    --normal example_normal.txt --vcf example_vcf.vcf -i Sample1 \
    --genome hg19 --gcgene  example_gc.gene.file.txt 

# Without a matched normal    
Rscript PureCN.R --outdir ~/tmp/ --tumor example_tumor.txt \ 
    --normaldb normalDB.rds --pool 5 --vcf example_vcf.vcf -i Sample1 \
    --genome hg19 --gcgene example_gc.gene.file.txt 

# From already segmented data
Rscript PureCN.R --outdir ~/tmp/ --segfile example_seg.txt \ 
    --vcf example_vcf.vcf -i Sample1 --genome hg19 \
    --funsegmentation none \
    --gcgene  example_gc.gene.file.txt 

# Two-pass loess normalization from NOT YET GC-normalized tumor coverage
# data (normal controls need to be GC-normalized by the LOESS method)
Rscript PureCN.R --outdir ~/tmp/ --tumor example_tumor.txt \ 
    --normaldb normalDB.rds --pool 5 --vcf example_vcf.vcf -i Sample1 \
    --genome hg19 --gcgene example_gc.gene.file.txt \
    --twopass --postoptimize

# Recreate output after manual curation of Sample_purecn.csv
Rscript PureCN.R --rds Sample1_purecn.rds
    
\end{verbatim}

\begin{table*}
\caption{PureCN command line script.}
\begin{tabular}{lll}
\toprule
Argument name       & Corresponding PureCN argument & PureCN function \\
\midrule
--help -h           & & \\
--version -v        & & \\
--force -f          & & \\
--seed -S           & & \\
--outdir -o         &          &                            \\
--normal -n         & normal.coverage.file   & \Rfunction{runAbsoluteCN}  \\
--tumor -t          & tumor.coverage.file     & \Rfunction{runAbsoluteCN} \\
--vcf -b            & vcf.file          & \Rfunction{runAbsoluteCN}   \\
--rds -r            & file.rds          & \Rfunction{readCurationFile}   \\
--genome -g         & genome             & \Rfunction{runAbsoluteCN}  \\
--gcgene -c         & gc.gene.file      & \Rfunction{runAbsoluteCN}   \\
--segfile -l        & seg.file           & \Rfunction{runAbsoluteCN}  \\
--snpblacklist -s   & snp.blacklist & \Rfunction{filterVcfBasic}     \\
--normal\_panel -p   & normal.panel.vcf.file & \Rfunction{setMappingBiasVcf} \\
--statsfile -a      & stats.file & \Rfunction{filterVcfMuTect}        \\
--targetweightfile -e & target.weight.file & \Rfunction{segmentationCBS} \\
--normaldb -d       & normalDB (serialized with \Rfunction{saveRDS}) & 
    \Rfunction{findBestNormal}, \Rfunction{filterTargets} \\
--pool -m           & pool, num.normals & \Rfunction{findBestNormal} \\    
--minaf -j          & af.range & \Rfunction{filterVcfBasic} \\
--minpurity -k      & test.purity & \Rfunction{runAbsoluteCN} \\
--maxpurity -x      & test.purity & \Rfunction{runAbsoluteCN} \\
--minploidy -K      & min.ploidy & \Rfunction{runAbsoluteCN} \\
--maxploidy -X      & max.ploidy & \Rfunction{runAbsoluteCN} \\
--postoptimize -z   & post.optimize & \Rfunction{runAbsoluteCN} \\    
--modelhomozygous -y & model.homozygous & \Rfunction{runAbsoluteCN} \\ 
--model -q          & model & \Rfunction{runAbsoluteCN} \\ 
--funsegmentation -w & fun.segmentation & \Rfunction{runAbsoluteCN} \\ 
--sampleid -i       & sampleid  & \Rfunction{runAbsoluteCN}  \\
--twopass -T        & purecn.output & \Rfunction{correctCoverageBias} \\
--outvcf -u         & return.vcf & \Rfunction{predictSomatic} \\
\bottomrule
\end{tabular}
\end{table*}
 
These R scripts can be found in the extdata directory of the installed package.

\section{Limitations}
\label{sec:limits}
\Biocpkg{PureCN} currently assumes a completely diploid normal genome. For
human samples, it tries to detect sex by calculating the coverage ratio of
chromosomes X and Y and will then remove sex chromosomes in male 
samples\footnote{Loss of Y chromosome (LOY) can result in wrong female calls,
especially in high purity samples or if LOY is in both tumor and contaminating
normal cells. The software throws a warning in this case when germline SNPs
are provided.}. For non-human samples, the user needs to manually remove all
non-diploid chromosomes from the coverage data and specify
\Rcode{sex="diploid"} in the \Biocpkg{PureCN} call.

While \Biocpkg{PureCN} supports and models sub-clonal somatic copy number
alterations, it currently assumes that the majority of alterations are
mono-clonal. For most clinical samples, this is reasonable, but very
heterogeneous samples are likely not possible to call without manual curation.
Poly-genomic tumors are often called as high ploidy or low purity. The former
usually happens when sub-clonal losses are called as 2 copies and mono-clonal
losses correctly as 1 copy. The latter when sub-clonal losses are called
mono-clonal, which only happens when there are far more sub-clonal than
mono-clonal losses. Please note however that unless purities are very high,
algorithms that model poly-genomic tumors do not necessarily have a higher call
rate, since they tend to overfit noisy samples or similarly confuse true
high-ploidy with poly-genomic tumors.  Due to the lack of signal, manual
curation is also recommended in low purity samples or very quiet genomes.

Finally, the software currently does not officially support VCF files containing
indels. Support for VCFs generated by \software{MuTect 2} that include both
single nucleotide variants (SNVs) and indels is planned for Bioconductor 3.5.

\section{Support}

If you encounter bugs or have problems running \Biocpkg{PureCN},
please post them at
\url{https://support.bioconductor.org/p/new/post/?tag\_val=PureCN}.

If \Biocpkg{PureCN} throws user errors, then there is likely a problem with the
input files.  If the error message is not self-explanatory, feel free to
seek help at the support site. In your report, please add the outputs of the
\Rfunction{runAbsoluteCN} call (with \Rcode{verbose=TRUE}) and
\Rfunction{sessionInfo()}. Please also check that your problem is not already
covered in the following sections.

For general feedback such as suggestions for improvements, feature requests,
complaints, etc. please do not hesitate to send us an email.

\subsection{Checklist}

\begin{itemize}
    \item Used the correct interval files provided by the manufacturer of the
        capture kit and the genome version of the interval file matches the
        reference.
    \item BAM files were generated following established best practices and
    tools finished successfully.
    \item Checked standard QC metrics such as AT dropout.
    \item Tumor and normal data were obtained using the same capture kit and
    pipeline.
    \item Coverage data of tumor and normal were GC-normalized.
    \item The VCF file contains germline variants (i.e. not only somatic calls).
    \item Maximized the number of high coverage heterozygous SNPs, for example
    by running \software{MuTect} with a 50bp interval padding
    (Section~\ref{secmaxsnps}).
    \item If a pool of normal samples is available, followed the steps in
    Section~\ref{artifactfiltering}.
    \item Read the output of \Rfunction{runAbsoluteCN} with
        \Rcode{verbose=TRUE}, fixed all warnings.
    \item If third-party segmentation tools are used, checked that normalized
    log-ratios are not biased, i.e. very similar compared to \Biocpkg{PureCN}
    log-ratios (some pipelines already adjust for a default normal 
    contamination).
\end{itemize}

\subsection{FAQ}

\paragraph{If the ploidy is frequently too high, please check:}

\begin{itemize}
    \item Does the log-ratio histogram (Figure~\ref{figure/figureexample2-1})
look noisy? If yes, then
    \begin{itemize}
    \item Is the coverage sufficient? Tumor coverages below 80X can be
difficult, especially in low purity samples. Normal coverages below 50X might
result in high variance of log-ratios. See Section~\ref{bestnormal} for finding
a good normal sample for log-ratio calculation.
    \item Is the coverage data of both tumor and normal GC-normalized? If not, 
    see \Rfunction{correctCoverageBias}.
    \item Is the quality of both tumor and normal sufficient? A high AT or
GC-dropout might result in high variance of log-ratios. Challenging FFPE samples
also might need parameter tuning of the segmentation function. See 
\Rfunction{segmentationCBS}. A high expected tumor purity allows more aggressive
segmentation parameters, such as \Rcode{prune.hclust.h=0.2} or higher.
    \item Was the correct target interval file used (genome version and capture
    kit, see Section~\ref{gatkcoverage})? If unsure, ask the help desk of your
    sequencing center.
    \item Were the normal samples run with the same assay and pipeline? 
    \item Did you provide \Rfunction{runAbsoluteCN} all the recommended files
    as described in Section~\ref{secrecommended}?  
    \item For whole-genome data, you will get better results using a
    specialized third-party segmentation method as described in
    section~\ref{customseg}, since our default is optimized for targeted
    sequencing.
    \item For smaller targeted panels, copy number events can interfere with
    the GC-normalization. Try either the \Rcode{POLYNOMIAL} method in
    \Rfunction{correctCoverageBias} or the two-pass loess normalization
    (Section~\ref{cmdpurecn}) if the data looks noisy. Third-party segmentation
    tools (Section~\ref{customseg}) that utilize off-target reads might be also
    worth trying.
    \end{itemize}
    \item Otherwise, if log-ratio peaks are clean as in 
Figure~\ref{figure/figureexample2-1}:
    \begin{itemize}
        \item Was MuTect run without a matched normal? If yes, then make sure
        to provide either a pool of normal VCF or a SNP blacklist (if no pool
        of normal samples is available) as described in
        Sections~\ref{artifactfiltering} and \ref{artifactfiltering2}.
        \item A high fraction of sub-clonal copy-number alterations might also
    result in a low ranking of correct low ploidy solutions (see
    Section~\ref{sec:limits}). 
    \end{itemize}
\end{itemize} 

\paragraph{If the ploidy is frequently too low:}
\begin{itemize}
    \item \Biocpkg{PureCN} with default parameters is conservative in calling
    genome duplications.
    \item This should only affect low purity samples (< 35\%), since in higher
    purity samples the duplication signal is usually strong enough to
    reliably detect it.  
    \item In whole-exome data, it is usually safe to decrease the
    \Rcode{max.homozygous.loss} default, since such large losses are rare.
\end{itemize}

\paragraph{Will PureCN work with my data?}

\begin{itemize}
    \item \Biocpkg{PureCN} was designed for medium-sized (>2-3Mb) targeted
    panels.  The more data, the better, best results are typically achieved
    in whole-exome data.
    \item The same is obviously true for coverage. Coverages below 80X are
    difficult unless purities are high and coverages are even.
    \item Samples with tumor purities below 20\% usually cannot be analyzed with this
    algorithm and \Biocpkg{PureCN} might return very wrong purity estimates. 
    \item The number of heterozygous SNPs is also important (>1000 per sample).  
    Copy number probes  enriched in SNPs are therefore very helpful 
    (see Section~\ref{secmaxsnps}).
    \item While \Biocpkg{PureCN} supports to some degree uneven tiling of
    targets, the more evenly distributed the better. Large gaps are problematic
    and might require third-party segmentation tools that support off-target
    reads (Section~\ref{customseg}). 
    \item Whole-genome data is not officially supported and specialized tools
    will likely provide better results.  Third-party segmentation tools
    designed for this data type would be again required.  
    \item Amplicon sequencing data is not supported. 
    \item \Biocpkg{PureCN} also needs process-matched normal samples, again,
    the more the better. 
\end{itemize}


\paragraph{If you have trouble generating input data PureCN accepts, please check:}
\label{faqinput}
\begin{itemize}
    \item For problems related to generating valid coverage data, either consult
    the \software{GATK} manual for the \software{DepthOfCoverage} tool or
    Section~\ref{purecncoverage} for the equivalent function in
    \Biocpkg{PureCN}.
    \item Currently only VCF files generated by \software{MuTect 1} are
        officially supported and well tested. A minimal example 
        \software{MuTect} call would be:
    \begin{verbatim}
    $JAVA -Xmx6g -jar $MUTECT \
    --analysis_type MuTect -R $REFERENCE \
    --dbsnp $DBSNP_VCF \
    --cosmic $COSMIC_VCF \
    -I:normal $BAM_NORMAL \
    -I:tumor $BAM_TUMOR  \
    -o $OUT/${ID}_bwa_mutect_stats.txt \
    -vcf $OUT/${ID}_bwa_mutect.vcf
    \end{verbatim}
    The normal needs to be matched; without matched normal, only provide the
    tumor BAM file (do NOT provide a process-matched normal here). The default
    output file is the stats or call-stats file; this can be provided in
    addition to the required VCF file via \Rcode{args.filterVcf} in
    \Rfunction{runAbsoluteCN}. If provided, it may help \Biocpkg{PureCN} filter
    artifacts. This requires \software{MuTect} in version 1.1.7.  
    
    Note that this \software {MuTect} VCF will contain SNVs in off-target reads.
    By default, \Biocpkg{PureCN} will only keep SNVs in the 50bp flanking
    regions of the provided targets. We highly recommend finding good values for
    each assay. A good cutoff will maximize the number of heterozygous SNPs and
    keep only an acceptable number of lower quality calls. This cutoff is set
    via \Rcode{interval.padding} in \Rcode{args.filterVcf}. See
    Section~\ref{secmaxsnps}. 
    \item For VCFs generated by other callers, the required dbSNP annotation 
    can be added for example with \software{bcftools}:
    \begin{verbatim}
    bcftools annotate --annotation $DBSNP_VCF \
        --columns ID --output $OUT/${ID}_bwa_dbsnp.vcf.gz --output-type z \
            $OUT/${ID}_bwa.vcf.gz
    \end{verbatim}
    \item To generate the normal panel VCF (\Rcode{normal.panel.vcf.file})
    with \software{GATK}:
    \begin{itemize}
        \item Run \software{MuTect} on the normal with
        \Rcode{-I:tumor \$BAM\_NORMAL} and optionally with the 
        \Rcode{--artifact\_detection\_mode} flag.
        \item Remove the empty \Rcode{none} sample from the VCF:
        \begin{verbatim}
    $JAVA -Xmx6g -jar $GATK \
    --analysis_type SelectVariants -R $REFERENCE \
    --exclude_sample_expressions none \
    -V $OUT/${ID}_bwa_mutect_artifact_detection_mode.vcf \
    -o $OUT/${ID}_bwa_mutect_artifact_detection_mode_no_none.vcf 
        \end{verbatim}
        \item Merge the VCFs:
        \begin{verbatim}
    CMD="java  -Xmx12g -jar $GATK -T CombineVariants --minimumN 5 -R $REFERENCE"
    for VCF in $OUT/*bwa_mutect_artifact_detection_mode_no_none.vcf;
    do
        CMD="$CMD --variant $VCF"
    done
    CMD="$CMD -o $OUT/normals_merged_min5.vcf"
    echo $CMD > $OUT/merge_normals_min5.sh
        \end{verbatim}
    \end{itemize}
\end{itemize}

\paragraph{Questions related to pool of normals.}

As described in detail in Sections~\ref{bestnormal} and
\ref{artifactfiltering}, a pool of normal samples is used in \Biocpkg{PureCN}
for coverage normalization (to adjust for target-specific capture biases) and
    for artifact filtering. A few recommendations:

\begin{itemize}
    \item Matched normals are obviously perfect for identifying most alignment
    artifacts and mapping biases. While not critical, we still recommend
    generating a \Rcode{normal.panel.vcf.file} for \software{MuTect} and
    \Rfunction{setMappingBiasVcf} using the available normals.
    \item Without matched normals, we need process-matched normals for coverage
    normalization. We recommend at least 2, ideally more than 5 
    from 2-3 different library preparation and sequencing batches.
    \item For artifact removal, the more normals available, the more rare
    artifacts are removed. We recommend at least 10, 50 or more are ideal.  
    \item For position-specific mapping bias correction, the more normals are
    available, the more rare SNPs will have reliable mapping bias estimates.
    This requires at least about 25 normals to be useful, 100 or more are
    ideal. 
    \item With smaller pool of normals, we additionally recommend filtering
    SNPs from low quality regions (Section~\ref{artifactfiltering2}).
    Additionally, it is worth trying the beta-binomial function instead of the
    default in the \Rcode{model} argument of \Rfunction{runAbsoluteCN}. This
    will incorporate uncertainty of observed variant allelic fractions in
    the variant fitting step.
\end{itemize}

\paragraph{Questions related to manual curation.}

\Biocpkg{PureCN}, like most other related tools, essentially finds the most
simple explanation of the data. There are three major problems with this
approach:
\begin{itemize}
    \item First, hybrid capture data can be noisy and the algorithm must
        distinguish signal from noise; if the algorithm mistakes noise for
        signal, then this often results in wrong high ploidy calls (see
        Sections~\ref{artifactfiltering} and \ref{artifactfiltering2}). If all
        steps in this vignette were followed, then \Biocpkg{PureCN} should
        ignore common artifacts. Noisy samples thus often have outlier ploidy
        values and are often automatically flagged by \Biocpkg{PureCN}. The
        correct solution is in most of these cases ranked second or third.
    \item The second problem is that signal can be sparse, i.e. when the tumor
        purity is very low or when there are only few somatic events.  Manual
        curation is often easy in the latter case. For example when small losses
        are called as homozygous, but corresponding germline allele-frequencies
        are unbalanced (a complete loss would result in balanced germline allele
        frequencies, since only normal DNA is left). Future versions might
        improve calling in these cases by underweighting uninformative genomic
        regions.
    \item The third problem is that tumor evolution is fast and complex and very
        difficult to incorporate into general likelihood models. Sometimes
        multiple solutions explain the data equally well, but one solution is
        then often clearly more consistent with known biology, for example LOH
        in tumor suppressor genes such as \textit{TP53}.  A basic understanding
        of both the algorithm and the tumor biology of the particular cancer
        type are thus important for curation.  Fortunately, in most cancer
        types, such ambiguity is rather rare. See also Section~\ref{sec:limits}.
\end{itemize}

\paragraph{If all or most of the samples are flagged as:}

\begin{description}
    \item[Noisy segmentation:] The default of 300 for \Rcode{max.segments} is
    calibrated for high quality and high coverage whole-exome data. For
    whole-genome data or lower coverage data, this value needs to be
    re-calibrated.  In case the copy number data looks indeed noisy, please see
    the first FAQ. Please be aware that \Biocpkg{PureCN} will apply more
    aggressive segmentation parameters when the number of segments exceeds this
    cutoff. If the high segment count is real, this might confound downstream
    analyses.
    \item[High AT/GC dropout:] If the data is GC-normalized, then there might
    be issues with either the target intervals or the provided GC content.
    Please double check that all files are correct and that all the coverage
    files are GC-normalized (Section~\ref{secgcbias}).
\end{description}

\appendix

\bibliography{PureCN}

\section*{Annotating intervals with gene symbols}

A simple function that annotates intervals with the \textbf{first}
overlapping interval from a given \software{TxDb} database:

<<annotatesymbols, message=FALSE, warning=FALSE>>=
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db) 

.annotateIntervals <- function(gc.gene.file, txdb, output.file = NULL) {
    gc <- read.delim(gc.gene.file, as.is=TRUE)
    # misuse this function to convert interval string into data.frame
    gc.data <- readCoverageFile(gc.gene.file)
    grGC <- GRanges(seqnames=gc.data$chr, 
        IRanges(start=gc.data$probe_start, end=gc.data$probe_end))
    id <- transcriptsByOverlaps(txdb, ranges=grGC, columns = "GENEID")
    id$SYMBOL <-select(org.Hs.eg.db, as.character(id$GENEID), "SYMBOL")[,2]
    gc$Gene <- id$SYMBOL[findOverlaps(grGC, id, select="first")]
    if (!is.null(output.file)) {
        write.table(gc, file = output.file, row.names = FALSE, 
            quote = FALSE, sep = "\t")
    }
    invisible(gc)
}

.annotateIntervals(gc.gene.file, TxDb.Hsapiens.UCSC.hg19.knownGene)
@
\section*{Session Info}
<<sessioninfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@

\end{document}