---
title: "PureCN best practices"
author:
      name: Markus Riester
      affiliation: Novartis Biomedical Research 
output:
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default  
package: PureCN
abstract: |
    This tutorial provides a quick overview of the command line tools shipping
    with _PureCN_. These tools implement highly recommended best practices. For
    the R package and more detailed information, see the main vignette. 
vignette: |
  %\VignetteIndexEntry{Best practices, quick start and command line usage}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r load-purecn, echo=FALSE, message=FALSE}
library(PureCN)
library(BiocStyle)
```

# Prerequisites

## Update from previous stable versions

`r Biocpkg("PureCN")` is backward compatible with input generated by
versions 1.16 and later. For versions 1.8 to 1.14, please re-run `NormalDB.R`
(see also below): 

```
$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --coverage-files example_normal_coverages.list \
    --genome hg19 --normal-panel $NORMAL_PANEL --assay agilent_v6
```

When using `--model betabin` in `PureCN.R`, we recommend for all previous
versions re-creating the mapping bias database by re-running `NormalDB.R`: 

```
# only re-creating the mapping bias file
$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --genome hg19 --normal-panel $NORMAL_PANEL --assay agilent_v6
```

For upgrades from version 1.6, we highly recommend starting from scratch
following this tutorial.


## Installation

For the command line scripts described in this tutorial, we will need to 
install `r Biocpkg("PureCN")` with suggested dependencies:

```
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("PureCN", dependencies = TRUE)
```

Alternatively, manually install the packages required by the command line
scripts:

```
BiocManager::install(c("PureCN", "optparse", "R.utils",
    "TxDb.Hsapiens.UCSC.hg19.knownGene", "org.Hs.eg.db"))
```

(Replace `hg19` with your genome version).

To use the alternative and in many cases recommended `r CRANpkg("PSCBS")`
segmentation:

```
# default PSCBS without support of interval weights
BiocManager::install("PSCBS")

# patched PSCBS with support of interval weights
BiocManager::install("lima1/PSCBS", ref="add_dnacopy_weighting")
```

To call mutational signatures, install the GitHub version of the
`r CRANpkg("deconstructSigs")` package:

```
BiocManager::install("raerose01/deconstructSigs")
```

For the experimental support of importing variant calls from 
_GATK4 GenomicsDB_, follow the installations instructions from
[GenomicsDB-R](https://github.com/nalinigans/GenomicsDB-R).

The _GATK4_ segmentation requires the `gatk` binary in path. Versions 4.1.7.0
and newer are supported.


# Prepare environment and assay-specific reference files


- Start R and enter the following to get the path to the command line
  scripts:

```{r path}
system.file("extdata", package = "PureCN")
```

- Exit R and store this path in an environment variable, for example in 
  BASH:

```
$ export PURECN="/path/to/PureCN/extdata"
$ Rscript $PURECN/PureCN.R --help
Usage: /path/to/PureCN/inst/extdata/PureCN.R [options] ...
```

- Generate an interval file from a BED file containing baits coordinates (not
  necessarily required with third-party segmentations, see in the corresponding
  Section \@ref(run-purecn-with-third-party-segmentation)):

```
# specify path where PureCN should store reference files
$ export OUT_REF="reference_files"
$ Rscript $PURECN/IntervalFile.R --in-file baits_hg19.bed \ 
    --fasta hg19.fa --out-file $OUT_REF/baits_hg19_intervals.txt \
    --off-target --genome hg19 \
    --export $OUT_REF/baits_optimized_hg19.bed \
    --mappability wgEncodeCrgMapabilityAlign100mer.bigWig \
    --reptiming wgEncodeUwRepliSeqK562WaveSignalRep1.bigWig
```
Internally, this script uses `r Biocpkg("rtracklayer")` to parse the
`--in-file`. Make sure that the file format matches the file extension.
See the `r Biocpkg("rtracklayer")` documentation for problems loading the file.
Check that the genome version of the baits file matches the reference.  Do not
include chrM baits in case the capture kit includes some.

We do not recommend padding the baits file manually unless the coverages are
very low (<30X) where the increased counts from the padded regions might decrease
sampling variance slightly. Note that we do however strongly recommend running the
variant caller with a padding of at least 50bp to increase the number of
informative SNPs, see below in the VCF section. Double check that the genome
version of the `--in-file` is correct - many assays are still designed using
older references and might need to be lifted over to the pipeline reference.
If possible, do NOT use a BED file that contains the targeted exons, instead use
the coordinates of the baits. These are optimized for GC-content and mappability and
will produce cleaner coverage profiles.

The `--off-target` flag will include off-target reads. Including them is
recommended except for Amplicon data. For whole-exome data, the benefit is
usually also limited unless the assay is inefficient with a high fraction of
off-target reads (>10-15%).

The `--genome` version is needed to annotate exons with gene symbols.  Use
hg19/hg38 for human genomes, not b37/b38. You might get a warning that an
annotation package is missing.  For hg19, install
`r Biocpkg("TxDb.Hsapiens.UCSC.hg19.knownGene")` in R.

The `--export` argument is optional. If provided, this script will store the
modified intervals as BED file for example (again every 
`r Biocpkg("rtracklayer")` format is supported). This is useful when the coverages
are calculated with third-party tools like GATK.

The `--mappability` argument should provide a `r Biocpkg("rtracklayer")`
parsable file with a mappability score in the first meta data column. If
provided, off-target regions will be restricted to regions specified in this
file.  On-target regions with low mappability will be excluded. For hg19,
download the file from the UCSC website. Choose the kmer size that best fits
your average mapped read length. For hg38, download recommended 76-kmer or
100-kmer mappability files through the courtesy of the Waldron lab from: 

   - [GCA_000001405.15_GRCh38_no_alt_analysis\_set_76.bw](https://s3.amazonaws.com/purecn/GCA_000001405.15_GRCh38_no_alt_analysis\_set_76.bw)
   - [GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw](https://s3.amazonaws.com/purecn/GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw)

See the FAQ section of the main vignette for instruction how to generate such a
file for other references.

Similarly, the `--reptiming` argument takes a replication timing score in the
same format.  If provided, GC-normalized and log-transformed coverage is tested
for a linear relationship with this score and normalized accordingly.  This is
optional and provides only a minor benefit for coverage normalization, but can
identify samples with high proliferation. Requires `--off-target` to be useful. 


# Create VCF files

`r Biocpkg("PureCN")` does not ship with a variant caller. Use a third-party
tool to generate a VCF for each sample. 

Important recommendations:

- Use _MuTect 1.1.7_ if possible; _Mutect 2_ from _GATK 4.1.7+_ is now out of
  alpha and VCFs generated following the best practices somatic workflow should
  work (earlier _Mutect 2_ versions are not supported and will not work).

- VCFs from most other tumor-only callers such as _VarScan2_ and
  _FreeBayes_ are supported, but only very limited artifact filtering will be
  performed for these callers.  Make sure to provide filtered VCFs. See the FAQ
  section in the main vignette for common problems and questions related to
  input data.

- Since germline SNPs are needed to infer allele-specific copy numbers, the
  provided VCF needs to contain both somatic and germline variants.  Make sure
  that upstream filtering does not remove high quality SNPs, in particular due
  to presence in germline databases. _Mutect 1.1.7_ automatically calls SNPs, 
  but _Mutect 2_ does not. Make sure to run _Mutect 2_ with
  `--genotype-germline-sites true --genotype-pon-sites true`. You will not get
  usuable output without those flags. Since _Mutect 2_ from _GATK 4.2.0+_,
  average base quality scores can be very low and variants will be too
  aggressively removed by _PureCN_.  You will need to set `--min-base-quality
  20` in _PureCN.R_ to keep them.

- Run the variant caller with a 50-75 base pair interval padding to increase
  the number of heterozygous SNPs (for example `--interval_padding` and
  `--interval-padding` in _Mutect 1.1.7_ and _Mutect 2_, respectively). For
  very high coverages beyond 1000X, it is safe to increase this value up to
  200bp.


# Run PureCN with internal segmentation

The following describes `r Biocpkg("PureCN")` runs with internal copy number
normalization and segmentation. 

What you will need:

- The interval file generated above

- BAM files of tumor samples. 

- BAM files of normal samples (see main vignette for recommendations). These
  normal samples are not required to be patient-matched to the tumor samples,
  but they need to be processed-matched (same assay run through the same
  alignment pipeline, ideally sequenced in the same lab)

- VCF files generated above for all tumor and normal BAM files


## Coverage

For each sample, tumor and normal, calculate GC-normalized coverages:

```
# Calculate and GC-normalize coverage from a BAM file 
$ Rscript $PURECN/Coverage.R --out-dir $OUT/$SAMPLEID \ 
    --bam ${SAMPLEID}.bam \
    --intervals $OUT_REF/baits_hg19_intervals.txt
```

Similar to GATK, this script also takes a text file containing a list of BAM or
coverage file names (one per line). The file extension must be `.list`:

```
# Calculate and GC-normalize coverage from a list of BAM files 
$ Rscript $PURECN/Coverage.R --out-dir $OUT/normals \ 
    --bam normals.list \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --cores 4
```

Important recommendations:

- Only provide `--keep-duplicates` or `--remove-mapq0` if you know what you are
  doing and always use the same command line arguments for tumor and the
  normals

- It can be safe to skip the GC-normalization with `--skip-gc-norm` when tumor
  and normal samples are expected to exhibit similar biases and a sufficient
  number of normal samples is available. A good example would be plasma
  sequencing. In contrast, old FFPE samples normalized against blood controls
  will more likely benefit from GC-normalization.
  
- A potential negative impact of GC-normalization is much more
  likely in very small targeted panels (< 0.5Mb) and worth benchmarking.

- When supported third-party tools are used to calculate coverage (currently
  _CNVkit_, _GATK3_ and _GATK4_), it is possible to GC-normalize those
  coverages with a matching interval file:

```
# GC-normalize coverage from a GATK DepthOfCoverage file
Rscript $PURECN/Coverage.R --out-dir $OUT/$SAMPLEID \
    --coverage ${SAMPLEID}.coverage.sample_interval_summary \ 
    --intervals $OUT_REF/baits_hg19_intervals.txt
```

## NormalDB

To build a normal database for coverage normalization, copy the paths to all
(GC-normalized) normal coverage files in a single text file, line-by-line:

```
ls -a $OUT/normals/*_loess.txt.gz | cat > example_normal_coverages.list
# In case no GC-normalization is performed:
# ls -a $OUT/normals/*_coverage.txt.gz | cat > example_normal_coverages.list

$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --coverage-files example_normal_coverages.list \
    --genome hg19 --assay agilent_v6

# When normal panel VCF is available (highly recommended for
# unmatched samples)
$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --coverage-files example_normal_coverages.list \
    --normal-panel $NORMAL_PANEL \
    --genome hg19 \
    --assay agilent_v6

# For a Mutect2/GATK4 normal panel GenomicsDB (beta)
$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --coverage-files example_normal_coverages.list \
    --normal-panel $GENOMICSDB-WORKSPACE-PATH/pon_db \
    --genome hg19 \
    --assay agilent_v6
```

Important recommendations:

- Consider generating different databases when differences are significant, e.g.
  for samples with different read lengths or insert size distributions

- In particular, do not mix normal data obtained with different capture kits
  (e.g. _Agilent SureSelect v4_ and _v6_)

- Provide a normal panel VCF here to precompute mapping bias for faster
  runtimes. The only requirement for the VCF is an `AD` format field containing
  the number of reference and alt reads for all samples. See the example file
  `$PURECN/normalpanel.vcf.gz`.

- For ideal results, examine the `interval_weights.png` file to find good
  off-target bin widths. You will need to re-run `IntervalFile.R` with the
  `--average-off-target-width` parameter and re-calculate the coverages. `NormalDB.R`
  will also give a suggestion for a good minimum width. We do not recommend
  going lower than this estimate; setting `--average-off-target-width` to value larger
  than this value can decrease noise at the cost of loss in resolution.
  Setting it to 1.2-1.5x the minimum recommendation (that should be ideally <
  250kb) is a good starting point.

- The `--assay` argument is optional and is only used to add the provided assay
  name to all output files

- A warning pointing to the likely use of a wrong baits file means that more
  than 5% of targets have close to 0 coverage in all normal samples. A BED file
  with the low coverage targets will be generated in `--out-dir`.  If for any
  reason there is no access to the correct file, it is recommended to re-run the
  `IntervalFile.R` command and provide this BED file with `--exclude`.


## PureCN

Now that the assay-specific files are created and all coverages calculated, we
run `r Biocpkg("PureCN")` to normalize, segment and determine purity and ploidy:

```
mkdir $OUT/$SAMPLEID

# Without a matched normal (minimal test run)
$ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt.gz \
    --sampleid $SAMPLEID \
    --vcf ${SAMPLEID}_mutect.vcf \
    --normaldb $OUT_REF/normalDB_hg19.rds \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --genome hg19 

# Production pipeline run
$ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt.gz \
    --sampleid $SAMPLEID \
    --vcf ${SAMPLEID}_mutect.vcf \
    --stats-file ${SAMPLEID}_mutect_stats.txt \
    --fun-segmentation PSCBS \
    --normaldb $OUT_REF/normalDB_hg19.rds \
    --mapping-bias-file $OUT_REF/mapping_bias_hg19.rds \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --snp-blacklist hg19_simpleRepeats.bed \
    --genome hg19 \
    --model betabin \
    --force --post-optimize --seed 123

# With a matched normal (test run; for production pipelines we recommend the
# unmatched workflow described above)
$ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt.gz \
    --normal $OUT/$SAMPLEID/${SAMPLEID_NORMAL}_coverage_loess.txt.gz \
    --sampleid $SAMPLEID \
    --vcf ${SAMPLEID}_mutect.vcf \
    --normaldb $OUT_REF/normalDB_hg19.rds \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --genome hg19

# Recreate output after manual curation of ${SAMPLEID}.csv
$ Rscript $PURECN/PureCN.R --rds $OUT/$SAMPLEID/${SAMPLEID}.rds
```

Important recommendations:

- Even if matched normals are available, it is often better to use the normal
  database for coverage normalization. **When a matched normal coverage is
  provided with `--normal` then the pool of normal coverage normalization and
  denoising steps are skipped!**

- Always provide the normal coverage database to ignore low quality regions in
  the segmentation and to increase the sensitivity for homozygous deletions in
  high purity samples.

- Double check that in `--tumor` and `--normaldb`, GC-normalization is either
  used in both  (`*_loess.txt.gz`) or skipped in both (`*_coverage.txt.gz`).

- The normal panel VCF file is useful for mapping bias correction and especially
  recommended without matched normals. See the FAQ of the main vignette how to
  generate this file. It is not essential for test runs.  

- The _MuTect 1.1.7_ stats file (the main output file besides the VCF) should be
  provided for better artifact filtering. If the VCF was generated by a pipeline
  that performs good artifact filtering, this file is not needed. Do NOT 
  provide this file for _Mutect 2_.

- The `--post-optimize` flag defines that purity should be optimized using both
  variant allelic fractions and copy number instead of copy number only.  This
  results in a significant runtime increase for whole-exome data.

- If `--out` is a directory, it will use the sample id as file prefix for all
  output files. Otherwise `r Biocpkg("PureCN")` will use `--out` as prefix.   
    
- The `--parallel` flag will enable the parallel fitting of local optima. See 
  `r Biocpkg("BiocParallel")` for details. This script will use the default
  backend. `--cores` is a short cut to use the specified number of CPUs instead
  of the default backend. Only specify one of the two arguments. **Note that
  memory usage can increase linearly with number of cores and insufficient
  memory can result in random crashes.**

- `--fun-segmentation PSCBS` is the new recommendation in 1.22.  Support for
  interval weights currently requires a patch (see Section
  \@ref(installation)). See below for some more details on the best choice of
  the method.

- `--model betabin` is the new recommendation in 1.22 with larger panel of
  normals (more than 10-15 normal samples). 

- Defaults are well calibrated and should produce close to ideal results for
  most samples. A few common cases where changing defaults makes sense:

    - High purity and high quality: For cancer types with a high expected
      purity, such as ovarian cancer, AND when quality is expected to be very
      good (high coverage, young samples), `--max-copy-number 8`. 
      (`r Biocpkg("PureCN")` reports copy numbers greater than this value, but
      will stop fitting SNP allelic fractions to the exact allele-specific copy
      number because this will get impossible very quickly with high copy
      numbers - and computationally expensive.) 

    - Small panels with high coverage: `--interval-padding 100` (or higher), requires
      running the variant caller with this padding or without interval file. Use
      the same settings for the panel of normals VCF so that SNPs in the
      flanking regions have reliable mapping bias estimates. The
      `--max-homozygous-loss` parameter might also need some adjustment for very
      small panels with large gaps around captured deletions.

    - Cell lines: Safely skip the search for low purity solutions in cell lines:
      `--max-copy-number 8`, `--min-purity 0.9`, `--max-purity 0.99`.  Add
      `--model-homozygous` to find regions of LOH in samples without normal
      contamination (do not provide this flag when matched normal data are
      available in the VCF).

    - cfDNA: `--min-purity 0.1`, `--min-af 0.01` (or lower) and `--error 0.0005`
      (or lower, when there is UMI-based error correction). Note that the
      estimated purity can be very wrong when the true purity is below 5-7%;
      these samples are usually flagged as non-aberrant.

    - All assays: `--max-segments` should set to a value so that with few
      exceptions only poor quality samples exceed this cutoff. For cancer types
      with high heterogenity, it is also recommended to increase
      `--max-non-clonal` to 0.3-0.4 (this will increase the runtime significantly
      for whole-exome data).
    
    - The choice of the segmentation function can also make a significant
      difference and unfortunately there is not yet a universal method that works
      best in all scenarios.

        - PSCBS: A good and safe starting point, especially with off-target
          regions that might exhibit different noise profiles compared to
          on-target.
        
        - GATK4: Most recent addition. Not yet well tested in
          `r Biocpkg("PureCN")`, but theoretically best choice with a larger
          number of SNPs per intervals, for example assays with copy number
          backbones. We appreciate feedback.

        - CBS: Simple, fast and well tested. Does not fully support SNP
          information, so only recommended for settings with a very small
          SNPs/intervals ratio, for example small targeted panels (<1Mb) with
          healthy off-target coverage (<150kb resolution and similar log ratio
          noise compared to on-target).

        - copynumber: For cases with multiple time points or biopsies. This is   
          automatically chosen with `--additional-tumors` and currently not
          supported in a single-sample analysis.

        - Hclust/none: For third-party segmentations. `Hclust` clusters segments
          in an attempt to calibrate log-ratios across chromosomes, `none`
          largely keeps everything as provided.

- A few recommendations for checks whether the `r Biocpkg("PureCN")` setup is
  correct:

    - The "Mean standard deviation of log-ratios" reported in the log file
      should be fairly low for high quality data. Older FFPE data can be around
      0.4, but high coverage, relatively recent samples should approach the
      0.15 minimum. If off-target is consistently noisier than on-target, it is
      probably worth increasing the off-target bin size and start from scratch
      (or in case of whole-exome sequencing, ignore off-target reads since they
      do not provide much additional information when bins are large and/or
      noisy).
    
    - Related to that, a warning is thrown when less than 10% of all intervals
      passing filters are off-target intervals. Whole-exome sequencing is
      usually around that value. If the log-ratio standard deviation is similar
      or even lower than the one for on-target, it is worth keeping off-target
      regions. Otherwise off-target might add more noise than signal.
      Off-target information is automatically ignored when the passing rate
      falls below 5% of all intervals. 

    - The fraction of targets with SNPs should be between 10 and 15 percent.
      If it is significantly lower, make sure that the variant caller was used
      with 50-100bp interval padding or no interval file at all. Also check
      that the interval file was generated using the baits coordinates, not the
      targets (the baits BED file should have a more even size distribution,
      e.g. 120bp and multiples of it). If many variants are removed by the
      default 25 base quality feature, you might be using _Mutect 2_ and need
      to re-run _PureCN.R_ with `--min-base-quality 20`.
    
    - "Initial testing for significant sample cross-contamination" in the log
      file should not have many false positives, i.e. should be "unlikely" for
      most samples, not "maybe". Insufficient artifact removal can result in
      too many false SNPs calls with low allelic fractions, confusing the
      contamination caller.
    
    - Read all warnings. 
        

# Run PureCN with third-party segmentation

Our internal `r Biocpkg("PureCN")` normalization combined with the _PSCBS_ or
_GATK4_ segmentation should produce highly competitive results and we encourage
users to try it and compare it to their existing pipelines. However, we realize
that often it is not an option to change tools in production pipelines and we
therefore made it relatively easy to use `r Biocpkg("PureCN")` with third-party
tools. We provide examples for _CNVkit_ and _GATK4_ and it should be
straightforward to adapt those for other tools.

What you will need:

- Output of third-party tools (see details below)

- VCF files for all tumor samples and some normal files (see main vignette for
  questions related to required normal samples) 


## General usage

If you already have a segmentation from third-party tools (for example _CNVkit_,
_GATK4_, _EXCAVATOR2_). For a minimal test run:

```
Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID  \
    --sampleid $SAMPLEID \
    --seg-file $OUT/$SAMPLEID/${SAMPLEID}.cnvkit.seg \
    --vcf ${SAMPLEID}_mutect.vcf \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --genome hg19 
```

See the main vignette for more details and file formats.


## Recommended _CNVkit_ usage

For a production pipeline run we provide again more information about the assay
and genome. Here an _CNVkit_ example: 

```
# Recommended: Provide a normal panel VCF to remove mapping biases, pre-compute
# position-specific bias for much faster runtimes with large panels
# This needs to be done only once for each assay
Rscript $PURECN/NormalDB.R --out-dir $OUT_REF --normal-panel $NORMAL_PANEL \
    --assay agilent_v6 --genome hg19 --force

# Export the segmentation in DNAcopy format
cnvkit.py export seg $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cns --enumerate-chroms \
    -o $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg

# Run PureCN by providing the *.cnr and *.seg files 
Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID  \
    --sampleid $SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cnr \
    --seg-file $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg \
    --mapping-bias-file $OUT_REF/mapping_bias_agilent_v6_hg19.rds \
    --vcf ${SAMPLEID}_mutect.vcf \
    --stats-file ${SAMPLEID}_mutect_stats.txt \
    --snp-blacklist hg19_simpleRepeats.bed \
    --genome hg19 \
    --fun-segmentation Hclust \
    --force --post-optimize --seed 123
```

Important recommendations:

- The `--fun-segmentation` argument controls if the data should to be
  re-segmented using germline BAFs (default). Set this value to `none` if the
  provided segmentation should be used as is.  The recommended `Hclust` will
  only cluster provided segments.

- Since _CNVkit_ provides all necessary information in the `*.cnr` output files,
  the `--intervals` argument is not required.

- In test runs, especially when the input VCF contains matched normal
  information, `--mapping-bias-file` can be skipped

- _CNVkit_ runs without normal reference samples are not recommended

- The `--stats-file` is only supported for _Mutect 1.1.7_. _Mutect 2_ provides
  the filter flags directly in the VCF.

## Recommended _GATK4_ usage

```
# Recommended: Provide a normal panel GenomicsDB to remove mapping
# biases, pre-compute position-specific bias for much faster runtimes
# with large panels. This needs to be done only once for each assay. 
Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --normal-panel $GENOMICSDB-WORKSPACE-PATH/pon_db \
    --assay agilent_v6 --genome hg19 --force

Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID  \
    --sampleid $SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}.hdf5 \
    --log-ratio-file $OUT/$SAMPLEID/${SAMPLEID}.denoisedCR.tsv \
    --seg-file $OUT/$SAMPLEID/${SAMPLEID}.modelFinal.seg \
    --mapping-bias-file $OUT_REF/mapping_bias_agilent_v6_hg19.rds \
    --vcf ${SAMPLEID}_mutect2_filtered.vcf \
    --snp-blacklist hg19_simpleRepeats.bed \
    --genome hg19 \
    --fun-segmentation Hclust \
    --force --post-optimize --seed 123
```

Important recommendations:

- The `--fun-segmentation` can be set to none in most cases. This will keep the
  segmentation largely as provided. `Hclust` clusters segments to avoid
  over-segmentation and to calibrate log-ratios across chromosomes.  This will
  thus alter the GATK4 segmentation, which might not be desired.

- Beta support for providing _CollectAllelicCounts_ output instead of _Mutect_
  is available. Use `--vcf ${SAMPLEID}.allelicCounts.tsv` to automatically
  import the SNP counts and convert them into a supported VCF. Note that this
  will not use any somatic SNV and indel information available in _Mutect_ VCFs
  and thus will also not provide any clonality annotation.
  
# Biomarkers

`Dx.R` provides copy number and mutation metrics commonly used as biomarkers,
most importantly tumor mutational burden (TMB), chromosomal instability (CIN)
and mutational signatures.

```
# Provide a BED file with callable regions, for examples obtained by
# GATK CallableLoci. Useful to calculate mutations per megabase and
# to exclude low quality regions.
grep CALLABLE ${SAMPLEID}_callable_status.bed > \ 
    ${SAMPLEID}_callable_status_filtered.bed

# Only count mutations in callable regions, also subtract what was
# ignored in PureCN.R via --snp-blacklist, like simple repeats, from the
# mutation per megabase calculation
# Also search for the COSMIC mutation signatures
# (http://cancer.sanger.ac.uk/cosmic/signatures)
Rscript $PureCN/Dx.R --out $OUT/$SAMPLEID/$SAMPLEID \
    --rds $OUT/SAMPLEID/${SAMPLEID}.rds \
    --callable ${SAMPLEID}_callable_status_filtered.bed \
    --exclude hg19_simpleRepeats.bed \
    --signatures 

# Restrict mutation burden calculation to coding sequences
Rscript $PureCN/FilterCallableLoci.R --genome hg19 \
    --in-file ${SAMPLEID}_callable_status_filtered.bed \
    --out-file ${SAMPLEID}_callable_status_filtered_cds.bed \
    --exclude '^HLA'
 
Rscript $PureCN/Dx.R --out $OUT/$SAMPLEID/${SAMPLEID}_cds \
    --rds $OUT/SAMPLEID/${SAMPLEID}.rds \
    --callable ${SAMPLEID}_callable_status_filtered_cds.bed \
    --exclude hg19_simpleRepeats.bed
```

Important recommendations:

- Run _GATK CallableLoci_ with `--minDepth N` where N is roughly 20% of the mean
  target coverage of all samples.

- If `--callable` is missing, all intervals passing filters are assumed to be
  callable.


# Reference

Argument name          | Corresponding PureCN argument | PureCN function 
-----------------------|-------------------------------|----------------
`--fasta`              | `reference.file` | `preprocessIntervals` 
`--in-file`             | `interval.file` | `preprocessIntervals` 
`--off-target`          | `off.target`    | `preprocessIntervals` 
`--average-target-width`        | `average.target.width` | `preprocessIntervals` 
`--min-target-width`     | `min.target.width` | `preprocessIntervals` 
`--small-targets`       | `small.targets` | `preprocessIntervals` 
`--average-off-target-width`     | `average.off.target.width` | `preprocessIntervals` 
`--off-target-seqlevels` | `off.target.seqlevels` | `preprocessIntervals` 
`--mappability`        | `mappability`   | `preprocessIntervals` 
`--min-mappability`     | `min.mappability`   | `preprocessIntervals` 
`--reptiming`          | `reptiming`   | `preprocessIntervals` 
`--average-reptiming-width`     | `average.reptiming.width` | `preprocessIntervals`  
`--genome`             | `txdb`, `org` | `annotateTargets` 
`--out-file`            | | 
`--export`             | | `rtracklayer::export`  
`--version -v`         | | 
`--force -f`           | | 
`--help -h`            | | 

: (\#tab:intervalfile) IntervalFile


Argument name      | Corresponding PureCN argument | PureCN function 
-------------------|-------------------------------|----------------
`--bam`            | `bam.file` | `calculateBamCoverageByInterval`
`--bai`            | `index.file` | `calculateBamCoverageByInterval` 
`--coverage`       | `coverage.file` | `correctCoverageBias`
`--intervals`      | `interval.file`  | `correctCoverageBias` 
`--method`         | `method`  | `correctCoverageBias`
`--keep-duplicates` | `keep.duplicates`  | `calculateBamCoverageByInterval`
`--chunks`         | `chunks`  | `calculateBamCoverageByInterval`
`--remove-mapq0`   | `mapqFilter`  | `ScanBamParam`
`--skip-gc-norm`   | | `correctCoverageBias`
`--out-dir`        | | 
`--cores`          | | Number of CPUs to use when multiple BAMs are provided 
`--parallel`       | | Use default `r Biocpkg("BiocParallel")` backend when multiple BAMs are provided 
`--seed`           | | 
`--version -v`     | | 
`--force -f`       | | 
`--help -h`        | | 

: (\#tab:coverage) Coverage


Argument name          | Corresponding PureCN argument | PureCN function 
-----------------------|-------------------------------|----------------
`--coverage-files`   | `normal.coverage.files` | `createNormalDatabase`
`--normal-panel`    | `normal.panel.vcf.file` | `calculateMappingBiasVcf` 
`--assay -a`          | Optional assay name | Used in output file names. 
`--genome -g`        | Optional genome version | Used in output file names. 
`--genomicsdb-af-field` | For GenomicsDB import, allelic fraction field | `calculateMappingBiasGatk4` 
`--min-normals-position-specific-fit` | `min.normals.position.specific.fit` | `calculateMappingBiasVcf`, `calculateMappingBiasGatk4` 
`--out-dir -o`         | | 
`--version -v`        | | 
`--force -f`          | | 
`--help -h`           | | 

: (\#tab:normaldb) NormalDB

Argument name          | Corresponding PureCN argument | PureCN function 
-----------------------|-------------------------------|----------------
`--sampleid -i`      | `sampleid`  | `runAbsoluteCN`  
`--normal`           | `normal.coverage.file`   | `runAbsoluteCN`  
`--tumor`            | `tumor.coverage.file`     | `runAbsoluteCN` 
`--vcf`              | `vcf.file`          | `runAbsoluteCN`   
`--rds`              | `file.rds`          | `readCurationFile`   
`--mapping-bias-file`  | `mapping.bias.file` | `setMappingBiasVcf` 
`--normaldb`         | `normalDB` (serialized with `saveRDS`) | `calculateTangentNormal`, `filterTargets` 
`--seg-file`          | `seg.file`           | `runAbsoluteCN`  
`--log-ratio-file`     | `log.ratio`          | `runAbsoluteCN`  
`--additional-tumors` | `tumor.coverage.files` | `processMultipleSamples`  
`--sex`              | `sex`                | `runAbsoluteCN`  
`--genome`           | `genome`             | `runAbsoluteCN`  
`--intervals`        | `interval.file`      | `runAbsoluteCN`   
`--stats-file`        | `stats.file` | `filterVcfMuTect`        
`--min-af`            | `af.range` | `filterVcfBasic` 
`--snp-blacklist`     | `snp.blacklist` | `filterVcfBasic`     
`--error`            | `error` | `runAbsoluteCN`     
`--db-info-flag`       | `DB.info.flag` | `runAbsoluteCN`     
`--popaf-info-field`   | `POPAF.info.field` | `runAbsoluteCN`     
`--cosmic-cnt-info-field`   | `Cosmic.CNT.info.field` | `runAbsoluteCN`     
`--min-cosmic-cnt`     | `min.cosmic.cnt` | `setPriorVcf`     
`--interval-padding`          | `interval.padding` | `filterVcfBasic`  
`--min-total-counts`   | `min.total.counts` | `filterIntervals`  
`--min-fraction-offtarget`   | `min.fraction.offtarget` | `filterIntervals`  
`--fun-segmentation`  | `fun.segmentation` | `runAbsoluteCN`  
`--alpha`            | `alpha` | `segmentationCBS` 
`--undo-sd`           | `undo.SD` | `segmentationCBS` 
`--changepoints-penalty`| `changepoints.penalty` | `segmentationGATK4` 
`--additional-cmd-args`| `additional.cmd.args` | `segmentationGATK4` 
`--max-segments`      | `max.segments` | `runAbsoluteCN` 
`--min-logr-sdev`      | `min.logr.sdev` | `runAbsoluteCN` 
`--min-purity`        | `test.purity` | `runAbsoluteCN` 
`--max-purity`        | `test.purity` | `runAbsoluteCN` 
`--min-ploidy`        | `min.ploidy` | `runAbsoluteCN` 
`--max-ploidy`        | `max.ploidy` | `runAbsoluteCN` 
`--max-copy-number`   | `test.num.copy` | `runAbsoluteCN` 
`--post-optimize`     | `post.optimize` | `runAbsoluteCN`     
`--bootstrap-n`       | `n` | `bootstrapResults` 
`--speedup-heuristics` | `speedup.heuristics` | `runAbsoluteCN`     
`--model-homozygous`  | `model.homozygous` | `runAbsoluteCN`  
`--model`            | `model` | `runAbsoluteCN`  
`--log-ratio-calibration` | `log.ratio.calibration` | `runAbsoluteCN`  
`--max-non-clonal`     | `max.non.clonal` | `runAbsoluteCN`  
`--max-homozygous-loss` | `max.homozygous.loss` | `runAbsoluteCN`  
`--out-vcf`           | `return.vcf` | `predictSomatic` 
`--out -o`           | |                            
`--parallel`         | `BPPARAM` | `runAbsoluteCN` 
`--cores`            | `BPPARAM` | `runAbsoluteCN` 
`--seed`             | | 
`--version -v`       | | 
`--force -f`         | | 
`--help -h`          | | 

: (\#tab:purecn) PureCN

Argument name   | Corresponding PureCN argument | PureCN function 
----------------|-------------------------------|----------------
`--rds`         | `file.rds` | `readCurationFile`   
`--callable`    | `callable` | `callMutationBurden` 
`--exclude`     | `exclude`  | `callMutationBurden` 
`--max-prior-somatic` | `max.prior.somatic`  | `callMutationBurden` 
`--signatures` | | `deconstructSigs::whichSignatures` 
`--signature-databases` | | `deconstructSigs::whichSignatures` 
`--out`         | | 
`--version -v`  | | 
`--force -f`    | | 
`--help -h`     | | 

: (\#tab:dx) Dx