%\VignetteKeywords{runAbsoluteCN} %\VignetteEngine{knitr::knitr} %\VignetteDepends{PureCN} %\VignettePackage{PureCN} %\VignetteIndexEntry{Quick start and command line usage} \documentclass{article} <>= BiocStyle::latex() @ \usepackage{booktabs} % book-quality tables \begin{document} <>= library(PureCN) set.seed(1234) @ \section*{PureCN - Quick Start} This tutorial provides a quick overview of the command line tools shipping with \Biocpkg{PureCN}. For the R package and more detailed information, see the main vignette. \subsection*{Update from previous stable versions} \Biocpkg{PureCN} is in general backward compatible with input generated by the previous stable version 1.8. \Biocpkg{PureCN} 1.10 introduced a new normal database format with several important improvements such as not splitting the database by sample sex for the normalization of autosomes. It is therefore necessary to at least re-run the NormalDB.R script described below. For upgrades from version 1.6, we highly recommend starting from scratch following this tutorial. \subsection*{Prepare environment and files} \begin{itemize} \item Start R and enter the following to get the path to the command line scripts: <>= system.file("extdata", package="PureCN") @ \item Exit R and store this path in an environment variable, for example in BASH: \begin{verbatim} $ export PURECN="/path/to/PureCN/extdata" $ Rscript $PURECN/PureCN.R --help Usage: /path/to/PureCN/inst/extdata/PureCN.R [options] ... \end{verbatim} \item Generate an interval file from a BED file containing baits coordinates: \begin{verbatim} # specify path where PureCN should store reference files $ export OUT_REF="reference_files" $ Rscript $PURECN/IntervalFile.R --infile baits_hg19.bed \ --fasta hg19.fa --outfile $OUT_REF/baits_hg19_intervals.txt \ --offtarget --genome hg19 \ --export $OUT_REF/baits_optimized_hg19.bed \ --mappability wgEncodeCrgMapabilityAlign100mer.bigWig \ --reptiming wgEncodeUwRepliSeqK562WaveSignalRep1.bigWig \end{verbatim} Internally, this script uses \Biocpkg{rtracklayer} to parse the \Rcode{infile}. Make sure that the file format matches the file extension. See the \Biocpkg{rtracklayer} documentation for problems loading the file. Check that the genome version of the baits file matches the reference. Do not include chrM baits in case the capture kit includes some. The \Rcode{--offtarget} flag will include off-target reads. Including them is recommended except for Amplicon data. The \Rcode{--genome} version is needed to annotate exons with gene symbols. The \Rcode{--export} argument is optional. If provided, this script will store the modified intervals as BED file for example (again every \Biocpkg{rtracklayer} format is supported). This is useful when the coverages are calculated with third-party tools like GATK. The \Rcode{--mappability} argument should provide a \Biocpkg{rtracklayer} parsable file with a mappability score in the first meta data column. If provided, off-target regions will be restricted to regions specified in this file. On-target regions with low mappability will be excluded. For hg19, download the file from the UCSC website. See the FAQ section of the main vignette for instruction how to generate such a file for other references. Similarly, the \Rcode{--reptiming} argument takes a replication timing score in the same format. If provided, GC-normalized and log-transformed coverage is tested for a linear relationship with this score and normalized accordingly. Generation of this interval file is not needed when CNVkit is used to generate coverage data (covered at the end of this tutorial). \end{itemize} \subsection*{Create VCF files} \Biocpkg{PureCN} does not ship with a variant caller. Use a third-party tool to generate a VCF for each sample. Important recommendations: \begin{itemize} \item Use \software{MuTect} 1.1.7 if possible \item Support for \software{MuTect 2} and \software{FreeBayes} is available for tumor-only VCFs, but currently poorly tested and only very limited artifact filtering will be performed for these callers. See the FAQ section in the main vignette for common problems and questions related to input data. \item Since germline SNPs are needed to infer allele-specific copy numbers, the provided VCF needs to contain both somatic and germline variants. Make sure that upstream filtering does not remove high quality SNPs, in particular due to presence in germline databases. \item Run the variant caller with a 50-75 base pair interval padding to increase the number of heterozygous SNPs \end{itemize} \subsection*{Run PureCN with internal segmentation} The following describes \Biocpkg{PureCN} runs with internal copy number normalization and segmentation. \subsubsection*{Coverage} For each sample, tumor and normal, calculate GC-normalized coverages: \begin{verbatim} # Calculate and GC-normalize coverage from a BAM file $ Rscript $PURECN/Coverage.R --outdir $OUT/$SAMPLEID \ --bam ${SAMPLEID}.bam \ --intervals $OUT_REF/baits_hg19_intervals.txt # GC-normalize coverage from a GATK DepthOfCoverage file Rscript $PURECN/Coverage.R --outdir $OUT/$SAMPLEID \ --coverage ${SAMPLEID}.coverage.sample_interval_summary \ --intervals $OUT_REF/baits_hg19_intervals.txt \end{verbatim} Similar to GATK, this script also takes a text file containing a list of BAM or coverage file names (one per line). The file extension must be .list: \begin{verbatim} # Calculate and GC-normalize coverage from a list of BAM files $ Rscript $PURECN/Coverage.R --outdir $OUT \ --bam normals.list \ --intervals $OUT_REF/baits_hg19_intervals.txt \ --cpu 4 \end{verbatim} Important recommendations: \begin{itemize} \item Only provide \Rcode{--keepduplicates} or \Rcode{--removemapq0} if you know what you are doing and always use the same command line arguments for tumor and the normals \end{itemize} \subsubsection*{NormalDB} To build a normal database for coverage normalization, copy the paths to all GC-normalized normal coverage files in a single text file, line-by-line: \begin{verbatim} ls -a normal*loess.txt | cat > example_normal.list # From already GC-normalized files $ Rscript $PURECN/NormalDB.R --outdir $OUT_REF \ --coveragefiles example_normal.list \ --genome hg19 --assay agilent_v6 # When normal panel VCF is available (highly recommended for unmatched samples) $ Rscript $PURECN/NormalDB.R --outdir $OUT_REF \ --coveragefiles example_normal.list \ --genome hg19 --normal_panel $NORMAL_PANEL --assay agilent_v6 \end{verbatim} Important recommendations: \begin{itemize} \item The resulting \Rcode{normalDB\_hg19.rds} file contains absolute paths to the coverage files. It is thus necessary to re-run this command when the coverage files are moved. \item Consider generating different databases when differences are significant, e.g. for samples with different read lengths or insert size distributions \item In particular, do not mix normal data obtained with different capture kits (e.g. Agilent SureSelect v4 and v6) \item Provide a normal panel VCF here to precompute mapping bias for faster runtimes. The only requirement for the VCF is an \Rcode{AD} format field containing the number of reference and alt reads for all samples. See the example file \Rcode{\$PURECN/normalpanel.vcf.gz}. \item For ideal results, examine the target\_weights.png file to find good off-target bin widths. You will need to re-run IntervalFile.R with the \Rcode{--offtargetwidth} parameter and re-calculate the coverages. \item The \Rcode{--assay} argument is optional and is only used to add the provided assay name to all output files \item A warning pointing to the likely use of a wrong baits file means that more than 5\% of targets have close to 0 coverage in all normal samples. A BED file with the low coverage targets will be generated in \Rcode{--outdir}. If for any reason there is no access to the correct file, it is recommended to re-run the IntervalFile.R command and provide this BED file with \Rcode{--exclude}. \end{itemize} \subsubsection*{PureCN} Now that the assay-specific files are created and all coverages calculated, we run PureCN.R to normalize, segment and determine purity and ploidy: \begin{verbatim} mkdir $OUT/$SAMPLEID # Without a matched normal (minimal test run) $ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt \ --sampleid $SAMPLEID \ --vcf ${SAMPLEID}_mutect.vcf \ --normaldb $OUT_REF/normalDB_hg19.rds \ --intervals $OUT_REF/baits_hg19_intervals.txt \ --genome hg19 # Production pipeline run $ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt \ --sampleid $SAMPLEID \ --vcf ${SAMPLEID}_mutect.vcf \ --statsfile ${SAMPLEID}_mutect_stats.txt \ --normaldb $OUT_REF/normalDB_hg19.rds \ --normal_panel $OUT_REF/mapping_bias_hg19.rds \ --intervals $OUT_REF/baits_hg19_intervals.txt \ --targetweightfile $OUT_REF/target_weights_hg19.txt \ --snpblacklist hg19_simpleRepeats.bed \ --genome hg19 \ --force --postoptimize --seed 123 # With a matched normal (test run; for production pipelines we recommend the # unmatched workflow described above) $ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt \ --normal $OUT/$SAMPLEID/${SAMPLEID_NORMAL}_coverage_loess.txt \ --sampleid $SAMPLEID \ --vcf ${SAMPLEID}_mutect.vcf \ --normaldb $OUT_REF/normalDB_hg19.rds \ --intervals $OUT_REF/baits_hg19_intervals.txt \ --genome hg19 # Recreate output after manual curation of ${SAMPLEID}.csv $ Rscript $PURECN/PureCN.R --rds $OUT/$SAMPLEID/${SAMPLEID}.rds \end{verbatim} Important recommendations: \begin{itemize} \item Even if matched normals are available, it is often better to use the normal database for coverage normalization \item Providing the normal database in addition to a matched normal is optional, but helps ignoring low quality regions in the segmentation \item The normal panel VCF file is useful for mapping bias correction and especially recommended without matched normals. See the FAQ of the main vignette how to generate this file. It is not essential for test runs. \item The \software{MuTect} 1.1.7 stats file (the main output file besides the VCF) should be provided for better artifact filtering. If the VCF was generated by a pipeline that performs good artifact filtering, this file is not needed. \item The \Rcode{--postoptimize} flag defines that purity should be optimized using both variant allelic fractions and copy number instead of copy number only. This results in a significant runtime increase for whole-exome data. \item If \Rcode{--out} is a directory, it will use the sample id as file prefix for all output files. Otherwise \Biocpkg{PureCN} will use \Rcode{--out} as prefix. \item The \Rcode{--parallel} flag will enable the parallel fitting of local optima. See \Biocpkg{BiocParallel} for details. This script will use the default backend. \item Defaults are well calibrated and should produce close to ideal results for most samples. A few common cases where changing defaults makes sense: \begin{description} \item[High purity and high quality:] For cancer types with a high expected purity, such as Ovarian cancer, AND when quality is expected to be very good (high coverage, young samples), \Rcode{--maxcopynumber 8} \item[Small panels with high coverage:] \Rcode{--padding 100} (or higher), requires running the variant caller with this padding or without interval file. Use the same settings for the panel of normals VCF so that SNPs in the flanking regions have reliable mapping bias estimates. \item[Cell lines:] Safely skip the search for low purity solutions in cell lines: \Rcode{--maxcopynumber 8}, \Rcode{--minpurity 0.9}, \Rcode{--maxpurity 0.99}. Add \Rcode{--modelhomozygous} to find regions of LOH in samples without normal contamination. \item[cfDNA:] \Rcode{--minpurity 0.1}, \Rcode{--minaf 0.01} (or lower) and \Rcode{--error 0.0005} (or lower, when there is UMI-based error correction) \item[Amplicon data:] \Rcode{--model betabin} (Amplicon data is not officially supported) \end{description} \end{itemize} \subsection*{Run PureCN with third-party segmentation} If you already have a segmentation from third-party tools (for example CNVkit, EXCAVATOR2). For a test run: \begin{verbatim} Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --sampleid $SAMPLEID \ --segfile $OUT/$SAMPLEID/${SAMPLEID}.cnvkit.seg \ --vcf ${SAMPLEID}_mutect.vcf \ --intervals $OUT_REF/baits_hg19_intervals.txt \ --genome hg19 \end{verbatim} See the main vignette for more details and file formats. For a production pipeline run we provide again more information about the assay and genome. Here an CNVkit example (CNVkit runs without normal reference samples are not recommended): \begin{verbatim} # Provide a normal panel VCF to remove mapping biases, pre-compute # position-specific bias for much faster runtimes with large panels # This needs to be done only once for each assay Rscript $PURECN/NormalDB.R --outdir $OUT_REF --normal_panel $NORMAL_PANEL \ --assay agilent_v6 --genome hg19 --force # Export the segmentation in DNAcopy format cnvkit.py export seg $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cns --enumerate-chroms \ -o $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg # Run PureCN by providing the *.cnr and *.seg files Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --sampleid $SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cnr \ --segfile $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg \ --normal_panel $OUT_REF/mapping_bias_agilent_v6_hg19.rds \ --vcf ${SAMPLEID}_mutect.vcf \ --statsfile ${SAMPLEID}_mutect_stats.txt \ --snpblacklist hg19_simpleRepeats.bed \ --genome hg19 \ --funsegmentation none \ --force --postoptimize --seed 123 \end{verbatim} Important recommendations: \begin{itemize} \item The \Rcode{--funsegmentation} argument controls if the data should to be re-segmented using germline BAFs (default). Set this value to \Rcode{none} if the provided segmentation should be used as is. \item Since CNVkit provides all necessary information in the *.cnr output files, the \Rcode{--intervals} argument is not required. \end{itemize} \subsection*{Dx} Dx.R extracts copy number and mutation metrics from PureCN.R output. \begin{verbatim} # Provide a BED file with callable regions, for examples obtained by # GATK CallableLoci. Useful to calculate mutations per megabase and # to exclude low quality regions. grep CALLABLE ${SAMPLEID}_callable_status.bed > \ ${SAMPLEID}_callable_status_filtered.bed # Only count mutations in callable regions, also subtract what was ignored # in PureCN.R via --snpblacklist, like simple repeats, from the mutation per # megabase calculation # Also search for the COSMIC mutation signatures # (http://cancer.sanger.ac.uk/cosmic/signatures) Rscript $PureCN/Dx.R --out $OUT/$SAMPLEID/$SAMPLEID \ --rds $OUT/SAMPLEID/${SAMPLEID}.rds \ --callable ${SAMPLEID}_callable_status_filtered.bed \ --exclude hg19_simpleRepeats.bed \ --signatures # Restrict mutation burden calculation to coding sequences Rscript $PureCN/FilterCallableLoci.R --genome hg19 \ --infile ${SAMPLEID}_callable_status_filtered.bed \ --outfile ${SAMPLEID}_callable_status_filtered_cds.bed Rscript $PureCN/Dx.R --out $OUT/$SAMPLEID/${SAMPLEID}_cds \ --rds $OUT/SAMPLEID/${SAMPLEID}.rds \ --callable ${SAMPLEID}_callable_status_filtered_cds.bed \ --exclude hg19_simpleRepeats.bed \end{verbatim} Important recommendations: \begin{itemize} \item The \Rcode{--signatures} argument does not work well with small numbers of somatic mutations; thus do not use it on CDS filtered data. \item Run GATK CallableLoci with \Rcode{--minDepth N} where N is roughly 20\% of the mean target coverage of all samples. \end{itemize} \clearpage \section*{Reference} \begin{table*} \caption{IntervalFile} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --fasta & reference.file & \Rfunction{preprocessIntervals} \\ --infile & interval.file & \Rfunction{preprocessIntervals} \\ --offtarget & off.target & \Rfunction{preprocessIntervals} \\ --targetwidth & average.target.width & \Rfunction{preprocessIntervals} \\ --offtargetwidth & average.off.target.width & \Rfunction{preprocessIntervals} \\ --offtargetseqlevels & off.target.seqlevels & \Rfunction{preprocessIntervals} \\ --mappability & mappability & \Rfunction{preprocessIntervals} \\ --minmappability & min.mappability & \Rfunction{preprocessIntervals} \\ --reptiming & reptiming & \Rfunction{preprocessIntervals} \\ --reptimingbinsize & & \\ --genome & txdb, org & \Rfunction{annotateTargets} \\ --outfile & & \\ --export & & \Rfunction{rtracklayer::export} \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \begin{table*} \caption{Coverage} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --bam & bam.file & \Rfunction{calculateBamCoverageByInterval} \\ --bai & index.file & \Rfunction{calculateBamCoverageByInterval} \\ --coverage & coverage.file & \Rfunction{correctCoverageBias} \\ --intervals & interval.file & \Rfunction{correctCoverageBias} \\ --method & method & \Rfunction{correctCoverageBias} \\ --keepduplicates & keep.duplicates & \Rfunction{calculateBamCoverageByInterval} \\ --removemapq0 & mapqFilter & \Rfunction{ScanBamParam} \\ --outdir & & \\ --cpu & & Number of CPUs to use\\ --seed & & \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \begin{table*} \caption{NormalDB} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --coveragefiles & normal.coverage.files & \Rfunction{createNormalDatabase} \\ --normal\_panel & normal.panel.vcf.file & \Rfunction{calculateMappingBiasVcf} \\ --maxmeancoverage & max.mean.coverage & \Rfunction{createNormalDatabase} \\ --assay -a & Optional assay name & Used in output file names. \\ --genome -g & Optional genome version & Used in output file names. \\ --outdir -o & & \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \begin{table*} \caption{PureCN} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --sampleid -i & sampleid & \Rfunction{runAbsoluteCN} \\ --normal & normal.coverage.file & \Rfunction{runAbsoluteCN} \\ --tumor & tumor.coverage.file & \Rfunction{runAbsoluteCN} \\ --vcf & vcf.file & \Rfunction{runAbsoluteCN} \\ --rds & file.rds & \Rfunction{readCurationFile} \\ --normal\_panel & normal.panel.vcf.file & \Rfunction{setMappingBiasVcf} \\ --normaldb & normalDB (serialized with \Rfunction{saveRDS}) & \Rfunction{findBestNormal}, \Rfunction{filterTargets} \\ --segfile & seg.file & \Rfunction{runAbsoluteCN} \\ --sex & sex & \Rfunction{runAbsoluteCN} \\ --genome & genome & \Rfunction{runAbsoluteCN} \\ --intervals & interval.file & \Rfunction{runAbsoluteCN} \\ --statsfile & stats.file & \Rfunction{filterVcfMuTect} \\ --minaf & af.range & \Rfunction{filterVcfBasic} \\ --snpblacklist & snp.blacklist & \Rfunction{filterVcfBasic} \\ --error & error & \Rfunction{runAbsoluteCN} \\ --dbinfoflag & DB.info.flag & \Rfunction{runAbsoluteCN} \\ --funsegmentation & fun.segmentation & \Rfunction{runAbsoluteCN} \\ --alpha & alpha & \Rfunction{segmentationCBS} \\ --undosd & undo.SD & \Rfunction{segmentationCBS} \\ --maxsegments & max.segments & \Rfunction{runAbsoluteCN} \\ --targetweightfile & target.weight.file & \Rfunction{segmentationCBS} \\ --minpurity & test.purity & \Rfunction{runAbsoluteCN} \\ --maxpurity & test.purity & \Rfunction{runAbsoluteCN} \\ --minploidy & min.ploidy & \Rfunction{runAbsoluteCN} \\ --maxploidy & max.ploidy & \Rfunction{runAbsoluteCN} \\ --maxcopynumber & test.num.copy & \Rfunction{runAbsoluteCN} \\ --postoptimize & post.optimize & \Rfunction{runAbsoluteCN} \\ --bootstrapn & n & \Rfunction{bootstrapResults} \\ --modelhomozygous & model.homozygous & \Rfunction{runAbsoluteCN} \\ --model & model & \Rfunction{runAbsoluteCN} \\ --logratiocalibration & log.ratio.calibration & \Rfunction{runAbsoluteCN} \\ --maxnonclonal & max.non.clonal & \Rfunction{runAbsoluteCN} \\ --outvcf & return.vcf & \Rfunction{predictSomatic} \\ --out -o & & \\ --parallel & BPPARAM & \Rfunction{runAbsoluteCN} \\ --seed & & \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \begin{table*} \caption{Dx} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --rds & file.rds & \Rfunction{readCurationFile} \\ --callable & callable & \Rfunction{callMutationBurden} \\ --exclude & exclude & \Rfunction{callMutationBurden} \\ --maxpriorsomatic & max.prior.somatic & \Rfunction{callMutationBurden} \\ --signatures & & \Rfunction{deconstructSigs::whichSignatures} \\ --out & & \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \clearpage \subsection*{Session Info} <>= toLatex(sessionInfo()) @ \end{document}