%\VignetteIndexEntry{CopywriteR} %\VignetteDepends{CopywriteR} %\VignetteKeywords{Copy number detection from targeted sequencing using off-target reads.} %\VignettePackage{CopywriteR} %\VignetteEngine{utils::Sweave} \documentclass{article} <>= BiocStyle::latex() @ \newcommand{\exitem}[3]{% \item \texttt{\textbackslash#1\{#2\}} #3 \csname#1\endcsname{#2}.% } \title{\bf CopywriteR: DNA copy number detection from off-target sequence data.} \author{Thomas Kuilman} \begin{document} \maketitle \begin{center} Department of Molecular Oncology \\* Netherlands Cancer Institute \\* The Netherlands \\* \end{center} \begin{center} \email{t.kuilman@nki.nl} or \email{thomaskuilman@yahoo.com} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Overview} To extract copy number information from targeted sequencing data while circumventing problems related to the use of on-target reads, we developed the CopywriteR package. Unlike other currently available tools, CopywriteR is based on off-target reads. We have shown that this approach has several advantages relative to current approaches \cite{s13059-015-0617-1} and it constitutes a viable alternative for copy number detection from targeted sequencing data. \section{CopywriteR workflow} CopywriteR uses .bam files as an input. These files are processed in several steps to allow copy number detection from off-target reads of targeted sequencing efforts. These steps include: \begin{itemize} \item Removing of low-quality and anomalous reads \item Peak calling (in reference when available, otherwise in sample itself) \item Discarding of reads in peak regions \item Counting reads on bins of pre-defined size \item Compensating for the difference in effective bin size upon discarding peaks in peak regions \item Correcting for GC-content and mappability; applying a blacklist filter for CNV regions \end{itemize} \section{Analysis workflow} The full analysis of copy number data with CopywriteR includes three sequential steps, using the \Rcode{preCopywriteR()}, \Rcode{CopywriteR()} and \Rcode{plotCNA()} functions respectively. In this analysis workflow, we will use CopywriteR to extract copy number information from whole-exome sequening data from a murine small-cell lung cancer. The complete dataset can be downloaded from the European Nucleotide Archive (\url{http://www.ebi.ac.uk/ena/home}) using the accession number PRJEB6954. In this vignette, only the sequence reads on chromosome 4 are used. The .bam file, which has been pre-filtered to minimize disk usage, is contained in the \Biocpkg{SCLCBam} package. Before starting the analysis, GC-content and mappability helper files need to be created for the desired bin size and reference genome. This is done using the \Rcode{preCopywriteR()} function. \subsection{\Rcode{preCopywriteR()}} CopywriteR uses binned read count data as a basis for copy number detection. We have pre-assembled 1 kb bin GC-content and mappability information for the hg18, hg19, hg38, mm9 and mm10 reference genomes in the \Biocpkg{CopyhelpeR} package. The \Rcode{preCopywriteR()} function allows to create the necessary helper files. These files can be created for any custom bin size that is a multiple of 1000 bp, and for the above-mentioned reference genomes. The helper files are saved in a new folder that is names after the relevant reference genome, the bin size, and the prefix (all separated by '\symbol{95}'). The helper files are required to run CopywriteR. If previously created helper files for a specific bin size / reference genome / prefix combination are available, the {\Rcode{preCopywriteR()} step can be omitted. In this example, CopywriteR will be applied to a bin size of 20 kb. We recommend to use a bin size of 20 kb when trying to analyze whole-exome sequencing data, while we would start analyzing at 50 kb resolution for targeted sequencing on smaller gene panels. As the .bam file contained in the \Biocpkg{SCLCBam} package is mapped to the mm10 reference genome, we run {\Rcode{preCopywriteR()} as follows: <>= library(CopywriteR) @ <>= data.folder <- tools::file_path_as_absolute(file.path(getwd())) @ <>= preCopywriteR(output.folder = file.path(data.folder), bin.size = 20000, ref.genome = "mm10_4") @ Please not that the "mm10\symbol{95}4" reference genome exclusively contains helper files for chromosome 4, and therefore should only be used when replicating the analysis in this vignette. The custom helper files are placed in a folder that is named after the reference genome and the bin size (in this example: mm10\symbol{95}4\symbol{95}20kb). This folder, on its turn, is placed inside the folder specified by the output.folder argument. The bin.size argument specifies the custom bin size (in bp), and the ref.genome argument can be either "hg18", "hg19", "hg38", "mm9" or "mm10". The default chromosome notation is "1", "2", ..., "X", "Y". The prefix argument is optional and can be used to add a prefix to these names. If you are interested in getting helper files for other genomes, please contact me by the email addresses supplied at the beginning of this vignette. <>= list.dirs(path = file.path(data.folder), full.names = FALSE)[2] @ <>= list.files(path = file.path(data.folder, "mm10_4_20kb"), full.names = FALSE) @ The data are contained in two GRanges objects (from the \Biocpkg{GenomicRanges} package) of which one (blacklist.rda) contains regions of copy number variation: <>= load(file = file.path(data.folder, "mm10_4_20kb", "blacklist.rda")) blacklist.grange @ The other helper file (GC\symbol{95}mappability.rda) contains GC-content and mappability information for a particular bin size: <>= load(file = file.path(data.folder, "mm10_4_20kb", "GC_mappability.rda")) GC.mappa.grange[1001:1011] @ \subsection{\Biocpkg{BiocParallel} for parallel computing} CopywriteR fully supports paralel computing and is implemented in such a way that every sample is processed on a single core. CopywriteR uses the \Biocpkg{BiocParallel} for parallel computing. This package requires the user to specify which parallel environment is used by creating an instance of \Rcode{BiocParallelParam}. In the example case, we are using the \Rcode{snow} environment and specify a \Rcode{SnowParam}: <>= bp.param <- SnowParam(workers = 1, type = "SOCK") bp.param @ For using other instances of the \Rcode{BiocParallelParam} parameter, please refer to the \Biocpkg{BiocParallel} package vignette. The bp.param parameter can now be passed on to \Rcode{CopywriteR()} to allow parallel computation. \subsection{\Rcode{CopywriteR()}} The \Rcode{CopywriteR()} function allows to calculate binned read count data based on the helper files created by \Rcode{preCopywriteR()}. CopywriteR uses a peak calling algorithm to remove 'on-target' reads. For every sample that is to be analyzed by CopywriteR, the user can specify on which sample the peak regions should be identified. The argument that controls this is sample.control. This \Rcode{matrix} or \Rcode{data.frame} should contain path names to 'samples' in the first column, and their corresponding 'controls', in the second column. The peaks called in a 'control' are used to remove reads from the corresponding 'sample'. Please note that any sample that is to be used by the downstream plotCNA function (including those that are only to be used as a reference) needs to be analyzed by the CopywriteR function. In the example used here, no matched germline control is available, and peaks are called on the sample itself. Therefore the paths specified in the samples and controls columns of the sample.control \Rcode{data.frame} below are identical. <>= path <- SCLCBam::getPathBamFolder() samples <- list.files(path = path, pattern = ".bam$", full.names = TRUE) controls <- samples sample.control <- data.frame(samples, controls) CopywriteR(sample.control = sample.control, destination.folder = file.path(data.folder), reference.folder = file.path(data.folder, "mm10_4_20kb"), bp.param = bp.param) @ The destination.folder argument determines in which folder the output folder is going to be placed, and the reference.folder points to the location of the custom bin size helper files. The keep.intermediary.files is an optional argument that determines whether intermediary .bam, .bai and peak .bed files are kept. The default value is FALSE to limit disk space usage; it can be set to TRUE however for troubleshooting purposes. Finally, the capture.regions.file argument is optional, and can be used to define the location of a capture regions bed file. If this file is specified, statistics of the overlap of these regions with the called peaks is provided. CopywriteR will create a new folder named CNAprofiles in the directory specified by destination.folder. This folder contains the following files: <>= cat(list.files(path = file.path(data.folder, "CNAprofiles")), sep = "\n") @ The read\symbol{95}counts.txt file contains both uncompensated and compensated read counts for every sample, as well as the corresponding fraction.of.bin values. The 'fraction of bin' indicates what fraction of the bin is not on peaks, and therefore effectively contributes to the read count. In our example, the read\symbol{95}counts.txt has the following content: <>= read.table(file = file.path(data.folder, "CNAprofiles", "read_counts.txt"), header = TRUE)[1001:1006, ] @ The log2\symbol{95}read\symbol{95}counts.igv file can be opened in the IGV browser (\url{http://www.broadinstitute.org/igv/}), and contains log2-transformed, normalized (ratios of) read counts. These data can be used for further downstream analysis, and are required for \Rcode{plotCNA()} to allow plotting of the copy number profiles: <>= read.table(file = file.path(data.folder, "CNAprofiles", "log2_read_counts.igv"), header = TRUE)[817:822, ] @ The input.Rdata file contains a number of variables that are required to run the last function of the CopywriteR package, \Rcode{plotCNA()}. The CopywriteR.log file contains log information of the R commands that have been used to perform the various subfunctions, and specifications of the input material. Finally, the qc folder contains two types of quality control plots. <>= cat(list.files(path = file.path(data.folder, "CNAprofiles", "qc")), sep = "\n") @ The fraction.of.bin files contain the empirical cumulative distribution function for the 'fractions of bins'. The read.counts.compensated files contain the plots and the loesses that are used for GC-content and mappability corrections. \subsection{\Rcode{plotCNA()}} The \Rcode{plotCNA()} function allows segmentation of the copy number data using DNAcopy \cite{Olshen:2004ge}, and subsequent plotting. We run the plotting function as follows: <>= plotCNA(destination.folder = file.path(data.folder)) @ The plotting function saves the DNAcopy object containing the segmentation values in the segment.Rdata file. In addition, it creates the folder 'plots': <>= cat(list.files(path = file.path(data.folder, "CNAprofiles")), sep = "\n") @ For every sample a separate directory is created: <>= cat(list.files(path = file.path(data.folder, "CNAprofiles", "plots")), sep = "\n") @ The samples are plotted per chromosome, as well as in a genome-wide fashion. We provide here the result for chromosome 4: \begin{center} \includegraphics[width=\textwidth]{CNAprofiles/plots/log2.T43_4.bam.vs.none/chrom_4} \end{center} In addition to the plots, the raw segmented values can be obtained from the DNAcopy object 'segment.Rdata'. We are interested in further improving CopywriteR, so if CopywriteR fails to analyze your samples, please don't hesitate to contact me. In this case, please provide the full command line output and the log-file. \section{Troubleshooting} We maintain a troubleshooting GitHub page where various issues are discussed that have been raised by CopywriteR users. Please follow this link: \url{https://github.com/PeeperLab/CopywriteR#troubleshooting}. \newpage \section{Session Information} The version number of \R{} and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliography{CopywriteR} \end{document}