% \VignetteIndexEntry{CNVPanelizer} % \VignetteKeywords{CNVPanelizer R} % \VignetteKeywords{Bootstrapping Panel Sequencing} % \VignettePackage{CNVPanelizer} %\VignetteCompiler{knitr} %\VignetteEngine{knitr::knitr} \documentclass{article} % For squares with numbers % (it was not showing the list of references at the end..) %\usepackage[style=numeric,backend=bibtex]{biblatex} \usepackage[backend=bibtex]{biblatex} % For squares with text % For name and year %\usepackage[style=authoryear,natbib=true,backend=bibtex]{biblatex} \usepackage[T1]{fontenc} \usepackage[sc]{mathpazo} \usepackage{graphics} \usepackage{float} % Required to present the reference list \usepackage{lscape} % Because landscape pages.. \usepackage{rotating} \usepackage{adjustbox} \usepackage{pdflscape} % to present pages in landscape in horizontal view %\usepackage{capt-of} %\usepackage{floatrow} % atempt to display plot without resizing \renewcommand{\sfdefault}{lmss} \renewcommand{\ttdefault}{lmtt} \setcounter{secnumdepth}{5} % Enable level 4-5 \setcounter{tocdepth}{4} % this needs to be after the usepackage otherwise % the references do no show at the end ?!?! %\addbibresource{CNVPanelizer.bib} % Specifying the packages.bib file \title{CNVPanelizer: Reliable CNV detection in target sequencing applications} \author{Cristiano Oliveira {\tt }} \author{ Oliveira, Cristiano\\ \texttt{cristiano.oliveira@med.uni-heidelberg.de} \and Wolf, Thomas\\ \texttt{thomas\_Wolf71@gmx.de} } \usepackage[margin=1in, a4paper]{geometry} \usepackage{hyperref} \hypersetup{ colorlinks, citecolor=black, filecolor=black, linkcolor=black, urlcolor=black } \begin{document} \maketitle %\renewcommand\contentsname{Table of contents} %Just to change title of TOC %\tableofcontents Amplicon based targeted sequencing, over the last few years, has become a mainstay in the clinical use of next generation sequencing technologies. For the detection of somatic and germline SNPs this has been proven to be a highly robust methodology. One area of genomic analysis which is usually not covered by targeted sequencing, is the detection of copy number variations (CNVs). While a large number of available algorithms and software address the problem of CNV detection in whole genome or whole exome sequencing, there are no such established tools for amplicon based targeted \mbox{sequencing}. We introduced a novel algorithm for the reliable detection of CNVs from targeted sequencing. \section{Introduction} To assess if a region specific change in read counts correlates with the presence of CNVs, we implemented an algorithm that uses a subsampling strategy similar to Random Forest to reliable predict the presence of CNVs. We also introduce a novel method to correct for the background noise introduced by sequencing genes with a low number of amplicons. To make it available to the community we implemented the algorithm as an R package. \section{Using} This section provides an overview of the package functions. \subsection{Installing and Loading the package} The package is available through the Bioconductor repository and can be installed and loaded using the following R commands: <>= # To install from Bioconductor source("http://bioconductor.org/biocLite.R") biocLite("CNVPanelizer") @ <>= # To load the package library(CNVPanelizer) @ \newpage \subsection{Reading data} \subsubsection{BED file} % \subparagraph{} % level 5 %\subsubsubsection{Loading BED data} The BED file is required to obtain amplicon and gene name information associated with the panel. <>= # Bed file defining the amplicons bedFilepath <- "/somePath/someFile.bed" # The column number of the gene Names in the BED file. amplColumnNumber <- 4 # Extract the information from a bed file genomicRangesFromBed <- BedToGenomicRanges(bedFilepath, ampliconColumn = amplColumnNumber, split = "_") metadataFromGenomicRanges <- elementMetadata(genomicRangesFromBed) geneNames = metadataFromGenomicRanges["geneNames"][, 1] ampliconNames = metadataFromGenomicRanges["ampliconNames"][, 1] @ \subsubsection{Selecting files} Two sets of data are required. The samples of interest and the set of reference bam files to compare against. <>= # Directory with the test data sampleDirectory <- "/somePathToTestData" # Directory with the reference data referenceDirectory <- "/somePathToReferenceData" # Vector with test filenames sampleFilenames <- list.files(path = sampleDirectory, pattern = ".bam$", full.names = TRUE) # Vector with reference filenames referenceFilenames <- list.files(path = referenceDirectory, pattern = ".bam$", full.names = TRUE) @ %\paragraph{Counting Reads} \subsubsection{Counting reads} The reads were counted using a wrapper function around the \textbf{ExomeCopy} package from the Bioconductor project. All reads overlapping with the region of an amplicon were counted for this amplicon. Only reads with a mapping quality $\geq 20$ were counted and the function allows to remove PCR Duplicates. For reads with the same start site, end site and chromosomal orientation only one is kept. PCR duplicates might cause a bias for the ratio between reference and the samples of interest. Thus this serves as an additional quality control step in the CNV detection pipeline. It is only recommended for Ion Torrent generated data. For Illumina data this step is not recommended. \newpage <>= # Should duplicated reads (same start, end site and strand) be removed removePcrDuplicates <- FALSE # TRUE is only recommended for Ion Torrent data # Read the Reference data set referenceReadCounts <- ReadCountsFromBam(referenceFilenames, genomicRangesFromBed, sampleNames = referenceFilenames, ampliconNames = ampliconNames, removeDup = removePcrDuplicates) # Read the sample of interest data set sampleReadCounts <- ReadCountsFromBam(sampleFilenames, genomicRangesFromBed, sampleNames = sampleFilenames, ampliconNames = ampliconNames, removeDup = removePcrDuplicates) @ \subsubsection{Using Synthetic Data} We also make available synthetic data to test the functions. The following examples make use of two generated data sets, one for the reference and the other as a testing set <>= data(sampleReadCounts) data(referenceReadCounts) # Gene names should be same size as row columns # For real data this is a vector of the genes associated # with each row/amplicon. For example Gene1, Gene1, Gene2, Gene2, Gene3, ... geneNames <- row.names(referenceReadCounts) # Not defined for synthetic data # For real data this gives a unique name to each amplicon. # For example Gene1:Amplicon1, Gene1:Amplicon2, Gene2:Amplicon1, # Gene2:Amplicon2, Gene3:Amplicon1 ... ampliconNames <- NULL @ \subsection{Normalization} To account for sample and sequencing run specific variations the counts obtained for each sample were normalized using a wrapper function around the \textbf{tmm} normalization function from the \textbf{NOISeq} package. <>= normalizedReadCounts <- CombinedNormalizedCounts(sampleReadCounts, referenceReadCounts, ampliconNames = ampliconNames) # After normalization data sets need to be splitted again to perform bootstrap samplesNormalizedReadCounts = normalizedReadCounts["samples"][[1]] referenceNormalizedReadCounts = normalizedReadCounts["reference"][[1]] @ \subsection{Bootstrap based CNV} This aproach is similar to the Random Forest method, which bootstraps the samples and subsamples the features. In our case features would be equivalent to amplicons. The subsampling procedure is repeated n times to generate a large set of randomized synthetic references $B = b_1,\ldots,b_n$ by selecting with replacement (bootstrapping) from the set of reference samples. The ratio between the sample of interest and each randomized reference is calculated for each gene, using only a randomly selected subset of amplicons. To get an estimate of significance, the 95 percent confidence interval of the bootstrapping/subsampling ratio distribution was calculated. A significant change is considered as an amplification for a lower bound ratio > 1 and a deletion for an upper bound ratio < 1. The confidence interval was calculated using the 0.025 and 0.975 percent quantiles of the bootstrapping/subsampling ratio distribution. <>= # Number of bootstrap replicates to be used replicates <- 10000 @ <>= # To speed up vignette generation while debugging #replicates <- 10 @ <>= # Perform the bootstrap based analysis bootList <- BootList(geneNames, samplesNormalizedReadCounts, referenceNormalizedReadCounts, replicates = replicates) @ \subsection{Background Estimation} Not all genes have the same number of amplicons $|A_G|$ and it has been shown that sequencing genes with a higher number of amplicons yields better sensitivity and specifity when detecting putative copy number variations. Still genes sequenced with a low number of amplicons might still show significant changes in observed read counts. While normalization makes the comparison of read counts comparable between samples, genes with a small number of amplicons might still show a bias. To quantify the effect of a low number of amplicons on the calling of CNVs we introduced a background noise estimation procedure. Using the ratio between the mean reference and the sample used for calling we subsample for each unique number of amplicons. In the case of two amplicons we repeatedly sample two random amplicons from the set of all amplicons, and average the ratios. Amplicons that belong to genes showing significant copy number variations $G_{sig}$ are not included in the subsampling pool. Each amplicon is weighted according to the number of amplicons the respective gene has $w_A = \frac{1}{|A_g|}$. Thus the probablity of sampling from a gene is the same regardless of the number amplicons. For each number of amplicons a background noise distribution is estimated. The reported background is defined by the lower noise $meanNoise + qnorm(0.025) * standardDeviationNoise$ and the upper noise $meanNoise + qnorm(0.975) * standardDeviationNoise$ of the respective distribution. This defines the 95 percent confidence interval. The significance level can also be passed as parameter. Less conservative detection can be achieved by setting the significanceLevel = 0.1. To calculate the mean and standard deviation (sd) of ratios the log ratios were used. If setting robust = TRUE, median is used instead of mean and mad (median absolute deviation) replaces sd. The upper bound ratio has to be below the lower noise (deletion) or the lower bound ratio above the upper noise (amplification) to be considered reliable. <>= # Estimate the background noise left after normalization backgroundNoise <- Background(geneNames, samplesNormalizedReadCounts, referenceNormalizedReadCounts, bootList, replicates = replicates, significanceLevel = 0.1, robust = TRUE) @ \subsection{Results} To analyse the results we provide two outputs. A plot which shows the detected variations, and a report table with more detailed information about those variations. \subsubsection{Report} The report describes the bootstrap distribution for each gene. An example can be found in figure~\ref{fig:reportTable}. The final report is genewise, and is based on the bootstrapping. <>= # Build report tables reportTables <- ReportTables(geneNames, samplesNormalizedReadCounts, referenceNormalizedReadCounts, bootList, backgroundNoise) @ At the figure~\ref{fig:reportTable} we can see an example of the report table for a single sample. \newpage \begin{landscape} \begin{figure}[H] <>= options(width=500) # to show the entire table.. # to avoid have to print to other page.. numberOfGenesViewport = 20 #if (nrow(reportTables[[1]]) > numberOfGenes) { # numberOfGenes = numberOfGenesViewport #} else { # numberOfGenes = nrow(reportTables[[1]]) #} #reportTables[[1]][1:numberOfGenes, ] # index of the sample to show sampleIndexToShow = 2 # TODO improve this.. remove the column.. # but for now just hide "Signif." column... no space available #indexToHide <- which(colnames(reportTables[[1]])=="Signif.") #indexToHide <- which(colnames(reportTables[[1]])=="MeanRatio") indexToHide <- which(colnames(reportTables[[1]]) %in% c("MeanRatio", "Passed")) reportTables[[sampleIndexToShow]][1:numberOfGenesViewport, -c(indexToHide)] #reportTables[[sampleIndexToShow]][1:numberOfGenesViewport, ] @ \caption{Sample report table.} \label{fig:reportTable} \end{figure} \begin{table}[h] \begin{tabular}{| l | p{15cm} |} \hline % Mean Ratio & The mean amplicon ratio between reference and sample % of interest for the gene\\ \hline % LBBtp | MeanBtp | UBBtp Ratio & The Lower, Mean and Upper bounds of the % Bootstrap Ratio distribution. \\ \hline LowerBoundBoot | MeanBoot | UpperBoundBoot & The confidence interval of the Bootstrap distribution \\ \hline % 5\% | 50\% | 95\% Quantile & The quantiles of the bootstrap % distribution\\ \hline Lower | Mean | Upper Noise & The Lower/Mean/Upper background noise bounds \\ \hline Signif. & Boolean value representing if read counts differences are Significant \\ \hline AboveNoise & Boolean value representing if LowerBoundBoot is above UpperNoise (UpperBoundBoot below LowerNoise) \\ \hline Amplicons & Number of Amplicons associated with the gene \\ \hline PutativeStatus & The detected level of change \\ \hline Passed & if both Signif. and AboveNoise are TRUE then 2, if only one is TRUE then 1 and if both are FALSE then 0 \\ \hline \end{tabular} \caption{Report table Column Description} \label{table:reportTableColumnDescription} \end{table} \end{landscape} \newpage \subsubsection{Plots} The generated plots (one per test sample) show the bootstrap distribution for each gene. The function \textbf{PlotBootstrapDistributions} generates a list of plots for all test samples (A plot per sample). At the figure~\ref{fig:BootstrapPlot} we can see an example of the plot for a single sample. <>= # Directory where the generated files with the analysis results will be saved. outputDirectory <- "./tmp" dir.create(outputDirectory, recursive = TRUE) # Export the report tables to excel format reportTablesFilepath <- file.path(outputDirectory, "report_tables.xlsx") WriteListToXLSX(reportTables, reportTablesFilepath) # # Export read counts to excel format readCountsFilepath <- file.path(outputDirectory, "readCounts.xlsx") normalizedReadCountsFilepath <- file.path(outputDirectory, "normalizedReadCounts.xlsx") WriteListToXLSX(list(samplesReadCount = sampleReadCounts, referenceReadCounts = referenceNormalizedReadCounts), readCountsFilepath) WriteListToXLSX(list(samplesReadCount = samplesNormalizedReadCounts, referenceReadCounts = referenceNormalizedReadCounts), normalizedReadCountsFilepath) justToHide = PlotBootstrapDistributions(bootList, reportTables, outputFolder=outputDirectory, save=TRUE) save.image(file = file.path(outputDirectory, "RSession.Rdata")) @ <>= PlotBootstrapDistributions(bootList, reportTables) @ % \begin{figure}[H] % <>= % PlotBootstrapDistributions(bootList, reportTables, outputDirectory) %[[sampleIndexToShow]] % @ % \caption{Sample plot of a sample.} % \label{fig:bootstrapPlot} % \end{figure} \subsection{Shiny App} To allow a more user friendly interaction and quick testing of CNVPanelizer, a shiny app is made available through the following command: <>= # default port is 8100 RunCNVPanelizerShiny(port = 8080) @ \begin{landscape} <>= PlotBootstrapDistributions(bootList, reportTables)[[sampleIndexToShow]] @ \begin{table}[hb] \centering \begin{tabular}{| l | p{15cm} |} \hline No Change & There is no change in the read count \\ \hline Non Reliable Change & The change is significant but below the noise level for the number of amplicons associated with this gene\\ \hline Reliable Change & The change is above the noise level and significant \\ \hline \end{tabular} \label{table:plotLabelDescription} \caption{Description of the CNV detection levels} \end{table} % % \begin{center} % \begin{tabular}{| l | l | l | p{15cm} |} % \hline % Day & Min Temp & Max Temp & Summary \\ \hline % Monday & 11C & 22C & A clear day with lots of sunshine. % However, the strong breeze will bring down the temperatures. \\ \hline % Tuesday & 9C & 19C & Cloudy with rain, across many northern regions. % Clear spells % across most of Scotland and Northern Ireland, % but rain reaching the far northwest. \\ \hline % Wednesday & 10C & 21C & Rain will still linger for the morning. % Conditions will improve by early afternoon and continue % throughout the evening. \\ % \hline % \end{tabular} % \end{center} % % With width specified: % \begin{center} % \begin{tabular}{ | l | l | l | p{5cm} |} % \hline % Day & Min Temp & Max Temp & Summary \\ \hline % Monday & 11C & 22C & A clear day with lots of sunshine. % However, the strong breeze will bring down the temperatures. \\ \hline % Tuesday & 9C & 19C & Cloudy with rain, across many northern regions. % Clear spells % across most of Scotland and Northern Ireland, % but rain reaching the far northwest. \\ \hline % Wednesday & 10C & 21C & Rain will still linger for the morning. % Conditions will improve by early afternoon and continue % throughout the evening. \\ % \hline % \end{tabular} % \end{center} \end{landscape} %\begin{figure}[H] %\centering % \includegraphics[keepaspectratio=true,width=\textwidth]{./imgs/some.png} % \caption{Sample plot per sample.} % \label{fig:verticalcell} %\end{figure} %\listoffigures %\listoftables %\printbibliography \end{document}