%\VignetteIndexEntry{Detection of de novo copy number alterations in case-parent trios} %\VignetteDepends{VanillaICE} %\VignetteKeywords{MinimumDistance, copy number, SNP, case-parent trios, de novo} %\VignettePackage{MinimumDistance} \documentclass{article} \usepackage{graphicx} %\usepackage[utf8]{inputenc} %\usepackage[T1]{fontenc} \usepackage{natbib} \usepackage{url} \usepackage{amsmath} \usepackage{amssymb} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\R}{\textsf{R}} \newcommand{\md}{\Rpackage{MinimumDistance}} \newcommand{\F}{\mathrm{F}} \newcommand{\M}{\mathrm{M}} \newcommand{\Of}{\mathrm{O}} \newcommand{\RR}{\mathrm{LR}} \newcommand{\LR}{\mathrm{LR}} \newcommand{\Blrr}{\mbox{\boldmath $R$}} \newcommand{\Bbaf}{\mbox{\boldmath $B$}} \newcommand{\logRratio}{$\log_2$ R ratio} \newcommand{\lrrlong}{$\log_2$ R ratio} \newcommand{\blrr}{\mbox{\boldmath $r$}} %\newcommand{\bbaf}{\mbox{\boldmath $b$}} \newcommand{\baf}{B allele frequency} \newcommand{\bafs}{B allele frequencies} \newcommand{\tsl}{trioSetList} \newcommand{\ts}{trioSet object} \newcommand{\mindist}{\ensuremath{\mbox{\boldmath $d$}}} \DeclareMathOperator{\I}{\mathbb{I}} \usepackage[margin=1in]{geometry} \title{Detection of de novo copy number alterations in case-parent trios using the \R{} package \md{}} \date{\today} \author{Moiz Bootwalla and Rob Scharpf} \begin{document} \maketitle <>= options(prompt="R> ", continue=" ", device=pdf, width=65) @ \begin{abstract} For the analysis of case-parent trio genotyping arrays, copy number variants (CNV) appearing in the offspring that differ from the parental copy numbers are often of interest (de novo CNV). This package defines a statistic, referred to as the minimum distance, that can be useful for identifying de novo copy number alterations in the offspring. We smooth the minimum distance using the circular binary segmentation algorithm implemented in the Bioconductor package \Rpackage{DNAcopy}. Trio copy number states are inferred from the maximum a posteriori probability of the segmented data, incorporating information from the log R ratios and B allele frequencies. As both log R ratios and B allele frequencies can be estimated from Illumina and Affymetrix arrays, this package supports de novo copy number inference in both platforms. \end{abstract} \section{Introduction} There are numerous \R{} packages available from Bioconductor for smoothing copy number alterations. For example, the biocview \texttt{CopyNumberVariants} in the 2.9 release of Bioconductor lists 27 packages. For the analysis of germline diseases, hidden Markov models have emerged as a popular tool as inference regarding the latent copy number state incorporates information from both the estimates of copy number (or relative copy number) and the allelic frequencies, such as the B allele frequency \citep{Peiffer2006} or genotype calls \citep{Colella2007, Wang2008, Scharpf2008}. For the analysis of somatic cell diseases such as cancer, algorithms that segment the genome into regions of constant copy number (referred to here as segmentation algorithms) may be more preferable for the detection of copy number aberrations (CNA) as a mixture of cells with different copy numbers give rise to mosaic (non-integer) copy numbers. Examples include circular binary segmentation implemented in the \R{} package \Rpackage{DNAcopy} and the \Rpackage{GLAD}, both of which were orginally developed for array CGH platforms \cite{Olshen2004,Hupe2004,Venkat2007}. One disadvantage of segmentation algorithms is that inference regarding duplication and deletions is not directly available. More recently, HMMs and segmentation algorithms have been developed for inferring common regions of copy number alterations in multiple samples. However, relatively few algorithms are available for inferring copy number alterations, especially de novo copy number alterations, in family-based study designs involving case-parent trios. Instead, a common strategy has been a two-step approach of identifying copy number alterations in the individual samples and then comparing the results across samples to infer whether an alteration observed in the offspring is de novo. A disadvantage of the two-step approach is that unadjusted sources of technical variation, such as waves, can contribute to false positives. To our knowledge, the joint HMM implemented in the PennCNV software is the only software to date that provides direct inference regarding de novo alterations in case parent study designs. This package develops an alternative framework for inferring regions of de novo copy number alterations in case-parent trios. Like PennCNV, inference regarding de novo alterations integrates information from both the log R ratios and B allele frequencies. Differences in the two approaches are most apparent in non-HapMap experimental datasets in which technical and experimental sources of variation appear to contribute to a large number of false positives. This vignette describes the analysis pipeline from preprocessed and normalized estimates of copy number and allele frequencies to inference regarding de novo copy number alterations. The workflow is illustrated using publicly available HapMap trios assayed on the Illumina 610quad array. Section \ref{smoothing} outlines two approaches for processing the data: Section \ref{option1} describes a pipeline in which a list of files are provided as input and the output is a \Robject{RangedDataCBS} object of genomic intervals and the inference for the trio copy number state; Section \ref{option2} provides a more detailed workflow in which intermediate data structures are created to facilitate visualization of the low-level copy number summaries along with inference from the \Rpackage{MinimumDistance} algorithm. Details regarding the intermediate data structures and visualization are described in Sections \ref{option2} and \ref{viz}, respectively. Where possible, we have adopted standard data structures for encapsulating the low-level data (\Robject{eSet} extensions) and the segmented data (\Robject{RangedData} extensions). \section{Large data applications and parallel processing} A separate vignette for using \Rpackage{MinimumDistance} with large datasets is forthcoming. For now, we remark that the discussion regarding the infrastructure for large data support and parallelization in the \R{} package \Rpackage{oligoClasses} is similar. In particular, to use a cluster we check for three requirements: 1) 'ff' is loaded; 2) 'snow' is loaded; and 3) the 'cluster' option is set. For example, in the following unevaluated code chunk would enable large data support and parallelization with 22 worker nodes. Finally, a check that the three requirements are satisfied is obtained with the function \Rfunction{parStatus} in the \Rpackage{oligoClasses} package. <>= library(snow) library(doSNOW) library(oligoClasses) cl <- makeCluster(22, type="SOCK") registerDoSNOW(cl) parStatus() @ Had we executed the preceding code chunk, the output from this vignette would be more or less the same. However, the containers for storing the low level copy number summaries and allele frequencies would contain \Rpackage{ff}-objects and many of the chromosome-specific tasks that can easily be parallelized such as segmentation and the posterior calling steps would be split to multiple workers. When parallelization is not enabled, one may want to 'register' the backend so that annoying warning messages do not appear during processing. This is accomplished using the function \Rfunction{registerDoSEQ} in the \Rpackage{foreach} package. <>= library(SNPchip) library(GenomicRanges) library(oligoClasses) library(MinimumDistance) foreach::registerDoSEQ() @ \section{Detection of de novo copy number} \label{smoothing} There are two options for processing. Section \label{option1} provides a simple pipeline for reading a list of files containing log R ratios and B allele frequencies and returning the segmented data with the trio copy number calls. The second option in Section \label{option2} provides a more detailed workflow of de novo copy number analysis with intermediate data structures useful for storing the log R ratios and B allele frequencies in a manner than can be easily accessed for subsequent plotting. The latter can be particularly useful for visual inspection of the trio copy number calls. \subsection{Simple Usage: generating ranged data with posterior calls} \label{option1} Here, we illustrate a simple pipeline for reading in BeadStudio files and returning a list of genomic ranges. There are 3 BeadStudio files provided with the \Rpackage{MinimumDistance} package -- one for the father (``F.txt''), mother (``M.txt''), and offspring (``O.txt''). In order to keep the size of the \Rpackage{MinimumDistance} package small, these files contain data for only 1000 markers. <>= path <- system.file("extdata", package="MinimumDistance") fnames <- list.files(path, pattern=".txt") fnames @ Before reading the data, we create a \Robject{data.frame} indicating the filenames for the father, mother, and offspring. In the following codechunk, we have a single file for the father, mother, and offspring (one trio). For processing multiple trios, the arguments to \Rfunction{data.frame} can be vectors where the ith element in each vector are the files for one trio. <>= pedigreeInfo <- data.frame(F="F.txt", M="M.txt", O="O.txt") @ \noindent Note that the sample identifiers for father and mother can be duplicated, but the offspring sample identifiers must be unique. We encapsulate this information more formally in an S4 class called \Rclass{Pedigree} and use the \Rfunction{Pedigree} constructor to generate an instance of the class: <>= ped <- Pedigree(pedigreeInfo) ped @ %The function \Rfunction{callDenovoSegments} reads the BeadStudio files %using information contained in the \Robject{ped} object, computes the %minimum distance (see Section \ref{option2}), segments the minimum %distance using circular binary segmentation, and calls the trio copy %number state using the maximum posterior probability. % %<>= %trace(callDenovoSegments, browser) %map.segs <- callDenovoSegments(path=path, % pedigreeData=ped, % cdfname="human610quadv1b", % chromosome=1, % segmentParents=FALSE, % genome="hg18") %%@ % %The object returned by \Rfunction{callDenovoSegments} is an object of %class \Robject{RangedDataHMM}. The numeric $xyz$ trio state symbol %indicates that the father has $x$ copies, the mother has $y$ copies, %and the offspring has $z$ copies. For example, the symbol for state %'321' would indicate a single copy duplication for the father, diploid %copy number for the mother, and a hemizygous deletion in the %offspring. In the above example, de novo copy number alterations were %not detected and the state '222' indicates that the father, mother, %and offspring appear to be diploid for the interval beginning at %{start(map.segs)} bp and ending at {end(map.segs)}. %\subsection{Pipeline using intermediate data strutures for storing log %R ratios and B allele frequencies} %\label{option2} % %This section describes a pipeline that permits the user to manipulate %and plot intermediate data structures. \paragraph{Instantiating a \Robject{TrioSetList} object}. A key design consideration for data structures that capture information on the low-level copy number and allele frequencies is that the `unit' of our analysis is a trio. A common query is to access the low level statistical summaries for a set of markers for all members in a trio. By organizing statistical summaries in arrays with dimension \texttt{feature x trios x family member} such queries are straightforward. As the number of trios in a typical genome-wide association study may exceed 1000, it may be impractical to store the low-level summaries in a single array. To make the size of the arrays more manageable and to facilitate parallelization, we organize the low-level summaries by chromosome using the \Rclass{TrioSetList} class. The \Rclass{TrioSetList} class has a slot for storing an object of class \Robject{Pedigree} (discussed previously), as well as slots for storing sample-level covariates on the offspring (\Robject{phenoData}), mother (\Robject{motherPhenoData}), and the father (\Robject{fatherPhenoData}). Often technical, experimental, and phenotypic covariates for the samples in an experiment are stored in a single tab-delimited file in which columns are covariates and rows are samples. Such a spreadsheed can be read into \R{} as a \Robject{data.frame}. Supplying such a \Robject{data.frame} to the argument \Robject{sample.sheet} in the \Rfunction{TrioSetList} constructor function will build \Robject{AnnotatedDataFrame} objects for father, mother, and offspring, extracting the appropriate rows (samples) from the \Robject{data.frame}. For example, in the following codechunk we load a \Robject{data.frame} called `sample.sheet' and pass the \Robject{sample.sheet} object to the \Rfunction{TrioSetList} constructor. In addition, we load a matrix of log R ratios and a matrix of B allele frequencies. The matrices for the log R ratios and B allele frequences contain only 25 markers for each chromosome as the intent is to illustrate the steps required for initializing a data structure of the \Robject{TrioSetList} class. Using the information stored in the \Robject{Pedigree} class, the matrices are transformed to 3-dimensional arrays and stored in the \Robject{assayDataList} slot of the \Rclass{TrioSetList} class. It is important that the column identifiers for the log R ratio and B allele frequency matrices correspond to sample identifiers in the \Robject{Pedigree} object. Accessors \Rfunction{lrr} and \Rfunction{baf} are available for extracting the list of arrays for the log R ratios and B allele frequencies. <>= library(human610quadv1bCrlmm) path <- system.file("extdata", package="MinimumDistance") load(file.path(path, "pedigreeInfo.rda")) load(file.path(path, "sample.sheet.rda")) load(file.path(path, "logRratio.rda")) load(file.path(path, "baf.rda")) stopifnot(colnames(logRratio) %in% allNames(Pedigree(pedigreeInfo))) nms <- paste("NA",substr(sample.sheet$Sample.Name, 6, 10),sep="") trioSetList <- TrioSetList(lrr=logRratio, ## must provide row.names baf=baf, pedigree=Pedigree(pedigreeInfo), sample.sheet=sample.sheet, row.names=nms, cdfname="human610quadv1bCrlmm", genome="hg18") show(trioSetList) @ \paragraph{Segmentation and posterior calls.} To keep the size of the \Rpackage{MinimumDistance} package small, the \Robject{trioSetList} object created in the previous section contained only 25 markers for each of the 22 autosomes. While useful for illustrating construction of the \Rclass{TrioSetList} class, there are too few markers included in the example to illustrate the smoothing and posterior calling algorithm for detecting de novo copy number events. To demonstrate these steps, we load a \Robject{TrioSetList} object containing several thousand markers for chromosomes 7 and 22 for 2 HapMap trios: <>= data(trioSetListExample) @ For a given trio, the signed minimum absolute difference of the offspring and parental \lrrlong{}s (\blrr{}) is defined as \begin{eqnarray} \label{eq:distance} \mindist &\equiv& \left(\blrr_\Of - \blrr_\M\right) \times \I_{[\left|\blrr_\Of - \blrr_\F\right| > \left|\blrr_\Of - \blrr_\M\right|]} + \left(\blrr_\Of - \blrr_\F\right) \times \I_{[\left|\blrr_\Of - \blrr_\F\right| \leq \left|\blrr_\Of - \blrr_\M\right| ]}. \end{eqnarray} Computation of $\mindist$ by the function \Rfunction{calculateMindist} is vectorized in \R{}: <>= md <- calculateMindist(lrr(trioSetList)) @ The circular binary segmentation algorithm implemented in the \R{} package \Rpackage{DNAcopy} can be used to smooth the minimum distance estimates. The method \Rfunction{segment2} is a wrapper to the \Rfunction{CNA.smooth} and \Rfunction{segment} functions defined in \Rpackage{DNAcopy}. Additional arguments to the \Rfunction{segment} function can be passed through the \texttt{...} argument in \Rfunction{segment2}, allowing users to modify the segmentation from the default values in \Rpackage{DNAcopy}. <>= md.segs <- segment2(object=trioSetList, md=md) @ The object returned by the \Rfunction{segment2} method is an object of class \Robject{RangedDataCBS}. <>= head(md.segs) @ If the offspring copy number changes within a minimum distance segment (as determined by the segmentation of the offspring copy number), the start and stop position of the minimum distance segments may be edited. The approach currently implemented is to define a new start and stop position if a breakpoint for the offspring segmentation occurs in the minimum distance interval. To illustrate, the following diagram uses vertical dashes (\texttt{|}) to denote breakpoints: \begin{verbatim} 1 ...--|--------------|--... ## minimum distance segment (before editing) 2 ...----|--------|------... ## segmenation of log R ratios for offspring -> 3 ...--|-|--------|---|--... ## after editing \end{verbatim} \noindent In the above illustration, posterior calls are provided for the 3 segments indicated in line 3 instead of the single segment in line 1. Two additional steps are therefore required: (1) segmentation of the offspring log R ratios and (2) editing of the minimum distance breakpoints when appropriate. The following codechunk, segments the log R ratios for all chromosomes in the \Robject{trioSetList} object. For purposes of visualization, we also segment the parental log R ratios: <>= lrr.segs <- segment2(trioSetList, segmentParents=TRUE) @ Minimum distance segments with a mean absolute value greater than 1 median absolute deviation from zero are edited using the \Rfunction{narrowRanges}: <>= mads.md <- mad2(md, byrow=FALSE) md.segs2 <- narrowRanges(md.segs, lrr.segs, thr=0.75, mad.minimumdistance=mads.md, fD=featureData(trioSetList)) @ To each range (segment) the maximum posterior probability and the posterior probability of the normal range are computed using the function \Rfunction{MAP}. <>= gr <- MAP(object=trioSetList, ranges=md.segs2, nupdates=5) show(gr) @ Note the \Rfunction{MAP} can be applied to a single interval. For example, <>= MAP(trioSetList, md.segs2[1, ]) @ The ratio of posterior probabilities can be used to rank genomic ranges by the evidence for a de novo copy number alteration. Note that physically adjacent genomic ranges in a sample that are assigned the same copy number state will be collapsed into a single range by the \Rfunction{computeBayesFactor} function. \subsection{Visualizations} \label{viz} %The \R{} package \Rpackage{VanillaICE} contains methods for %visualizing \Rclass{RangedData} objects that build on the %infrastructure in the \Rpackage{lattice} package. Visualizations for %trios will extend these methods and are currently under %development. Here, we provide a quick example using a function and a %lattice-style panel function that are not currently exported. To plot the log R ratios and B allele frequencies for one of the de novo deletions, we extract these low-level summaries from the \Robject{TrioSetList} object into a \Rclass{data.frame} that can be used by the \Rfunction{xyplot} function in the \Rpackage{lattice} package. <>= denovo.range <- GRanges("chr22", IRanges(20.8e6, 21.4e6)) i <- subjectHits(findOverlaps(denovo.range, gr)) gr.denovo <- gr[i, ] gr.denovo <- gr.denovo[state(gr.denovo)=="221", ] library(lattice) library(foreach) trioSet <- trioSetList[[2]][, match(sampleNames(gr.denovo), sampleNames(trioSetList))] mindist(trioSet) <- md[[2]][, match(sampleNames(gr.denovo), sampleNames(trioSetList)), drop=FALSE] fig <- xyplotTrio(gr.denovo, trioSet, frame=500e3, ylab="log R ratio and BAFs", xlab="physical position (Mb)", panel=xypanelTrio, scales=list(x="same", y=list(alternating=1, at=c(-1, 0, log2(3/2), log2(4/2)), labels=expression(-1, 0, log[2](3/2), log[2](4/2)))), lrr.segments=lrr.segs, md.segments=md.segs, layout=c(1, 4), col.hom="grey50", col.het="grey50", col.np="grey20", state.cex=0.8, cex=0.3, ylim=c(-3, 1.5), par.strip.text=list(lines=0.8, cex=0.6), key=list(text=list(c(expression(log[2]("R ratios")), expression("B allele freqencies")), col=c("black", "blue")), columns=2)) @ %se <- as(tList, "SummarizedExperiment") %df <- .to_data.frame(gr.denovo, se, maxgap=500e3) %panelFun <- function(x,y, granges, id, chrom, subscripts, mapcalls, ...){ % id <- id[subscripts][1] % chr <- chrom[subscripts][1] % granges <- granges[sampleNames(granges)==id & chromosome(granges)==chr,] % if(panel.number()==1) % mapcalls <- mapcalls[sampleNames(mapcalls)==id & chromosome(mapcalls)==chr,] % panel.grid() % panel.xyplot(x,y,...) % x0 <- start(granges)/1e6 % x1 <- end(granges)/1e6 % lsegments(x0, % values(granges)$seg.mean, % x1, % values(granges)$seg.mean, lwd=2,col="black") % if(panel.number()==1){ % mapcalls <- mapcalls[state(mapcalls) != "222", ] % st <- start(mapcalls)/1e6 % en <- end(mapcalls)/1e6 % y0 <- rep(1.0, length(st)) % y1 <- rep(1.3, length(st)) % lrect(st, y0, en, y1, col=as.integer(as.factor(state(mapcalls))), % border=as.integer(as.factor(state(mapcalls)))) % mns <- apply(cbind(st, en), 1, mean) % ltext(mns, 1.15, state(mapcalls), col="white", cex=1.2) % } %} %lfig <- xyplot(lrr~x/1e6|interval, df, % pch=20, % col="gray", % lwd=2, % scales=list(x=list(relation="free", axs="i")), % layout=c(3,1), % xlab="Mb", ylab="log R ratios", % granges=lrr.segs, % id=df$id, % chrom=df$chrom, % mapcalls=gr, % panel=panelFun, % ylim=c(-2, 1.5)) %bfig <- xyplot(baf~x/1e6|interval, df, % pch=20, % col="gray", % lwd=2, % scales=list(x=list(relation="free",axs="i")), % layout=c(3,1), % xlab="Mb", ylab="BAFs") %arrangeFigs(list(lrr=lfig, baf=bfig)) %@ %<>= %gr2 <- gr[sampleNames(gr)==sampleNames(gr.denovo) & chromosome(gr)=="chr22", ] %@ The final graphic is displayed in Figure \ref{fig:trioplot}. <>= pdf("triofig.pdf", width=10, height=6) print(fig) dev.off() @ \begin{figure}[t] \centering \includegraphics[width=\textwidth]{triofig} \caption{\label{fig:trioplot} The log R ratios (grey) and B allele frequencies (blue) for the parent-offspring trio, as well as the minimum distance (bottom panel). The solid vertical lines in the bottom panel indicate the start and stop positions of an inferred de novo hemizygous deletion in the offspring (state '221'). } \end{figure} \section{Session information} <>= toLatex(sessionInfo()) @ \bibliography{MinimumDistance}{} \bibliographystyle{plain} \end{document}