%\VignetteIndexEntry{Detection of de novo copy number alterations in case-parent trios} %\VignetteDepends{oligoClasses, VanillaICE, MinimumDistance} %\VignetteKeywords{MinimumDistance, copy number, SNP, case-parent trios, de novo} %\VignettePackage{MinimumDistance} \documentclass{article} \usepackage{graphicx} %\usepackage[utf8]{inputenc} %\usepackage[T1]{fontenc} \usepackage{natbib} \usepackage{url} \usepackage{amsmath} \usepackage{amssymb} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\R}{\textsf{R}} \newcommand{\md}{\Rpackage{MinimumDistance}} \newcommand{\F}{\mathrm{F}} \newcommand{\M}{\mathrm{M}} \newcommand{\Of}{\mathrm{O}} \newcommand{\RR}{\mathrm{LR}} \newcommand{\LR}{\mathrm{LR}} \newcommand{\Blrr}{\mbox{\boldmath $R$}} \newcommand{\Bbaf}{\mbox{\boldmath $B$}} \newcommand{\logRratio}{$\log_2$ R ratio} \newcommand{\lrrlong}{$\log_2$ R ratio} \newcommand{\blrr}{\mbox{\boldmath $r$}} %\newcommand{\bbaf}{\mbox{\boldmath $b$}} \newcommand{\baf}{B allele frequency} \newcommand{\bafs}{B allele frequencies} \newcommand{\tsl}{trioSetList} \newcommand{\ts}{trioSet object} \newcommand{\mindist}{\ensuremath{\mbox{\boldmath $d$}}} \DeclareMathOperator{\I}{\mathbb{I}} \usepackage[margin=1in]{geometry} \title{Detection of de novo copy number alterations in case-parent trios using the \R{} package \md{}} \date{\today} \author{Rob Scharpf} \begin{document} \maketitle <>= options(prompt="R> ", continue=" ", device=pdf, width=65) @ \begin{abstract} For the analysis of case-parent trio genotyping arrays, copy number variants (CNV) appearing in the offspring that differ from the parental copy numbers are often of interest (de novo CNV). This package defines a statistic, referred to as the minimum distance, for identifying de novo copy number alterations in the offspring. We smooth the minimum distance using the circular binary segmentation algorithm implemented in the Bioconductor package \Rpackage{DNAcopy}. Trio copy number states are inferred from the maximum a posteriori probability of the segmented data, incorporating information from the log R ratios and B allele frequencies. As both log R ratios and B allele frequencies are estimable from Illumina and Affymetrix arrays, this package supports de novo copy number inference in both platforms. \end{abstract} \section{Introduction} There are numerous \R{} packages available from Bioconductor for smoothing copy number alterations. For example, the biocview \texttt{CopyNumberVariants} in the 2.9 release of Bioconductor lists 27 packages. For the analysis of germline diseases, hidden Markov models have emerged as a popular tool as inference regarding the latent copy number state incorporates information from both the estimates of copy number (or relative copy number) and the allelic frequencies, such as the B allele frequency \citep{Peiffer2006} or genotype calls \citep{Colella2007, Wang2008, Scharpf2008}. For the analysis of somatic cell diseases such as cancer, algorithms that segment the genome into regions of constant copy number (referred to here as segmentation algorithms) may be more preferable for the detection of copy number aberrations (CNA) as a mixture of cells with different copy numbers give rise to mosaic (non-integer) copy numbers. Examples include circular binary segmentation implemented in the \R{} package \Rpackage{DNAcopy} and the \Rpackage{GLAD}, both of which were orginally developed for array CGH platforms \cite{Olshen2004,Hupe2004,Venkat2007}. One disadvantage of segmentation algorithms is that inference regarding duplication and deletions is not directly available. More recently, HMMs and segmentation algorithms have been developed for inferring common regions of copy number alterations in multiple samples. However, relatively few algorithms are available for inferring copy number alterations, especially de novo copy number alterations, in family-based study designs involving case-parent trios. Instead, a common strategy has been a two-step approach of identifying copy number alterations in the individual samples and then comparing the results across samples to infer whether an alteration observed in the offspring is de novo. A disadvantage of the two-step approach is that unadjusted sources of technical variation, such as waves, can contribute to false positives. The joint HMM implemented in the PennCNV software is one of the few algorithms to date that provides direct inference regarding de novo alterations in case parent study designs. This package develops an alternative framework for inferring regions of de novo copy number alterations in case-parent trios. Like the PennCNV joint HMM, inference regarding de novo alterations integrates information from both the log R ratios and B allele frequencies. Differences in the two approaches are most apparent in non-HapMap experimental datasets in which technical and experimental sources of variation appear to contribute to a large number of false positives. This vignette describes the analysis pipeline from preprocessed and normalized estimates of copy number and allele frequencies to inference of de novo copy number alterations. The workflow is illustrated using a publicly available HapMap trio and a publicly available oral cleft trio, each assayed on the Illumina 610quad array. \section{Data input} <>= library(oligoClasses) library(VanillaICE) library(SummarizedExperiment) library(MinimumDistance) foreach::registerDoSEQ() @ \subsection{Reading and organizing the annotation on the markers} We require that marker-level annotation is represented as a \Rclass{GRanges}-derived class. To read the plain text annotation file, we use the \Rfunction{fread} function provided in the \Rpackage{data.table} package. In addition, we define an indicator for whether the marker is polymophic using the 'Intensity Only' flag. The \Rclass{SnpGRanges} class created from the \Robject{fgr} object in the following code-chunk ensures that this binary indicator is created and can be reliably accessed. <>= library(data.table) extdir <- system.file("extdata", package="VanillaICE") features <- suppressWarnings(fread(file.path(extdir, "SNP_info.csv"))) fgr <- GRanges(paste0("chr", features$Chr), IRanges(features$Position, width=1), isSnp=features[["Intensity Only"]]==0) fgr <- SnpGRanges(fgr) names(fgr) <- features[["Name"]] @ Ideally, one should include the genome build and the chromosome lengths appropriate to the build. Here, we extract the metadata on the chromosomes using the BSgenome Bioconductor package for the hg18 build. Finally, we sort the \Robject{fgr} object such that the chromosomes are ordered by their \texttt{seqlevels} and the markers are ordered by their genomic position along the chromosome. <>= library(BSgenome.Hsapiens.UCSC.hg18) sl <- seqlevels(BSgenome.Hsapiens.UCSC.hg18) seqlevels(fgr) <- sl[sl %in% seqlevels(fgr)] seqinfo(fgr) <- seqinfo(BSgenome.Hsapiens.UCSC.hg18)[seqlevels(fgr),] fgr <- sort(fgr) @ \subsection{Organizing the marker-level summaries} The abbreviated plain text files included with this package that contain the log R ratios and B allele frequencies from 2 trios are listed below. <>= files <- list.files(extdir, full.names=TRUE, recursive=TRUE, pattern="FinalReport") @ We parse these files into a specific format such that all downstream steps for copy number estimation no longer depend on the format of the source files. At this point, we encapsulate the names of the source files ('sourcePaths'), the location of where we intend to store the parsed data ('parsedPath'), and the genomic marker annnotation created in the preceding section ('rowRanges') in a single object called an \Rclass{ArrayViews}. <>= ## ## Where to keep parsed files ## parsedDir <- "ParsedFiles" if(!file.exists(parsedDir)) dir.create(parsedDir) views <- ArrayViews(rowRanges=fgr, sourcePaths=files, parsedPath=parsedDir) show(views) @ Because the format of the source files depends on upstream software, we read one file and store information regarding its parsing so that subsequent files can be parsed in a similar fashion. We use the \Rfunction{fread} to read in the first file. <>= ## read the first file dat <- fread(files[1], skip="[Data]") head(dat,n=3) @ Next, we select which columns we plan to keep. Again, the required data for downstream processing is the name of the SNP identifier, the log R ratios, and B allele frequencies. <>= ## information to store on the markers select_columns <- match(c("SNP Name", "Allele1 - AB", "Allele2 - AB", "Log R Ratio", "B Allele Freq"), names(dat)) @ We also specify the order in which we will store the marker-level summaries by matching the \Rfunction{rownames} of the \Robject{views} object with the names of the markers in the source file: <>= index_genome <- match(names(fgr), dat[["SNP Name"]]) @ Similar to the parameter classes defined in \Rpackage{Rsamtools}, we encapsulate the information for parsing the columns and rows of the source files in a class. In addition, we specify which variable names in the source file refers to log R ratios ('cnvar'), B allele frequences ('bafvar'), and genotypes ('gtvar'). <>= scan_params <- CopyNumScanParams(index_genome=index_genome, select=select_columns, cnvar="Log R Ratio", bafvar="B Allele Freq", gtvar=c("Allele1 - AB", "Allele2 - AB")) @ The \Rfunction{parseSourceFile} will parse a single file in the \Robject{views} object (by default, the first file) according to the parameters for reading the data in the \Robject{scan\_params} object and write the output to the \Rfunction{parsedPath} directory. In particular, the \Rfunction{parseSourceFile} returns \Rclass{NULL}. <>= parsedPath(views) parseSourceFile(views[, 1], scan_params) @ To apply the function \Rfunction{parseSourceFile} to all arrays in the \Robject{views} object, one can use the functional \Rfunction{sapply}. <>= invisible(sapply(views, parseSourceFile, param=scan_params)) @ Apart from confirming their existance, the user should not have a need to directly access the parsed files. Utilities for querying these files are provided through the \Robject{views} object. <>= head(list.files(parsedPath(views)), n=3) @ \subsection{Accessors for the parsed data} The preference for writing the parsed data to disk rather than keeping the data in RAM is simply that the latter does not scale to projects involving thousands of samples. For the former, slices of the parsed data easily be accessed from the \Rfunction{parsedPath} directory via methods defined for the \Rclass{ArrayViews} class. For example, one can use accessors for the low-level summaries directly: \Rfunction{lrr}, \Rfunction{baf}, and \Rfunction{genotypes} for log R ratios, B allele frequencies, and genotypes, respectively. The user has the option of either subsetting the views object or subsetting the matrix returned by the accessor to extract the appropriate data slice. In the following examples, we access data on the first 2 markers and sample indices 2-4. <>= lrr(views)[1:2, 2:4] ## or lrr(views[1:2, 2:4]) ## B allele frequencies baf(views[1:2, 2:4]) ## potentially masked by function of the same name in crlmm VanillaICE::genotypes(views)[1:2, 2:4] @ More often, it is useful to extract the low-level data in a \Rclass{RangedSummarizedExperiment}-derived class such that meta-data on the samples remains bound to the columns of the assay data (log R ratios / B allele frequencies) and meta-data on the rows remains bound to the rows of the assay data. This is accomplished by applying the \Rfunction{MinDistExperiment} function to a \Robject{views} object, as described in the following section. \section{Binding the genomic, marker, and individual-level data} \subsection{Pedigree annotation} The container for a single pedigree is the \Rclass{ParentOffspring} class. <>= ped_hapmap <- ParentOffspring(id = "hapmap", father="12287_03", mother="12287_02", offspring="12287_01", parsedPath=parsedPath(views)) @ For families with multiple affected offspring, the argument to offspring can be a character-vector with length greater than 1. We can store any number of \Rclass{ParentOffspring} objects in a list using the \Rclass{ParentOffspringList} class. In particular, for the 2 trios provided in the \Rpackage{VanillaICE} package <>= ped_list <- ParentOffspringList(pedigrees=list( ParentOffspring(id = "hapmap", father="12287_03", mother="12287_02", offspring="12287_01", parsedPath=parsedPath(views)), ParentOffspring(id = "cleft", father="22169_03", mother="22169_02", offspring="22169_01", parsedPath=parsedPath(views)))) pedigreeName(ped_list) @ For a single pedigree, the \Rclass{MinDistExperiment} encapsulates the annotation on the pedigree, the genomic annotation of the markers, and the marker-level summary statistics. Note, however, that the sample identifiers in the pedigree are not the same as the file identifiers used by default to create the \R{} object \Robject{views}. To resolve the naming issue, we read in separate file provided in the \Rpackage{VanillaICE} package: <>= sample_info <- read.csv(file.path(extdir, "sample_data.csv"), stringsAsFactors=FALSE) ind_id <- setNames(gsub(" ", "", sample_info$IndividualID), sample_info$File) colnames(views) <- ind_id[gsub(".txt", "", colnames(views))] @ \subsection{MinDistExperiment class} The constructor function for \Rclass{MinDistExperiment} extracts from the views object only the data relevant for the provided pedigree. <>= me <- MinDistExperiment(views, pedigree=ped_list[[2]]) colnames(me) me @ \section{Detection of de novo copy number variants} \label{smoothing} A container for the various parameters used to segment and call de novo copy number variants is provided by the \Rclass{MinDistParam}. A constructor function of the same name provides a default parametrization that works well in most instances: <>= params <- MinDistParam() show(params) @ The parameters in this class are organized according to function. For example, the argument \Robject{dnacopy} takes an argument of class \Rclass{DNAcopyParam} that contains setting that are passed to the \Rfunction{segment} function for the implementation of circular binary segmentation in the \R{} package \Rpackage{DNAcopy}. Changing the parameters for the segmentation is most easily accomplished by a call to the constructor. E.g., <>= segment_params <- DNAcopyParam(alpha=0.01) params <- MinDistParam(dnacopy=segment_params) @ Several of the parameters relate to a priori assumptions of the probability of a non-Mendelian transmission. These a priori assumptions are defined in PennCNV \cite{Wang2008} and encapsulated in the \Rclass{PennParam} class. <>= penn_param <- PennParam() show(penn_param) @ \noindent Finally, parameters for the emission probabilities computed by the \R{} package \Rpackage{VanillaICE} are encapsulated in the \Rclass{EmissionParam} class. Again, default values for the emission probabilities will be created automatically as part of the \Robject{params} object if none is specified. \paragraph{Segmentation and posterior calls.} %To keep the size of the \Rpackage{MinimumDistance} package small, the %\Robject{trioSetList} object created in the previous section contained %only 25 markers for each of the 22 autosomes. While useful for %illustrating construction of the \Rclass{TrioSetList} class, there are %too few markers included in the example to illustrate the smoothing %and posterior calling algorithm for detecting de novo copy number %events. To demonstrate these steps, we load a \Robject{TrioSetList} %object containing several thousand markers for chromosomes 7 and 22 %for 2 HapMap trios: % %<>= %data(trioSetListExample) %me <- as(trioSetList, "MinDistExperiment") %@ For a given trio, the signed minimum absolute difference of the offspring and parental \lrrlong{}s (\blrr{}) is defined as \begin{eqnarray} \label{eq:distance} \mindist &\equiv& \left(\blrr_\Of - \blrr_\M\right) \times \I_{[\left|\blrr_\Of - \blrr_\F\right| > \left|\blrr_\Of - \blrr_\M\right|]} + \left(\blrr_\Of - \blrr_\F\right) \times \I_{[\left|\blrr_\Of - \blrr_\F\right| \leq \left|\blrr_\Of - \blrr_\M\right| ]}. \end{eqnarray} If the offspring copy number changes within a minimum distance segment (as determined by the segmentation of the offspring copy number), the start and stop position of the minimum distance segments may be edited. The approach currently implemented is to define a new start and stop position if a breakpoint for the offspring segmentation occurs in the minimum distance interval. To illustrate, the following diagram uses vertical dashes (\texttt{|}) to denote breakpoints: \begin{verbatim} 1 ...--|--------------|--... ## minimum distance segment (before editing) 2 ...----|--------|------... ## segmenation of log R ratios for offspring -> 3 ...--|-|--------|---|--... ## after editing \end{verbatim} \noindent In the above illustration, posterior calls are provided for the 3 segments indicated in line 3 instead of the single segment in line 1. Two additional steps are therefore required: (1) segmentation of the offspring log R ratios and (2) editing of the minimum distance breakpoints when appropriate. The following codechunk, segments the log R ratios for all chromosomes in the \Robject{me} object and edits the ranges for the minimum distance when conflicts with the offspring segmentation boundaries arise (as illustrated above). <>= mdgr <- segment2(me, params) @ For each minimum distance segment in which the mean minimum distance is above a user-specificed cutoff in absolute value (specified in terms of the median absolute deviations of the minimum distance), a trio copy number state is assigned from the maximum a posteriori estimate. <>= ## the threshold in terms of the number of median absolute deviations from zero nMAD(params) md_g <- MAP2(me, mdgr, params) show(md_g) @ \section{Inspecting, Filtering, and plotting de novo CNV inference} \label{viz} \subsection{Filtering} There are several meta-data columns stored in the \Rclass{GRanges}-derived summary useful for filtering the set of genomic intervals. All the available parameters for filtering the \Robject{fit} object are stored in the parameter class \Rclass{FilterParam} of which an instance can be created by its constructor of the same name without any arguments. <>= filter_param <- FilterParamMD() show(filter_param) @ To apply the default filter parameters to the \Robject{fit} object, we use the \Rfunction{cnvFilter} function. The \Rfunction{cnvFilter} function returns only the set of genomic ranges satisfying the filters. For example, <>= cnvFilter(md_g, filter_param) @ The helper functions \Rfunction{denovoHemizygous} and \Rfunction{denovoHomozygous} create commonly used filters. <>= denovoHemizygous(md_g) denovoHomozygous(md_g) @ Equivalently, one could customize the filter parameteters to acheive the same result: <>= select_cnv <- FilterParamMD(state=c("220", "221", "223"), seqnames="chr22") cnvs <- cnvFilter(md_g, select_cnv) cnvs @ \subsection{Visualization} This produces a somewhat unsatisfactory result in that there are 2 additional denovo hemizygous deletions nearby that likely comprise one denovo segment. <>= denovoHemizygous(md_g) @ In the following code-chunk, we reduce the denovo hemizygous deletions and update the posterior probabilities for the MAP estimates. We use a combination of lattice and grid to visualize the marker-level summaries and copy number inference for the trio. <>= library(grid) g2 <- reduce(denovoHemizygous(md_g), min.gapwidth=500e3) post <- MAP2(me, g2, params) g2 <- denovoHemizygous(post) vps <- pedigreeViewports() grid.params <- HmmTrellisParam() p <- plotDenovo(me, g2, grid.params) pedigreeGrid(g=g2, vps=vps, figs=p) @ \section{Acknowledgements} Moiz Bootwalla contributed to early versions of this vignette. \section{Session information} <>= toLatex(sessionInfo()) @ \bibliography{MinimumDistance}{} \bibliographystyle{plain} \end{document}