%\VignetteIndexEntry{crlmmDownstream} %\VignetteKeywords{copy number, genotype, SNP} %\VignettePackage{VanillaICE} \documentclass{article} \usepackage{amsmath} \usepackage{graphicx} \usepackage[numbers]{natbib} \usepackage{color} \usepackage[margin=1in]{geometry} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\texttt{#1}} \newcommand{\R}{\textsf{R}} \newcommand{\hmmoptions}{\Robject{HmmOptions}} \newcommand{\hmmparam}{\Robject{HmmParameter}} \newcommand{\crlmm}{\Rpackage{crlmm}} \newcommand{\oligo}{\Rpackage{oligo}} \title{Integration with the crlmm package for copy number inference} \author{Robert Scharpf} \begin{document} \maketitle <>= library(oligoClasses) library(VanillaICE) library(crlmm) library(SNPchip) library(IRanges) library(foreach) @ We load a portion of chromosome 8 from 2 HapMap samples that were processed using the \Rpackage{crlmm} package. <>= data(cnSetExample, package="crlmm") @ In the following unevaluated code chunk, we could enable parallelization of the hidden Markov models using the package \Rpackage{snow}. <>= ##registerDoSEQ() library(snow) library(doSNOW) cl <- makeCluster(2, type="SOCK") registerDoSNOW(cl) ocSamples(2) @ <>= registerDoSEQ() @ The data \Robject{cnSetExample} is an object of class \Rclass{CNSet}. We coerce the \Rclass{CNSet} object to a list class that contains information on copy number (log R ratios), genotypes, genotype probabilities, and B allele frequencies. If the \Rclass{CNSet} object contained assay data for multiple chromosomes, each element in the following list container would contain the assay data for one chromosome. <>= oligoList <- BafLrrSetList(cnSetExample) oligoList[[1]] @ \section*{Wave correction} To correct for genomic waves that correlate with GC content [refs], we use the \R{} package \Rpackage{ArrayTV}. Currently, the \Rpackage{ArrayTV} is available from github (\texttt{https://github.com/rscharpf/ArrayTV}). As this package is not yet available from Bioconductor, the code chunks in this section are not evaluated. In the following code-chunk, we first select a subset of the samples in the study to evaluate the genomic window for wave correction. See the ArrayTV vignette for details. For large datasets, one could randomly select 20 or 25 samples to compute the window, and then use a pre-selected window for wave correction on the remaining samples. We illustrate this process, even though our dataset only contains two samples. We process the dataset 20 samples at a time by setting \texttt{ocSamples()} to 20, thereby controlling the amount of RAM required (use smaller batch sizes for systems with less RAM). <>= library(ArrayTV) i <- seq_len(ncol(oligoList)) ocSamples(20) increms <- c(10,1000,100e3) wins <- c(100,10e3,1e6) tvScores <- gcCorrect(oligoList[, 1], increms=increms, maxwins=wins, returnOnlyTV=TRUE, verbose=TRUE) order(tvScores[[1]][,2], decreasing=TRUE) @ Next, we select a small window of 10 bp and a larger window of 10,000 bp and pre-compute the gc composition: <>= gc.matrix <- computeGC(oligoList, c(10, 10e3), c(10,10e3)) @ We use the gc content for the two windows in \texttt{gc.matrix} to correct the log R ratios for all samples in the dataset. If the assay data elements in the \texttt{oligoList} object were \texttt{ff} objects, NULL is returned and the log R ratios are updated on disk. If the assay data elements are matrices, a \Rclass{BafLrrSetList} object is returned with the correct log R ratios. <>= oligoList2 <- gcCorrect(oligoList, increms=c(10,10e3), maxwins=c(10,10e3), providedGC=gc.matrix) @ \section*{HMM} To identify CNVs, we fit a 6-state hidden markov model from estimates of the B allele frequency and log R ratios. A \Rfunction{hmm} method is defined for the \Rclass{BafLrrSetList} class, and we apply the method directly with a few parameters that change the arguments from their default values. For example, the \texttt{TAUP} parameter scales the transition probability matrix. Larger values of \texttt{TAUP} makes it more expensive to transition from the normal copy number state to states with altered copy number. <>= res <- hmm(oligoList, p.hom=0.1, nupdates=5, TAUP=1e10) @ \section*{Commonly used accessors for the BafLrrSetList class} The \verb+[[+ method can be used to extract a \Rclass{BafLrrSet} object for one chromosome. This object is similar to an \Rclass{ExpressionSet} in the Biobase package, but with assay data elements for the log R ratios and B allele frequencies. <>= oligoSet <- oligoList[[1]] @ Accessors for the log R ratios and B allele frequencies are given by \Rfunction{lrr} and \Rfunction{baf}, respectively. <>= r <- lrr(oligoSet) b <- baf(oligoSet) @ Note that the above objects are stored as integers. <>= range(r, na.rm=TRUE) range(b, na.rm=TRUE) @ The log R ratios and B allele frequencies can be transformed back to the original scale by dividing by 100 and 1000, respectively. <>= r <- r/100 b <- b/1000 @ \section*{Visualization of genomic intervals (e.g., CNVs) with log R ratios and B allele frequencies} In this section, we use lattice-style plots to visualize the genomic intevals of the altered copy number states. We begin by subsetting the \Robject{GRanges} object from the HMM to contain only the altered copy number states. <>= gr <- unlist(res) gr <- gr[state(gr) %in% c(1,2,5,6) & sampleNames(gr) == "NA19007",] @ Next, we create a \Robject{SummarizedExperiment} object containing the log R ratios, B allele frequencies, and physical position (in Mb) that are within \texttt{maxgap} basepairs of the CNV intervals stored in the \texttt{gr} object. Because creation of a \Robject{SummarizedExperiment} will read in all of the relevant low level data from disk, it is useful to first select only the relevant chromosome and samples. While these steps are not really necessary in our toy example with only one chromosome and 2 samples, it can increase computational speed substantially in large studies. Finally, we create coerce the \Robject{brList} object to a \Robject{SummarizedExperiment}. <>= chr <- paste("chr", chromosome(oligoList), sep="") brList <- oligoList[chr %in% chromosome(gr)] brList <- brList[, match(sampleNames(gr)[1], sampleNames(brList))] se <- as(brList, "SummarizedExperiment") @ Next, we use the lattice function \texttt{xyplot} to plot the log R ratios and B allele frequencies for each genomic interval in the \Robject{gr} object. Lattice plots require a \Robject{data.frame}, and so we create a \Robject{data.frame} object using the method \Rfunction{dataFrame}. Additional arguments such as \texttt{maxgap} are passed to the \Rfunction{findOverlaps} function in the \Rpackage{GenomicRanges} package. Here, we frame each alteration by a genomic interval of 200kb by specifying \texttt{maxgap=200e3}. <>= df <- dataFrame(gr, se, maxgap=500e3) head(df) @ Finally, we create separate \Rclass{trellis} objects for the log R ratios and B allele frequencies using the wrapper function \Rfunction{latticeFigs}. <>= colors <- c("red", "orange", "white", "white", "lightblue", "blue")[state(gr)] figs <- latticeFigs(gr, df, colors=colors) @ We can arrange the two trellis objects using the function \Rfunction{arrangeFigs} defined in the \Rpackage{SNPchip} package. %<>= %print(fig) %@ % % <>= pdf("crlmmDownstream-latticeFig.pdf", width=10, % height=7) @ % <>= library(grid) library(lattice) arrangeFigs(figs) @ % %<>= %dev.off() %@ \begin{figure}[t] \centering \includegraphics[width=\textwidth]{crlmmDownstream-arrangeFig} \caption{\label{fig:chr8} Plot of log R ratios for altered copy number states. Each panel displays one region with a copy number alteration predicted from the 6-state HMM with padding on each side. The light blue and orange shaded rectangles indicate duplications and hemizygous deletions, respectively, identified by the HMM.} \end{figure} \section*{Parallelization} As the HMM is fit independently to each sample, parallelization is straightforward and will be performed automatically when multiple CPUs are available. For large datasets in which the assay data elements of the \Rclass{BafLrrSetList} object are stored as \Rclass{ff} objects on disk, only \texttt{ocSamples()} will be processed at a time. In the following unevaluated code chunk, we set up a parallel environment using the \R{} packages \Rpackage{snow} and \Rpackage{foreach} and specify that we want to process the samples one at a time. Automatically, the two samples in the \Robject{oligoList} object are processed on separate CPUs. <>= library(foreach) library(snow) library(doSNOW) cl <- makeCluster(2, type="SOCK") registerDoSNOW(cl) ocSamples(2) @ <>= res2 <- hmm(oligoList, p.hom=0, TAUP=1e10, nupdates=5) @ <>= stopCluster(cl) @ \section*{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \end{document}