% \VignetteIndexEntry{VanillaICE Vignette} % \VignetteKeywords{copy number, genotype, SNP} % \VignettePackage{VanillaICE} \documentclass{article} \usepackage{amsmath} \usepackage{graphicx} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\R}{\textsf{R}} \newcommand{\hmmoptions}{\Robject{HmmOptions}} \newcommand{\hmmparam}{\Robject{HmmParameter}} \newcommand{\cne}{\widehat{\text{CN}}} \newcommand{\gte}{\widehat{\text{GT}}} \newcommand{\gtehom}{\widehat{\text{HOM}}} \newcommand{\gtehet}{\widehat{\text{HET}}} \newcommand{\pgte}{\text{S}_{\widehat{\text{\tiny GT}}}} \newcommand{\pcne}{\text{S}_{\widehat{\text{\tiny CN}}}} \newcommand{\pgtehom}{\text{S}_{\widehat{\text{\tiny HOM}}}} \newcommand{\pgtehet}{\text{S}_{\widehat{\text{\tiny HET}}}} \newcommand{\thom}{\text{HOM}} \newcommand{\thet}{\text{HET}} \newcommand{\bDelta}{\mbox{\boldmath $\Delta$}} \newcommand{\real}{\mbox{$\mathbb R$}} % real numbers \newcommand{\bnu}{\mbox{\boldmath $\nu$}} \newcommand{\ice}{\Rpackage{VanillaICE}} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \begin{document} \title{\ice{}: Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays} \author{Robert Scharpf} \maketitle <>= options(width=69) @ \section{Introduction} Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g. aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and translocations), and changes to small genomic regions (including single nucleotide polymorphisms). A variety of alterations that occur in chromosomal DNA, many of which can be detected using high density single nucleotide polymorphism (SNP) microarrays, are linked to normal variation as well as disease and therefore of particular interest. These include changes in copy number (deletions and duplications) and genotype (e.g. the occurrence of regions of homozygosity). Hidden Markov models (HMM) are particularly useful for detecting such abnormalities, modeling the spatial dependence between neighboring SNPs. Here, we extend previous approaches that utilize HMM frameworks for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding measures of uncertainty when available. Using simulated and real data, we demonstrate how confidence scores control smoothing in a probabilistic framework. The goal of this vignette is to provide a simple interface for fitting HMMs and plotting functions to help visualize the predicted states alongside the experimental data. \section{Simple Usage} \label{sec:simpleUsage} \subsection{\label{sec:locusCN} Locus-level estimates of copy number} To do: describe how to obtain locus-level estimates of copy number. \subsection{Hidden Markov model to smooth locus-level estimates} <>= library(VanillaICE) @ %Before fitting the HMM, the data must be organized into one of the %following classes for high-throughput SNP data: %\begin{itemize} %\item \Robject{SnpCallSet} (genotype calls) %\item \Robject{SnpCopyNumberSet} (copy number estimates) %\item \Robject{oligoSnpSet} (genotype calls and copy number estimates) %\item \Robject{RatioSnpSet} (genotype calls and ratios of A and B allele intensities) %\end{itemize} %When pre-processing Affymetrix SNP chips with the \R{} package %\Rpackage{oligo}, an object of one of the above classes is created and %can be used directly with the HMMs described in this vignette. These %classes can also be created from Illumina data as described in the %IlluminaHowTo vignette. The simulated data provided with this package %is an instance of class \Robject{oligoSnpSet} <>= data(chromosome1) annotation(chromosome1) chromosome1 @ A visualization of the data: <>= gp <- plot(chromosome1) show(gp) @ The HMM for total copy number assumes that the copy number estimates, conditional on the hidden state, are approximately Gaussian. Fitting a HMM requires the following components: % A log transformation will be performed automatically provided that % the copy number estimates are all positive. See the documentation % pages in the R package \ice{} for more information about the % \Robject{chromosome1} example dataset. \begin{itemize} \item the hidden states \item the emission probabilities \item transition probabilities \end{itemize} \paragraph{Hidden states} For the hidden Markov model, it is important that the SNPs (within a chromosome) are ordered by physical position. <>= ann <- fData(chromosome1)[, c("chromosome", "position")] ann[, "chromosome"] <- chromosome2integer(ann[, "chromosome"]) chromosome1 <- chromosome1[order(ann[, "chromosome"], ann[, "position"]), ] @ Next we specify the hidden states and the corresponding number for each state. <>= states <- c("homozygousDeletion", "hemizygousDeletion", "normal", "LOH", "3copyAmp", "4copyAmp") mu <- c(0.05, 1, 2, 2, 3, 4) @ SNP-specific estimates of the uncertainty of the total copy number will be available from methods in Section \ref{sec:locusCN} shortly. Below, we simply obtain a robust estimate of the copy number standard deviation across SNPs and use this estimate for all of the SNPs. Because we the log-transformed copy number estimates are more nearly Gaussian, we calculate a robust estimate of the standard deviation on the log scale, and use this estimate for all of the SNPs. This section will be updated. <>= library(genefilter) CT <- copyNumber(chromosome1) sample.sd <- matrix(rowSds(t(log2(CT))), nrow(CT), ncol(CT)) ##robustSD <- function(X) diff(quantile(X, probs=c(0.16, (1-0.16)), na.rm=TRUE))/2 ##sample.sd <- matrix(robustSD(log2(CT)), nrow(CT), ncol(CT)) @ \paragraph{Emission probabilities} Locus-level estimates of copy number, when suitably transformed, are assumed to be approximately Gaussian-distributed. We may obtain emission probabilities as follows: <>= logCT <- array(log2(CT), dim=c(nrow(CT), ncol(CT), length(states))) dimnames(logCT) <- list(rownames(CT), colnames(CT), states) logMu <- aperm(array(log2(mu), dim=c(length(states), ncol(CT), nrow(CT)))) logSd <- aperm(array(sample.sd, dim=c(length(states), ncol(CT), nrow(CT)))) dimnames(logMu) <- dimnames(logSd) <- dimnames(logCT) k <- which(!is.na(as.vector(logCT))) emission.logCT <- dnorm(as.vector(logCT)[k], as.vector(logMu)[k], as.vector(logSd)[k]) emission.logCT <- array(emission.logCT, dim=dim(logCT)) logemission.logCT <- log(emission.logCT) @ Alternatively, one may use the function copynumberEmission (does nearly the same as above, but with a few additional checks). <>= logemission.logCT2 <- copynumberEmission(copynumber=CT, states=states, mu=mu, uncertainty=sample.sd, takeLog=TRUE, verbose=TRUE) dimnames(logemission.logCT) <- dimnames(logemission.logCT2) identical(logemission.logCT, logemission.logCT2) rm(logemission.logCT2); gc() @ Adding a small positive value to the log emission probabilities prevents breaks in the predicted states due to extreme values. This step is less important if SNP-specific estimates of uncertainty are available. <>= logemission.logCT[logemission.logCT < -10] <- -10 @ Uncertainty estimates for the genotype calls can also be integrated. The genotype calls must be represented as integers (1 = AA, 2 = AB, 3 = BB, 4 = missing). We recommend using CRLMM to genotype -- CRLMM provides genotypes for all SNPs on the array (no missing values) and confidence estimates for the call that takes into consideration the signal to noise ratio of the sample as well as a SNP-specific estimate of the uncertainty. Di-allelic genotype calls from CRLMM are useful, though incorrect for CNV. In particular, di-allelic genotype calls can help pinpoint hemizygous deletions (genotype calls of AA or BB correspond to 'A' and 'B') and copy-neutral regions of loss of heterozygosity (LOH). For the HMM, one must specify the probability of a homozygous genotype call (\texttt{probHomCall}) for each of the hidden states. The goal, then, is to have genotype emission probabilities that are informative for distinguishing the normal state versus hemizygous deletions or LOH, and somewhat agnostic for distinguishing between homozygous deletions/amplifications from normal regions. To illustrate this idea, consider two scenarios where the true state is amplification of copy number (e.g., 3 copies) over a regions spanning 10 SNPs. Scenario 1: 7 of the 10 di-allelic genotype calls are homozygous. Scenario 2: all 10 of the di-allelic genotype calls are homozygous. For the hidden Markov model, we assume that the probability of a homozygous genotype call is 0.99 for normal, homozygous deletion and amplification. For hemizygous deletion and copy-neutral LOH, we specify a probability of 0.9999. Note the following: \begin{itemize} \item The emission probabilities for the di-allelic genotype calls in scenarios 1 and 2 will be the same for normal, homozygous deletion, and amplification. The copy number emission probabilities will discriminate between these states. \item Hemizygous deletion and copy-neutral LOH are penalized severely in scenario 1. This penalization reflects our belief that the probability of observing 3 genotype errors (AB must be a genotype error for these states) in a small region. \item In scenario 2, hemizgyous deletion and copy-neutral LOH have a higher emission probability for the genotypes. However, because the probability of observing homozygous genotype calls for amplification and normal states is also high, most of the information for discriminating between and amplification containing 10 homozygous genotype and a copy-neutral region of LOH will be driven by the emission probability for copy number. \end{itemize} In summary, specifiying high probabilities for observing a homozygous genotype call in normal and copy number alterered states (as indicated below) allows fairly long stretches of homozygous genotype calls to occur by chance in any of the states. In our experience, sequences of 70 - 100 homozygous genotypes are fairly common and likely represent normal regions of the genome with fairly uniform haplotype structure. If any of the genotype calls are missing and missingness is not independent of the underlying hidden state, one may specify the probability of a missing genotype calls for each hidden state (\texttt{probMissing}). By default, the HMM will assume that missing genotype calls are independent of the underlying hidden state. <>= probs <- c(0.99, 0.9999, 0.99, 0.9999, 0.99, 0.99) probMissing <- c(0.999, rep(0.01, 5)) names(probs) <- states GT <- calls(chromosome1) genotypeEmission <- function(genotypes, states, probHomCall, probMissing, verbose=TRUE){ if(!is.numeric(genotypes)) stop("genotypes must be integers (1=AA, 2=AB, 3=BB, 4=missing") emissionForGenotypes <- function(probHomGenotype, genotypes){ isHom <- which(as.vector(genotypes) == 1 | as.vector(genotypes) == 3) isHet <- which(as.vector(genotypes) == 2) isMissing <- which(as.vector(genotypes) == 4 | is.na(as.vector(genotypes))) emission.gt <- rep(NA, length(genotypes)) emission.gt[isHom] <- probHomGenotype emission.gt[isHet] <- 1-probHomGenotype emission.gt[isMissing] <- NA emission.gt } emission.gt <- array(NA, dim=c(nrow(GT), ncol(GT), length(states))) for(j in 1:ncol(GT)){ emission.gt[, j, ] <- sapply(probs, emissionForGenotypes, genotypes=GT[, j]) if(any(is.na(emission.gt[, j, 1]))){ missing <- is.na(emission.gt[, j, 1]) if(!missing(probMissing)){ if(length(probMissing) != length(states)) stop("probMissing must be a numeric vector equal to the number of states") emission.gt[missing, j, ] <- matrix(probMissing, sum(missing), length(states), byrow=TRUE) } else{ if(verbose) message("Argument probMissing is not specified. Assume that missing genotype calls are independent of the underling hidden state") emission.gt[missing, j, ] <- 1 } } } dimnames(emission.gt) <- list(rownames(genotypes), colnames(genotypes), states) return(suppressWarnings(log(emission.gt))) } logemission.gt <- genotypeEmission(genotypes=calls(chromosome1), states=states, probHomCall=probs) @ Conditional on the hidden state, we assume that the copy number and genotype are independent. Therefore, the emission probabilities for an HMM that models the copy number and genotypes jointly are computed by adding the emission probabilities (log-scale) for copy number and genotype: <>= logemission <- logemission.gt + logemission.logCT @ \paragraph{Transition probabilities} We transform the physical distance between adjacent loci to an estimate of the genomic distance. <>= tau <- exp(-2 * diff(ann[, "position"])/(100*1e6)) @ Note that the above transition probabilities can be scaled by specifying a matrix of dimension S x S, where S is the number of hidden states. For instance, we define loss of heterozgosity in a single sample as a sequence of homozygous genotypes that is longer than what one would expect to observe by chance. One way to control the size (and number) of LOH regions detected is to scale the probability of transitioning to and from this state. A transition scale matrix of 1's would not modify the probability of transitioning between states. Example with the first SNP. <>= epsilon <- 1-tau[1] M <- matrix(epsilon/(length(states)-1), length(states), length(states)) dimnames(M) <- list(states, states) diag(M) <- tau[1] all(rowSums(M) == 1) @ Note that by default, the transition probability matrix distributes the epsilon equally to the remaining states. We may alter how the epsilon is distributed by scaling this matrix. For instance, <>= ##Make it harder to transition to LOH tau.scale <- matrix(1, length(states), length(states)) dimnames(tau.scale) <- list(states, states) tau.scale["normal", "LOH"] <- 1e-4 @ The other states must be rescaled (here, we rescale the other states by a factor \texttt{scale}) <>= S <- length(states) scale <- (S-1)/(S-2 + 1e-4) tau.scale["normal", c("homozygousDeletion", "hemizygousDeletion", "3copyAmp", "4copyAmp")] <- scale all(round(rowSums(M*tau.scale), 5) == 1) @ \paragraph{Initial state probabilities} <>= initialStateProb <- rep(1e-4, length(states)) initialStateProb[states=="normal"] <- 1-(length(states)-1)*1e-4 @ \paragraph{Fitting the HMM} We use the Viterbi algorithm to find the sequence of hidden states that maximizes the probability of the observed data. <>= fit <- viterbi(initialStateProbs=log(initialStateProb), emission=logemission[, 1, ], tau=tau) fit2 <- viterbi(initialStateProbs=log(initialStateProb), emission=logemission[, 1, ], tau=tau, tau.scale=tau.scale) @ Note that in the above example, scaling the transition probability matrix did not affect the predicted sequence of hidden states. <>= table(fit) results <- findBreaks(x=fit, states=states, position=ann[, "position"], chromosome=ann[, "chromosome"], sample=colnames(CT)) results[results$state != "normal", ] @ \section{Using S4 classes/methods} The objective of developing classes is to facilitate the process of fitting an HMM and to more effectively keep the assaydata and metadata together in one object. The following code uses classes to do pretty much the same as above. We begin by reproducing the \Robject{chromosome1} object, an object of class \Robject{oligoSnpSet}, from scratch. <>= ann[, "chromosome"] <- integer2chromosome(ann[, "chromosome"]) fD <- new("AnnotatedDataFrame", data=ann, varMetadata=data.frame(labelDescription=colnames(ann))) pD <- annotatedDataFrameFrom(CT, byrow=FALSE) GT <- matrix(as.integer(GT), nrow(GT), ncol(GT)) dimnames(GT) <- dimnames(CT) chromosome1 <- new("oligoSnpSet", copyNumber=CT, calls=GT, featureData=fD, phenoData=pD, annotation="pd.mapping50kHind.240,pd.mapping50kXba.240") validObject(chromosome1) @ <>= options <- new("HmmOptions", snpset=chromosome1, states=states, copyNumber.location=mu, copyNumber.scale=sample.sd[1], probHomCall=c(0.5, 0.999, 0.7, 0.999, 0.7, 0.7)) params <- new("HmmParameter", states=states(options), initialStateProbability=0.999) cn.emission <- copyNumber.emission(options) gt.emission <- calls.emission(options) emission(params) <- cn.emission + gt.emission ##log scale genomicDistance(params) <- exp(-2 * diff(position(chromosome1))/(100*1e6)) transitionScale(params) <- matrix(1, length(states), length(states)) class(params) hmmpredict <- hmm(options, params) class(hmmpredict) breaks <- findBreaks(predictions(hmmpredict), states=states, position=ann[, "position"], chromosome=ann[, "chromosome"], sample=colnames(CT)) @ See \cite{Scharpf2007a} for a more complete description of the simulated dataset and the features detected by this HMM. We may plot the data along with the predictions as follows: <>= gp <- plot(snpset(options), hmmpredict) gp$abline.v <- TRUE ##plots vertical dashed lines allParameters <- unlist(snpPar(gp)) gp$col.predict[3] <- "white" gp$hmm.ycoords <- c(0.7,0.9) show(gp) @ \subsection{Integrating Confidence Estimates (ICE)} FIXME: integrating confidence estimates of the genotype calls, the probability of a missing genotype, epsilon probabiity for extreme observations In this section, we illustrate how one may fit an HMM that incorporates confidence esimates of the SNP-level summaries for genotype calls and copy number. Confidence scores (inverse of standard errors) are available for this object (see Section \ref{sec:confidenceScores} for how confidence scores were derived). This information is incorporated into the HMM emission probabilities. Probably the easiest way to do that is recreate the options object, and then recalculate the emission probabilities. <>= data(chromosome1) range(cnConfidence(chromosome1)) options <- new("HmmOptions", snpset=chromosome1, states=c("D", "N", "L", "A"), copyNumber.location=c(1, 2, 2, 3), copyNumber.ICE=TRUE, probHomCall=c(0.99, 0.75, 0.99, 0.75)) params <- new("HmmParameter", states=states(options), initialStateProbability=0.999) cn.emission <- copyNumber.emission(options) gt.emission <- calls.emission(options) emission(params) <- cn.emission + gt.emission genomicDistance(params) <- exp(-2*diff(position(chromosome1))/(100*1e6)) transitionScale(params) <- matrix(1, length(states(options)), length(states(options))) fit.ice <- hmm(options, params) calculateBreakpoints(fit.ice) @ We may also incorporate the confidence scores for the genotype calls by specifying \Robject{calls.ICE=TRUE} (FIX ME). The slot \Robject{probHomCall} stores user-specified probabilities of P(call is AA or BB | state is LOH) and P(call is AA or BB | state is normal). These probabilities must be specified in this order. The emission probabilities for the genotype calls will only be calculated for the states LOH (LOH is defined as a stretch of homozygous genotype calls longer than what one would expect by chance) and Normal refers to typical ratios of heterozygous to homozygous genotype calls. The slot \Robject{term5} contains user-specified probabilities for the P(true genotype is HET | genotype call is AB, hidden state is Normal) and P(true genotype is HET | genotype call is AA or BB, hidden state is Normal), respectively. Default values are provided when not specified, as the following example illustrates. <>= ##FIXME options <- new("HmmOptions", snpset=chromosome1, states=c("D", "N", "L", "A"), copyNumber.location=c(1, 2, 2, 3), copyNumber.ICE=TRUE, calls.ICE=TRUE, probHomCall=c(0.99, 0.75)) params <- new("HmmParameter", states=states(options), initialStateProbability=0.99) cn.emission <- copyNumber.emission(options) genomicDistance(params) <- exp(-2 * physicalDistance(options)/(100*1e6)) transitionScale(params) <- scaleTransitionProbability(states(options)) gt.emit <- calls.emission(options) gt.emission <- array(NA, dim(cn.emission)) gt.emission[, , 1:2] <- gt.emit gt.emission[, , 3:4] <- gt.emit emission(params) <- cn.emission + gt.emission fit.ice <- hmm(options, params) @ <>= ##gp <- plot(snpset(options), fit.ice) ##gp$abline.v <- TRUE ##plots vertical dashed lines ##gp$col.predict[2] <- "white" ##show(gp) gp <- plot(snpset(options), fit.ice) gp$abline.v <- TRUE gp$col.predict <- c("darkblue", "white", "yellow", "orange") gp$hmm.coords show(gp) legend(-0.05, 10, fill=gp$col.predict[c(1, 3, 4)], legend=c("< 2 copies", "copy-neutral LOH", "> 2 copies"), bty="n", title="predicted states") legend(0, 0.8, legend="predictions", bty="n", cex=0.8, adj=0) @ Note that the ICE HMM correctly identifies the simulated normal segments in features B and C (the normal segments were simulated to have high confidence scores). Additionally, the ICE HMM detects the micro-amplification in region E (also simulated to have high confidence scores). \subsection{Confidence scores}\label{sec:confidenceScores} \paragraph{Confidence scores for genotype calls} We suggest using the \Rfunction{CRLMM} algorithm \cite{Carvalho2007} for genotype calls. \Rfunction{CRLMM} (in the R package \Rpackage{oligo}) provides confidence scores ($\pgte$) of the genotype estimates ($\gte$). From 269 HapMap samples assayed on the Affymetrix 50k Xba and Hind chips, we have a gold standard of the true genotype defined by the consensus of the HapMap centers. We use kernal based density estimates to obtain {\scriptsize \begin{eqnarray} \label{ingo2} f\left\{\ \pgtehom\ |\ \gtehom,\thom\ \right\},\ f\left\{\ \pgtehom\ |\ \gtehom,\thet\ \right\},\ f\left\{\ \pgtehet\ |\ \gtehet,\thom\ \right\},\ \mbox{~ and~} f\left\{\ \pgtehet\ |\ \gtehet,\thet\ \right\} \end{eqnarray} } \noindent separately for the Xba and Hind 50k chips. The first term in (\ref{ingo2}), for example, denotes the density of the scores when the genotype is correctly called homozygous ($\gtehom$) and the true genotype is homozygous ($\thom$). See \cite{Scharpf2007a} for a more complete description of the methods. The data needed to estimate these densities is stored in the experiment data package \Rpackage{callsConfidence}. \Rpackage{callsConfidence} is available from the author's website. %\texttt{http://biostat.jhsph.edu/~rscharpf/software/index.html}. \paragraph{Confidence scores for copy number estimates} To illustrate how standard errors of the copy number estimate could be integrated in the HMM, the R object \Robject{chromosome1} contains standard errors simulated from a shifted Gamma: $\mbox{Gamma}(1, 2) + 0.3$, where 1 is the shape parameter and 2 is the rate parameter. To ascertain the effect of qualitatively high confidence scores on the ICE HMM, we scaled a robust estimate of the copy number standard deviation by $\frac{1}{2}$. Similarly, to simulate less precise $\cne$ we scaled $\epsilon$ by 2. For more detailed information about how the data in the \Robject{chromosome1} was generated, see the documentation for this object in the R package \ice. \section{The HmmParameter class} An instance of the class is created by the method \Rfunction{new}: <>= new("HmmParameter") @ The object \Robject{params} contains all of the parameters needed for fitting the HMM, including an estimate of the genomic distance between SNPs (used for calculating SNP-specific transition probabilities), emission probabilities (slot: emission), and initial state probabilities. \paragraph{Emission probabilities.} The emission probabilities are stored as an array in the \Robject{params} object. The emission probability array has dimension R $\times$ C $\times$ S, where $S$ is the number of hidden states, $R$ is the number of rows (SNPs), and $C$ is the number of samples. One may use $[$ to subset object of class \Robject{HmmParameter}. <<>>= params[5, 1, ] @ \paragraph{Transition probabilities.} The probability of remaining in the same state, $P(S_t = S_{t+1})$ (the diagonal of the transition probability matrix) is a function of the distance (d) between SNPs: $e^{-2d(100 * 1e6)}$. This value is stored in the slot \Robject{tau} of the \Robject{params} object. The probability of leaving a state is $\epsilon$, where $\epsilon = 1 - P(S_t = S_{t+1})$. The $\epsilon$ is split among $S-1$ states. The SNP-specific transition probabilities can be scaled by specifying a matrix. No scaling of the transition probabilities between states, the default, occurs when the scaling matrix is all 1's: <>= transitionScale(params) @ For illustration, one could make the probability of transitioning from an altered state to a normal state 10 times as likely as the probability of transitioning between two altered states by the following commmand: <>= ##FIXME transitionScale(params) <- scaleTransitionProbability(states(params), SCALE=10) transitionScale(params) @ \section{The HmmOptions Class} To be completed \ldots \section{The HmmPredict Class} The output from the HMM is an instance of the \Robject{HmmPredict} class and contains the predicted states as well as the breakpoints for the different states. <>= fit.ice @ The breakpoints are stored in slot \Robject{breakpoints} of an object of class \Robject{HmmPredict}. These functions do roughly the same thing: <>= breaks <- breakpoints(fit.ice) predict <- predictions(fit.ice) breaks <- findBreaks(x=predict, states=states(fit.ice), position=position(fit.ice), chromosome=chromosome(fit.ice), sample=sampleNames(fit.ice)) breaks <- calculateBreakpoints(fit.ice) @ %The \Robject{summary} method returns a list where each element in the %list provides statistics for an altered states. For each chromosome, %the mean, median, and standard deviation of the size of the features %and number of SNPs involved are reported. For instance, in this HMM %the altered states were loss of heterozygosity (L), amplification of %copy number (A), and deletion of copy number (D). % <>= ##FIXME summary(fit.ice) @ \section{HMMs for different classes of data} \subsection{Copy number} The method \Robject{hmm} has a different set of underlying hidden states depending on whether copy number estimates, genotype calls, or both are available. When only copy number estimates are available, the hidden states (for autosomes) are hemizygous or homozygous deletion (one or fewer copies), normal (two copies), and amplification (three or more copies). The corresponding data class is {SnpCopyNumberSet}. To illustrate, we convert the \Robject{chromosome1} example to an object of this class and fit the HMM. <>= chr1.cn <- as(chromosome1, "SnpCopyNumberSet") options <- new("HmmOptions", snpset=chr1.cn, states=c("D", "N", "A"), copyNumber.location=1:3) params.cn <- new("HmmParameter", states=c("D", "N", "A")) emission(params.cn) <- copyNumber.emission(options) transitionScale(params.cn) <- scaleTransitionProbability(states=c("D", "N", "A"), normalLabel="N") genomicDistance(params.cn) <- exp(-2*physicalDistance(options)/(100*1e6)) fit.cn <- hmm(options, params.cn) breakpoints(fit.cn) graph.par <- plot(snpset(options), fit.cn) graph.par$abline.v <- FALSE @ <>= print(graph.par) legend(0, 0.8, fill=graph.par$col.predict[c(1, 3)], legend=c("< 2 copes", "> 2 copies"), bty="n", cex=0.8) @ <>= rm(chr1.cn, fit.cn, params.cn) @ \subsection{Genotype calls} When only genotype calls are available, the hidden states are loss and retention (ret) of heterozygosity. We define \textit{loss} to be a sequence of homozygous SNPs longer than what we would expect to observe by chance. Note that many long stretches of homozygosity may occur as a result of a population sharing a common underlying haplotype structure; loss predictions from an HMM fit to an indvidual do not necessarily reflect the 'loss' of an allele in that individual. For illustration, we convert the \Robject{chromosome1} example to an object of class \Robject{HmmSnpCallSet} and refit the HMM. <>= chr1.calls <- as(chromosome1, "SnpCallSet") options.calls <- new("HmmOptions", snpset=chr1.calls, states=c("L", "N"), probHomCall=c(0.99, 0.7)) params.calls <- new("HmmParameter", states=states(options.calls)) transitionScale(params.calls) <- scaleTransitionProbability(states(options.calls)) genomicDistance(params.calls) <- exp(-2*physicalDistance(options.calls)/(100*1e6)) emission(params.calls) <- calls.emission(options.calls) fit.calls <- hmm(options.calls, params.calls) breakpoints(fit.calls) gp <- plot(snpset(options.calls), fit.calls) gp$col.predict <- c("black", "white") gp$ylim <- c(-0.5, 1) gp$add.centromere <- FALSE gp$abline.v <- TRUE gp$cytoband.ycoords <- c(-0.45, -0.4) gp$hmm.ycoords <- c(-0.2, -0.05) @ <>= print(gp) legend(0, -0.1, legend="LOH", fill="black", title="predictions", bty="o", cex=0.8) @ \subsection{Genotype calls and copy number} Section \ref{sec:simpleUsage} illustrates how one may fit the HMM to objects of class \Robject{oligoSnpSet}. More documentation about the classes can be found in the documentation for the R package \ice. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliography{ice}{} \bibliographystyle{plain} \end{document}