% \VignetteIndexEntry{TitanCNA} % \VignetteDepends{TitanCNA, foreach} % \VignetteKeywords{Subclonal copy number CNA LOH clonal genome sequencing tumours cancer} % \VignettePackage{HMMcopy} \documentclass[letterpaper]{article} \usepackage{natbib} \usepackage{fullpage} \usepackage{amsmath, amsthm, latexsym} \usepackage{url} \usepackage[utf8]{inputenc} \title{TITAN: Subclonal copy number and LOH prediction from whole genome sequencing of tumours} \author{Gavin Ha} \date{\today} \begin{document} \SweaveOpts{concordance=TRUE} \setkeys{Gin}{width=\textwidth} \maketitle \tableofcontents \section{Introduction} TITAN is a probabilistic framework for predicting regions of copy number alterations (LOH) and loss of heterozygosity (LOH) events in tumour whole genome sequencing (WGS) data. The model simultaneously estimates the cellular prevalence, proportion of tumour sample containing the event, and clonal cluster memberships. The statistical framework was designed based on the principles that the observed sequencing signal is an aggregated measure of multiple heterogeneous cellular populations including normal and tumour subpopulations and sets of genetic aberrations are observed at similar cellular prevalence if these events co-occurred in the same clone due to punctuated clonal expansions \cite{Greaves2012}. The TITAN software is part of an accompanying manuscript where mathematical details are presented \cite{Ha2013}. The observed measurements are modeled as generated from a composite of three cell populations \cite{Yau2010}, consisting normal cells at global proportion $n$; tumour cells with normal genotype at proportion $(1-n)*s_{z}$; and tumour cells with aberrant genotype at proportion $(1-n)*(1-s_{z})$. $s_{z}$ is the proportion of tumour cells that is diploid heterozygous (and therefore normal) at the locus. Thus, $(1-s_{z})$ is the proportion of the tumour population containing the event, or we define as the cellular prevalence. We assume multiple somatic events share similar cellular prevalence and thus can be assigned to one of a finite number of clonal clusters, $z\in Z$. The input data to TITAN are tumour read depth and allele counts. The read depth is corrected for GC content and mappability biases using the Bioconductor R package HMMcopy. Then, the log ratio between tumour and normal corrected depths is computed. The log ratio is model using a mixture of Gaussians were the mean parameter is a funtion of the cellular prevalence, normal proportion, and tumour ploidy; the variance is also unknown and estimated. The allele counts are transformed to allelic ratios, which is defined as the reference read count divided by depth or proportion of reference reads at a locus. Allelic ratios are modeled using a binomial distribution \cite{Ha2012} with the success parameter being a function of cellular prevalence and normal contamination. TITAN is implemented as a two-factor hidden Markov model (HMM) where the hidden genotypes and the hidden clonal cluster memberships are the two Markov chains. The state space is expanded as the joint genotype and clonal cluster states. The output of TITAN is a list of CNA and LOH events, with accompanying parameter estimates of normal proportion, the cellular prevalence and clonal population cluster membership for each event. TITAN estimates the cellular prevalence (proportion of tumour cells with the tumour genotype) based on a fixed number of clonal clusters. Users are required to run TITAN on a tumour sample for a range of clonal clusters in parallel. We advise users to use a range of 1 to 5. When results for the 5 jobs are generated, one for each fixed number of clusters, each parameter output file will contain an S\_Dbw Validity Index score. The run for a tumour sample that has the lowest S\_Dbw score is the optimal result and the corresponding number of clonal clusters is the optimal one. This vignette will detail an example run of TITAN for only 2 clonal clusters of chromosome 2 of a triple negative breast cancer sample \cite{Ha2012,Ha2013}. Once again, for real data, users should repeat the \textit{TITAN} for additional clonal clusters and then select the optimal run using the minimum S\_Dbw Validity index. \section{Analysis Workflow Overview} \begin{enumerate} \item The data points of interest are the germline heterozygous SNP loci identified in the matched normal WGS sample. This is a pre-processing step that is completed by the user, and is outside of the scope of this R package. \item At these loci, the read depth, reference read count and non-reference read count are extracted from the tumour WGS sample. This step is completed outside of the R package \item The read depth are normalized for GC content and mappability biases using the \textit{HMMcopy} R package. \textit{TITAN} uses a wrapper for the function in \textit{HMMcopy} to accomplish this. \item \textit{TITAN} requires as input, the read counts and normalized read depth to estimate model parameters and infer subclonal CNA/LOH. \item For real data, repeat \textit{TITAN} analysis for a range of clonal clusters. \end{enumerate} \section{Model training and parameter estimation} TITAN uses the Expectiation-Maximization algorithm to train the model and estimate parameters for cellular prevalence, normal proportion and tumour ploidy. In the E-step, the forwards-backwards algorithm is employed, making a call to a C implementation for faster computation. In the M-step, parameters are estimated using maximum a posteriori (MAP). The various parameters TITAN uses the Viterbi algorithm to find the optimal state sequence path of genotype and clonal cluster membership for each data point. \section{Extracting tumour and normal read depth} TITAN requires the read depth from the tumour and normal WGS samples. This is done using a tool, outside of R, distributed as part of the \textit{HMMcopy Suite} available at \url{http://compbio.bccrc.ca/software/hmmcopy/}. In short, the suite has tools to: \begin{itemize} \item Obtain high resolution bin counts for large ($\approx$ 250GB) BAM files within a few hours. \item Obtain GC content for bins from standard FASTA files within minutes for a human genome. \item Obtain average mappability for bins from BigWig within minutes, or FASTA files within a day for a human genome. \end{itemize} For TITAN, the user will need to generate the read depth files for the tumour and normal samples in the form of WIG files. Also required, and is provided in the \textit{HMMcopy suite} for the GRCh37-lite reference genome, are the GC content and mappability scores WIG files. Please refer to the instructions on the \textit{HMMcopy suite} website for preparing these files. \section{TITAN analysis} Users should run TITAN once for each setting of the number of clonal clusters, ranging from 1 to 5. Although higher number of clonal clusters can be used, in practice, the sequencing coverage of 30-50X, analyzing more than 5 clonal clusters may not produce accurate results. For each run of TITAN on real data with approximately 2 million loci across all chromosomes and for 5 clonal clusters, the maximum memory requirement is approximately 24Gb. \subsection{Loading the data} Load the provided text file for the allele counts for chromosome 2 of a breast cancer sample \cite{Ha2012}. The format of the file must be 6 columns: chromosome, position, reference base, reference read counts, non-reference base, non-reference read counts. A list object with 6 components of equal size/lengths is returned. <<>>= library(TitanCNA) infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") data <- loadAlleleCounts(infile, genomeStyle = "NCBI") names(data) @ Users can specify the desired chromosome naming convention by using the \verb|genomeStyle| argument. Using \verb|NCBI| will use chromosome names such as 1, 2, 3, ..., X, while using \verb|UCSC| will use names such as chr1, chr2, chr3, ..., chrX. Note that it does not matter what the chromosome convention was originally; using the \verb|genomeStyle| argument will return the desired convention. \subsection{Correct GC content and mappability biases} Correct GC content and mappability biases using a wrapper function that calls uses a function similar to the \textit{HMMcopy} package R function, \textit{correctReadcount}. Here, 4 wig files are required: tumour, normal, GC content, and mappability score. The wrapper will apply the correction sub-routines and compute the log ratio as log2(tumour/normal). <<>>= tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") cnData <- correctReadDepth(tumWig, normWig, gc, map, genomeStyle = "NCBI") head(cnData) @ For real data, users can use the genome-wide GC content and mappability score for 1kb windows. The two files can be found compressed in data/GRCh37-lite.tar.gz of the git repository. These WIG files were generated from the reference GRCh37-lite.fasta; the tumour and normal BAM files used for generating the tumour and normal WIG files in the preprocessing steps MUST also use this same reference. Otherwise, new GC and map wig files will need to be created as per instructions on \url{http://compbio.bccrc.ca/software/hmmcopy/} Then, find the log ratio at each position of interest (germline heterozygous SNPs) given in the object "data". Then transform the log ratios to natural logs instead of log base 2. Remove \verb|logR| and \verb|cnData| objects to save memory. <<>>= logR <- getPositionOverlap(data$chr, data$posn, cnData) data$logR <- log(2^logR) #transform the log ratio to natural logs @ \subsection{Filter the data} Filter the data for low and high depth positions and positions with NA's. For ease of analysis, TITAN converts chromosome X->23 and Y->24. Do not worry, in the final output, "X" and "Y" will be in the output. Optionally, \begin{itemize} \item \verb|positionList|: data.frame containing a list of positions can be passed as an argument, specifying which positions to use. \item \verb|centromeres|: data.frame with 3 columns - chr, start, stop that correspond to centromere regions. This argument can actually accept any regions that you wish to exclude. \item \verb|centromere.flankLength|: the number of base pairs to the left and right of each region in \verb|centromeres| that you which to further exclude. \end{itemize} <<>>= data <- filterData(data, c(1:22, "X", "Y"), minDepth = 10, maxDepth = 200, positionList = NULL, centromere = NULL, centromere.flankLength = 10000) @ \subsection{Loading the initial parameters} Load the default parameters using a specified maximum copy number \textit{TITAN} will consider. Here, 5 is used but up to 8 copies can be specified, however, running time and memory performance hits will be incurred. Here, you will also specify the number of clonal clusters to use for the TITAN run. This is where the number of clusters will be specified when running for 1,2,3,4, or 5 clonal cluster for real data. In general, the default parameters will work well with real, full datasets of whole genome sequencing of tumour samples. Including the loaded \verb|data| from the previous step will help inform the baseline allelic ratio parameters; this is recommended when using \verb|symmetric| genotypes. \verb|hetBaselineSkew| is the allelic reference skew for heterozygous states (e.g. 1:1, 2:2, 3:3). This value is the additive to baseline allelic ratios, for example, when \verb|hetBaselineSkew=0.05|, then the heterozygous allelic ratio is expected to be 0.55. The \verb|alleleEmissionModel| is specified as \verb|"binomial"| or \verb|"Gaussian"| (not fully implemented yet; please use binomial). <<>>= numClusters <- 2 params <- loadDefaultParameters(copyNumber = 5, numberClonalClusters = numClusters, symmetric = TRUE, hetBaselineSkew = 0, alleleEmissionModel = "binomial", data = data) params @ \verb|params| is a list object containing 4 sets of parameters, each as a component: \begin{itemize} \item \verb|genotypeParams|: Parameters for copy number and allelic ratios genotype states \item \verb|normalParams|: Parameters for normal contamination \item \verb|ploidyParams|: Parameters for average tumour ploidy \item \verb|cellPrevParams|: Parameters for modeling subclonality: clonal clusters and cellular prevalence \item \verb|data|: Input data frame to help determine the prior diploid heterozygous baseline allelic ratio noise \end{itemize} Analysis of copy number by \textit{TITAN} is subject to tumour ploidy. TITAN attempts to estimate this value globally and is accurate in most cases. However, the issue is that, from the signals observed in the data, \textit{TITAN} cannot distinguish between diploid (2 global copies) and tetraploidy (4 global copies) because the data can explain both situations. Therefore, in general, if the user is aware that a sample is ployploid, then TITAN should be run with the ploidy initialization of 2 and also for initialization of 4. If the user is unaware of the ploidy status of the sample, then inspection of a TITAN run at ploidy initialization of 2 can help. If this run has many large, prominent regions of inferred homozygous deletions, then it is likely that this sample is triploid or tetraploid. <>= params$ploidyParams$phi_0 <- 2 # for diploid or params$ploidyParams$phi_0 <- 4 # for tetraploid/ployploid @ Because this vignette is an example involving only one chromosome, the fewer loci in the analysis will give different results than for a sample with all chromosomes. For this example, we will modify the Gaussian variance hyperparameter (prior used in modeling the log ratios) to a less influential setting, making the analysis more suitable for one chromosome. Also, because we know the average tumour ploidy of the genome-wide sample is slightly less than 2, we will initialize the \verb|phi_0| to 1.5 so that it does not overfit to this one chromosome. <<>>= K <- length(params$genotypeParams$alphaKHyper) params$genotypeParams$alphaKHyper <- rep(500, K) params$ploidyParams$phi_0 <- 1.5 @ \subsection{Model training and parameter estimation} Parameter estimation in \textit{TITAN} uses the Expectation Maximization algorithm (EM) and forwards-backwards algorithm. This step generally requires parallelization to increase time performance. The \textit{TITAN} package uses the \textit{foreach} package which conveniently enables the user to use libraries, such as \textit{doMC} and \textit{doMPI}, to parallelize the training by chromosome. It is up to the user to decide which is the best library to use depending on the current system. For example, if one wishes to use multiple cores via forking on a single machine, <>= library(doMC) registerDoMC(cores = 4) #use 4 cores on a single machine @ Here is a little more detail regarding some important arguments: \begin{itemize} \item \verb|maxiter|: maximum number of EM iterations allowed. In practice, users should set it at 20 since it does not exceed 20. For this vignette to finish in a shorter time, we use 3. \item \verb|maxiterUpdate|: maximum number of coordinate descent iterations during the M-step when parameters are estimated. In practice, 1500 iterations should be used. \item \verb|txnExpLen|: influences prior probability of genotype transitions in the HMM. The higher, the lower tendency to change state. \item \verb|txnZstrength|: influences prior probability of clonal cluster transitions in the HMM. Larger values means lower tendency to change clonal cluster state. \item \verb|useOutlierState|: logical indicating whether an additional outlier state should be used. In practice, this usually is not necessary. \item \verb|normalEstimateMethod|: specifies how to handle normal proportion estimation. To estimate, use "map" which is maximum a posteriori. If you wish to not estimate this parameter, then use "fixed". This will default the normal proportion to whatever is specified in \verb|params$normalParams$n_0|. For example, if you know that this sample has absolutely no normal contamination, then use \verb|params$normalParams$n_0 <- 0| and set \verb|normalEstimateMethod="fixed"|. \item \verb|estimateS|: logical indicating whether to account for clonality and estimate subclonal events \item \verb|estimatePloidy|: logical indicating whether to estimate and account for tumour ploidy \end{itemize} <<>>= convergeParams <- runEMclonalCN(data, params, maxiter = 3, maxiterUpdate = 50, useOutlierState = FALSE, txnExpLen = 1e15, txnZstrength = 5e5, normalEstimateMethod = "map", estimateS = TRUE, estimatePloidy = TRUE) names(convergeParams) @ \verb|convergeParams| is a list object with components containing the converged parameters from the EM training, including posterior marginal responsibilities, log likelihood, and original parameter settings. \begin{itemize} \item \verb|n|: Converged estimate for normal contamination parameter. \verb|numeric array| containing estimates at each EM iteration. \item \verb|s|: Converged estimate(s) for cellular prevalence parameter(s). This value is defined as the proportion of tumour sample that does \emph{not} contain the aberrant genotype. This will contrast what is output in \verb|outputTitanResults|. \verb|numeric array| containing estimates at each EM iteration. If more than one cluster is specified, then \verb|s| is a \verb|numeric matrix|. \item \verb|var|: Converged estimates for variance parameter of the Gaussian mixtures used to model the log ratio data. \verb|numeric matrix| containing estimates at each EM iteration. \item \verb|phi|: Converged estimate for tumour ploidy parameter. \verb|numeric array| containing estimates at each EM iteration. \item \verb|piG|: Converged estimate for initial genotype state distribution. \verb|numeric matrix| containing estimates at each EM iteration. \item \verb|piZ|: Converged estimate for initial clonal cluster state distribution. \verb|numeric matrix| containing estimates at each EM iteration. \item \verb|muR|: Mean of binomial mixtures computed as a function of \verb|s| and \verb|n|. \verb|numeric matrix| containing estimates at each EM iteration. See References for mathematical details. \item \verb|muC|: Mean of Gaussian mixtures computed as a function of \verb|s|, \verb|n|, and \verb|phi|. \verb|numeric matrix| containing estimates at each EM iteration. See References for mathematical details. \item \verb|loglik|: Posterior Log-likelihood that includes data likelihood and the priors. \verb|numeric array| containing estimates at each EM iteration. \end{itemize} \subsection{Find the genotype and clonal cluster state paths} Viterbi algorithm is used to return the optimal genotype/clonal cluster state path. After running EM, use the converge parameters and the input data to compute the optimal segmentation results. <<>>= optimalPath <- viterbiClonalCN(data, convergeParams) head(optimalPath) @ It is difficult to interpret the output of this function directly. The user should use the function \verb|outputTitanResults| to format the results. \subsection{Format and print results to file} \subsubsection{Position-specific Copy Number Results} Position-specific results that includes genotype, clonal cluster, and cellular prevalence estimates. The posterior probabilities at each position can optionally be returned. The results can be output to a file if a file name is specified for the \verb|filename| argument. <<>>= results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, subcloneProfiles = TRUE, correctResults = TRUE, proportionThreshold = 0.05, proportionThresholdClonal = 0.05, is.haplotypeData = FALSE) names(results) head(results$corrResults) ## corrected results convergeParams <- results$convergeParam ## use corrected parameters results <- results$corrResults ## use corrected results @ The elements of this list contains the following: \begin{enumerate} \item \verb|results|: TITAN results, uncorrected for cluster number and parameters \item \verb|corrResults|: TITAN results, corrected by removing empty clusters and parameters adjusted accordingly. \item \verb|convergeParams|: Corrected parameter object \end{enumerate} When \verb|correctResults=TRUE|, the results will be post-processed to exclude clonal clusters that have fewer genomic alterations. Users can specify the minimum proportion of the genome altered for any subclonal cluster using \verb|proportionThreshold| and the minimum proportion of clonal events for the clonal (highest cellular prevalence) cluster using \verb|proportionThresholdClonal|. The function returns the post-processed \verb|convergeParams| and \verb|corrResults| objects in a 2 element list. \verb|results| and \verb|CorrResults| have the following format: \begin{enumerate} \item \verb|Chr| \item \verb|Position| \item \verb|RefCount|: number of reads matching the reference base \item \verb|NRefCount|: number of reads matching the non-reference base \item \verb|Depth|: total read depth at the position \item \verb|AllelicRatio|: RefCount/Depth \item \verb|LogRatio|: log2 ratio between normalized tumour and normal read depths \item \verb|CopyNumber|: predicted TITAN copy number \item \verb|TITANstate|: internal state number used by TITAN; see supplementary table 2 in manuscript \item \verb|TITANcall|: interpretable TITAN state; string \{HOMD,DLOH,HET,NLOH,ALOH,ASCNA,BCNA,UBCNA\}, see supplementary table 2 in manuscript \item \verb|ClonalCluster|: predicted TITAN clonal cluster; lower cluster numbers represent clusters with higher cellular prevalence \item \verb|CellularPrevalence|: proportion of tumour cells containing event; not to be mistaken as proportion of sample (including normal) \item \verb|Subclone1.CopyNumber|: Copy number profile for Subclone 1 \item \verb|Subclone1.TITANcall|: TITAN state for Subclone 1 \item \verb|Subclone1.Prevalence|: Subclonal prevalence for Subclone 1 \item \verb|Subclone2.CopyNumber|: Copy number profile for Subclone 2 \item \verb|Subclone2.TITANcall|: TITAN state for Subclone 2 \item \verb|Subclone2.Prevalence|: Subclonal prevalence for Subclone 2 \end{enumerate} %For haplotype-based copy number using phasing information, use \verb|is.haplotypeData=TRUE|. Then, additional columns are included: %\begin{enumerate} % \item \verb|HaplotypeRatio|: haplotype-based fraction if analyzing haplotype-based %copy number % \item \verb|HaplotypeCount|: % \item \verb|HaplotypeDepth|: Total coverage % \item \verb|PhaseSet|: An identifier indicating the original haplotype phase block %\end{enumerate} \subsubsection{Copy number segments} Since version 1.10.1, users can directly generate segments and output these to files. There are two output files: 1) Segments with detailed information, and 2) Segments compatible for loading into the Integrative Genomics Viewer (IGV) application. These filenames can be specified in arguments \verb|filename| and \verb|igvfilename| <>= segs <- outputTitanSegments(results, id = "test", convergeParams, filename = NULL, igvfilename = NULL) head(segs) @ The definitions of the columns of the segment object \verb|segs| are the following: \begin{enumerate} \item \verb|Sample|: Name of sample \item \verb|Chromosome|, \verb|Start_Position.bp.|, \verb|End_Position.bp.|: Coordinates of segment \item \verb|Length.snps.|: Number of SNPs in the segment \item \verb|Median_Ratio|: Median allelic ratio across SNPs in the segment \item \verb|Median_logR|: Median log ratio across SNPs in the segment \item \verb|TITAN_state|: Same as defined above \item \verb|TITAN_call|: Same as defined above \item \verb|Copy_Number|: Same as defined above \item \verb|MinorCN|: Copy number of minor allele \item \verb|MajorCN|: Copy number of major allele \item \verb|Clonal_Cluster|: Same as defined above \item \verb|Cellular_Frequency|: Same as defined above \end{enumerate} \subsubsection{Adjusted Copy Number Results} New in version 1.17.1, copy number results can be further adjusted to reflect total integer and allelic copy number by correcting the HMM copy number states. The HMM normally supports a maximum copy number value of 8, but this correction will output an adjusted copy number based on the log ratio values. Users can specify the initial copy number state to correct using \verb|maxCNtoCorrect.autosomes|. For example, if \verb|maxCNtoCorrect.autosomes=8|, then all positions and segments with copy number \verb|>=8| will be corrected. If \verb|correctHOMD=TRUE|, then homozygous deletion (copy number of 0) calls will be adjusted to account for potential label-switching (i.e. missed HOMD call) and false positives (i.e. incorrect HOMD call). Finally, users can specify the minimum tumor content required to performed the tumor purity and ploidy correction. Samples with lower tumor conent will likely have noisier adjusted values since tumor purity is used in the correction. The correction is cmoputed based on the following formula from Ref\cite{Ha2013} using observed log ratio $l_t$ at position or segment $t$ and tumor purity $(1-n)$ and tumor ploidy $\phi$: \begin{eqnarray} l_t=\log\left(\frac{nc_{N}+\left(1-n\right)s_{z}c_{N}+\left(1-n\right)\left(1-s_{z}\right)c_t}{nc_{N}+\left(1-n\right)\phi}\right)\\ \hat{c_T}=\frac{2^{l_T}\left[nc_{N}+\left(1-n\right)\phi\right]-nc_{N}+\left(1-n\right)s_{z}c_{N}}{\left(1-n\right)\left(1-s_{z}\right)} \end{eqnarray} where $\hat{c_{t}}$ is the adjusted tumor copy number at position or segment $t$. <>= # get the estimated tumor ploidy ploidy <- tail(convergeParams$phi, 1) # get the estimated normal normal <- tail(convergeParams$n, 1) # apply tumor purity and ploidy correction corrIntCN.results <- correctIntegerCN(results, segs, 1 - normal, ploidy, maxCNtoCorrect.autosomes = 8, maxCNtoCorrect.X = NULL, correctHOMD = FALSE, minPurityToCorrect = 0.2, gender = "female", chrs = 2) head(corrIntCN.results$segs) # re-assign to results and segs objects results <- corrIntCN.results$cn segs <- corrIntCN.results$segs @ Both position-level (\verb|results|) and segment-level (\verb|segs|) results will have additional columns appended for the adjusted copy number results. The additional columns are the following: \begin{enumerate} \item \verb|logR_Copy_Number|: Purity and ploidy corrected log ratios that have been converted to a decimal-based copy number value. \item \verb|Corrected_Copy_Number|: Purity and ploidy corrected total copy number rounded to the nearest integer. \item \verb|Corrected_Call|: Copy number status of the total corrected copy number. \item \verb|Corrected_MajorCN|: Purity and ploidy corrected integer (rounded) major copy number value. \item \verb|Corrected_MinorCN|: Purity and ploidy corrected integer (rounded) minor copy number value. \end{enumerate} \subsubsection{Estimated Model Parameters} Model parameters and summary values such as the number of clonal clusters and their corresponding cellular prevalence, normal contamination, ploidy, and the S\_Dbw validity index for model selection later. <>= outparam <- paste("test_cluster02_params.txt", sep = "") outputModelParameters(convergeParams, results, outparam, S_Dbw.scale = 1) @ The format of the parameter output file is the following: \begin{enumerate} \item Normal contamination estimate: proportion of normal content in the sample; tumour content is 1 minus this number \item Average tumour ploidy estimate: average number of estimated copies in the genome; 2 represents diploid \item Clonal cluster cellular prevalence: Z denotes the number of clonal clusters; each value (space-delimited) following are the cellular prevalence estimates for each cluster \item Genotype binomial means for clonal cluster Z: set of 21 binomial estimated parameters for each specified cluster \item Genotype Gaussian means for clonal cluster Z: set of 21 Gaussian estimated means for each specified cluster \item Genotype Gaussian variance: set of 21 Gaussian estimated variances; variances are shared for across all clusters \item Number of iterations: number of EM iterations needed for convergence \item Log likelihood: complete data log-likelihood for current cluster run \item S\_Dbw dens.bw: density component of S\_Dbw index \item S\_Dbw scat: scatter component of S\_Dbw index \item S\_Dbw validity index: used for model selection; choose run with optimal number of clusters based on lowest S\_Dbw index \end{enumerate} Users may alter the \verb|S_Dbw.scale| argument to penalize higher number of clonal clusters. <>= outputModelParameters(convergeParams, results, outparam, S_Dbw.scale = 10) @ \subsection{Plotting the results} Plot the results using 3 built in functions. For each chromosome, there is a separate plot function for the log ratios (CNA), allelic ratios (LOH), and cellular prevalence. Each function adds to an existing plot. \subsubsection{Copy number alterations (log ratio)} Generate the figure of copy number alteration results by plotting the log ratios. Optionally, the data can be corrected by using the estimated \verb|ploidy|. Otherwise, use \verb|ploidy=NULL|. <>= ploidy <- tail(convergeParams$phi, 1) ploidy normal <- tail(convergeParams$n, 1) normal plotCNlogRByChr(results, segs = segs, chr = 2, ploidy = ploidy, normal = normal, ylim = c(-2, 2), cex = 0.25, xlab = "", main = "Chr 2") @ The Y-axis is based on log ratios. Log ratios are computed ratios between normalized tumour and normal read depths. Data points close to 0 represent diploid, above 0 are copy gains, below 0 are deletions. Bright Green - HOMD Green - DLOH Blue - HET, NLOH Dark Red - GAIN Red - ASCNA, UBCNA, BCNA \subsubsection{Loss of heterozygosity (allelic ratio)} The loss of heterozygosity (LOH) results is shown by plotting the allelic ratio. <>= plotAllelicRatio(results, chr = 2, ylim = c(0, 1), cex = 0.25, xlab = "", main = "Chr 2") @ The Y-axis is based on allelic ratios. Allelic ratios are computed as RefCount/Depth. Data points close to 1 represent homozygous reference base, close to 0 represent homozygous non-reference base, and close to 0.5 represent heterozygous. Normal contamination influences the divergence away from 0.5 for LOH events. Grey - HET, BCNA Bright Green - HOMD Green - DLOH, ALOH Blue - NLOH Dark Red - GAIN Red - ASCNA, UBCNA \subsubsection{Cellular prevalence and clonal clusters} One of the key features of \textit{TITAN} is the estimation of cellular prevalence and the inference of clonal cluster membership. The estimated normal proportion is required for this plot. This can be obtained from \verb|convergeParams|. <>= norm <- tail(convergeParams$n, 1) norm # estimated normal contamination 1 - convergeParams$s[, ncol(convergeParams$s)] # estimated cellular prevalence plotClonalFrequency(results, chr = 2, normal = norm, ylim = c(0, 1), cex = 0.25, xlab = "", main = "Chr 2") @ The Y-axis is the cellular prevalence that includes the normal proportion. Therefore, the cellular prevalence here refers to the proportion in the sample (including normal). Lines are drawn for each data point indicating the cellular prevalence. Heterozygous diploid are not shown because it is a normal genotype and is not categorized as being subclonal (this means 100\% of cells are normal). The black horizontal line represents the tumour content labeled as "T". Each horizontal grey line represents the cellular prevalence of the clonal clusters labeled as Z1, Z2, etc. Colours are the sames for allelic ratio plots. \subsubsection{Subclone profiles} New since TitanCNA v1.2.0, users can plot the copy number profiles for the predicted subclones. This function only works for solutions containing 1 or 2 clonal clusters. <>= plotSubcloneProfiles(results, chr = 2, cex = 1, spacing = 2, main = "Chr 2") @ Colours have the same definition as for the allelic ratio plots. \subsubsection{Segment medians} For a cleaner visualization of copy number results, users can plot segment means of total copy number and allelic copy number results <>= plotSegmentMedians(segs, chr=2, resultType = "LogRatio", plotType = "CopyNumber", plot.new = TRUE, ylim = c(0, 4), main="Chr 2") @ \section{Session Information} The version number of R and packages loaded for generating the vignette <>= toLatex(sessionInfo()) @ \bibliographystyle{plain} \bibliography{TitanCNA} \end{document}