%\VignetteIndexEntry{Rtreemix} %\VignetteDepends{methods, graph, Biobase, Hmisc, Rgraphviz} %\VignetteKeywords{Rtreemix, mtreemix, disease progression, HIV} %\VignettePackage{Rtreemix} %\documentclass[a4paper, oneside, 10pt]{article} \documentclass[a4paper,12pt,twoside]{article} \usepackage[pdftex]{graphicx} \usepackage{calc} \usepackage{sectsty} \usepackage{caption} \renewcommand{\captionfont}{\it\sffamily} \renewcommand{\captionlabelfont}{\bf\sffamily} \allsectionsfont{\sffamily} % page style %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage[a4paper, left=23mm, right=23mm, top=20mm, bottom=20mm, nohead]{geometry} \setlength{\parskip}{1.5ex} \setlength{\parindent}{0cm} \pagestyle{empty} \usepackage{Sweave} \SweaveOpts{prefix.string = Rtreemix} \title{\vspace*{-6ex} {\bf Rtreemix}: a package for estimating mutagenetic trees mixture models and genetic progression scores} \author{Jasmina Bogojeska, J\"org Rahnenf\"uhrer} \date{\today \\% \texttt{http://www.mpi-sb.mpg.de/$\sim$jasmina}} \begin{document} \maketitle <>= options(width = 70) @ \section{Introduction} The mixture of mutagenetic trees introduced in \cite{Beer12005} is an evolutionary model that provides an interpretable probabilistic framework for modeling multiple paths of ordered accumulation of permanent genetic changes that can be used for describing many disease processes. Each path captures a possible route of disease development. These models are used to model \texttt{HIV} progression characterized by accumulation of resistance mutations in the viral genome under drug pressure \cite{Beer12005} and cancer progression by accumulation of chromosomal aberrations in tumor cells \cite{Rahn2005}. From the mixture model, a genetic progression score (\texttt{GPS}) can be computed for each patient \cite{Rahn2005} that gives an estimate of the disease progression and can be used for specifying therapies or estimating survival times of the patients. Both the mixture model itself and the derived \texttt{GPS} values are shown to improve the interpretation of disease progression and to have predictive power for estimating the drug resistance in \texttt{HIV} \cite{Beer12005} or the survival time in cancer \cite{Rahn2005}. Beerenwinkel \textit{et al.} in \cite{Beer22005} introduced the \texttt{Mtreemix} package implemented in \texttt{C/C++} that provides an efficient code for estimating the mutagenetic trees mixture models from cross-sectional data and using them for various predictions. Building up on the \texttt{Mtreemix} software and using the \texttt{C/C++} API that \texttt{R} provides, we created the \texttt{Rtreemix} package. By reusing already existing \texttt{C} functions our package provides the users with \texttt{R} functions as efficient as the programs available in the \texttt{Mtreemix}. Similar to \texttt{Mtreemix}, the \texttt{Rtreemix} package provides functions for learning the mixture model from given data, simulation, likelihood computation and estimation of the GPS values. Furthermore, it introduces new functionality for estimating genetic progression scores with corresponding bootstrap confidence intervals and for performing stability analysis of the mixture models \cite{Bogo2008}. \section{Structure and Functionality} The class structure of the \texttt{Rtreemix} package is given in Figure~\ref{fig:classes}. \begin{figure}[!h] \centering \includegraphics[width=0.7\linewidth]{ClassDiagram.pdf} \caption{Class diagram of the package \texttt{Rtreemix}. The diagram illustrates the classes with their attributes and the relationships among them.} \label{fig:classes} \end{figure} In this way, all data structures and information connected to some entity, like the mixture model or the data for the model estimation, are packed together, which makes the code compact and easy to understand and work with. In what follows we will present a working scenario in which we will use and briefly discuss most of the functions available in \texttt{Rtreemix}. More detailed information about the parameters of all the functions used in the text bellow and their default values can be found in the help files of the package. First, we load the package. <>= library(Rtreemix) @ \subsection{The Dataset and the \textit{RtreemixData} class} The datasets used for estimating the mixture models consist of binary patterns that describe the occurrence of a set of genetic events in a set of patients. Each pattern corresponds to a single patient. The set of genetic events comprises genetic changes relevant for the disease taken into consideration. In \texttt{Rtreemix} the data used for estimating a mutagenetic trees mixture model is represented as an object of the class \texttt{RtreemixData}. We consider the dataset from the Stanford \texttt{HIV} Drug Resistance Database \cite{Rhee2003} that comprises genetic measurements of 364 \texttt{HIV} patients treated only with the drug \textit{zidovudine}. This dataset is given as an \texttt{RtreemixData} object \textit{hiv.data.RData} in the \textit{data} folder of the package and can be loaded and displayed as follows. <>= data(hiv.data) show(hiv.data) ## show the RtreemixData object @ It should be also pointed out that a text file with a specific format can be used for creating an object of class \texttt{RtreemixData}. An example of such file is the file \textit{treemix.pat} in the \textit{examples} directory of the package. The text files used to create an \texttt{RtreemixData} object should follow the format of this file. Using \textit{treemix.pat} the \texttt{RtreemixData} object is created as follows. <>= ex.data <- new("RtreemixData", File = paste(path.package(package = "Rtreemix"), "/examples/treemix.pat", sep = "")) show(ex.data) ## show the RtreemixData object @ One can also create an \texttt{RtreemixData} object by specifying the set of patient profiles as a binary matrix as shown in the code below. <>= bin.mat <- cbind(c(1, 0, 0, 1, 1), c(0, 1, 0, 0, 1), c(1, 1, 0, 1, 0)) toy.data <- new("RtreemixData", Sample = bin.mat) show(toy.data) @ Additionally, there are functions for listing the set of profiles, the genetic events, the patient IDs, the number of events and the number of patients in the dataset. <>= Sample(hiv.data) Events(hiv.data) Patients(hiv.data) eventsNum(hiv.data) sampleSize(hiv.data) @ \subsection{Learning mutagenetic trees mixture models} Having a set of patterns that indicate the occurence of genetic events for a group of patients we can learn a mutagenetic trees mixture model. The model is an object of the \texttt{RtreemixModel} class that extends the \texttt{RtreemixData} class. We fit a 2-trees mixture model for the \texttt{HIV} data \cite{Rhee2003}. <>= mod <- fit(data = hiv.data, K = 2, equal.edgeweights = TRUE, noise = TRUE) show(mod) @ The tree components of the fitted model can be visualized as follows. \setkeys{Gin}{width=.9\linewidth} \begin{figure}[!t] \centering <>= plot(mod, fontSize = 15) @ \caption{The mutagenetic trees mixture model for the \texttt{HIV} dataset.} \label{fig:hiv} \end{figure} When the mixture model contains a large number of tree components it is convenient to be able to plot a specific tree component. The following code plots the second tree component of the mixture model learned from the \texttt{HIV} dataset. <>= plot(mod, k=2, fontSize = 14) @ It is assumed that mixture models with at least two components always have the noise (star) component as a first component. When only one tree component is fitted to the given data it can be either a star or a nontrivial tree component based on the choice of the parameter \texttt{noise}. The mixture components comprising the model are represented as a list of directed \texttt{graphNEL} objects, and their weights (the mixture parameters) are given as a numeric vector. There are functions for getting the mixture parameters of the model, the number of tree components, the dataset used for estimating the model, etc. <>= Weights(mod) Trees(mod) getTree(object = mod, k = 2) ## Get a specific tree component k edgeData(getTree(object = mod, k = 2), attr = "weight") ## Conditional probabilities assigned to edges of the 2nd tree component numTrees(mod) getData(mod) @ This class can also contain other useful information connected with the mixture model: an indicator for the presence of the star component, a matrix of the responsibilities of each tree component for each pattern of the data used for learning the model, a matrix of the complete dataset in case of missing data, etc. <>= Star(mod) Resp(mod) CompleteMat(mod) @ The mutagenetic trees mixture model encodes a probability distribution on the set of all possible patterns \cite{Beer12005}. <>= distr <- distribution(model = mod) distr$probability @ One can also generate a random mutagenetic mixture model. In this case each tree component from the model is drawn uniformly at random from the tree topology space by using the Pr\"ufer encoding of trees. The number of tree components and the number of genetic events have to be specified. Additionally, one can specify the range from which the edge weights of the tree components are randomly drawn ($[0.2, 0.8]$ in the example bellow). <>= rand.mod <- generate(K = 3, no.events = 9, noise.tree = TRUE, prob = c(0.2, 0.8)) show(rand.mod) @ It is also possible to fit a mixture model and analyze its variance by deriving confidence intervals for the mixture parameters and the edge weights (resulting from a bootstrap analysis). An example for this is given in the PDF file \textit{ExtendedVignette} in the \textit{doc} folder of the package. \subsection{The \texttt{likelihood} method} The package \texttt{Rtreemix} implements the function \texttt{likelihoods} which calculates the (log, weighted) likelihoods for the patterns in a given dataset (\texttt{RtreemixData} object) derived with respect to a given \texttt{RtreemixModel}. The likelihoods are contained in an object of class \texttt{RtreemixStats} which extends the \texttt{RtreemixData} class. The number of the genetic events in the patterns from the given dataset has to be equal to the number of genetic events in the branchings from the given mixture model. In the code that follows we calculate the likelihoods of the \texttt{HIV} dataset with respect to the fitted mixture model. <>= mod.stat <- likelihoods(model = mod, data = hiv.data) Model(mod.stat) getData(mod.stat) LogLikelihoods(mod.stat) WLikelihoods(mod.stat) @ When having the weighted likelihoods, one can easily derive the responsibilities of the model components for generating the patterns in the specified dataset. <>= getResp(mod.stat) @ \subsection{The \texttt{sim} method} The mutagenetic trees mixture model encodes a probability distribution on the set of all possible patterns for a specified set of genetic events. The \texttt{sim} method provides the possibility of simulating (drawing) patterns from a given \texttt{RtreemixModel}. The simulated patterns are then returned as an \texttt{RtreemixData} object. Let's draw a specified number of patterns from our randomly generated model. <>= data <- sim(model = rand.mod, no.draws = 300) show(data) @ When besides the mixture model also the sampling mode and its respective sampling parameter are specified, this function simulates patterns together with their waiting and sampling times from the respective model. The waiting and sampling times result from a waiting time simulation along the branchings of the mixture model. The \texttt{sim} method presents the results in an \texttt{RtreemixSim} object. An example illustrating this is given in the PDF file \textit{ExtendedVignette}. \subsection{The genetic progression score (\texttt{GPS})} When assuming independent Poisson processes for the occurrence of events on the edges of each mixture component and for the sampling time of the disease, waiting times can be mapped on the tree edges of the branchings. Consequently, a genetic progression score (\texttt{GPS}) that incorporates correlations among events and time intervals among occurrences of events can be associated to the mixture model as proposed in \cite{Rahn2005}. The \texttt{GPS} estimates the stage of the disease and is important for estimating survival times and giving patients proper therapies. In the \texttt{Rtreemix} package the method \texttt{gps} calculates the \texttt{GPS} of a given set of patterns with respect to a specified mixture model (an \texttt{RtreemixModel} object). The GPS values are derived in a waiting time simulation for a given large number of simulation iterations, and for a specified sampling mode ("constant" or "exponential") and its corresponding sampling parameter. The result of the \texttt{GPS} calculation is given as an object of the class \texttt{RtreemixGPS} which extends the class \texttt{RtreemixData}. Moreover, with the function \texttt{confIntGPS} the package gives the possibility to analyze the variance of the \texttt{GPS} values. This function first calculates the \texttt{GPS} for the patterns in a given dataset based on a fitted \texttt{K}-mutagenetic trees mixture model. Then, it derives a 95\% confidence intervals for the \texttt{GPS} values with bootstrap analysis. The data and the number of tree components \texttt{K} have to be specified. The confidence intervals reflect the variability of the estimated \texttt{GPS} values. The results are again given as an \texttt{RtreemixGPS} object. The usage of the functions \texttt{gps} and \texttt{confIntGPS} is demonstrated in the PDF file \textit{ExtendedVignette}. \section{Stability analysis with the \texttt{Rtreemix} package} The \texttt{Rtreemix} package implements functions that can be used for assessing the quality and analyzing the stability of different attributes of the mutagenetic trees mixture model (the probability distributions induced by the model, the number of tree components, the topologies of the tree components, the GPS values and so on). This is usually done by first choosing a model attribute of interest and its appropriate similarity measure, and then inspecting its stability in a simulation setting. In the simulation study typically attributes from simulated true mixture models are compared with the corresponding attributes from models fitted to observations drawn from these true models. The similarity between the true and the fitted model is then compared to the similarities between the true and a sufficient number of random models sampled uniformly from the mixture model space. The quality of a fitted attribute is then assessed by estimating a p-value as the percentage of cases in which the true model is closer to a random model than to the fitted model with respect to the chosen model attribute. More details about performing stability analysis of the mutagenetic tree mixture modes and interpreting the results can be found in \cite{Bogo2008}. Simple code examples for the stability analysis of a mutagenetic trees mixture models are given in the \textit{doc} folder of the package in the PDF file \textit{ExtendedVignette}. \bibliography{Rtreemix} \bibliographystyle{unsrt} \end{document}