%\VignetteIndexEntry{The genomic STate ANnotation package} %\VignetteDepends{} %\VignetteKeywords{Genome Annotation, Hidden Markov Model} %\VignettePackage{STAN} % name of package %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage{subfig} %cp STAN.Rnw ../STAN.Rnw %R CMD Sweave --engine=knitr::knitr --pdf STAN.Rnw <>= BiocStyle::latex() @ <>= library("knitr") opts_chunk$set(tidy=FALSE,dev="png",fig.show="hide", fig.width=4,fig.height=4.5, message=FALSE, highlight=FALSE) @ \author{Benedikt Zacher$^{1,2,*}$, Julia Ertl$^{1}$, Julien Gagneur$^{1}$, Achim Tresch$^{1,2,3}$ \\[1em] \small{$^{1}$ Gene Center and Department of Biochemistry, LMU, Munich, Germany} \\ \small{$^{2}$ Institute for Genetics, University of Cologne, Cologne, Germany} \\ \small{$^{3}$ Max Planck Institute for Plant Breeding Research, Cologne, Germany} \\ \\ \small{\texttt{$^*$zacher (at) genzentrum.lmu.de}}} \title{The genomic STate ANnotation package} \begin{document} \maketitle \vspace{1cm} \begin{center} \begin{tabular}{ | l | } \hline \\ If you use \Biocpkg{STAN} in published research, please cite: \\ \\ Zacher, B. and Lidschreiber, M. and Cramer, P. and Gagneur, J. and Tresch, A. (2014): \\ \textbf{Annotation of genomics data using bidirectional hidden Markov models unveils} \\ \textbf{variations in Pol II transcription cycle} \emph{Mol. Syst. Biol.} \textbf{10}:768 \\ \\ \hline \end{tabular} \end{center} \tableofcontents \vspace{1cm} \newpage \section{Quick start} \vspace{0.25cm} <>= ## Loading library and data library(STAN) data(trainRegions) data(pilot.hg19) ## Model initialization hmm_nb = initHMM(trainRegions[1:3], nStates=10, "NegativeBinomial") ## Model fitting hmm_fitted_nb = fitHMM(trainRegions[1:3], hmm_nb, maxIters=10) ## Calculate state path viterbi_nb = getViterbi(hmm_fitted_nb, trainRegions[1:3]) ## Convert state path to GRanges object viterbi_nb_gm12878 = viterbi2GRanges(viterbi_nb, pilot.hg19, 200) @ \vspace{0.25cm} \section{Introduction} Genome segmentation with hidden Markov models has become a useful tool to annotate genomic elements, such as promoters and enhancers by data integration. \Biocpkg{STAN} (genomic \textbf{ST}ate \textbf{AN}notation) implements (bidirectional) Hidden Markov Models (HMMs) using a variety of different probability distributions, which can be used to model a wide range of current genomic data: \begin{itemize} \item Multivariate gaussian: e.g. conutinuous microarray and transformed sequencing data. \item Poisson: e.g. discrete count data from sequencing experiments. \item (Zero-Inflated) Negative Binomial: e.g. discrete count data from sequencing experiments. \item Poisson Log-Normal: e.g. discrete count data from sequencing experiments. \item Negative Multinomial: e.g. discrete count data from sequencing experiments. \item Multinomial: e.g. methylation rates from bisulfite sequencing (in this case it reduces to a Binomial) or binned nuceotide frequencies. \item Bernoulli: Initially proposed by \cite{ernst2010} to model presence/absence of chromatin marks (another example: transcription factor binding). \item Nucleotide distribution: e.g. nucleotide frequencies in the DNA sequence. \end{itemize} The use of these distributions enables integrating a wide range of genomic data types (e.g. continuous, discrete, binary) to \textit{de novo} learn and annotate the genome into a given number of 'genomic states'. The 'genomic states' may for instance reflect distinct genome-associated protein complexes or describe recurring patterns of chromatin features, i.e. the 'chromatin state'. Unlike other tools, \Biocpkg{STAN} also allows for the integration of strand-specific (e.g. RNA) and non-strand-specific data (e.g. ChIP) \cite{zacher2014}. In this vignette, we illustrate the use of \Biocpkg{STAN} by inferring 'chromatin states' from a small example data set of two Roadmap Epigenomics cell lines. Moreover, we show how to use strand-specific RNA expression with non-strand-specific ChIP data to infer 'transcription states' in yeast. Before getting started the package needs to be loaded: \vspace{0.25cm} <>= library(STAN) @ \vspace{0.25cm} \section{Genomic state annotation of Roadmap Epigenomics Sequencing data} The data (or observations) provided to \Biocpkg{STAN} may consist of one or more observation sequences (e.g. chromosomes), which are contained in a list of (position x experiment) matrices. \Robject{trainRegions} is a list containing one three ENCODE pilot regions (stored in \Robject{pilot.hg19} as \Rclass{GRanges} object) with data for two cell types (K562: E123, Gm12878: E116) from the Roadmap Epigenomics project. The data set contains ChIP-Seq experiments of seven histone modifications (H3K4me1, H3K4me3, H3K36me3, H3K27me3, H3K27ac and H3K9ac), as well as DNase-Seq and genomic input. \vspace{0.25cm} <>= data(trainRegions) names(trainRegions) str(trainRegions[c(1,4)]) @ \vspace{0.25cm} The genomic regions for each cell type in \Robject{trainRegions} are stored as a GRanges object in \Robject{pilot.hg19}: \vspace{0.25cm} <>= data(pilot.hg19) pilot.hg19 @ \vspace{0.25cm} Before model fitting, we calculate size factors to correct for the different different sequencing depths between cell lines. \vspace{0.25cm} <>= celltypes = list("E123"=grep("E123", names(trainRegions)), "E116"=grep("E116", names(trainRegions))) sizeFactors = getSizeFactors(trainRegions, celltypes) sizeFactors @ \vspace{0.25cm} Genome segmentation is carried out in \Biocpkg{STAN} using three functions: \Rfunction{initHMM}, \Rfunction{fitHMM} and \Rfunction{getViterbi}. \Rfunction{initHMM} initializes a model with \Robject{nStates} states for a given probability/emission distribution, which we set to 'PoissonLogNormal' in this example. \Rfunction{fitHMM} then optimizes model parameters using the Expectation-Maximization algorithm. Model parameters can be accessed with the \Rfunction{EmissionParams} function. Note that in this example, we set the maximal number of iteration to 10 in this case for speed reason. To ensure convergence this number should be higher in real world applications. After HMM fitting, the state annotation is calculated using the \Rfunction{getViterbi} function. \vspace{0.25cm} <>= nStates = 10 hmm_poilog = initHMM(trainRegions, nStates, "PoissonLogNormal", sizeFactors) hmm_fitted_poilog = fitHMM(trainRegions, hmm_poilog, sizeFactors=sizeFactors, maxIters=10) viterbi_poilog = getViterbi(hmm_fitted_poilog, trainRegions, sizeFactors) str(viterbi_poilog) @ \vspace{0.25cm} In order to ease the use of other genomic applications and Bioconductor packages, the viterbi path can be converted into a \Rclass{GRanges} object. \vspace{0.25cm} <>= viterbi_poilog_gm12878 = viterbi2GRanges(viterbi_poilog[1:3], regions=pilot.hg19, binSize=200) viterbi_poilog_gm12878 @ \vspace{0.25cm} Before giving some more details about further analysis and visualization of the models we repeat above segementations using the 'NegativeBinomial' emission functions. \vspace{0.25cm} <>= hmm_nb = initHMM(trainRegions, nStates, "NegativeBinomial", sizeFactors) hmm_fitted_nb = fitHMM(trainRegions, hmm_nb, sizeFactors=sizeFactors, maxIters=10) viterbi_nb = getViterbi(hmm_fitted_nb, trainRegions, sizeFactors=sizeFactors) viterbi_nb_gm12878 = viterbi2GRanges(viterbi_nb[1:3], pilot.hg19, 200) @ \vspace{0.25cm} In order to assign biologically meaningful roles to the inferred states we calculate the mean number of reads per 200 base pair bin for both segmentations. \vspace{0.25cm} <>= avg_cov_nb = getAvgSignal(viterbi_nb, trainRegions) avg_cov_poilog = getAvgSignal(viterbi_poilog, trainRegions) @ \vspace{0.25cm} <>= library(gplots) heat = c("dark blue", "dodgerblue4", "darkred", "red", "orange", "gold", "yellow") colfct = colorRampPalette(heat) colpal_statemeans = colfct(200) ord_nb = order(apply(avg_cov_nb,1,max), decreasing=TRUE) statecols_nb = rainbow(nStates) names(statecols_nb) = ord_nb pdf("nb_avg_cov.pdf") heatmap.2(log(avg_cov_nb+1)[as.character(ord_nb),], margins=c(8,7),srtCol=45, RowSideColors=statecols_nb[as.character(ord_nb)], dendrogram="none", Rowv=FALSE, Colv=FALSE, col=colpal_statemeans, trace="none", cellnote=round(avg_cov_nb,1)[as.character(ord_nb),], notecol="black") dev.off() ord_poilog = order(apply(avg_cov_poilog,1,max), decreasing=TRUE) statecols_poilog = rainbow(nStates) names(statecols_poilog) = ord_poilog pdf("poilog_avg_cov.pdf") heatmap.2(log(avg_cov_poilog+1)[ord_poilog,], RowSideColors=statecols_poilog[as.character(ord_poilog)], margins=c(8,7),srtCol=45, dendrogram="none", Rowv=FALSE, Colv=FALSE, col=colpal_statemeans, trace="none", cellnote=round(avg_cov_poilog,1)[ord_poilog,], notecol="black") dev.off() @ These are then plotted using the \Rfunction{heatmap.2} function (see Figure \ref{fig:mean1}). \vspace{0.25cm} <>= ## specify color palette library(gplots) heat = c("dark blue", "dodgerblue4", "darkred", "red", "orange", "gold", "yellow") colfct = colorRampPalette(heat) colpal_statemeans = colfct(200) ## define state order and colors ord_nb = order(apply(avg_cov_nb,1,max), decreasing=TRUE) statecols_nb = rainbow(nStates) names(statecols_nb) = ord_nb heatmap.2(log(avg_cov_nb+1)[as.character(ord_nb),], margins=c(8,7), srtCol=45, RowSideColors=statecols_nb[as.character(ord_nb)], dendrogram="none", Rowv=FALSE, Colv=FALSE, col=colpal_statemeans, trace="none", cellnote=round(avg_cov_nb,1)[as.character(ord_nb),], notecol="black") ## define state order and colors ord_poilog = order(apply(avg_cov_poilog,1,max), decreasing=TRUE) statecols_poilog = rainbow(nStates) names(statecols_poilog) = ord_poilog heatmap.2(log(avg_cov_poilog+1)[as.character(ord_poilog),], margins=c(8,7), srtCol=45, RowSideColors=statecols_poilog[as.character(ord_poilog)], dendrogram="none", Rowv=FALSE, Colv=FALSE, col=colpal_statemeans, trace="none", cellnote=round(avg_cov_poilog,1)[as.character(ord_poilog),], notecol="black") @ \vspace{0.25cm} \begin{figure}[!htp] \centering \subfloat[]{{\includegraphics[width=8cm]{nb_avg_cov.pdf} }}% \qquad \subfloat[]{{\includegraphics[width=8cm]{poilog_avg_cov.pdf} }}% \caption{Mean read counts of the (a) 'NegativeBinomial' and (b) 'PoissonLogNormal' state annotation}% \label{fig:mean1}% \end{figure} In order to visualize both \Biocpkg{STAN} segmentations, we convert the viterbi paths and the data to \Biocpkg{Gviz} objects. \vspace{0.25cm} <>= library(Gviz) from = start(pilot.hg19)[3] to = from+300000 gvizViterbi_nb = viterbi2Gviz(viterbi_nb_gm12878, "chr11", "hg19", from, to, statecols_nb) gvizViterbi_poilog = viterbi2Gviz(viterbi_poilog_gm12878, "chr11", "hg19", from, to, statecols_poilog) gvizData = data2Gviz(trainRegions[3], pilot.hg19[3], 200, "hg19", col="black") @ \vspace{0.25cm} Then, we use the \Rfunction{plotTracks} function to plot everything (see Figure \ref{fig:gviz1}). \vspace{0.25cm} <>= gaxis = GenomeAxisTrack() data(ucscGenes) mySize = c(1,rep(1.2,9), 0.5,0.5,3) plotTracks(c(list(gaxis), gvizData,gvizViterbi_nb,gvizViterbi_poilog,ucscGenes["chr11"]), from=from, to=to, showFeatureId=FALSE, featureAnnotation="id", fontcolor.feature="black", cex.feature=0.7, background.title="darkgrey", lwd=2, sizes=mySize) @ \vspace{0.25cm} <>= gaxis = GenomeAxisTrack() data(ucscGenes) mySize = c(1,rep(1.2,9), 0.5,0.5,3) pdf("gviz_example.pdf", width=7*1.5) plotTracks(c(list(gaxis), gvizData,gvizViterbi_nb,gvizViterbi_poilog,ucscGenes["chr11"]), from=from, to=to, showFeatureId=FALSE, featureAnnotation="id", fontcolor.feature="black", cex.feature=0.7, background.title="darkgrey", lwd=2, sizes=mySize)#, stacking="dense")#, ylim=c(0,100)) dev.off() @ \begin{figure}[!htp] \centering \includegraphics[width=18cm]{gviz_example.pdf} \caption{Genome Browser showing the 10 data tracks used for model learning together with the 'Negativebinomial' (top) and 'PoissonLogNormal' (bottom) segmentations and known UCSC gene annotations.}% \label{fig:gviz1}% \end{figure} \subsection*{Modeling Sequencing data using other emission functions} In this section we illustrate the use of other distributions to annotate the the Roadmap Epigenomics example data set, namely the 'Poisson', 'NegativeMultinomial', 'Gaussian' and 'Bernoulli' models. The 'Poisson' model is an obvious choice when dealing with count data. However since the variance of the Poisson is equal to its mean it might not be an ideal choice for modeling Sequencing experiments, which have been shown to be overdispersed \cite{anders2010}. \vspace{0.25cm} <>= hmm_pois = initHMM(trainRegions, nStates, "Poisson") hmm_fitted_pois = fitHMM(trainRegions, hmm_pois, maxIters=10) viterbi_pois = getViterbi(hmm_fitted_pois, trainRegions) @ \vspace{0.25cm} The 'NegativeMultinomial' distribution for genome segmentation with HMMs was first proposed in the EpicSeg model \cite{mammana2015}. The Negative Multinomial can be understood as a Multinomial distribution, where its overdispersion of is modeled by a Negative Binomial distribution. However, this assumes a shared overdispersion across data tracks within a state as opposed to the 'NegativeBinomial' and 'PoissonLogNormal' models which model the variance for each state and data track separately. In order to use the 'NegativeMultinomial' in \Biocpkg{STAN} an additional data track - the sum of counts - for each bin needs to be added to the data. Internally the 'NegativeMultinomial' is modeled as a product of a 'NegativeBinomial' and a 'Multinomial' emission (see section 'Combining different emission functions' for further details): \vspace{0.25cm} <>= simData_nmn = lapply(trainRegions, function(x) cbind(apply(x,1,sum), x)) hmm_nmn = initHMM(simData_nmn, nStates, "NegativeMultinomial") hmm_fitted_nmn = fitHMM(simData_nmn, hmm_nmn, maxIters=10) viterbi_nmn = getViterbi(hmm_fitted_nmn, simData_nmn) @ \vspace{0.25cm} In order to model the data using Gaussian distributions, it needs to be log-transformed and smoothed. This approach is implementd in Segway, a method used by the ENCODE Consortium for chromatin state annotation \cite{hoffman2012}. However, to overcome singularity of the (diagonal) covariance matrix due to the zero-inflated distribution of the transformed read counts, it uses a shared variance over states for each data track. To use gaussian distributions with Sequencing data in \Biocpkg{STAN}, we transform the data (with the hyperbole sine function \cite{hoffman2012}) and model it using the emission 'IndependentGaussian' with a shared covariance, i.e. \Robject{sharedCov=TRUE}. \vspace{0.25cm} <>= trainRegions_smooth = lapply(trainRegions, function(x) apply(log(x+sqrt(x^2+1)), 2, runningMean, 2)) hmm_gauss = initHMM(trainRegions_smooth, nStates, "IndependentGaussian", sharedCov=TRUE) hmm_fitted_gauss = fitHMM(trainRegions_smooth, hmm_gauss, maxIters=10) viterbi_gauss = getViterbi(hmm_fitted_gauss, trainRegions_smooth) @ \vspace{0.25cm} Another approach was proposed in ChromHMM, which models binarized data using an independent Bernoulli model \cite{ernst2010}. Note, that the performance of the model highly depends on the non-trivial choice of a proper cutoff and quantitative information is lost. The latter is especially important when predicting promoters and enhancers since these elements are both marked H3K4me1 and H3K4me3, but at different ratios. The function \Rfunction{binarizeData} binarizes the data using the default approach by ChromHMM \cite{ernst2010}. The model can then be fit by specifying the 'Bernoulli' model. Note however, that initialization and model fitting are carried out differently than in the ChromHMM implementation. In particular \Biocpkg{STAN} uses the EM algorithm while ChromHMM uses online EM. For details on the initialization, please see the \Rfunction{initHMM} manual. \vspace{0.25cm} <>= trainRegions_binary = binarizeData(trainRegions) hmm_ber = initHMM(trainRegions_binary, nStates, "Bernoulli") hmm_fitted_ber = fitHMM(trainRegions_binary, hmm_ber, maxIters=10) viterbi_ber = getViterbi(hmm_fitted_ber, trainRegions_binary) @ \vspace{0.25cm} We calculate the mean read coverage for each method and segmentation: \vspace{0.25cm} <>= avg_cov_gauss = getAvgSignal(viterbi_gauss, trainRegions) avg_cov_nmn = getAvgSignal(viterbi_nmn, trainRegions) avg_cov_ber = getAvgSignal(viterbi_ber, trainRegions) avg_cov_pois = getAvgSignal(viterbi_pois, trainRegions) @ \vspace{0.25cm} These are again plotted using the \Rfunction{heatmap.2} function (see Figure \ref{fig:mean_counts_other}). \vspace{0.25cm} <>= heatmap.2(log(avg_cov_gauss+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_gauss,1), notecol="black") heatmap.2(log(avg_cov_nmn+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_nmn,1), notecol="black") heatmap.2(log(avg_cov_ber+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_ber,1), notecol="black") heatmap.2(log(avg_cov_pois+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_pois,1), notecol="black") @ \vspace{0.25cm} <>= pdf("avg_cov_gauss.pdf") heatmap.2(log(avg_cov_gauss+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_gauss,1), notecol="black") dev.off() pdf("avg_cov_nmn.pdf") heatmap.2(log(avg_cov_nmn+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_nmn,1), notecol="black") dev.off() pdf("avg_cov_ber.pdf") heatmap.2(log(avg_cov_ber+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_ber,1), notecol="black") dev.off() pdf("avg_cov_pois.pdf") heatmap.2(log(avg_cov_pois+1), margins=c(8,7),srtCol=45, dendrogram="row", Rowv=TRUE, Colv=FALSE, col=colpal_statemeans, trace="none", notecex=0.7, cexRow=0.75, cexCol=1, cellnote=round(avg_cov_pois,1), notecol="black") dev.off() @ \begin{figure}[!htp] \centering \subfloat[]{{\includegraphics[width=8cm]{avg_cov_gauss.pdf} }}% \qquad \subfloat[]{{\includegraphics[width=8cm]{avg_cov_nmn.pdf} }} \\% \caption{2 Figures side by side}% \subfloat[]{{\includegraphics[width=8cm]{avg_cov_ber.pdf} }}% \qquad \subfloat[]{{\includegraphics[width=8cm]{avg_cov_pois.pdf} }} \caption{Mean read counts of the (a) 'IndependentGaussian' (b) 'NegativeMultinomial' (c) 'Bernoulli' and (d) 'Poisson' state annotation.} \label{fig:mean_counts_other}% \end{figure} \section{Integrating strand-specific and non-strand-specific data with STAN} \Biocpkg{STAN} also allows for the integration of strand-specific (e.g. RNA) and non-strand-specific data (e.g. ChIP). This is donw using bidirectional hidden Markov models (bdHMMs) which were proposed in \cite{zacher2014}. A bdHMM models a directed process using the concept of twin states, where each genomic state is split up into a pair of twin states, one for each direction (e.g. sense and antisense in context of transcription). Those twin state pairs are identical in terms of their emissions (i.e. they model the same genomic state). Currently the following models are available for bdHMMs: 'IndependentGaussian', 'Gaussian', 'NegativeBinomial', 'ZINegativeBinomial' and 'PoissonLogNormal'. We now illustrate the use of bdHMMs in \Biocpkg{STAN} at an example data set of yeast transcription factors measured by ChIP-chip and RNA expression measured with a tiling array which was used to model the transcription cycle as a sequence of 'transcription states' in \cite{zacher2014}. \\ The \Rfunction{initBdHMM} function is used to initialize a bdHMM with 6 twin states. Note that the overall number of states in the bdHMM is 12 (6 identical twin state pairs). \Robject{dirobs} defines the directionality (or strand-specificity) of the data tracks. In \Robject{dirobs}, the first 10 data tracks are non-strand-specific ChIP-chip measurments, indicated by '0' and data track 11 and 12 are strand-specific RNA expression measurements, indicated by '1'. Note that strand-specific data tracks must be labeled as increasing pairs of integers. Thus and additional strand-specific data track pair would be labeled as a pair of '2'. Model fitting and calculation of the state annotation are carried out as for standard HMMs: \vspace{0.25cm} <>= data(yeastTF_databychrom_ex) nStates = 6 dirobs = as.integer(c(rep(0,10), 1, 1)) bdhmm_gauss = initBdHMM(yeastTF_databychrom_ex, nStates, "Gaussian", directedObs=dirobs) bdhmm_fitted_gauss = fitHMM(yeastTF_databychrom_ex, bdhmm_gauss) viterbi_bdhmm_gauss = getViterbi(bdhmm_fitted_gauss, yeastTF_databychrom_ex) @ \vspace{0.25cm} We plot the means of the multivariate gaussian distrbutions for each state (see Figure \ref{fig:means_yeast}): \vspace{0.25cm} <>= statecols_yeast = rep(rainbow(nStates), 2) names(statecols_yeast) = StateNames(bdhmm_fitted_gauss) means_fitted = EmissionParams(bdhmm_fitted_gauss)$mu heatmap.2(means_fitted, col=colpal_statemeans, RowSideColors=statecols_yeast[rownames(means_fitted)], trace="none", cexCol=0.9, cexRow=0.9, cellnote=round(means_fitted,1), notecol="black", dendrogram="row", Rowv=TRUE, Colv=FALSE, notecex=0.9) @ \vspace{0.25cm} \begin{figure}[!htp] \centering \includegraphics[width=10cm]{yeast_means_gauss.pdf} \caption{Mean signal 6 bdHMM twin state pairs. 'F' and 'R' indicate forward and reverse direction of state pairs.}% \label{fig:means_yeast}% \end{figure} <>= pdf("yeast_means_gauss.pdf") statecols_yeast = rep(rainbow(nStates), 2) names(statecols_yeast) = StateNames(bdhmm_fitted_gauss) means_fitted = EmissionParams(bdhmm_fitted_gauss)$mu heatmap.2(means_fitted, col=colpal_statemeans, RowSideColors=statecols_yeast[rownames(means_fitted)], trace="none", cexCol=0.9, cexRow=0.9, cellnote=round(means_fitted,1), notecol="black", dendrogram="row", Rowv=TRUE, Colv=FALSE, notecex=0.9) dev.off() @ \vspace{0.25cm} We convert the viterbi path into a \Robject{GRanges} object. Note that the directionaliy of bdHMM states is indicated by 'F' (forward) and 'R' (reverse). \vspace{0.25cm} <>= yeastGRanges = GRanges(IRanges(start=1214616, end=1225008), seqnames="chrIV") names(viterbi_bdhmm_gauss) = "chrIV" viterbi_bdhmm_gauss_gr = viterbi2GRanges(viterbi_bdhmm_gauss, yeastGRanges, 8) viterbi_bdhmm_gauss_gr @ \vspace{0.25cm} Next, we visualize the data, state annotation and together with SGD genes using \Biocpkg{Gviz} (see Figure \ref{fig:gviz_yeast}): \vspace{0.25cm} <>= chr = "chrIV" gen = "sacCer3" gtrack <- GenomeAxisTrack() from=1217060 to=1225000 forward_segments = grep("F", viterbi_bdhmm_gauss_gr$name) reverse_segments = grep("R", viterbi_bdhmm_gauss_gr$name) gvizViterbi_yeast = viterbi2Gviz(viterbi_bdhmm_gauss_gr[forward_segments], "chrIV", "sacCer3", from, to, statecols_yeast) gvizViterbi_yeast2 = viterbi2Gviz(viterbi_bdhmm_gauss_gr[reverse_segments], "chrIV", "sacCer3", from, to, statecols_yeast) gvizData_yeast = data2Gviz(yeastTF_databychrom_ex, yeastGRanges, 8, "sacCer3", col="black") gaxis = GenomeAxisTrack() data(yeastTF_SGDGenes) mySize = c(1,rep(1,12), 0.5,0.5,3) plotTracks(c(list(gaxis), gvizData_yeast,gvizViterbi_yeast,gvizViterbi_yeast2, list(yeastTF_SGDGenes)), cex.feature=0.7, background.title="darkgrey", lwd=2, sizes=mySize, from=from, to=to, showFeatureId=FALSE, featureAnnotation="id", fontcolor.feature="black", cex.feature=0.7, background.title="darkgrey", showId=TRUE) @ \vspace{0.25cm} \vspace{0.25cm} <>= yeastGRanges = GRanges(IRanges(start=1214616, end=1225008), seqnames="chrIV") names(viterbi_bdhmm_gauss) = "chrIV" viterbi_bdhmm_gauss_gr = viterbi2GRanges(viterbi_bdhmm_gauss, yeastGRanges, 8) chr = "chrIV" gen = "sacCer3" gtrack <- GenomeAxisTrack() from=1217060 to=1225000 forward_segments = grep("F", viterbi_bdhmm_gauss_gr$name) reverse_segments = grep("R", viterbi_bdhmm_gauss_gr$name) gvizViterbi_yeast = viterbi2Gviz(viterbi_bdhmm_gauss_gr[forward_segments], "chrIV", "sacCer3", from, to, statecols_yeast) gvizViterbi_yeast2 = viterbi2Gviz(viterbi_bdhmm_gauss_gr[reverse_segments], "chrIV", "sacCer3", from, to, statecols_yeast) gvizData_yeast = data2Gviz(yeastTF_databychrom_ex, yeastGRanges, 8, "sacCer3", col="black") gaxis = GenomeAxisTrack() data(yeastTF_SGDGenes) mySize = c(1,rep(1,12), 0.5,0.5,3) pdf("gviz_example_yeast.pdf", width=7*1.5) plotTracks(c(list(gaxis), gvizData_yeast,gvizViterbi_yeast,gvizViterbi_yeast2,list(yeastTF_SGDGenes)), cex.feature=0.7, background.title="darkgrey", lwd=2, sizes=mySize, from=from, to=to, showFeatureId=FALSE, featureAnnotation="id", fontcolor.feature="black", cex.feature=0.7, background.title="darkgrey", showId=TRUE)#, stacking="dense")#, ylim=c(0,100)) dev.off() @ \vspace{0.25cm} \begin{figure}[!htp] \centering \includegraphics[width=18cm]{gviz_example_yeast.pdf} \caption{Genome Browser showing the 12 data tracks used for model learning together with the segmentations and known SGD gene annotations.}% \label{fig:gviz_yeast}% \end{figure} %\subsection*{Collapsing two directed twin states into one undirected states} %There are two states in the yeast bdHMM that seem to have low directional %information, since they frequently switch between forward and reverse direction %(Figure \ref{fig:gviz_yeast}): The promoter and low signal state. It is % possible to collapse these twin states into one undirected state and ... %\section{Combining different emission functions} %TODO \section{Concluding Remarks} This vignette was generated using the following package versions: <>= toLatex(sessionInfo()) @ \newpage \bibliography{refs} \end{document}