%\VignetteIndexEntry{An Introduction to the BPR method}
%\VignetteKeywords{BPR, methylation, gene expression}
%\VignettePackage{BPRMeth}
%\VignetteEngine{knitr::knitr}

\documentclass{article}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

\bioctitle[BPRMeth: Methylation features for clustering and prediction]{BPRMeth: Higher order methylation features for clustering and prediction in epigenomic studies}
\author{Chantriolnt-Andreas Kapourani\footnote{C.A.Kapourani@ed.ac.uk or kapouranis.andreas@gmail.com}}
\date{Modified: 1 August, 2016. Compiled: \today}

\begin{document}

\maketitle

\tableofcontents


%%
% Introduction section
%%
\section{Introduction}
DNA methylation is an intensely studied epigenetic mark, yet its functional role is incompletely understood. Attempts to quantitatively associate average DNA methylation to gene expression yield poor correlations outside of the well-understood methylation-switch at CpG islands.

Here we use probabilistic machine learning to extract higher order features associated with the methylation profile across a defined region. These features quantitate precisely notions of shape of a methylation profile, capturing spatial correlations in DNA methylation across genomic regions.  Using these higher order features across promoter-proximal regions, we are able to construct a powerful machine learning predictor of gene expression.


%%
% Background section
%%
\section{Background}
DNA methylation data produced by High-Throughput Sequencing (HTS) technology can be modelled with a Binomial distribution:
\begin{equation}
  m \sim \mathcal{B}inom(t, p)
\end{equation}

In practical studies we are interested in learning the methylation patterns of genomic regions which can be represented by an observation vector $\mathbf{y}$. Let $f(x) = \Phi \big(g(x)\big)$ be a latent function representing the methylation profiles and $g(x)$ be of the form:
\begin{equation}\label{eq-basis}
	 g(x) = \sum\limits_{j=0}^{M-1} w_{j} h_{j}(x)
\end{equation}
where $h_{j}(\cdot)$ can be any basis function, e.g. Radial Basis Function (RBF), and $w_{j}$ are the coefficients for each basis.

Given $f(x)$ the observations $y_{l}$ for each CpG site are i.i.d. Binomial variables, so we can define the joint log-likelihood in factorised form:
\begin{equation} \label{eq:bpr-likelihood}
 \log p(\mathbf{y} | f) = \sum\limits_{l = 1}^{L} \log \bigg( \mathcal{B}inom\big(m_{l} | t_{l}, \Phi(g(x_{l}))\big) \bigg)
\end{equation}
We refer to this observation model as the Binomial Probit Regression (BPR) likelihood function. \emph{Figure \ref{fig:model-performance}} shows the process of learning the methylation profile for a specific promoter region using the BPR model. For a more detailed explanation of the statistical method, see \cite{Kapourani2016}.

\begin{figure}[!ht]
\centerline{\includegraphics[width=0.66\textwidth]{model-bpr}}
\caption{\small{Illustration of the process for learning methylation profiles using the BPR model. The inputs to the model are the observed methylation levels of the CpGs across the promoter region, plus the number, with their corresponding centres, of the Radial Basis Functions (RBFs), which in this example are chosen to be four. Using this information the BPR model will learn the optimal coefficients for each RBF using maximum likelihood. Finally, we obtain the underlying methylation profile by a linear combination of the fitted RBFs. One should note that the higher the number of RBFs the better the resolution for the methylation profile.}}
\label{fig:model-performance}
\end{figure}


%%
% Analysis Pipeline section
%%
\section{Analysis Pipeline}

%%
% Sample data section
%%
\subsection{Sample data}
To illustrate the functions of the \verb|BPRMeth| package we will use real datasets that are publicly available from the ENCODE project consortium \cite{Dunham2012}. More specifically we will focus on the K562 immortalized cell line, with GEO: GSE27584 for the RRBS data and GEO: GSE33480 for the RNA-Seq data. We will use directly the preprocessed files, however, we should note that we have converted the RNA-Seq data from \verb|.gtf| to \verb|.bed| format using the \verb|bedops| tool (\href{http://bedops.readthedocs.io} {http://bedops.readthedocs.io}). We have kept only the protein coding genes, and for the purpose of this vignette we focus only on \verb|chr12| and \verb|chr13|. Full details of where to download the data and how to preprocess them are included in the Supplementary Material of \cite{Kapourani2016}.


%%
% Read HTS data section
%%
\subsection{Import and read HTS files}
Due to its general approach, the BPR method can be applied both in RRBS and WGBS methylation datasets, provided that we have information about the methylated and unmethylated reads in each CpG location.

The \verb|BPRMeth| package provides methods for reading files generated from HTS experiments with specific formats. For the formats provided by the ENCODE datasets described above; we have implemented \Rfunction{process\_haib\_caltech\_wrap}, which is a wrapper function for performing the preprocessing and obtaining the final objects for downstream analysis. The user can implement his own methods for reading files with different formats, provided that he can create an object similar to what is described below. First, we load and attach the package and then we obtain the paths for the sample RRBS and RNA-Seq files with the following commands:
<<eval=TRUE, echo=TRUE, message=FALSE, results="asis">>=
library(BPRMeth)
rrbs_file <- system.file("extdata", "rrbs.bed", package = "BPRMeth")
rnaseq_file <- system.file("extdata", "rnaseq.bed", package = "BPRMeth")
@

Then, we process both files to obtain a \verb|processHTS| object which will be used for downstream analysis:
<<eval=TRUE, echo=TRUE, message=FALSE, results="asis">>=
# Preprocess both RRBS and RNA-Seq files
HTS_data <- process_haib_caltech_wrap(rrbs_file, rnaseq_file)
@

Among other information, the \verb|processHTS| object contains the following important slots:
\begin{enumerate}
\item{A \Rfunction{list} where each entry corresponds to a different promoter methylation region, accessible with \textbf{methyl\_region}. \\ More specifically, each methylation promoter region is an $L_{i} \times 3$ dimensional matrix, where $L_{i}$ denotes the number of CpGs found in region $i$. The columns contain the following information:
\begin{itemize}
\item{ 1st column: Contains the locations of CpGs relative to TSS. Note that the actual locations are scaled to the (-1, 1) region. }
\item{ 2nd column: Contains the total reads of each CpG in the corresponding location.}
\item{ 3rd column: Contains the methylated reads each CpG in the corresponding location.}
\end{itemize} }
\item{A \Rfunction{vector} with the corresponding log2 transformed gene expression levels for each promoter region, accessible with \textbf{gex}.}
\item{A \Biocpkg{GRanges} object storing the RNA-Seq data together with annotation information for each promoter region, accessible with \textbf{rna\_data}.}
\end{enumerate}

Now, we give examples on how to access these information on the sample data described in the previous section. Initially, we can access the $16th$ promoter methylation region as follows:
<<eval=TRUE, echo=TRUE>>=
HTS_data$methyl_region[[16]]
@

Below we show the log2 transformed gene expression levels for the first 10 promoter regions:
<<eval=TRUE, echo=TRUE>>=
head(HTS_data$gex, 10)
@

Finally, the RNA-Seq data which are stored in a \Biocpkg{GRanges} object, can be accessed as follows:
<<eval=TRUE, echo=TRUE>>=
HTS_data$rna_data
@

The package provides implementation for specific parts of the preprocessing steps described above which can be seen by typing \Rfunction{process\_haib\_caltech\_wrap} on the R console. If the user has files with different formats, he can implement his own functions for reading the data and then combine them with the simple functions present in the BPRMeth package, in order to obtain an object similar to \verb|processHTS|, which should have the slots described above.


%%
% Predict expression section
%%
\subsection{Predict gene expression}
After preprocessing the HTS data, we have the following number of unique protein-coding genes belonging in chr12 and chr13:
<<eval=TRUE, echo=TRUE>>=
# Obtain the number of gene promoters
length(HTS_data$gex)
@

Learning the methylation profiles is equivalent to optimizing the model parameters $\mathbf{w}$ described in Eq. \ref{eq-basis}. These parameters can be considered as the extracted features which quantitate precisely notions of shape of a methylation profile.

\subsubsection{Create basis objects}
For each promoter region, we will learn its methylation profile using the BPR model with a specified number of Radial Basis Functions (RBFs) $M$. For a single input variable $x$, the RBF takes the form $h_{j}(x) = exp(-\gamma || x - \mu_{j} ||^2)$, where $\mu_{j}$ denotes the location of the $j^{th}$ basis function in the input space and $\gamma$ controls the spatial scale. The case when $M = 0$ is equivalent to learning the average methylation level for the given region (i.e. learn a constant function).

For our running example, we will create two RBF objects, one with 9 basis functions and the other with 0 basis functions denoting the mean methylation level approach:
<<eval=TRUE, echo=TRUE>>=
# Create basis object with 9 RBFs
basis_profile <- create_rbf_object(M = 9)

# Create basis object with 0 RBFs, i.e. constant function
basis_mean <- create_rbf_object(M = 0)
@

The \Rfunction{rbf} object contains information such as the centre locations $\mu_{j}$ and the value of the spatial scale parameter $\gamma$:
<<eval=TRUE, echo=TRUE>>=
# Show the slots of the 'rbf' object
basis_profile
@
The $\gamma$ is computed by the number of basis M, however the user can tune it according to his liking. Except from RBF basis, the \verb|BPRMeth| pacakge provides polynomial basis which can be created with the \Rfunction{create\_polynomial\_object} function.

\subsubsection{Learn methylation profiles and make predictions}
We can now optimize the BPR likelihood function and extract the features $\mathbf{w}_{i}$ for each promoter region. To quantitatively predict expression at each region, we construct a regression model by taking as input the higher-order methylation features learned from the BPR model. In addition to these features, we consider two supplementary sources of information: (1) the goodness of fit in RMSE and (2) the CpG density. For our analysis an SVM regression model is considered. We will use $70\%$ of the data for training, and we will test the model's performance on the remaining $30\%$.

All the aforementioned steps are assembled in the \Rfunction{bpr\_predict\_wrap} wrapper function, which returns a \verb|bpr_predict| object.
<<eval=TRUE, message=FALSE, echo=TRUE>>=

# Set seed for reproducible results
set.seed(1234)


# Perform predictions using methylation profiles
res_profile <- bpr_predict_wrap(x = HTS_data$methyl_region, y = HTS_data$gex,
                                basis = basis_profile, fit_feature = "RMSE",
                                cpg_dens_feat = TRUE, is_parallel = FALSE,
                                is_summary = FALSE)

# Perform predictions using mean methylation level
res_mean <- bpr_predict_wrap(x = HTS_data$methyl_region, y = HTS_data$gex,
                             basis = basis_mean, is_parallel = FALSE,
                             is_summary = FALSE)
@

We can now compare the Pearson's correlation coefficient $r$ for both models and observe that the higher-order methylation features achieve test correlations twice as large as previously reported when using average methylation levels.
<<eval=TRUE, echo=TRUE>>=
# Test errors for methylation profiles, PCC = Pearson's r
res_profile$test_errors$pcc

# Test errors for mean methylation levels
res_mean$test_errors$pcc
@

Figure \ref{figure/figureexample-1} shows an example promoter region together with the fitted methylation profiles.
<<figureexample, fig.show='hide', fig.width=7.8, fig.height=5>>=
# Choose promoter region 21 -> i.e. LEPREL2 gene
gene_name <- HTS_data$rna_data$gene_name[21]
plot_fitted_profiles(region = 21, X = HTS_data$methyl_region, fit_prof = res_profile,
                     fit_mean = res_mean, title = paste0("Gene ", gene_name))
@
\incfig{figure/figureexample-1}{0.8\textwidth}{Methylation pattern for the LEPREL2 gene over $\pm7kb$ promoter region.}
{The points represent the DNA methylation level of each CpG site. The shape of the methylation profiles is captured by the red curve, whereas the orange dashed line denotes the mean methylation level.}

Figure \ref{figure/figureexample2-1} shows a scatter plot of the predicted and measured expression values for the \verb|chr12| and \verb|chr13| of the K562 cell line.
<<figureexample2, fig.show='hide', fig.width=12.8, fig.height=5.8>>=
par(mfrow=c(1,2))
plot_scatter_gex(bpr_predict_obj = res_profile)
plot_scatter_gex(bpr_predict_obj = res_mean, main_lab = "Mean Methylation")
@
\incfig{figure/figureexample2-1}{0.98\textwidth}{Quantitative relationship between DNA methylation patterns and gene expression.}{Scatter plots of predicted versus measured (log2-transformed) gene expression values: using the BPR model and extracting higher-order features \emph{(left)}, and using the average methylation level \emph{right} as input to the SVM regression model. Each shaded blue dot represents a different gene. The red dashed line indicates the linear fit between the predicted and measured expression values.}

%%
% Cluster profiles section
%%
\subsection{Cluster methylation profiles}
Another application of the BPR model is to use the higher-order methylation features to cluster DNA methylation patterns across promoter-proximal regions and examine whether distinct methylation profiles are associated to different gene expression levels. To cluster methylation profiles, we consider a mixture modelling approach and we apply the EM algorithm to estimate the model parameters.

The \verb|BPRMeth| package provides the \Rfunction{bpr\_cluster\_wrap} function for performing the clustering process, where the user needs to provide the number of clusters $K$, the methylation regions and a basis object. Since we are interested in capturing broader similarities between profiles rather than fine details, we will model the methylation profiles at a slightly lower resolution:
<<eval=TRUE, message=FALSE, warning=FALSE, echo=TRUE>>=
# Set seed for reproducible results
set.seed(1234)
# Create basis object with 4 RBFs
basis_obj <- create_rbf_object(M = 4)
# Set number of clusters K = 5
K <- 5
# Perform clustering
res <- bpr_cluster_wrap(x = HTS_data$methyl_region, K = K, basis = basis_obj,
                        em_max_iter = 15, opt_itnmax = 30, is_parallel = FALSE)
@

Figure \ref{figure/figureexample3-1} shows the fitted methylation profiles for each cluster.
<<figureexample3, warning=FALSE, fig.show='hide', fig.width=13.5, fig.height=5>>=
par(mfrow=c(1,2))
plot_cluster_prof(bpr_cluster_obj = res)
boxplot_cluster_gex(bpr_cluster_obj = res, gex = HTS_data$gex)
@
\incfig{figure/figureexample3-1}{0.99\textwidth}{Clustering methylation profiles across promoter-proximal regions.}{\emph{(Left)} Five clustered methylation profiles over $\pm 7kb$ promoter region w.r.t. TSS in the direction of transcription. Each methylation profile is modelled using four RBFs. \emph{(Right)} Boxplots with the corresponding expression levels of the protein-coding genes assigned to each cluster. The colors match with the clustered methylation profiles shown on the left.}


%%
% SessionInfo section
%%
%\clearpage
\section{Session Info}
This vignette was compiled using:
<<>>=
sessionInfo()
@

%%
% Acknowledgements section
%%
\section{Acknowledgements}
This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido
Sanguinetti.

This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh, and by the European Research Council through grant MLCS306999.

\bibliography{BPRMeth_vignette}

\end{document}