---
title: "*betaHMM* package: Quick-start guide"
author: "Koyel Majumdar, Isobel Claire Gormley, Thomas Brendan Murphy, 
Romina Silva, Antoinette Sabrina Perry, Ronald William Watson, 
Florence Jaffrézic, Andrea Rau"
output: 
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
always_allow_html: true
vignette: >
  %\VignetteIndexEntry{betaHMM}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}{inputenc}
---

```{r setup, include = FALSE}
knitr::include_graphics("betaHMM_hex.png")
options(rmarkdown.html_vignette.check_title = FALSE)
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(eval = TRUE)

#knitr::opts_chunk$set(dev = 'png')
#knitr::opts_chunk$set(dpi=100)
```

# Installation
To install this package, start R (version "4.3") and enter:

```{r biocsetup, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("betaHMM")
```

# Introduction

DNA methylation, the addition of a methyl group to a cytosine-guanine
dinucleotide (CpG) site, is influenced by environmental factors and serves as a
disease biomarker. In diploid individuals, CpG sites are hypermethylated (both
strands methylated), hypomethylated (neither strand methylated), or
hemimethylated (one strand methylated). Identifying differentially methylated
CpG sites (DMC) and regions (DMR) can reveal the impact of environmental
stressors.

Methylation levels, called beta values, represent the proportion of methylated
probes, usually modeled with beta distributions. They are often logit
transformed into M-values for modeling. To directly model beta values and
account for spatial correlation between CpG sites, we propose a homogeneous
hidden Markov model (HMM) approach called betaHMM. It identifies DMCs and DMRs
while considering spatial dependency and sample relationships. Simulation and
prostate cancer data demonstrate its effectiveness. Through our submission of
betaHMM to Bioconductor, we intend to contribute to the open-source software
ecosystem, supporting robust and reproducible data analysis
for both established and emerging biological assays.

This document gives a quick tour of the functionalities in **betaHMM**. See
`help(package="betaHMM")` for further details and references provided by
`citation("betaHMM")`.

# Walk through

## Prerequisites
Before starting the **betaHMM** walk through, the user should have a working
R software environment installed on their machine. The **betaHMM** package has
the following dependencies which, if not already installed on the machine will
automatically be installed along with the package:
**stats, ggplot2, utils,  scales, methods, pROC, foreach, doParallel, cowplot,
dplyr, tidyr, stringr**.


Assuming that the user has the **betaHMM** package installed,
the user first needs to load the package:

```{r package, include=TRUE, echo=TRUE, message=FALSE,warning=FALSE}
library(betaHMM)
```

## Loading the data


## Loading the methylation and annotation data
The **betaHMM** software package offers a pre-processed methylation dataset
containing beta values obtained from DNA samples collected from four patients
with high-grade prostate cancer. These samples encompass both benign and tumor
prostate tissues and were subjected to methylation profiling using Infinium
MethylationEPIC Beadchip technology. The dataset comprises DNA samples from
R = 2 treatment conditions for each of N = 4 patients, with each DNA sample
providing beta values for C = 694,820 CpG sites. This data collection was
part of a study on prostate cancer methylomics (Silva et al. 2020). For
testing purposes, a subset of CpG sites from chromosome 7 has been included
in the package.

This package also provides a subset of the EPIC annotation file, which
users need to input into the **betaHMM** function. Users can load these two
datasets from the package and inspect the first six rows in the dataframes
using the following procedure:

```{r data,include=TRUE, echo=TRUE}
data(pca_methylation_data)
head(pca_methylation_data)
data(annotation_data)
head(annotation_data)


```

## The betaHMM workflow

The betaHMM model, which is employed to identify DMCs (differentially
methylated CpG sites) and DMRs (differentially methylated regions), consists of
three crucial functions. The entire process is carried out separately for each
chromosome and involves the following steps:
\begin{itemize}
    \item the \textbf{betaHMM} function: This function handles model parameter
    and hidden state estimation.It comprises three main steps: initialization,
    the Baum-Welch algorithm, and the Viterbi algorithm.
    \item The \textbf{DMC identification} function: This function utilizes the
    output from the betaHMM and identifies the hidden states that are
    differentially methylated, along with the DMCs themselves. The selection of
    DMCs is based on the area-under-curve (AUC) method.
    \item The \textbf{DMR identification} function: This function operates on
    the output from the DMC identification function. It includes a user-defined
    parameter specifying the minimum number of adjacent CpGs required to form a
    DMR. The output of this function is a dataframe that contains information
    about the number of DMCs within a DMR, the CpG sites involved, and the
    starting and ending locations of the DMR.
\end{itemize}

## Model parameter estimation

The initial phase of the workflow involves employing the Baum-Welch algorithm
to estimate the model parameters. Subsequently, we apply the Viterbi algorithm
to determine the most probable sequence of hidden states.

```{r betaHMM,include=TRUE, echo=TRUE}
M <- 3 ## No. of methylation states in a DNA sample type
N <- 4 ## No. of patients
R <- 2 ## No. of treatment conditions
my.seed <- 321 ## set seed for reproducibility

betaHMM_out <- betaHMM(pca_methylation_data,
                                annotation_data,
                                M = 3,
                                N = 4,
                                R = 2,
                                parallel_process = FALSE,
                                seed = my.seed,
                                treatment_group = c("Benign","Tumour"))
```

## Summary of model parameters
The resulting output of a call to betaHMM is an S4 object of class
betaHMMResults.

```{r betaHMMclass,include=TRUE, echo=TRUE}
class(betaHMM_out)
```

The parameters estimated can be displayed using the following S4 methods:

```{r betaHMMaccessor,include=TRUE, echo=TRUE}
## transition matrix estimated for all chromosomes
A(betaHMM_out)

## Shape parameters estimated for a certain chromosome
phi(betaHMM_out)

## Hidden states assigned to all CpG sites for a certain chromosome
head(hidden_states(betaHMM_out)[["chr 7"]])
```


A summary of the model parameters estimated for each chromosome can be obtained
as below:

```{r betaHMMsummary,include=TRUE, echo=TRUE}
summary(betaHMM_out)
```


## DMC identification

After estimating the parameters and hidden states, we proceed to calculate the
AUC metric, which quantifies the dissimilarities between the cumulative
distributions estimated for each hidden state within each chromosome. A
user-defined threshold for the AUC metric is then applied to identify the most
differentially methylated hidden states. Additionally, we utilize a
user-defined threshold for the measure of uncertainty regarding membership in
these highly differentially methylated hidden states to pinpoint the most
differentially methylated CpG sites.

To access the dataframe containing information about CpG site locations,
methylation values, hidden state assignments, and a flag indicating DMC status,
you can use the S4 `assay` command.

```{r dmc,include=TRUE, echo=TRUE}
dmc_out <- dmc_identification(betaHMM_out)
dmc_df <- assay(dmc_out)
head(dmc_df)
```

## Summary of the DMCs identified

```{r dmcsummary,include=TRUE, echo=TRUE}
summary(dmc_out)
```

## Plot the density estimates of the model parameter estimates


The fitted density estimates, kernel density estimates, and the uncertainty in
the hidden state assignment can be observed using the plot function for the
betaHMM output. Since the parameters are estimated individually for each
chromosome, one can generate plots for each chromosome separately. To specify
the chromosome of interest, the user should utilize the `chromosome` parameter
within the function.

Additionally, the AUC metrics calculated for the hidden states can also be
displayed using the `AUC` parameter. By providing the AUC values obtained
through the `dmc_identification` function as an input parameter, the plot will
depict the AUC metrics corresponding to the selected chromosome in the plot
panels.

```{r betaHMMplot,include=TRUE,echo=TRUE,fig.width=8,fig.height=5,dev='png'}
AUC_chr <- AUC(dmc_out)
plot(betaHMM_out, chromosome = "7", what = "fitted density", AUC = AUC_chr)
```


The visualization of uncertainties in hidden state estimation is accomplished
using a boxplot. Additionally, users have the option to input the threshold of
uncertainty intended for DMC identification into the plotting function. This
allows for the incorporation of the specified threshold into the visualization
of uncertainties.

```{r betaHMMplot2,include=TRUE,echo=TRUE,fig.width=6,fig.height=5,dev='png'}
plot(betaHMM_out, chromosome = "7", what = "uncertainty",
        uncertainty_threshold = 0.2)
```

## DMR identification from DMCs identified


Spatially correlated CpG sites give rise to clusters of CpG sites with similar
methylation states, leading to the formation of biologically significant
regions known as differentially methylated regions (DMRs). To define the number
of adjacent DMCs required to identify a DMR, users can utilize the user-defined
parameter `DMC_count`, which is set to a default value of 2. If no value is
specified, the function automatically identifies regions within a chromosome
containing two or more adjacent DMCs as DMRs.

The DMR location information, along with the DMCs within each DMR, is presented
as an S4 output of class `dmrResults`. Accessing this output can be achieved
using the S4 assay method.

```{r dmr,include=TRUE, echo=TRUE}
dmr_out <- dmr_identification(dmc_out, parallel_process = FALSE)
dmr_df <- assay(dmr_out)
head(dmr_df)
```

## Summary of the DMRs identified

```{r dmrsummary,include=TRUE, echo=TRUE}
summary(dmr_out)
```


## Plot to visualise the DMCs and DMRs

The S4 plot method has the capability to utilize the output generated by the
`dmc_identification` function for plotting methylation values along with the
uncertainty associated with being identified as a DMC. To create such plots,
users must provide the starting CpG site's IlmnID as an input to the
`start_CpG` parameter. Additionally, the `end_CpG` parameter can either take
the IlmnID of the ending CpG site to be plotted or the number of CpG sites
to be plotted, excluding the starting CpG site.

It is worth noting that even though a CpG site may exhibit a very low
uncertainty in being associated with the hidden state identified as the most
differentially methylated, it might not be selected as a DMC if it falls below
the threshold value set by the user.

```{r dmrplot,include=TRUE,echo=TRUE,fig.width=7,fig.height = 5, dev = 'png'}
plot(dmc_out, start_CpG = "cg17750844", end_CpG = 15)
```

## Threshold identification in DNA samples belonging to a specific condition

The initialization of the intricate \eqn{K} hidden states betaHMM model,
aimed at identifying differentially methylated hidden states across DNA
samples collected from multiple biological conditions, begins with the
utilization of a simpler 3-state betaHMM model for parameter estimation
within a single treatment condition for identifying the 3 known methylation
states (hypomethylation, hemimethylation and hypermethylation). Subsequently,
these estimated parameters are amalgamated to construct the \eqn{3^R} hidden
states parameter model.

The function employed to estimate the 3-state betaHMM also provides an
objective estimation of the threshold methylation value distinguishing the
three methylation states. This function can be executed independently to
analyze the data distribution within DNA samples originating from a single
biological condition.

```{r threshold,include=TRUE, echo=TRUE}
threshold_out <- threshold_identification(pca_methylation_data[,1:5],
                                            package_workflow = FALSE,
                                            annotation_file = annotation_data,
                                            M = 3,
                                            N = 4,
                                            parameter_estimation_only = FALSE,
                                            seed = my.seed)
threshold(threshold_out)
```

## Plotting the results from threshold identification function

```{r thresholdplot,include=TRUE,echo=TRUE,fig.width=5,fig.height=4,dev='png'}
plot(threshold_out, plot_threshold = TRUE, what = "fitted density")
```

```{r}
sessionInfo()
```