---
title: "The ramr User's Guide"
author: "Oleksii Nikolaienko"
date: "`r format(Sys.time(), '%d %B, %Y')`"
abstract: |
  A comprehensive guide to using the ramr package for detection
  of rare aberrantly methylated regions (epimutations).
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{ramr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  width = 100
)
options(width=100)
```

# Introduction

`ramr` is an R package for detection of low-frequency aberrant methylation events (epimutations) in large data sets obtained by methylation profiling using array or high-throughput methylation sequencing. In addition, package provides functions to visualize found aberrantly methylated regions (AMRs), to generate sets of all possible regions to be used as reference sets for enrichment analysis, and to generate biologically relevant test data sets for performance evaluation of AMR/DMR search algorithms.

### Current Features

-   Identification of aberrantly methylated regions (AMRs, i.e., epimutations)
-   AMR visualization
-   Generation of reference sets for third-party analyses (e.g., enrichment)
-   Generation of test data sets for performance evaluation of algorithms for search of differentially (DMR) or aberrantly (AMR) methylated regions

### Major improvements

##### v1.16 [BioC 3.21]

-   Major rewrite of `getAMR` and `simulateData` functions, which are now much faster (C/C++, OpenMP threads) and more robust (correctly deal with methylation sequencing data that often contains 0 and 1 values).
-   Old functions `getAMR` and `simulateData` as they were described in the `ramr` paper are now obsolete, but kept under different names (`getAMR.obsolete` and `simulateData.obsolete`, respectively) for consistency.
-   Cleaner and more robust AMR plotting.

# Reading data

`ramr` methods operate on objects of the class `GRanges`. The input object for AMR search must in addition contain metadata columns with sample beta values. A typical input object looks like this:

```         
GRanges object with 383788 ranges and 845 metadata columns:
             seqnames    ranges strand |         GSM1235534         GSM1235535         GSM1235536 ...
                <Rle> <IRanges>  <Rle> |          <numeric>          <numeric>          <numeric> ...
  cg13869341     chr1     15865      * |  0.801634776091808  0.846486905008704   0.86732154737116 ...
  cg24669183     chr1    534242      * |  0.834138820071765  0.861974610731835  0.832557979806823 ...
  cg15560884     chr1    710097      * |  0.711275180750356   0.70461945838556  0.699487225634589 ...
  cg01014490     chr1    714177      * | 0.0769098196182058 0.0569443780518647 0.0623154673389864 ...
  cg17505339     chr1    720865      * |  0.876413362222415  0.885593263385521  0.877944732153869 ...
         ...      ...       ...    ... .                ...                ...                ... ...
  cg05615487    chr22  51176407      * |   0.84904178467798  0.836538383875097   0.81568519870099 ...
  cg22122449    chr22  51176711      * |  0.882444486059592  0.870804215405886  0.859269224277308 ...
  cg08423507    chr22  51177982      * |  0.886406345093286  0.882430879852752  0.887241923657461 ...
  cg19565306    chr22  51222011      * | 0.0719084295670266 0.0845209871264646 0.0689074604483659 ...
  cg09226288    chr22  51225561      * |  0.724145303755024  0.696281176451351  0.711459675603635 ...
```

`ramr` package is supplied with a sample data, which was simulated using GSE51032 data set as described in the `ramr` reference paper. Sample data set `ramr.data` contains beta values for 10000 CpGs and 100 samples (`ramr.samples`), and carries 6 unique (`ramr.tp.unique`) and 15 non-unique (`ramr.tp.nonunique`) true positive AMRs containing at least 10 CpGs with their beta values increased/decreased by 0.5

```{r, fig.width=10, fig.height=4, out.width="100%", out.height="100%"}
library(GenomicRanges)
library(ggplot2)
library(ramr)
data(ramr)

head(ramr.samples)
ramr.data[1:10,ramr.samples[1:3]]
plotAMR(data.ranges=ramr.data, amr.ranges=ramr.tp.unique[1])
plotAMR(data.ranges=ramr.data, amr.ranges=ramr.tp.nonunique[c(1,6,11)])
```

The input (or template) object may be obtained using data from various sources. Here we provide two examples:

### Using data from NCBI GEO

The following code pulls (NB: very large) raw files from NCBI GEO database, performes normalization and creates `GRanges` object for further analysis using `ramr` (system requirements: 22GB of disk space, 64GB of RAM)

```         
library(minfi)
library(GEOquery)
library(GenomicRanges)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)

# destination for temporary files
dest.dir <- tempdir()

# downloading and unpacking raw IDAT files
suppl.files <- getGEOSuppFiles("GSE51032", baseDir=dest.dir, makeDirectory=FALSE, filter_regex="RAW")
untar(rownames(suppl.files), exdir=dest.dir, verbose=TRUE)
idat.files  <- list.files(dest.dir, pattern="idat.gz$", full.names=TRUE)
sapply(idat.files, gunzip, overwrite=TRUE)

# reading IDAT files
geo.idat <- read.metharray.exp(dest.dir)
colnames(geo.idat) <- gsub("(GSM\\d+).*", "\\1", colnames(geo.idat))

# processing raw data
genomic.ratio.set <- preprocessQuantile(geo.idat, mergeManifest=TRUE, fixOutliers=TRUE)

# creating the GRanges object with beta values
data.ranges <- granges(genomic.ratio.set)
data.betas  <- getBeta(genomic.ratio.set)
sample.ids  <- colnames(geo.idat)
mcols(data.ranges) <- data.betas

# data.ranges and sample.ids objects are now ready for AMR search using ramr
```

### Using Bismark cytosine report files

```         
library(methylKit)
library(GenomicRanges)

# file.list is a user-defined character vector with full file names of Bismark cytosine report files
file.list

# sample.ids is a user-defined character vector holding sample names
sample.ids

# methylation context string, defines if the reads covering both strands will be merged
context <- "CpG"

# fitting beta distribution (filtering using ramr.method "beta" or "wbeta") requires
# that most of the beta values are not equal to 0 or 1
min.beta <- 0.001
max.beta <- 0.999

# reading and uniting methylation values
meth.data.raw <- methRead(as.list(file.list), as.list(sample.ids), assembly="hg19", header=TRUE,
                          context=context, resolution="base", treatment=rep(0,length(sample.ids)),
                          pipeline="bismarkCytosineReport")
meth.data.utd <- unite(meth.data.raw, destrand=isTRUE(context=="CpG"))

# creating the GRanges object with beta values
data.ranges <- GRanges(meth.data.utd)
data.betas  <- percMethylation(meth.data.utd)/100
data.betas[data.betas<min.beta] <- min.beta
data.betas[data.betas>max.beta] <- max.beta
mcols(data.ranges) <- data.betas

# data.ranges and sample.ids objects are now ready for AMR search using ramr
```

# Simulating data

`ramr` provides methods to create sets of random AMRs and to generate biologically relevant methylation beta values using real data sets as templates. The following code provides an example, however it is recommended to use a real experimental data (e.g. GSE51032) to create a test data set for assessing the performance of `ramr` or other AMR/DMR search engines. The results of data simulation are fully reproducible when the same seed has been set (at a cost of serial random number generation).

```{r, fig.width=10, fig.height=6, out.width="100%", out.height="100%"}
# set the seed if reproducible results required
  set.seed(999)

# unique random AMRs
  amrs.unique <-
    simulateAMR(template.ranges=ramr.data, nsamples=25, regions.per.sample=2,
                min.cpgs=5, merge.window=1000, dbeta=0.2)

# non-unique AMRs outside of regions with unique AMRs
  amrs.nonunique <-
    simulateAMR(template.ranges=ramr.data, nsamples=4, exclude.ranges=amrs.unique,
                sample.names=sprintf("sample%02i", c(2, 4, 8, 16)),
                samples.per.region=2, min.cpgs=5, merge.window=1000)
  
# random noise outside of AMR regions
  noise <-
    simulateAMR(ramr.data, nsamples=25, regions.per.sample=20,
                exclude.ranges=c(amrs.unique, amrs.nonunique),
                min.cpgs=1, max.cpgs=1, merge.window=1, dbeta=0.5)

# "smooth" methylation data without AMRs (negative control)
  smooth.data <-
    simulateData(template.ranges=ramr.data, nsamples=25)
  
# methylation data with AMRs and noise
  noisy.data <-
    simulateData(template.ranges=ramr.data, nsamples=25,
                 amr.ranges=c(amrs.unique, amrs.nonunique, noise))
  
# that's how regions look like
  library(gridExtra)

  do.call("grid.arrange", c(plotAMR(data.ranges=noisy.data, amr.ranges=amrs.unique[1:4]), ncol=2))
  do.call("grid.arrange", c(plotAMR(data.ranges=noisy.data, amr.ranges=sort(amrs.nonunique)[1:8]), ncol=2))
  do.call("grid.arrange", c(plotAMR(data.ranges=noisy.data, amr.ranges=noise[1:4]), ncol=2))
  
# can we find them?
  system.time(
    found <- getAMR(
      data.ranges=noisy.data,
      compute="beta+binom", compute.estimate="amle", compute.weights="logInvDist",
      combine.min.cpgs=5, combine.threshold=1e-2, combine.window=1000
    )
  )
  
# all possible regions
  all.ranges <- getUniverse(noisy.data, min.cpgs=5, merge.window=1000)
  
# true positives
  tp <- sum(found %over% c(amrs.unique, amrs.nonunique))
  
# false positives
  fp <- sum(found %outside% c(amrs.unique, amrs.nonunique))

# true negatives
  tn <- length(all.ranges %outside% c(amrs.unique, amrs.nonunique))
  
# false negatives
  fn <- sum(c(amrs.unique, amrs.nonunique) %outside% found)
  
# accuracy, MCC
  acc <- (tp+tn) / (tp+tn+fp+fn)
  mcc <- (tp*tn - fp*fn) / (sqrt(tp+fp)*sqrt(tp+fn)*sqrt(tn+fp)*sqrt(tn+fn))
  setNames(c(tp, fp, tn, fn), c("TP", "FP", "TN", "FN"))
  setNames(c(acc, mcc), c("accuracy", "MCC"))
```

# AMR identification

This code shows how to do basic analysis with `ramr` using provided data files:

```{r, fig.width=10, fig.height=6, out.width="100%", out.height="100%"}
# identify AMRs
amrs <- getAMR(
  data.ranges=ramr.data, data.samples=ramr.samples, compute="beta+binom",
      combine.min.cpgs=5, combine.threshold=1e-2, combine.window=1000
)

# inspect
sort(amrs)

do.call("grid.arrange", c(plotAMR(data.ranges=ramr.data, amr.ranges=amrs[1:10]), ncol=2))
```

The results of parallel AMR search are fully reproducible (do not depend on the seed).

# AMR annotation and enrichment analysis

If necessary, AMRs can be annotated to known genomic elements using R library `annotatr` [^1] or tested for potential enrichment in epigenetic or other marks using R library `LOLA` [^2]

[^1]: Raymond G Cavalcante, Maureen A Sartor, annotatr: genomic regions in context, Bioinformatics, Volume 33, Issue 15, 01 August 2017, Pages 2381–2383, <https://doi.org/10.1093/bioinformatics/btx183>

[^2]: Nathan C. Sheffield, Christoph Bock, LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor, Bioinformatics, Volume 32, Issue 4, 15 February 2016, Pages 587–589, <https://doi.org/10.1093/bioinformatics/btv612>

```{r}
# annotating AMRs using R library annotatr
library(annotatr)
annotation.types <- c("hg19_cpg_inter", "hg19_cpg_islands", "hg19_cpg_shores",
                      "hg19_cpg_shelves", "hg19_genes_intergenic", "hg19_genes_promoters",
                      "hg19_genes_5UTRs", "hg19_genes_firstexons", "hg19_genes_3UTRs")
annotations <- build_annotations(genome='hg19', annotations=annotation.types)
amrs.annots <- annotate_regions(regions=amrs, annotations=annotations, ignore.strand=TRUE, quiet=FALSE)
summarize_annotations(annotated_regions=amrs.annots, quiet=FALSE)
```

```         
# generate the set of all possible genomic regions using sample data set and
# the same parameters as for AMR search
universe <- getUniverse(ramr.data, min.cpgs=5, merge.window=1000)

# enrichment analysis of AMRs using R library LOLA
library(LOLA)
# prepare the core database as described in vignettes
vignette("usingLOLACore")
# load the core database and perform the enrichment analysis
hg19.coredb <- loadRegionDB(system.file("LOLACore", "hg19", package="LOLA"))
runLOLA(amrs, universe, hg19.coredb, cores=1, redefineUserSets=TRUE)
```

# Other information

### Citing the `ramr` package

Oleksii Nikolaienko, Per Eystein Lønning, Stian Knappskog, *ramr*: an R/Bioconductor package for detection of rare aberrantly methylated regions, Bioinformatics, 2021;, btab586, <https://doi.org/10.1093/bioinformatics/btab586>

### The data underlying `ramr` manuscript

Replication Data for: "ramr: an R package for detection of rare aberrantly methylated regions, <https://doi.org/10.18710/ED8HSD>

### Session Info

```{r session}
sessionInfo()
```