---
title: "recountmethylation User's Guide"
author:
- Sean K. Maden
- Reid F. Thompson
- Kasper D. Hansen
- Abhinav Nellore
date: "`r format(Sys.time(), '%d %B, %Y')`"
bibliography: bibliography.bib
csl: cell-numeric.csl
package: recountmethylation
vignette: > 
  %\VignetteIndexEntry{recountmethylation User's Guide}
  %\VignetteDepends{RCurl}
  %\usepackage[UTF-8]{inputenc} 
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
output:
  BiocStyle::html_document:
    code_folding: show
    toc: yes
    tocfloat: yes
  BiocStyle::pdf_document: 
    toc: yes
    toc_depth: 2
---

```{r setup, echo = FALSE, warning = FALSE}
suppressMessages(library(knitr))
suppressMessages(library(GenomicRanges))
suppressMessages(library(limma))
suppressMessages(library(minfi))
suppressMessages(library(ExperimentHub))
suppressMessages(library(recountmethylation))
knitr::opts_chunk$set(echo = TRUE, eval = TRUE, warning = FALSE, 
  message = FALSE)
```

# Introduction and overview

The `recountmethylation` package provides access to databases of DNA 
methylation (DNAm) data from over 62,000 cumulative sample records with 
IDATs in the Gene Expression Omnibus (GEO, available by November, 2020). 
Samples were run using either of 2 Illumina BeadArray platform types, 
either the older HM450K platform or the newer EPIC/HM850K platform. The
database compilation files include mined, mapped, and model-based sample 
metadata, and DNAm data in the form of either raw/unnormalized red and 
green signals, raw/unnormalized methylated and unmethylated signals, or 
normalized DNAm fractions (a.k.a. "Beta-values") @maden_human_2020. 
Normalization was performed using the out-of-band signal correction (a.k.a. 
"noob") method, a type of within-sample normalization @triche_low-level_2013.

This User's Guide shows how to use the `recountmethylation` package to obtain, 
load, and query the DNAm databases with 2 small example files. Background about 
DNAm arrays, DNAm measurement, `SummarizedExperiment` objects, database file 
types, and samples metadata is also provided. Further analysis examples are 
contained in the `data_analyses` vignette.

## Disclaimer

Databases accessed with `recountmethylation` contain data from GEO 
(ncbi.nlm.nih.gov/geo/), a live public database where alterations to 
online records can cause discrepancies with stored data over time. 
We cannot guarantee the accuracy of stored data, and advise users 
cross-check their findings with latest available records.

## Database files and access

Database compilation file download and access is managed by the `get_db` 
functions, where the DNAm array platform type using the `platform` argument
(see `?get_db` for details). Both HM450K and EPIC/HM850K platforms are 
currently supported (see below for platform details). Note you will need 
between 50-180 Gb of disk space to store a single database file. Files pair 
sample metadata and assay data in various formats, including `HDF5-SummarizedExperiment` database directories, and `HDF5` database files 
with the `.h5` extension. 

The databases are located at 
[https://methylation.recount.bio/](https://methylation.recount.bio/), 
and file details are viewable as follows:

```{r, echo = TRUE, message = TRUE}
sm <- as.data.frame(smfilt(get_servermatrix()))
if(is(sm, "data.frame")){knitr::kable(sm, align = "c")}
```

## ExperimentHub integration

The DNAm array database files are indexed on `ExperimentHub`, and are 
viewable as follows.

```{r}
hub = ExperimentHub::ExperimentHub() # connect to the hubs
rmdat <- AnnotationHub::query(hub, "recountmethylation") # query the hubs
rmdat
```

In addition to using the `getdb` functions, the `HDF5` (".h5"" extension) 
files may alternatively be downloaded from the hubs, as follows.

```{r}
eid <- "EH3778" # h5 test file id
fpath <- rmdat[[eid]] # download with default caching
rhdf5::h5ls(fpath)
```

Note that whether downloads use the hubs or `getdb` functions, caching 
is implemented to check for previously downloaded database files.

# Background

This section includes essential background about DNAm array platforms, assays 
and file types, and sample metadata.

## DNAm arrays

Databases include human samples run on the Illumina Infinium HM450K BeadArray 
platform. HM450K is a popular 2-channel platform that probes over 480,000 CpG 
loci genome-wide, with enriched coverage at CG islands, genes, and enhancers 
@sandoval_validation_2011. The more recently released EPIC/HM850K platform 
contains an expanded probe set targeting over 850,000 CpGs, including 
more than 90% of the HM450K probes, with greater coverage of potential intergenic 
regulatory regions @pidsley_critical_2016.

Array processing generates 2 intensity files (IDATs) per sample, one each for 
the red and green color channels. These raw files also contain control signals 
useful for quality evaluations @noauthor_illumina_2010. The BeadArray probes use 
either of 2 bead technologies, known as Type I and Type II, where the majority 
(72%) of probes use the latter. For Type II probes, a single bead assay informs 
a single probe, while Type I probes use 2 beads each. Practically, this means 
the bead-specific matrices found in `RGChannelSet` objects are larger than the 
probe-specific matrices found in derived object types (e.g. for HM450K samples, 
622,399 assays for red/green signal matrices versus 485,512 assays for 
methylated/unmethylated signal, DNAm fractions matrices, see below).

## `SummarizedExperiment` object classes

DNAm array sample IDATs can be read into an R session as an object of class 
`RGChannelSet`, a type of `SummarizedExperiment`. These objects support 
analyses of high-throughput genomics datasets, and they include slots for 
assay matrices, sample metadata, and experiment metadata. During a typical 
workflow, normalization and preprocessing convert `RGChannelSet` objects into 
new types like `MethylSet` and `RatioSet`. While not all IDAT information is 
accessible from every object type (e.g. only `RGChannelSet`s can contain 
control assays), derived objects like `MethylSet`s and `RatioSet`s may be 
smaller and/or faster to access.

Three `SummarizedExperiment` databases are provided as 
`HDF5-SummarizedExperiment` files, including an unnormalized `RGChannelSet` 
(red/green signals), an unnormalized `MethylSet` (methylated/unmethylated 
signals) and a normalized `GenomicRatioSet` (DNAm fractions). For the latter, 
DNAm fractions (logit2 Beta-values, or M-values) were normalized using the 
out-of-band signal or "noob" method, an effective within-sample normalization 
that removes signal artifacts @triche_low-level_2013.

## Database file types

Database files are stored as either `HDF5` or `HDF5-SummarizedExperiment`. For 
most R users, the latter files will be most convenient to work with. `HDF5`, or 
hierarchical data format 5, combines compression and chunking for convenient 
handling of large datasets. `HDF5-SummarizedExperiment` files combine the 
benefits of `HDF5` and `SummarizedExperiment` entities using a 
DelayedArray-powered backend. Once an `HDF5-SummarizedExperiment` file is 
loaded, it can be treated similarly to a `SummarizedExperiment` object in 
active memory. That is, summary and subset operations execute rapidly, and 
realization of large data chunks in active memory is delayed until called for 
by the script (see examples).

## Sample metadata

Sample metadata are included with DNAm assays in the database files. Currently, 
metadata variables include GEO record IDs for samples (GSM) and studies (GSE), 
sample record titles, learned labels for tissue and disease, sample type 
predictions from the MetaSRA-pipeline, and DNAm model-based predictions for 
age, sex, and blood cell types. Access sample metadata from 
`SummarizedExperiment` objects using the `pData` minfi function (see examples). 
Examples in the `data_analyses` vignette illustrate some ways to utilize the 
provided sample metadata.

Provided metadata derives from the GSE-specific SOFT files, which contain 
experiment, sample, and platform metadata. Considerable efforts were made to 
learn, harmonize, and predict metadata labels. Certain types of info lacking 
in the `recountmethylation` metadata may be available in the SOFT files, 
especially if it is sample non-specific (e.g. methods text, PubMed ID, etc.) 
or redundant with DNAm-derived metrics (e.g. DNAm summaries, predicted sex, 
etc.).

It is good practice to validate the harmonized metadata with original metadata 
records, especially where labels are ambiguous or there is insufficient 
information for a given query. GEO GSM and GSE records can be viewed from a 
browser, or SOFT files may be downloaded directly. Packages like GEOmetadb and 
GEOquery are also useful to query and summarize GEO metadata.

# `HDF5-SummarizedExperiment` example

This example shows basic handling for `HDF5-SummarizedExperiment` (a.k.a. 
"h5se") files. For these files, the `getdb` function returns the loaded file. 
Thanks to a `DelayedArray` backend, even full-sized `h5se` databases can be 
treated as if they were fully loaded into active memory.

## Obtain the test database

The test `h5se` dataset includes sample metadata and noob-normalized 
DNAm fractions (Beta-values) for chromosome 22 probes for 2 samples. 
Datasets can be downloaded using the `getdb` series of functions 
(see `?getdb` for details), where the `dfp` argument specifies the 
download destination. The test `h5se` file is included in the package 
"inst" directory, and can be loaded as follows.

```{r}
dn <- "remethdb-h5se_gr-test_0-0-1_1590090412"
path <- system.file("extdata", dn, package = "recountmethylation")
h5se.test <- HDF5Array::loadHDF5SummarizedExperiment(path)
```

## Inspect and summarize the database

Common characterization functions can be used on the dataset after it has been 
loaded. These include functions for `SummarizedExperiment`-like objects, such 
as the `getBeta`, `pData`, and `getAnnotation` minfi functions. First, inspect 
the dataset using standard functions like `class`, `dim`, and `summary` as 
follows.

```{r}
class(h5se.test)
```

```{r}
dim(h5se.test)
```

```{r}
summary(h5se.test)
```

Access the sample metadata for the 2 available samples using `pData`. 

```{r}
h5se.md <- minfi::pData(h5se.test)
dim(h5se.md)
```
```{r}
colnames(h5se.md)
```

Next get CpG probe-specific DNAm fractions, or "Beta-values", with `getBeta` 
(rows are probes, columns are samples).

```{r}
h5se.bm <- minfi::getBeta(h5se.test)
dim(h5se.bm)
```
```{r}
colnames(h5se.bm) <- h5se.test$gsm
knitr::kable(head(h5se.bm), align = "c")
```

Access manifest information for probes with `getAnnotation`. This includes the 
bead addresses, probe type, and genome coordinates and regions. For full details 
about the probe annotations, consult the minfi and Illumina platform documentation.

```{r}
an <- minfi::getAnnotation(h5se.test)
dim(an)
```

```{r}
colnames(an)
```

```{r}
ant <- as.matrix(t(an[c(1:4), c(1:3, 5:6, 9, 19, 24, 26)]))
knitr::kable(ant, align = "c")
```

# `HDF5` database and example

To provide more workflow options, bead-specific red and green signal data have 
been provided with sample metadata in an `HDF5`/`h5` file. This example shows 
how to handle objects of this type with `recountmethylation`.

## Obtain the test database

The test `h5` file includes metadata and bead-specific signals from 
chromosome 22 for the same 2 samples as in the `h5se` test file. 
Note `getdb` functions for `h5` files simply return the database path.
Since the test `h5` file has also been included in the package "inst" folder,
get the path to load the file as follows.

```{r}
dn <- "remethdb-h5_rg-test_0-0-1_1590090412.h5"
h5.test <- system.file("extdata", "h5test", dn, 
                    package = "recountmethylation")
```

## Inspect and summarize the database

Use the file path to read data into an `RGChannelSet` with the `getrg` 
function. Setting `all.gsm = TRUE` obtains data for all samples in the
database files, while passing a vector of GSM IDs to `gsmv` argument 
will query a subset of available samples. Signals from all available 
probes are retrieved by default, and probe subsets can be obtained by 
passing a vector of valid bead addresses to the `cgv` argument.

```{r}
h5.rg <- getrg(dbn = h5.test, all.gsm = TRUE)
```

To avoid exhausting active memory with the full-sized `h5` dataset, provide 
either `gsmv` or `cgv` to `getrg`, and set either `all.cg` or `all.gsm` to 
FALSE (see `?getrg` for details).

As in the previous example, use `pData` and `getAnnotation` to get sample 
metadata and array manifest information, respectively. Access the green and 
red signal matrices in the `RGChannelSet` with the `getRed` and `getGreen` 
minfi functions.

```{r}
h5.red <- minfi::getRed(h5.rg)
h5.green <- minfi::getGreen(h5.rg)
dim(h5.red)
```
```{r}
knitr::kable(head(h5.red), align = "c")
```
```{r}
knitr::kable(head(h5.green), align = "c")
```
```{r}
identical(rownames(h5.red), rownames(h5.green))
```

Rows in these signal matrices map to bead addresses rather than probe IDs. 
These matrices have more rows than the `h5se` test Beta-value matrix because 
any type I probes use data from 2 beads each.

# Validate DNAm datasets

This section demonstrates validation using the test databases. Full code
to reproduce this section is provided but not evaluated, as it involves a 
download from the GEO servers. As the disclaimer notes, it is good practice
to validate data against the latest available GEO files. This step may be 
most useful for newer samples published close to the end compilation date 
(through November 7, 2020 for current version), which may be more prone to 
revisions at initial publication.

## Download and read IDATs from the GEO database server

Use the `gds_idat2rg` function to download IDATs for the 2 test samples 
and load these into a new `RGChannelSet` object. Do this by passing a vector
of GSM IDs to `gsmv` and the download destination to `dfp`. (note, chunks in
this section are fully executable, but not evaluated for this vignette).

```{r, eval = FALSE}
# download from GEO
dlpath <- tempdir()
gsmv <- c("GSM1038308", "GSM1038309")
geo.rg <- gds_idat2rg(gsmv, dfp = dlpath)
colnames(geo.rg) <- gsub("\\_.*", "", colnames(geo.rg))
```

## Compare DNAm signals

Extract the red and green signal matrices from `geo.rg`.

```{r, eval = FALSE}
geo.red <- minfi::getRed(geo.rg)
geo.green <- minfi::getGreen(geo.rg)
```

Match indices and labels between the GEO and `h5` test signal matrices.

```{r, eval = FALSE}
int.addr <- intersect(rownames(geo.red), rownames(h5.red))
geo.red <- geo.red[int.addr,]
geo.green <- geo.green[int.addr,]
geo.red <- geo.red[order(match(rownames(geo.red), rownames(h5.red))),]
geo.green <- geo.green[order(match(rownames(geo.green), rownames(h5.green))),]
identical(rownames(geo.red), rownames(h5.red))
identical(rownames(geo.green), rownames(h5.green))
class(h5.red) <- "integer"
class(h5.green) <- "integer"
```

Finally, compare the signal matrix data.

```{r, eval = FALSE}
identical(geo.red, h5.red)
```
```{r, eval = FALSE}
identical(geo.green, h5.green)
```

## Compare DNAm Beta-values

Before comparing the GEO-downloaded data to data from the `h5se.test` database, 
normalize the data using the same out-of-band or "noob" normalization technique 
that was used to generate data in the `h5se` database.

```{r, eval = FALSE}
geo.gr <- minfi::preprocessNoob(geo.rg)
```

Next, extract the Beta-values.

```{r, eval = FALSE}
geo.bm <- as.matrix(minfi::getBeta(geo.gr))
```

Now match row and column labels and indices.

```{r, eval = FALSE}
h5se.bm <- as.matrix(h5se.bm)
int.cg <- intersect(rownames(geo.bm), rownames(h5se.bm))
geo.bm <- geo.bm[int.cg,]
geo.bm <- geo.bm[order(match(rownames(geo.bm), rownames(h5se.bm))),]
```

Finally, compare the two datasets.

```{r, eval = FALSE}
identical(summary(geo.bm), summary(h5se.bm))
```
```{r, eval = FALSE}
identical(rownames(geo.bm), rownames(h5se.bm))
```

# Troubleshooting and tips

This section describes how to address potential issues with accessing the
database files or working with the `DelayedArray` based objects locally.

## Issue: large file downloads don't complete

If repeated attempts to download the database compilation files fail, you
may try the following:

* First ensure your internet connection is stable and there is sufficient 
space at the download destination for the database file. 

* Second, try increasing your timeout duration beyond the default before 
repeating the download attempt with `getdb`. Check the current timeout 
for an R session with `getOptions('timeout')`, then manually increase 
the timeout duration with `options(timeout = new.time)`.

* Finally, you may attempt to download a server file using command line 
calls to your system terminal or console. For instance, on a Mac you 
might try `wget -r <file_url>`. If this doesn't work, you can again 
attempt to increase the timeout duration and repeat the download attempt.

## Issue: unexpected function behaviors for `DelayedArray` inputs

Unexpected function behaviors may arise when using `DelayedArray`-based inputs.
These essentially arise from lacking interoperativity between normal matrices 
and the `DelayedArray`-based matrices. Known examples include:

* `minfi::detectionP()`: 

Throws error for specific subsets of data, such as for queries of exactly 
50 samples. 

```{r, eval = FALSE}
detectionP(rg[,1:50])
"Error in .local(Red, Green, locusNames, controlIdx, TypeI.Red, TypeI.Green, dim(Red_grid) == dim(detP_sink_grid) are not all TRUE"
```

* `minfi::preprocessFunnorm()`: 

Throws error when called for an `RGChannelSet` of type `HDF5-SummarizedExperiment`.

```{r, eval = FALSE}
preprocessFunnorm(rg)
"Error: 'preprocessFunnorm()' only supports matrix-backed minfi objects.""
```

These and other related errors may be addressed by instantiating the data query, 
or the data chunk, as a new non-`DelayedArray` object. For example, remake a 
subset of the full `h5se` dataset, `rg`, as follows.

```{r, eval = FALSE}
rg.h5se <- loadHDF5SummarizedExperiment(rg.path) # full h5se RGChannelSet
rg.sub <- rg.h5se[,c(1:20)] # subset samples of interest
# re-make as non-DA object
rg.new <- RGChannelSet(Red = getRed(rg.sub), 
                       Green = getGreen(rg.sub),
                       annotation = annotation(rg.sub))
gr <- preprocessFunnorm(rg.new) # repeat preprocessing
```

Alternatively, non-`DelayedArray` `RGChannelSet` objects can be readily generated from
the full `h5` `RGChannelSet` database with the provided function `getrg()`.

# Get more help

Consult the Data Analyses [vignette](link.url) and main [manuscript](link.url) 
for analysis examples and details about data compilations.

# Session info

```{r get_sessioninfo}
sessionInfo()
```

# Works Cited