---
title: "SeSAMe Basic Usage"
date: "`r BiocStyle::doc_date()`"
package: sesame
output: BiocStyle::html_document
fig_width: 6
fig_height: 6
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{"0. Basic Usage"}
  %\VignetteEncoding{UTF-8}
---

```{r message=FALSE, warning=FALSE, include=FALSE}
library(sesame)
library(dplyr)
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Installation

From Bioconductor
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("sesame")
```

Development version can be installed from github
```{r, eval=FALSE}
BiocManager::install('zwdzwd/sesameData')
BiocManager::install('zwdzwd/sesame')
```

**CRITICAL:** After a new installation, one needs to cache the associated
annotation data using the following command.
This needs to be done only once per SeSAMe installation.
```{r eval=FALSE}
sesameDataCacheAll()
```
This function caches the needed SeSAMe annotation for all the supported
platforms. One optionally caches only one specific platform (e.g., EPIC)
by calling
```{r eval=FALSE}
sesameDataCache("EPIC")
```

SeSAMe annotation data is managed by the
[sesameData](https://bioconductor.org/packages/release/
data/experiment/html/sesameData.html) package which is based on the
[ExperimentHub](https://bioconductor.org/packages/release/
bioc/html/ExperimentHub.html) package. You can find the path of
the annotation data storage on your local computer using
`tools::R_user_dir("ExperimentHub", which="cache")`.

# Beta Value Extraction

The raw Infinium BeadChip data are stored in IDAT files. Each sample has
two IDAT files and they correspond to the red and green signal respectively.
Green and red files for the same samples should always share the same sample
name prefix.
For example, `204529320035_R06C01_Red.idat` and
`204529320035_R06C01_Grn.idat` correspond to the red and green signal
of one sample. In SeSAMe, we will use the common prefix,
i.e. 204529320035_R06C01, to refer to that sample.
Sesame recognize both the raw idat as well as gzipped idats which are common
in data stored in [GEO](https://www.ncbi.nlm.nih.gov/geo/).
For example, in addition to the example above, SeSAMe also recognizes
`GSM2178224_Red.idat.gz` and `GSM2178224_Grn.idat.gz`.

The function `readIDATpair` function reads in the signal intensity data from
the IDAT pairs. The function takes the **common prefix** as input and outputs
a `SigSet`. Using the two examples above, one would run the following
commands.
```{r eval=FALSE}
sset = readIDAT("GSM2178224")
# or
sset = readIDAT("204529320035_R06C01")
```
Note that SeSAMe automatically detects and matches up the green and red signal
files for the same sample.
We will get back to the structure of `SigSet` below.

DNA methylation level (aka beta values) are defined as
`methylated signal`/(`methylated signal` + `unmethylated signal`).
It can be retrieved using the `getBetas` function on the `SigSet`. 
The output is a named vector with probe ID as names.
For example, the following commands reads in one
sample and convert it to beta values.
```{r eval=FALSE}
sset = readIDAT("GSM2178224")
betas = getBetas(sset)
```
Please note that `getBetas` takes a single `SigSet`
object instead of a list of them. To have it process many `SigSet`s, you
should combine that with looping functions `lapply` or `mclapply`s, or using
the `openSesame` pipeline (see below).

## Search IDAT Prefixes

Most often we will be working with a folder that contains many IDATs. Here is
where the `searchIDATprefixes` function comes in handy. It lets us search all
the IDATs in a folder and its subfolders recursively. Combine this with the
R looping functions lets you process many IDATs without having to specify
all IDAT names. `searchIDATprefixes` returns a named vector of prefixes with
associated `Red` and `Grn` files, for `readIDATpair`:
```{r}
ssets = lapply(
    searchIDATprefixes(system.file("extdata/", package = "sesameData")),
    readIDATpair)
```
A simple list of "SigSet"s are returned.

## Custom-made Array

If you need to deal with a custom-made array instead of the standard
array (MM285, EPIC, HM450 etc) supported natively by SeSAMe, you would need
to provide a manifest that describe the probe information. You should be
able to obtain that from the Illumina support website.
The manifest should be made into a simple data frame with a minimum of
four columns: `Probe_ID`, `M`, `U` and `col`.
The easiest way to make the the manifest compatible with SeSAMe is by
following internal manifests for a SeSAME-supported platform. They can be
retrieved with the `sesameDataGet` function:
```{r}
mft = sesameDataGet("MM285.address")$ordering
```

The `col` is either `G` (stand for Green) or `R` (stand for Red) or `NA` (
stand for both in the case of Infinium II design).
For Infinium-II probes, the `M` column and `col` column is left to be `NA`.
For example, one can check that both `M` and `col` columns are filled with
the Infinium-I probes:
```{r}
head(mft[!is.na(mft$col),])
```
The last column `mask` is a logical vector that defines the default masking
behavior of SeSAMe for the platform (see below for discussion of NA-masking).

With the manifest, your data can be processed using
```{r eval = FALSE}
sset = readIDATpair("your_sample_name", manifest = mft)
betas = getBetas(sset)
```

## Sum Two-channels

Beta values for Infinium-I probes can also be obtained by summing up the two
in-band channel and out-of-band channel. This rescues probes with SNP hitting
the extension base and hence switching color channel. More details can be
found in
[Zhou et al 2017](https://academic.oup.com/nar/article/45/4/e22/2290930).
```{r}
library(sesame)
## show case using an example without mask, then add mask with qualityMask
sset = qualityMask(sesameDataGet('EPIC.1.LNCaP')$sset)
betas = getBetas(sset, sum.TypeI = TRUE)
```

For color-channel-switching probes, extra SNP allele frequencies can be obtained
by summing up methylated and umethylated alleles:
```{r}
extraSNPAFs = getAFTypeIbySumAlleles(sset)
```

# NA-Masking

If you call getBetas as is, you may notice that some of the beta values
show up having NA values. This NA-masking is controlled internally using
a `mask` attribute in `SigSet`. E.g.,
```{r message = FALSE}
betas = getBetas(sset)
head(betas)
```

To check probes to be NA-masked in a `SigSet`, one can use the `mask`
function
```{r}
head(mask(sset))
sum(mask(sset))   # number of probes to be NA-masked
sum(is.na(betas)) # should be the same as above
```

Please be noted that `mask` in `SigSet` does not actually remove the probe
reading but only specify how SeSAMe currently views the measurement. One can
add more probes to the mask with the `addMask` function. Other functions
such as the detection p-value calculation, also modifies `mask`. NA-masking
influences other normalization and preprocessing functions. Therefore one
should set it carefully. If one does not do any explicit masking, one simply
gets the masking specified in the `mask` column of the manifest. This defines
the default masking behavior of SeSAMe. For more details of masking, one can
refer to
[Zhou et al 2017](https://academic.oup.com/nar/article/45/4/e22/2290930).
One can override this masking by the `resetMask` function
```{r}
sum(mask(sset))  # before resetting
sum(mask(resetMask(sset)))  # after resetting, expect 0
```
The `getBetas` function can also ignore NA-masking while extracting beta
values:
```{r}
sum(is.na(getBetas(sset, mask=FALSE)))  # expect 0
```

We recommend two types of probe masking:

1) Experiment-dependent masking based on signal detection p-value
([Zhou et al. 2018](https://www.ncbi.nlm.nih.gov/pubmed/30085201)). Probes with
p-value higher than a threshold (default: 0.05) are masked (see following for
detection p-value calculation using the `pOOBAH` method).

2) Experiment-independent probe masking due to design issues. This is typically
designated in the `mask` column of the manifest (see 
[Zhou et al. 2017](https://www.ncbi.nlm.nih.gov/pubmed/27924034)): This masking
supports EPIC, MM285, HM450 and HM27 and is turned on by default and can also be
explicitly added by the function `qualityMask`.

# Preprocessing

## Detection p-value

As mentioned above, experiment-dependent masking based on signal detection
p-values is effective in excluding artifactual methylation level reading
and probes with too much influence from signal background. We recommend the
`pOOBAH` algorithm that was based on Infinium-I probe out-of-band signal for
calibrating the distribution of the signal background:
```{r}
sum(mask(sset))  # before pOOBAH
sum(mask(pOOBAH(sset))) # after pOOBAH, expect more probes
sum(mask(pOOBAH(resetMask(sset)))) # pOOBAH-only masking
```

Sometimes one would want to calculation detection p-value without creating
masking like in the case of having to upload the p-value to GEO. In those cases
one can use the `return.pval` option and add pvalue-based mask later.
```{r}
pvals = pOOBAH(sset, return.pval=TRUE)
sset2 = addMask(sset, pvals > 0.05) # default, recommend between 0.05 and 0.2
```

## Background Subtraction

SeSAMe implements the
background subtraction based on normal-exponential deconvolution using
out-of-band probes `noob`
([Triche et al. 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3627582/))
and optionally with extra bleed-through subtraction. Signal bleed-through
happens when measurement from one channel affects the measurement in the other
channel. SeSAMe's `noob` further removes residual background by regressing
out the green-to-red and red-to-green relationship using Type-I probes.
```{r}
sset.nb = noob(sset)          # noob background subtraction
sset.sb = scrub(sset)         # more aggressive background subtraction
```

One can use following beta to total signal intensity to check the effect of
background subtraction.
```{r}
sesamePlotIntensVsBetas(sset)
sesamePlotIntensVsBetas(sset.nb) # with noob
sesamePlotIntensVsBetas(sset.sb) # with scrub
```

## Dye Bias Correction

Dye bias refers to the difference in signal intensity between the two color
channel. SeSAMe offers two flavors of dye bias correction: linear scaling
(`dyeBiasCorr`) and nonlinear scaling (`dyeBiasCorrTypeINorm`). Linear scaling
equalize the mean of all probes from the two color channel.
```{r}
library(sesame)
sset.dbLinear = dyeBiasCorr(sset)
sesamePlotRedGrnQQ(sset.dbLinear)
```

Residual dye bias can be corrected using nonlinear quantile interpolation with
Type-I probes.
```{r}
sset.dbNonlinear = dyeBiasCorrTypeINorm(sset)
```
Under this correction, Infinium-I Red probes and
Infinium-I Grn probes have the same distribution of signal.
```{r}
sesamePlotRedGrnQQ(sset.dbNonlinear)
```

Note that linear scaling does not shift beta values of Type-I probes while
nonlinear scaling does shift beta values of Type-I probes.

## Channel Inference
Sometimes Infinium-I channel spec is inaccurate in the manifest. We can
infer the channel using data.
```{r}
sset.InfICorrected = inferTypeIChannel(sset, verbose=TRUE)
```
# The openSesame Pipeline

We have discussed noob, nonlinear dye bias correction and pOOBAH. Put
together, this can be implemented as follows
```{r eval = FALSE}
idat_dir = system.file("extdata/", package = "sesameData")
betas = do.call(cbind, mclapply(searchIDATprefixes(idat_dir), function(pfx) {
    getBetas(dyeBiasNL(noob(pOOBAH(readIDATpair(pfx)))))
}), mc.cores=2)
```
Here we use two cores with `mc.cores=2`.

Equivalently, sesame provides the **openSesame** pipeline
```{r eval = FALSE}
idat_dir = system.file("extdata/", package = "sesameData")
betas = openSesame(idat_dir, BPPARAM=BiocParallel::MulticoreParam(2))
```
as a quickstart default. Here `idat_dir` is the directory containing all
the IDAT files.
Like readIDATpair, `openSesame` also works with custom-made array with a
manifest file (see above):

```{r eval = FALSE}
betas = openSesame(idat_dir, manifest = manifest)
```

# The SigSet Class

SeSAMe design includes alight-weight full exposure of internal signal
intensities (essential information for users of Illumina methylation array data,
as demonstrated in Zhou et al 2018), which permits sensitive and specific joint
inference on copy number and DNA methylation.

Central to the SeSAMe platform is the `SigSet` data structure,
an S4 class with slots containing signals for six different classes of probes:

1) `II` - Type-II probes;
2) `IR` - Type-I Red channel probes;
3) `IG` - Type-I Grn channel probes;
4) `oobG` - Out-of-band Grn channel probes (matching Type-I Red channel probes
in number);
5) `oobR` - Out-of-band Red channel probes (matching Type-I Grn channel probes
in number);
6) `ctl` - control probes.

For all save control probes, signal intensities are stored as an `Nx2` numeric
matrix, with `N` representing the number of probes in the class. The two
columns of the matrix represent the methylated probe intensity and the 
unmethylated probe intensity. (Previously, this was implemented in an R6
Reference class, `SignalSet`. The current S4 implementation in `SigSet`
complies with Bioconductor guidelines, and for backwards compatibility, the
`signalR6toS4` function transforms a `SignalSet` to a `SigSet`.

For example, printing the SigSet directly shows its content
```{r}
sset
head(II(sset)) # or sset@II
head(ctl(sset)) # the last column: the type of the control.
```

Similarly, signals for Type-I probes can be accessed from `sset@IR` and
`sset@IG` and out-of-band signals from `sset@oobG` and `sset@oobR`.
As one can see the probe names (row names) of `IR` always coincide with the
probe names (row names) of `oobG` (and vice versa). This is because the
out-of-band probe signal for red channel probes is in green channel
(and vice versa). 

## Interaction with Minfi

`SigSet` can be converted back and forth from Minfi `RGChannelSet` in multiple
ways. One can sesamize a minfi `RGChannelSet` which returns a `GenomicRatioSet`.
Here we are illustrating using the `FlowSorted.Blood.450k` object, which is 
distributed in the `minfi::RGChannelSet`.
```{r message = FALSE}
library(sesame)
library(FlowSorted.Blood.450k)
options(rmarkdown.html_vignette.check_title = FALSE)

```

```{r}
grSet <- sesamize(FlowSorted.Blood.450k[,1:4])
grSet
```