Installation

To install SeSAMe from Bioconductor,

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("sesame")

To install the development version directly from github

BiocManager::install('zwdzwd/sesameData')
BiocManager::install('zwdzwd/sesame')

CRITICAL: After a new installation, one needs to cache the associated annotation data using the following command. This needs to be done only once per SeSAMe installation.

sesameDataCacheAll()

This function caches the needed SeSAMe annotation for all the supported platforms. One optionally caches only one specific platform (e.g., EPIC) by calling

sesameDataCache("EPIC")

SeSAMe annotation data is managed by the sesameData package which is based on the ExperimentHub package. You can find the path of the annotation data storage on your local computer using

tools::R_user_dir("ExperimentHub", which="cache")

## [1] "/home/biocbuild/.cache/R/ExperimentHub"

Beta Value Extraction

The raw Infinium BeadChip data are stored in IDAT files. Each sample has two IDAT files and they correspond to the red and green signal respectively. Green and red files for the same samples should always share the same sample name prefix. For example, 204529320035_R06C01_Red.idat and 204529320035_R06C01_Grn.idat correspond to the red and green signal of one sample. In SeSAMe, we will use the common prefix, i.e. 204529320035_R06C01, to refer to that sample. Sesame recognize both the raw IDAT as well as gzipped IDATs which are common in data stored in GEO. For example, in addition to the example above, SeSAMe also recognizes GSM2178224_Red.idat.gz and GSM2178224_Grn.idat.gz.

The function readIDATpair function reads in the signal intensity data from the IDAT pairs. The function takes the common prefix as input and outputs a SigDF object. Using the two examples above, one would run the following commands.

sdf = readIDAT("204529320035_R06C01")  # Example 1
sdf = readIDAT("GSM2178224")           # Example 2

Note that SeSAMe automatically detects and matches up the green and red signal files for the same sample. We will get back to the structure of SigDF below.

DNA methylation level (aka beta values) are defined as
β = M/(M + U)
with M representing the signal from methylated allele and U representing the unmethylated signal. It can be retrieved calling the getBetas function on the SigDF. The output is a named vector with probe ID as names. For example, the following commands read in one sample and convert it to β values.

sdf = readIDAT("GSM2178224")

print(sdf)

## SigDF - EPIC
##  - 142158 Infinium-I Probes
##  - 723760 Infinium-II Probes
##  - 635 Control Probes
##  - 105454 Number of Masked Probes
##  -    Row   Probe_ID  MG   MR   UG    UR col  mask
##  -      1 cg07881041  NA   NA 6543  1011   2 FALSE
##  -      2 cg18478105 332 1013 7094   910   G FALSE
##  - 865917 cg10633746  NA   NA  652 12346   2 FALSE
##  - 865918 cg12623625  NA   NA 7267  3217   2 FALSE

head(as.data.frame(sdf)) # peek into the internals

betas = getBetas(sdf)
head(betas)

## cg07881041 cg18478105 cg23229610 cg03513874 cg09835024 cg05451842 
## 0.86616362 0.04470778         NA 0.52479062 0.02888713 0.06215903

CRITICAL: getBetas takes a single SigDF object as input instead of a list of them. A common mistake is to c-merge multiple SigDFs. To combine multiple SigDFs, one can use list() instead. To have it process many SigDFs, we should combine that with looping functions lapply or mclapplys, or using the openSesame pipeline (see below).

Search IDAT Prefixes

Most often we will be working with a folder that contains many IDATs. Here is where the searchIDATprefixes function comes in handy. It lets us search all the IDATs in a folder and its subfolders recursively. Combine this with the R looping functions lets you process many IDATs without having to specify all IDAT names. searchIDATprefixes returns a named vector of prefixes with associated Red and Grn files, for readIDATpair:

sdfs = lapply(
    searchIDATprefixes(system.file("extdata/", package = "sesameData")),
    readIDATpair)
sdfs

## $`4207113116_A`
## SigDF - HM27
##  - 27578 Infinium-I Probes
##  - 0 Infinium-II Probes
##  - 144 Control Probes
##  - 2263 Number of Masked Probes
##  -   Row   Probe_ID   MG   MR    UG   UR col  mask
##  -     1 cg00000292 1072 2840   507  752   R FALSE
##  -     2 cg00002426  423  325   628 3867   R FALSE
##  - 27577 cg27662877  476  241  5874  507   G FALSE
##  - 27578 cg27665659  961  465 17831  661   G FALSE
## 
## $`4207113116_B`
## SigDF - HM27
##  - 27578 Infinium-I Probes
##  - 0 Infinium-II Probes
##  - 144 Control Probes
##  - 2263 Number of Masked Probes
##  -   Row   Probe_ID  MG   MR    UG   UR col  mask
##  -     1 cg00000292 513  570   377 1373   R FALSE
##  -     2 cg00002426 656 2349   507  293   R FALSE
##  - 27577 cg27662877 504  319  3759  119   G FALSE
##  - 27578 cg27665659 818  204 17821  429   G FALSE

A simple list of “SigSet”s are returned.

Custom-made Array

If you need to deal with a custom-made array instead of the standard array (MM285, EPIC, HM450 etc) supported natively by SeSAMe, you would need to provide a manifest that describe the probe information. You should be able to obtain that from the Illumina support website. The manifest should be made into a simple data frame with a minimum of four columns: Probe_ID, M, U and col. The easiest way to make the the manifest compatible with SeSAMe is by following internal manifests for a SeSAME-supported platform. They can be retrieved with the sesameDataGet function:

mft = sesameDataGet("MM285.address")$ordering

The col is either G (stand for Green) or R (stand for Red) or 2 ( stand for both in the case of Infinium II design). For Infinium-II probes, the M column and col column is left to be NA. For example, one can check that both M and col columns are filled with the Infinium-I probes:

head(mft[!is.na(mft$col),])

The last column mask is a logical vector that defines the default masking behavior of SeSAMe for the platform (see below for discussion of NA-masking).

With the manifest, your data can be processed using

sdf = readIDATpair("your_sample_name", manifest = mft)
betas = getBetas(sdf)

Sum Two-channels

β values for Infinium-I probes can also be obtained by summing up the two in-band channel and out-of-band channel. This rescues probes with SNP hitting the extension base and hence switching color channel. More details can be found in Zhou et al 2017.

library(sesame)
## show case using an example without mask, then add mask with qualityMask
sdf = qualityMask(sesameDataGet('EPIC.1.SigDF'))
betas = getBetas(sdf, sum.TypeI = TRUE)

For color-channel-switching probes, extra SNP allele frequencies can be obtained by summing up methylated and umethylated alleles:

extraSNPAFs = getAFTypeIbySumAlleles(sdf)

NA-Masking

If you call getBetas as is, you may notice that some of the beta values show up having NA values. This NA-masking is controlled internally using the mask column in SigDF. E.g.,

betas = getBetas(sdf)
head(betas)

## cg07881041 cg18478105 cg23229610 cg03513874 cg09835024 cg05451842 
## 0.86616362 0.04470778         NA 0.52479062 0.02888713 0.06215903

To check probes to be NA-masked in a SigSet, one can use the mask function

head(sdf$mask)

## [1] FALSE FALSE  TRUE FALSE FALSE FALSE

sum(sdf$mask)   # number of probes to be NA-masked

## [1] 105455

sum(is.na(betas)) # should be the same as above

## [1] 105455

Please be noted that mask in SigDF does not actually remove the probe reading but only specify how SeSAMe currently views the measurement. One can add more probes to the mask with the addMask function. Other functions such as the detection p-value calculation, also modifies mask. NA-masking influences other normalization and preprocessing functions. Therefore one should set it carefully. If one does not do any explicit masking, one simply gets the masking specified in the mask column of the manifest. This defines the default masking behavior of SeSAMe. For more details of masking, one can refer to Zhou et al 2017. One can override this masking by the resetMask function

sum(sdf$mask)  # before resetting

## [1] 105455

sum(resetMask(sdf)$mask)  # after resetting, expect 0

## [1] 0

The getBetas function can also ignore NA-masking while extracting beta values:

sum(is.na(getBetas(sdf, mask=FALSE)))  # expect 0

## [1] 0

We recommend two types of probe masking:

Experiment-dependent Probe Masking based on signal detection p-value (Zhou et al. 2018). Probes with p-value higher than a threshold (default: 0.05) are masked (see following for detection p-value calculation using the pOOBAH method).
Experiment-independent Probe Masking due to design issues. This is typically designated in the mask column of the manifest (see Zhou et al. 2017): This masking supports EPIC, MM285, HM450 and HM27 and is turned on by default and can also be explicitly added by the function qualityMask.

Preprocessing

Detection p-value

As mentioned above, experiment-dependent masking based on signal detection p-values is effective in excluding artifactual methylation level reading and probes with too much influence from signal background. We recommend the pOOBAH algorithm that was based on Infinium-I probe out-of-band signal for calibrating the distribution of the signal background:

sum(sdf$mask)  # before pOOBAH

## [1] 105455

sum(pOOBAH(sdf)$mask) # after pOOBAH, expect more probes

## [1] 137610

sum(pOOBAH(resetMask(sdf))$mask) # pOOBAH-only masking

## [1] 37548

Sometimes one would want to calculation detection p-value without creating masking like in the case of having to upload the p-value to GEO. In those cases one can use the return.pval option and add pvalue-based mask later.

pvals = pOOBAH(sdf, return.pval=TRUE)
sdf2 = addMask(sdf, pvals > 0.05) # default, recommend between 0.05 and 0.2

Background Subtraction

SeSAMe implements the background subtraction based on normal-exponential deconvolution using out-of-band probes noob (Triche et al. 2013) and optionally with extra bleed-through subtraction. Signal bleed-through happens when measurement from one channel affects the measurement in the other channel. SeSAMe’s noob further removes residual background by regressing out the green-to-red and red-to-green relationship using Type-I probes.

sdf.nb = noob(sdf)          # noob background subtraction
sdf.sb = scrub(sdf)         # more aggressive background subtraction

One can use following beta to total signal intensity to check the effect of background subtraction.

sesamePlotIntensVsBetas(sdf)

sesamePlotIntensVsBetas(sdf.nb) # with noob

sesamePlotIntensVsBetas(sdf.sb) # with scrub

Dye Bias Correction

Dye bias refers to the difference in signal intensity between the two color channel. SeSAMe offers two flavors of dye bias correction: linear scaling (dyeBiasCorr) and nonlinear scaling (dyeBiasCorrTypeINorm). Linear scaling equalize the mean of all probes from the two color channel.

library(sesame)
sdf.dbLinear = dyeBiasCorr(sdf)
sesamePlotRedGrnQQ(sdf.dbLinear)

Residual dye bias can be corrected using nonlinear quantile interpolation with Type-I probes.

sdf.dbNonlinear = dyeBiasCorrTypeINorm(sdf)

Under this correction, Infinium-I Red probes and Infinium-I Grn probes have the same distribution of signal.

sesamePlotRedGrnQQ(sdf.dbNonlinear)

Note that linear scaling does not shift beta values of Type-I probes while nonlinear scaling does shift beta values of Type-I probes.

Channel Inference

Sometimes Infinium-I channel spec is inaccurate in the manifest. We can infer the channel using data.

sdf.InfICorrected = inferInfiniumIChannel(sdf, verbose=TRUE)

## Infinium-I color channel reset:
## R>R: 92148
## G>G: 49039
## R>G: 64
## G>R: 907

The openSesame Pipeline

We have discussed noob, nonlinear dye bias correction and pOOBAH. Put together, this can be implemented as follows

idat_dir = system.file("extdata/", package = "sesameData")
betas = do.call(cbind, mclapply(searchIDATprefixes(idat_dir), function(pfx) {
    getBetas(dyeBiasNL(noob(pOOBAH(readIDATpair(pfx)))))
}), mc.cores=2)

Here we use two cores with mc.cores=2.

Equivalently, sesame provides the openSesame pipeline

idat_dir = system.file("extdata/", package = "sesameData")
betas = openSesame(idat_dir)

as a quickstart default. Here idat_dir is the directory containing all the IDAT files. Multi-core processing can be invoked using the BiocParallel:

betas = openSesame(idat_dir, BPPARAM=BiocParallel::MulticoreParam(2))

Like readIDATpair, openSesame also works with custom-made array with a manifest file (see above):

betas = openSesame(idat_dir, manifest = manifest)

The SigDF Class

SeSAMe design includes alight-weight full exposure of internal signal intensities (essential information for users of Illumina methylation array data, as demonstrated in Zhou et al 2018), which permits sensitive and specific joint inference on copy number and DNA methylation.

Central to the SeSAMe platform is the SigDF data structure, a data.frame subclass with the following column names

colnames(sdf)

## [1] "Probe_ID" "MG"       "MR"       "UG"       "UR"       "col"      "mask"

The col column specifies the color channel and takes G, R and 2. The Infinium-I probes carry G and R

head(sdf[sdf$col != "2",])      # Infinium-I probes

head(sdf[sdf$col == "2",])      # Infinium-II probes

The controls attributes may contain the control probe information. This is not a mandatory field for valid SigDF

## same as head(attr(sdf, "controls"))
head(controls(sdf))  # the last column: the type of the control.

For control probes, signal intensities are stored as an Nx2 numeric matrix, with N representing the number of probes in the class. The two columns of the matrix represent the methylated probe intensity and the unmethylated probe intensity.

IO with tsv/csv files

SigDF can be written as and read from plain text file with the compliant column names (i.e., Probe_ID, MG, MR, UG, UR, col, mask).

txt_file_path = sprintf("%s/sigdf.txt", tempdir())
sdf_write_table(sdf, file=txt_file_path) # save as space-delimited txt
sdf2 = sdf_read_table(txt_file_path)     # read back

tsv_file_path = sprintf("%s/sigdf.tsv", tempdir())
sdf_write_table(sdf, file=tsv_file_path, sep="\t", quote=FALSE) # save as tsv
sdf2 = sdf_read_table(tsv_file_path)                            # read back

csv_file_path = sprintf("%s/sigdf.csv", tempdir())
sdf_write_table(sdf, file=csv_file_path, sep=",") # save as csv
sdf2 = sdf_read_table(csv_file_path, sep=",")     # read back

Interaction with SigSet

Previously, the signal was implemented an S4 implementation in SigSet complies with Bioconductor guidelines, and for backwards compatibility, SigSet can be transformed to a SigDF using the SigSetToSigDF function sesame:::SigSetToSigDF(sset).

Interaction with Minfi

SigSet can be converted back and forth from Minfi RGChannelSet in multiple ways. One can sesamize a minfi RGChannelSet which returns a GenomicRatioSet. Here we are illustrating using the FlowSorted.Blood.450k object, which is distributed in the minfi::RGChannelSet.

library(sesame)
library(FlowSorted.Blood.450k)
options(rmarkdown.html_vignette.check_title = FALSE)

grSet <- sesamize(FlowSorted.Blood.450k[,1:4])

## Sesamizing WB_105...

## Sesamizing WB_218...

## Sesamizing WB_261...

## Sesamizing PBMC_105...

## Loading required package: IlluminaHumanMethylation450kanno.ilmn12.hg19

## Loading required package: IlluminaHumanMethylation450kmanifest

grSet

## class: GenomicRatioSet 
## dim: 485512 4 
## metadata(1): SNPs
## assays(2): Beta CN
## rownames(485512): cg13869341 cg14008030 ... cg08265308 cg14273923
## rowData names(0):
## colnames(4): WB_105 WB_218 WB_261 PBMC_105
## colData names(8): Sample_Name Slide ... CellType Sex
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: SeSAMe (type I)
##   minfi version: 1.40.0
##   Manifest version: 0.6.0

Session Info

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] IlluminaHumanMethylation450kmanifest_0.4.0        
##  [2] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
##  [3] FlowSorted.Blood.450k_1.32.0                      
##  [4] minfi_1.40.0                                      
##  [5] bumphunter_1.36.0                                 
##  [6] locfit_1.5-9.4                                    
##  [7] iterators_1.0.14                                  
##  [8] foreach_1.5.2                                     
##  [9] Biostrings_2.62.0                                 
## [10] XVector_0.34.0                                    
## [11] e1071_1.7-9                                       
## [12] tidyr_1.2.0                                       
## [13] dplyr_1.0.7                                       
## [14] knitr_1.37                                        
## [15] SummarizedExperiment_1.24.0                       
## [16] Biobase_2.54.0                                    
## [17] MatrixGenerics_1.6.0                              
## [18] matrixStats_0.61.0                                
## [19] scales_1.1.1                                      
## [20] DNAcopy_1.68.0                                    
## [21] GenomicRanges_1.46.1                              
## [22] GenomeInfoDb_1.30.1                               
## [23] IRanges_2.28.0                                    
## [24] S4Vectors_0.32.3                                  
## [25] wheatmap_0.1.0                                    
## [26] ggplot2_3.3.5                                     
## [27] sesame_1.12.9                                     
## [28] sesameData_1.12.0                                 
## [29] rmarkdown_2.11                                    
## [30] ExperimentHub_2.2.1                               
## [31] AnnotationHub_3.2.1                               
## [32] BiocFileCache_2.2.1                               
## [33] dbplyr_2.1.1                                      
## [34] BiocGenerics_0.40.0                               
## 
## loaded via a namespace (and not attached):
##   [1] fastmatch_1.1-3               plyr_1.8.6                   
##   [3] splines_4.1.2                 BiocParallel_1.28.3          
##   [5] digest_0.6.29                 htmltools_0.5.2              
##   [7] fansi_1.0.2                   magrittr_2.0.2               
##   [9] memoise_2.0.1                 tzdb_0.2.0                   
##  [11] limma_3.50.0                  readr_2.1.2                  
##  [13] annotate_1.72.0               siggenes_1.68.0              
##  [15] askpass_1.1                   prettyunits_1.1.1            
##  [17] colorspace_2.0-2              blob_1.2.2                   
##  [19] rappdirs_0.3.3                ggrepel_0.9.1                
##  [21] xfun_0.29                     crayon_1.4.2                 
##  [23] RCurl_1.98-1.5                jsonlite_1.7.3               
##  [25] genefilter_1.76.0             GEOquery_2.62.2              
##  [27] survival_3.2-13               glue_1.6.1                   
##  [29] gtable_0.3.0                  zlibbioc_1.40.0              
##  [31] DelayedArray_0.20.0           Rhdf5lib_1.16.0              
##  [33] HDF5Array_1.22.1              DBI_1.1.2                    
##  [35] rngtools_1.5.2                Rcpp_1.0.8                   
##  [37] xtable_1.8-4                  progress_1.2.2               
##  [39] mclust_5.4.9                  bit_4.0.4                    
##  [41] proxy_0.4-26                  preprocessCore_1.56.0        
##  [43] httr_1.4.2                    fgsea_1.20.0                 
##  [45] RColorBrewer_1.1-2            ellipsis_0.3.2               
##  [47] reshape_0.8.8                 pkgconfig_2.0.3              
##  [49] XML_3.99-0.8                  farver_2.1.0                 
##  [51] sass_0.4.0                    utf8_1.2.2                   
##  [53] tidyselect_1.1.1              labeling_0.4.2               
##  [55] rlang_1.0.1                   reshape2_1.4.4               
##  [57] later_1.3.0                   AnnotationDbi_1.56.2         
##  [59] munsell_0.5.0                 BiocVersion_3.14.0           
##  [61] tools_4.1.2                   cachem_1.0.6                 
##  [63] cli_3.1.1                     generics_0.1.2               
##  [65] RSQLite_2.2.9                 evaluate_0.14                
##  [67] stringr_1.4.0                 fastmap_1.1.0                
##  [69] yaml_2.2.2                    bit64_4.0.5                  
##  [71] beanplot_1.2                  scrime_1.3.5                 
##  [73] purrr_0.3.4                   randomForest_4.7-1           
##  [75] KEGGREST_1.34.0               sparseMatrixStats_1.6.0      
##  [77] nlme_3.1-155                  doRNG_1.8.2                  
##  [79] mime_0.12                     nor1mix_1.3-0                
##  [81] xml2_1.3.3                    biomaRt_2.50.3               
##  [83] BiocStyle_2.22.0              compiler_4.1.2               
##  [85] filelock_1.0.2                curl_4.3.2                   
##  [87] png_0.1-7                     interactiveDisplayBase_1.32.0
##  [89] tibble_3.1.6                  bslib_0.3.1                  
##  [91] stringi_1.7.6                 highr_0.9                    
##  [93] GenomicFeatures_1.46.4        lattice_0.20-45              
##  [95] Matrix_1.4-0                  multtest_2.50.0              
##  [97] vctrs_0.3.8                   rhdf5filters_1.6.0           
##  [99] pillar_1.7.0                  lifecycle_1.0.1              
## [101] BiocManager_1.30.16           jquerylib_0.1.4              
## [103] data.table_1.14.2             bitops_1.0-7                 
## [105] rtracklayer_1.54.0            httpuv_1.6.5                 
## [107] BiocIO_1.4.0                  R6_2.5.1                     
## [109] promises_1.2.0.1              KernSmooth_2.23-20           
## [111] gridExtra_2.3                 codetools_0.2-18             
## [113] MASS_7.3-55                   assertthat_0.2.1             
## [115] rhdf5_2.38.0                  rjson_0.2.21                 
## [117] openssl_1.4.6                 withr_2.4.3                  
## [119] GenomicAlignments_1.30.0      Rsamtools_2.10.0             
## [121] GenomeInfoDbData_1.2.7        mgcv_1.8-38                  
## [123] hms_1.1.1                     quadprog_1.5-8               
## [125] grid_4.1.2                    base64_2.0                   
## [127] class_7.3-20                  DelayedMatrixStats_1.16.0    
## [129] illuminaio_0.36.0             shiny_1.7.1                  
## [131] restfulr_0.0.13

SeSAMe Basic Usage

8 February 2022