---
title: "Other SeSAMe Features"
date: "`r BiocStyle::doc_date()`"
package: sesame
output: BiocStyle::html_document
fig_width: 8
fig_height: 6
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{5. Other Features}
  %\VignetteEncoding{UTF-8}
---

# Genomic Privacy

## Purpose 

Probe masking is important to prevent privacy data leakage. The goal of
data sanitization is to modifiy IDAT files in place, so they can be released
to public domain without privacy leak. This will be achieved by
deIdentification. The following function requires the R package seSAMe. 

```{r message=FALSE, warning=FALSE, include=FALSE}
library(sesame)
```

Let's take DNA methylation data from the HM450 platform for example.
```{r, eval=TRUE}
dest_dir = tempdir()
res_grn = sesameDataDownload("3999492009_R01C01_Grn.idat", dest_dir=dest_dir)
res_red = sesameDataDownload("3999492009_R01C01_Red.idat", dest_dir=dest_dir)
```
                                                   
## De-identify by Masking

This first method of deIdentification masks SNP probe intensity mean by zero.
As a consequence, the allele frequency will be 0.5. 

```{r, eval=TRUE}

deIdentify(res_grn$dest_file, sprintf("%s/deidentified_Grn.idat", dest_dir))
deIdentify(res_red$dest_file, sprintf("%s/deidentified_Red.idat", dest_dir))

betas1 = getBetas(readIDATpair(sprintf("%s/3999492009_R01C01", dest_dir)))
betas2 = getBetas(readIDATpair(sprintf("%s/deidentified", dest_dir)))

head(betas1[grep('rs',names(betas1))]) 
head(betas2[grep('rs',names(betas2))])
```

Note that before deIdentify, the rs values will all be different. After
deIdentify, the rs values will all be masked at an intensity of 0.5. 

## De-identify by Scrambling

This second method of deIdentification will scramble the intensities using
a secret key to help formalize a random number. Therefore, randomize needs
to be set to TRUE. 

```{r, eval=TRUE}

my_secret <- 13412084
set.seed(my_secret)

deIdentify(res_grn$dest_file,
    sprintf("%s/deidentified_Grn.idat", dest_dir), randomize=TRUE)

my_secret <- 13412084
set.seed(my_secret)
deIdentify(res_red$dest_file,
    sprintf("%s/deidentified_Red.idat", dest_dir), randomize=TRUE)

betas1 = getBetas(readIDATpair(sprintf("%s/3999492009_R01C01", dest_dir)))
betas2 = getBetas(readIDATpair(sprintf("%s/deidentified", dest_dir)))

head(betas1[grep('rs',names(betas1))]) 
head(betas2[grep('rs',names(betas2))]) 

```
Note that the rs values are scrambled after deIdentify.  

## Re-identify

To restore order of the deIdentified intensities, one can re-identify IDATs.
The reIdentify function can thus restore the scrambled SNP intensities. 

```{r, eval=TRUE}

my_secret <- 13412084
set.seed(my_secret)

reIdentify(sprintf("%s/deidentified_Grn.idat", dest_dir),
    sprintf("%s/reidentified_Grn.idat", dest_dir))

my_secret <- 13412084
set.seed(my_secret)
reIdentify(sprintf("%s/deidentified_Red.idat", dest_dir),
    sprintf("%s/reidentified_Red.idat", dest_dir))

betas1 = getBetas(readIDATpair(sprintf("%s/3999492009_R01C01", dest_dir)))
betas2 = getBetas(readIDATpair(sprintf("%s/reidentified", dest_dir)))

head(betas1[grep('rs',names(betas1))]) 
head(betas2[grep('rs',names(betas2))]) 

```
Note that reIdentify restored the values. Subsequently, they are the same as
betas1. 

# Extract Genotypes

SeSAMe can output explicit and Infinium-I-derived SNP to VCF.
This information can be used to identify sample swaps.

```{r}
sset <- sesameDataGet('EPIC.1.LNCaP')$sset
annoS = sesameDataGetAnno("EPIC/EPIC.hg19.snp_overlap_b151.rds")
annoI = sesameDataGetAnno("EPIC/EPIC.hg19.typeI_overlap_b151.rds")
head(formatVCF(sset, annoS=annoS, annoI=annoI)) # output to console
```

One can output to actual VCF file with a header by `formatVCF(sset,
vcf=path_to_vcf)`.

# The FileSet

## Preprocessing IDATs to FileSets

When a large number of samples are being analyzed, it is desirable to have
random access to specific CpG methylation without loading all the data.
SeSAMe provides such interface through the `fileSet` object which is 
in essence an indexed file-based numeric matrix.

The one function to generate a `fileSet` is through the `openSesameToFile`
function. In this case, there is no concrete output from the function. The 
consequence is the generation of a file at the given path. One can operate
on the `fileSet` by referencing the path to the file.

```{r message = FALSE}
library(sesame)
options(rmarkdown.html_vignette.check_title = FALSE)
```

The following `openSesameToFile` call does three things
- generates a file called `mybetas`. 
- generates an index file called `mybetas_idx.rds`
- returns a `fileSet` object which serves as an interface to the two files.
```{r}
fset <- openSesameToFile('mybetas',
    system.file('extdata',package='sesameData'))
```

## Introduction to fileSet

When printed to console, the number of samples and the number of probes are 
shown.
```{r}
fset
```

One can obtain the samples and probes information with the `$` operator.
```{r}
head(fset$samples) # sample IDs
head(fset$probes) # probe IDs
```

## Query fileSet
One can query the specific CpG by probe name(s) and sample name(s). 
Note that every query to fset is a disk read. Therefore it can be slower than
in-memory processing. Here we only retrieve the beta values for the two probes
_cg00006414_ and _cg00007981_ in the sample *4207113116_B*.
```{r}
sliceFileSet(fset, '4207113116_B', c('cg00006414','cg00007981'))
```

## Read Existing fileSet

In the previous example, we preprocessed IDATs directly to `fileSet`. We can
also read a pre-existing `fileSet` using the file path using `readFileSet`
function.
```{r}
fset <- readFileSet('mybetas')
sliceFileSet(fset, '4207113116_A', 'cg00000292')
```

## Write fileSet by Allocation and Filling

`fileSet` size is always fixed. One cannot dynamically expand or shrink a
fileSet. We can write a fileSet by filling the space one sample by one sample.
This is achieved by first allocating the space given the number of samples
and the probe IDs (optional if platform is one if HM27, HM450 or EPIC).
```{r}
fset2 <- initFileSet('mybetas2', 'HM450', c('sample1', 'sample2'))
```
Then one can fill in the beta values by `mapFileSet`. Here I am 
illustrating using a randomly generated beta values.
```{r}
hypothetical_betas <- setNames(runif(fset2$n), fset2$probes)
mapFileSet(fset2, 'sample2', hypothetical_betas)
```

The mapped value should be equal to the generated beta value. Let's 
spot-check.
```{r}
abs(sliceFileSet(fset2,'sample2','cg00000108') -
        hypothetical_betas['cg00000108']) < 1e-7
```