---
title: "Analyzing Data from Other Methylation Platforms or Data Types"
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document:
        toc: true # table of content true
vignette: >
    %\VignetteIndexEntry{5. Analyzing data from other sources}
    %\VignetteEngine{knitr::rmarkdown}
editor_options:
    chunk_output_type: console
---


```{r loadPackages, echo=FALSE, warning=FALSE, message=FALSE}
library(knitr)
library(pander)
suppressPackageStartupMessages(library(ramwas))
panderOptions("digits", 3)
opts_chunk$set(fig.width = 6, fig.height = 6)
# opts_chunk$set(eval=FALSE)
# dr = "D:/temp/"
```

# Using RaMWAS for with other methylation platforms or data types

[RaMWAS](https://bioconductor.org/packages/ramwas/)
is primarily designed for studies of methylation measurements
from enrichment platforms.

However, RaMWAS can also be useful for the analysis of 
methylation measurements from other  platforms 
(e.g. Illumina HumanMethylation450K array) or
other data types such as gene expression levels or genotype information.
RaMWAS can perform several analysis steps on such data including: 
principal component analysis (PCA), 
association testing (MWAS, TWAS, GWAS), 
and multimarker analysis with cross validation using the elastic net.

## Import data from other sources

Without external data source at hand,
we show how to create and fill data matrices
with artificial data.
Importing real data can be done in a similar way,
with random data generation replaced with reading data from
existing sources.

We create data files in the same format as produced by
[Step 3](RW1_intro.html#step3) of RaMWAS.

These files include

*   `CpG_locations.*` -- filematrix with the location of 
    the CpGs / SNPs / gene trascription start sites.\
    It must have two columns with integer values -- 
    chromosome number and location
    (`chr` and `position`).
*   `CpG_chromosome_names.txt` -- file with chromosome names (factor levels)
    for the integer column `chr` in the location filematrix.
*   `Coverage.*` -- filematrix with the data for all samples and all locations.\
    Each row has data for a single sample.
    Row names must be sample names.\
    Each column has data for a single location
    (CpG / SNP / gene trascription start site).
    Columns must match rows of the location filematrix.\

First, we load the package and set up a working directory.
The project directory `dr` can be set to
a more convenient location when running the code.
```{r generateData}
library(ramwas)

# work in a temporary directory
dr = paste0(tempdir(), "/simulated_matrix_data")
dir.create(dr, showWarnings = FALSE)
cat(dr,"\n")
```


Let the sample data matrix have 200 samples and 100,000 variables.
```{r dims, eval=TRUE}
nsamples = 200
nvariables = 100000
```

For these `r nsamples` samples we generate a data frame with
age and sex phenotypes and a batch effect covariate.
```{r setseed1, echo=FALSE}
set.seed(18090212)
```
```{r genCovar}
covariates = data.frame(
    sample = paste0("Sample_",seq_len(nsamples)),
    sex = seq_len(nsamples) %% 2,
    age = runif(nsamples, min = 20, max = 80),
    batch = paste0("batch",(seq_len(nsamples) %% 3))
)
pander(head(covariates))
```

Next, we create the genomic locations for 100,000 variables.
```{r setseed2, echo=FALSE}
set.seed(18090212)
```
```{r genLocs}
temp = cumsum(sample(20e7 / nvariables, nvariables, replace = TRUE) + 0)
chr      = as.integer(temp %/% 1e7) + 1L
position = as.integer(temp %% 1e7)

locmat = cbind(chr = chr, position = position)
chrnames = paste0("chr", 1:10)
pander(head(locmat))
```


Now we save locations in a filematrix
and create a text file with chromosome names.\
```{r locSave}
fmloc = fm.create.from.matrix(
            filenamebase = paste0(dr,"/CpG_locations"),
            mat = locmat)
close(fmloc)
writeLines(
        con = paste0(dr,"/CpG_chromosome_names.txt"),
        text = chrnames)
```

Finally, we create data matrix.
We include sex effect in 225 variables and 
age effect in 16 variables out of each 2000.
Each variable is also affected by noise and batch effects.

```{r setseed3, echo=FALSE}
set.seed(18090212)
```
```{r fillDataMat}
fm = fm.create(paste0(dr,"/Coverage"), nrow = nsamples, ncol = nvariables)

# Row names of the matrix are set to sample names
rownames(fm) = as.character(covariates$sample)

# The matrix is filled, 2000 variables at a time
byrows = 2000
for( i in seq_len(nvariables/byrows) ){ # i=1
    slice = matrix(runif(nsamples*byrows), nrow = nsamples, ncol = byrows)
    slice[,  1:225] = slice[,  1:225] + covariates$sex / 30 / sd(covariates$sex)
    slice[,101:116] = slice[,101:116] + covariates$age / 10 / sd(covariates$age)
    slice = slice + ((as.integer(factor(covariates$batch))+i) %% 3) / 40
    fm[,(1:byrows) + byrows*(i-1)] = slice
}
close(fm)
```


## Principal Component Analysis (PCA)

To run PCA with RaMWAS we specify three parameters:

*   `dircoveragenorm` -- directory with the data matrix
*   `covariates` -- data frame with covariates
*   `modelcovariates` -- names of covariates to regress out

```{r param1}
param = ramwasParameters(
    dircoveragenorm = dr,
    covariates = covariates,
    modelcovariates = NULL
)
```

```{r threads, echo=FALSE}
# Bioconductor requires limit of 2 parallel jobs
param$cputhreads = 2
```

Now we run PCA.

```{r pcaNULL, warning=FALSE, message=FALSE}
ramwas4PCA(param)
```

The top several PCs are marginally distinct from the rest.
```{r plotPCA, warning=FALSE, message=FALSE}
pfull = parameterPreprocess(param)
eigenvalues = fm.load(paste0(pfull$dirpca, "/eigenvalues"));
eigenvectors = fm.open(
                filenamebase = paste0(pfull$dirpca, "/eigenvectors"),
                readonly = TRUE);
plotPCvalues(eigenvalues)
plotPCvectors(eigenvectors[,1], 1)
plotPCvectors(eigenvectors[,2], 2)
plotPCvectors(eigenvectors[,3], 3)
plotPCvectors(eigenvectors[,4], 4)
close(eigenvectors)
```

There are strong correlations between top PCs with
sex, age, and batch covariates.\
Note, for the categorical covariate (batch)
the table shows R^2^ instead of correlations.

```{r topCorNULL}
# Get the directory with PCA results
pfull = parameterPreprocess(param)
tblcr = read.table(
            file = paste0(pfull$dirpca, "/PC_vs_covs_corr.txt"),
            header = TRUE,
            sep = "\t")
pander(head(tblcr, 5))
```

The p-values for these correlations and R^2^
show that the top two PCs are correlated with
sex and age while a number of other PCs are affected by sample batch effects.
```{r topPvNULL}
pfull = parameterPreprocess(param)
tblpv = read.table(
            file = paste0(pfull$dirpca, "/PC_vs_covs_pvalue.txt"),
            header = TRUE,
            sep = "\t")
pander(head(tblpv, 5))
```


## PCA with batch regressed out

It is common to regress out batch and lab-technical effects
from the data in the analysis.

Let's regress out batch in our example
by changing `modelcovariates` parameter.
```{r pcaBatch, warning=FALSE, message=FALSE}
param$modelcovariates = "batch"

ramwas4PCA(param)
```

The p-values for association between PCs and covariates changed slightly:
```{r topPvBatch}
# Get the directory with PCA results
pfull = parameterPreprocess(param)
tblpv = read.table(
            file = paste0(pfull$dirpca, "/PC_vs_covs_pvalue.txt"),
            header = TRUE,
            sep = "\t")
pander(head(tblpv, 5))
```
Note that the PCs are now orthogonal to the batch effects and thus
the corresponding p-values all equal to 1.

## Association testing

Let us test for association between
variables in the data matrix and the sex covariate
(`modeloutcome` parameter)
correcting for batch effects (`modelcovariates` parameter).
Save top 20 results (`toppvthreshold` parameter) in a text file.

```{r paramGWAS, warning=FALSE, message=FALSE}
param$modelcovariates = "batch"
param$modeloutcome = "sex"
param$toppvthreshold = 20

ramwas5MWAS(param)
```

The QQ-plot shows mild enrichment among a large number of variables,
which is consistent with how the data was generated --
22\% of variables are affected by sex.
```{r tableMWAS, warning=FALSE, message=FALSE}
mwas = getMWAS(param)
qqPlotFast(mwas$`p-value`)
title(pfull$qqplottitle)
```

The top finding saved in the text file are:
```{r topPvMWAS}
# Get the directory with testing results
pfull = parameterPreprocess(param)
toptbl = read.table(
            file = paste0(pfull$dirmwas,"/Top_tests.txt"),
            header = TRUE,
            sep = "\t")
pander(head(toptbl, 5))
```


## Further steps of RaMWAS pipeline

Steps 6 and 7 of RaMWAS pipeline can also be applied
to the data matrix exactly as described in the
[overview vignette](RW1_intro.html#annotation-of-top-results).

## Cleanup

Here we remove all the files created by the code above.
```{r clean}
unlink(paste0(dr,"/*"), recursive=TRUE)
```

# Version information
```{r version}
sessionInfo()
```