%\VignetteIndexEntry{An Introduction to the REMP Package}
%\VignetteKeywords{DNA Methylation, repetitive element, prediction}
%\VignettePackage{REMP}
\documentclass[10pt]{article}

\usepackage{listings}
\usepackage{times}
\usepackage{hyperref}
\usepackage{biblatex}

\textwidth=7in
\textheight=9in
%\parskip=.3cm
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpkg}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

<<echo=FALSE>>=
library(knitr)
options(width=60)

listing <- function(x, options) {
  paste("\\begin{lstlisting}[basicstyle=\\ttfamily,breaklines=true]\n",
    x, "\\end{lstlisting}\n", sep = "")
}
knit_hooks$set(source=listing, output=listing)
@

\title{An Introduction to the \Rpkg{REMP} Package}
\author{Yinan Zheng}
\date{\today}

\begin{document}

\maketitle

\tableofcontents


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}

\Rpkg{REMP} predicts DNA methylation of locus-specific repetitive elements (RE) by learning surrounding genetic and epigenetic information. \Rpkg{REMP} provides genomewide single-base resolution of DNA methylation on RE that are difficult to measure using array-based or sequencing-based platforms, which enables epigenome-wide association study (EWAS) and differentially methylated region (DMR) analysis on RE. 

\section{Installation}
Install \Rpkg{REMP} (release version):

<<bioconductorREMPrelease, eval=FALSE>>=
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("REMP")
@

To install devel version:

<<bioconductorREMPdev, eval=FALSE>>=
library(devtools)
install_github("YinanZheng/REMP")
@

Load \Rpkg{REMP} into the workspace:

<<loadREMP, message=FALSE, eval=TRUE>>=
library(REMP)
@

\section{REMP: Repetitive Element Methylation Prediction}
Currently \Rpkg{REMP} supports Human (hg19, build 37) Alu and LINE-1 (L1) repetitive element (RE) methylation prediction using Illumina 450k or EPIC array.

\subsection{Groom methylation data}
Appropriate data preprocessing including quality control and normalization of methylation data are recommended before running \Rpkg{REMP}. Many packages are available to carry out these data preprocessing steps, for example, \Rpkg{minfi}, \Rpkg{wateRmelon}, and \Rpkg{methylumi}.

\Rpkg{REMP} is trying to minimize the requirement of the methylation data format. User can maintain the methylation data in \Rclass{RatioSet} or \Rclass{GenomicRatioSet} object offered by \Rpkg{minfi}, \Rclass{data.table}, \Rclass{data.frame}, \Rclass{DataFrame}, or \Rclass{matrix}. User can input either beta value or M-value. There are only two basic requirements of the methylation data:
\begin{enumerate}
  \item Each row should represent CpG probe and each column should represent sample.
  \item The row names should indicate Illumina probe ID (i.e. cg00000029).
\end{enumerate}

However, there are some other common data issues that may prevent \Rpkg{REMP} from running correctly. For example, if the methylation data are in beta value and contain zero methylation values, logit transformation (to create M-value) will create negative infinite value; or the methylation data contain \Rcode{NA}, \Rcode{Inf}, or \Rcode{NaN} data. To tackle these potential issues, \Rpkg{REMP} includes a handy function \Rfunction{grooMethy} which can help detect and fix these issues. We highly recommend to take advantage of this function:

<<grooMethy, eval=TRUE>>=
# Get GM12878 methylation data (450k array)
GM12878_450k <- getGM12878('450k') 
GM12878_450k <- grooMethy(GM12878_450k, verbose = TRUE)
GM12878_450k
@

For zero beta values, \Rfunction{grooMethy} will replace them with smallest non-zero beta value. For \Rcode{NA}/\Rcode{NaN}/\Rcode{Inf} values, \Rfunction{grooMethy} will treat them as missing values and then apply KNN-imputation to complete the dataset. If the imputed value is out of the original range (which is possible when \Rcode{imputebyrow = FALSE}), mean value will be used instead. Warning: imputed values for multimodal distributed CpGs (across samples) may not be correct. Please check package \Rpkg{ENmix} to identify the CpGs with multimodal distribution. 

\subsection{Prepare annotation data}
To run \Rpkg{REMP} for RE methylation prediction, user first needs to prepare some annotation datasets. The function \Rfunction{initREMP} is designed to do the job.

Suppose user will predict Alu methylation using Illumina 450k array data:

<<remparcel, eval=TRUE>>=
data(Alu.demo)
remparcel <- initREMP(arrayType = "450k", REtype = "Alu", 
                      RE = Alu.demo, ncore = 1)
remparcel
@

For demonstration, we only use 500 selected Alu sequence dataset which comes along with the package (\Rcode{Alu.demo}). We specify \Rfunarg{RE = Alu.demo}, so that the annotation dataset will be generated for the 500 selected Alu sequences. Most of the time, specifying \Rfunarg{RE} is not necessary, as the function will fetch the complete RE sequence dataset from package \Rpkg{AnnotationHub} using \Rfunction{fetchRMSK}. User can also use this argument \Rfunarg{RE} to provide customized RE dataset.

All data are stored in the \Rclass{REMParcel} object:

<<saveParcel, eval=TRUE>>=
saveParcel(remparcel)
@

It is recommended to specify a working directory using argument \Rfunarg{work.dir} so that the data generated can be preserved for later use. Without specifying working directory , the annotation dataset will be created under the temporal directory \Rcode{tempdir()} by default. User can also turn on the \Rfunarg{export} parameter in \Rfunction{initREMP} to save the data automatically.

\subsection{Run prediction}

Once the annotation data are ready, user can pass the annotation data parcel to \Rfunction{remp} for prediction:

<<rempredict, eval=TRUE>>=
remp.res <- remp(GM12878_450k, REtype = 'Alu', 
                 parcel = remparcel, ncore = 1, seed = 777)
@

If \Rfunarg{parcel} is missing, \Rfunction{remp} will then try to search the REMParcel data file in the directory indicated by \Rfunarg{work.dir}. If \Rfunarg{work.dir} is also missing, \Rfunction{remp} will try to search the REMParcel data file in the temporal directory \Rcode{tempdir()}.

By default, \Rfunction{remp} uses Random Forest (\Rfunarg{method = 'rf'}) model (package \Rpkg{randomForest}) for prediction. Random Forest model is recommended because it offers more accurate prediction results and it automatically enables Quantile Regression Forest (Nicolai Meinshausen, 2006) for prediction reliability evaluation. \Rfunction{remp} constructs 19 predictors to carry out the prediction. For Random Forest model, the tuning parameter \Rfunarg{param = 6} (i.e. \Rfunarg{mtry} in \Rpkg{randomForest}) indicates how many predictors will be randomly selected for building the individual trees. The performance of random forest model is often relatively insensitive to the choice of \Rfunarg{mtry}. Therefore, auto-tune will be turned off using random forest and \Rfunarg{mtry} will be set to one third of the total number of predictors. It is recommended to specify a seed for reproducible prediction results.

\Rfunction{remp} will return a \Rclass{REMPset} object, which inherits Bioconductor's \Rclass{RangedSummarizedExperiment} class:

<<rempprint, eval=TRUE>>=
remp.res

# Display more detailed information
details(remp.res)
@

Prediction results can be obtained by accessors: 

<<rempaccessors, eval=TRUE>>=
# Predicted RE-CpG methylation value (Beta value)
rempB(remp.res)

# Predicted RE-CpG methylation value (M value)
rempM(remp.res)

# Genomic location information of the predicted RE-CpG
# Function inherit from class 'RangedSummarizedExperiment'
rowRanges(remp.res)

# Standard error-scaled permutation importance of predictors
rempImp(remp.res)

# Retrive seed number used for the reesults
metadata(remp.res)$Seed
@

Trim off less reliable predicted results:

<<remptrim, eval=TRUE>>=
# Any predicted CpG values with quality score less than
# threshold (default = 1.7) will be replaced with NA. 
# CpGs contain more than missingRate * 100% (default = 20%) 
# missing rate across samples will be discarded. 
remp.res <- rempTrim(remp.res, threshold = 1.7, missingRate = 0.2)
details(remp.res)
@

Aggregate the predicted methylation of CpGs in RE by averaging them to obtain the RE-specific methylation level:

<<rempaggregate, eval=TRUE>>=
remp.res <- rempAggregate(remp.res, NCpG = 2)
details(remp.res)
@

It is recommended to aggregate the predicted CpG methylation by RE to obtain RE-wise methylation level. This is beneficial because 1) it greatly reduces the data dimension for downstream analysis and 2) it produces more robust RE methylation estimation. Note that by default, RE with 2 or more predicted CpG sites will be aggregated. Therefore, the downside of doing this is the reduced coverage of RE. The assumption of doing this is the CpG methylation level within each RE are similar.

To add genomic regions annotation of the predicted REs:

<<rempdecodeAnnot, eval=TRUE>>=
# By default gene symbol annotation will be added 
remp.res <- decodeAnnot(remp.res)
rempAnnot(remp.res)
@

Seven genomic region indicators will be added to the annotation data in the input \Rclass{REMProduct} object:

\begin{itemize}
\item InNM: in protein-coding genes (overlap with refSeq gene's "NM" transcripts + 2000 bp upstream of the transcription start site (TSS))
\item InNR: in noncoding RNA genes (overlap with refSeq gene's "NR" transcripts + 2000 bp upstream of the TSS)
\item InTSS: in flanking region of 2000 bp upstream of the TSS. Default upstream limit is 2000 bp, which can be modified globally using \Rfunction{remp\_options}
\item In5UTR: in 5'untranslated regions (UTRs)
\item InCDS: in coding DNA sequence regions
\item InExon: in exon regions
\item In3UTR: in 3'UTRs
\end{itemize}

Note that intron region and intergenic region information can be derived from the above genomic region indicators: if "InNM" and/or "InNR" is not missing but "InTSS", "In5UTR", "InExon", and "In3UTR" are missing, then the RE is strictly located within intron region; if all indicators are missing, then the RE is strictly located in intergenic region.

\subsection{Plot prediction}

Make a density plot of the predicted methylation (beta values):

\begin{figure}[h!]
\centering
<<rempplot, eval=TRUE, fig.width=4.5, fig.height=4>>=
plot(remp.res, main = "Alu methylation (GM12878)", col = "blue")
@
\end{figure}

\pagebreak

\section{Extract RE-CpG methylation profiled by Illumina BeadChip array}

\Rpkg{REMP} offers a handy tool to extract methylation data of CpGs that are located in RE.

<<remprofile, eval=TRUE>>=
# Use Alu.demo for demonstration
remp.res <- remprofile(GM12878_450k, REtype = "Alu", RE = Alu.demo)
details(remp.res)

# All accessors and utilites for REMProduct are applicable
remp.res <- rempAggregate(remp.res)
details(remp.res)
@

\section{Session Information}

<<SessionInfo, echo=FALSE>>=
sessionInfo()
@

\end{document}