%\VignetteIndexEntry{epigenomix package vignette}
%\VignetteDepends{}
%\VignetteKeywords{}
%\VignettePackage{epigenomix}

\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage{verbatim} %TEMP: only for block comments
\usepackage{float}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\SweaveOpts{prefix.string=epigenomix}


\begin{document}
\SweaveOpts{concordance=TRUE}

\title{epigenomix --- Epigenetic and gene transcription data normalization and integration with mixture models}
\author{Hans-Ulrich Klein, Martin Sch\"{a}fer}
\date{April 9, 2017}

\maketitle

\tableofcontents

<<echo=FALSE>>=
options(width=60)
options(continue=" ")
@

\newpage

\section{Introduction}
This package provides methods for an integrative analysis of gene transcription and epigenetic data, especially histone ChIP-seq data \cite{Klein2014}. Histone modifications are an epigenetic key mechanism to activate or repress the transcription of genes. Several data sets consisting of matched transcription data and histone modification data localized by ChIP-seq have been published. However, both data types are often analysed separately and results are compared afterwards. The methods implemented here are designed to detect transcripts that are differentially transcribed between two conditions due to an altered histone modification and are suitable for very small sample sizes. Transcription data may be obtained by microarrays or RNA-seq.\\
Briefly, the following workflow is described in this document:
\begin{enumerate}
  \item Matching of both data types by assigning the number of ChIP-seq reads aligning within the promoter region to the respective transcription value
  \item Normalization of ChIP-seq values
  \item Calculation of a correlation score for each gene by multiplying the standardized difference of ChIP-seq values by the standardized difference of transcription values
  \item Fitting a (Bayesian) mixture model to this score: The implicit assignment of transcripts to mixture components is used to classify transcripts into one of the following groups: (i) Transcripts with equally directed differences in both data sets, (ii) transcripts with reversely directed differences in both data sets and (iii) transcripts with no differences in at least one of the two data sets. Group (iii) is represented by centred normal components whereas an exponential component is used for group (i) and a mirrored exponential component for group (ii).
\end{enumerate}
In addition to this vignette, a manuscript published in \emph{Current Protocols in Human Genetics} provides detailed documentation of a typical workflow for integrating RNA-seq and ChIP-seq data with epigenomix \cite{Klein2016}. The manuscript covers data preprocessing steps before reading data into R, mapping strategies for different histone marks, and assessment and troubleshooting of the Bayesian mixture model.
\pagebreak

\section{Data preprocessing and normalization}
\subsection{Microarray gene expression data}
First, we load an example microarray gene expression data set. The data set consists of four samples. Two wild type replicates and two \textit{CEBPA} knock-out replicates. The differences between \textit{CEBPA} knock-down and wild type samples are of interest. The data set is stored as an \Rclass{ExpressionSet} object and was reduced to a few probesets on chromosome 1.
<<label=GEdata>>=
library(epigenomix)
data(eSet)
pData(eSet)
@
Data was measured using Affymetrix Mouse Gene 1.0 ST arrays and RMA normalization was applied. See packages \Rpackage{affy} and \Rpackage{Biobase} how to process affymetrix gene expression data.

\subsection{RNA-seq data}
Using RNA-seq instead of microarrays has the advantage that the abundance of individual transcript can be estimated. For this task, software like Cufflinks \cite{Trapnell2010} can be employed. Moreover, the Cuffdiff method (part of the Cufflinks software package) allows to summarize the estimated transcript abundances over all transcripts that share the same transcriptional start site (TSS) and offers several normalization methods, e.g. scaling based on the observed quartiles \cite{Trapnell2013}. Grouping all transcripts sharing the same TSS is favourable for the later matching task. Importing the Cuffdiff output as data frame gives us the FPKM (fragments per kilobase of transcript per million fragments mapped) values.
<<label=RNAseqImport>>=
data(fpkm)
head(fpkm[c(-2,-8), ])
@
The last six columns were not included in the Cuffdiff output, but were extracted from the annotation file given as input to Cuffdiff. Next, we construct an \Rclass{ExpressionSet} object so that we can handle RNA-seq data in the same way as microarray data:
<<label=RNAseqToESet>>=
mat <- log2(as.matrix(fpkm[, c("CEBPA_WT", "CEBPA_KO")]))
rownames(mat) <- fpkm$tss_id
eSet.seq <- ExpressionSet(mat)
pData(eSet.seq)$CEBPA <- factor(c("wt", "ko"))
fData(eSet.seq)$chr <- fpkm$chr
fData(eSet.seq)$tss <- fpkm$tss
@

\subsection{Histone ChIP-seq data}
The example histone ChIP-seq data is stored as \Rclass{GRangesList} object:
<<label=ChIPdata>>=
data(mappedReads)
names(mappedReads)
@
There are two elements within the list. One \textit{CEBPA} wild type and one knock-out sample. Most of the originally obtained reads were removed to reduce storage space. Further, the reads were extended towards the 3 prime end to the mean DNA fragment size of 200bps and duplicated reads were removed. See R packages \Rpackage{Rsamtools} and \Rpackage{GenomicAlignments} how to read in and process sequence reads

\subsection{Data matching}
The presented ChIP-seq data lozalized H3K4me3 histone modifications. This modification primarily occures at promoter regions. Hence, we assign ChIP-seq values to transcription values by counting the number of reads lying wihtin the promoter of the measured transcript.

\subsubsection{Microarray gene expression data}
Depending on the array design, probes often measure more than one transcripts simultaneously. These transcripts may have different TSS/promoters. This makes data matching in case of arrays somewhat tricky. We first create a list with one element for each probe that stores the Ensemble transcript IDs of all transcripts measured by that probeset:
<<label=ProbeToTranscript>>=
probeToTrans <- fData(eSet)$transcript
probeToTrans <- strsplit(probeToTrans, ",")
names(probeToTrans) <- featureNames(eSet)
@
Next, we need the transcriptional start sites for each transcript.
<<label=TranscriptToTSS>>=
data(transToTSS)
head(transToTSS)
@
Such a data frame can be obtained e.g. using \Rpackage{biomaRt}:
<<label=BiomaRt,eval=FALSE>>=
library("biomaRt")
transcripts <- unique(unlist(probeToTrans))
mart <- useMart("ENSEMBL_MART_ENSEMBL",dataset="mmusculus_gene_ensembl", host="www.ensembl.org")
transToTSS <- getBM(attributes=c("ensembl_transcript_id",
    "chromosome_name", "transcript_start",
    "transcript_end", "strand"),
    filters="ensembl_transcript_id",
    values=transcripts, mart=mart)
indNeg <- transToTSS$strand == -1
transToTSS$transcript_start[indNeg] <- transToTSS$transcript_end[indNeg]
transToTSS$transcript_end <- NULL
@
Having these information, the promoter region for each probe can be calculated unsing \Rmethod{matchProbeToPromoter}. Argument \Rfunarg{mode} defines how probes with multiple transcripts should be handled.
<<label=dataMatching>>=
promoters <- matchProbeToPromoter(probeToTrans,
    transToTSS, promWidth=6000, mode="union")
promoters[["10345616"]]
@
Note that some promoter regions, like for probeset \Rcode{"10345616"}, may consist of more than one interval.\\
Finally, \Rmethod{summarizeReads} is used to count the number of reads within the promoter regions:
<<label=makeChIPseqSet>>=
chipSetRaw <- summarizeReads(mappedReads, promoters, summarize="add")
chipSetRaw
head(chipVals(chipSetRaw))
@
The method returns an object of class \Rclass{ChIPseqSet}, which is derived from class \Rclass{RangedSummarizedExperiment}.

\subsubsection{RNA-seq data}
In case of RNA-seq data, we have one transcription value for each group of transcripts sharing the same TSS. Hence, a promoter region can be simply assigned to each transcription value:
<<label=makeChIPseqSet_rna>>=
promoters.seq <- GRanges(seqnames=fData(eSet.seq)$chr,
                        ranges=IRanges(start=fData(eSet.seq)$tss, width=1),
                        probe=featureNames(eSet.seq))
promoters.seq <- resize(promoters.seq, width=3000, fix="center")
promoters.seq <- split(promoters.seq, elementMetadata(promoters.seq)$probe)
@

Next, we can count the number of reads falling into our promoters:
<<label=makeChIPseqSet>>=
chipSetRaw.seq <- summarizeReads(mappedReads, promoters.seq, summarize="add")
chipSetRaw.seq
head(chipVals(chipSetRaw.seq))
@
From now on, we do not distinguish between microarray and RNA-seq any more. \Rcode{eSet} can be substituted by \Rcode{eSet.ser} and \Rcode{chipSetRaw} by \Rcode{chipSetRaw.seq}. In the following, the microarray data is used, since the RNA-seq data was not obtained from the same samples as the ChIP-seq data (actually, not even the same organism).

\pagebreak


\section{ChIP-seq data normalization}
It may be necessary to normalize ChIP-seq data due to different experimental conditions during ChIP.
<<label=normalizeData>>=
chipSet <- normalize(chipSetRaw, method="quantile")
@
In addition to quantile normalization, other methods like the method presented by \cite{Anders2010} are available.
\begin{figure}[H]
\begin{center}
<<label=normalizationPlot,fig=TRUE,width=10,height=5>>=
par(mfrow=c(1,2))
plot(chipVals(chipSetRaw)[,1], chipVals(chipSetRaw)[,2],
     xlim=c(1,600), ylim=c(1,600), main="Raw")
plot(chipVals(chipSet)[,1], chipVals(chipSet)[,2],
     xlim=c(1,600), ylim=c(1,600), main="Quantile")
@
\end{center}
\caption{Raw and quantile normalized ChIP-seq data.}
\end{figure}
\pagebreak


\section{Data integration}
In order to integrate both data types, a correlation score $Z$ (motivated by the work of \cite{Schaefer2012}) can be calculated by multiplying the standardized difference of gene expression values with the standardized difference of ChIP-seq values. Prior to this, pheno type information must be added to the \Robject{chipSet} object.
<<label=integrateData>>=
eSet$CEBPA
colnames(chipSet)
chipSet$CEBPA <- factor(c("wt", "ko"))
colData(chipSet)

intData <- integrateData(eSet, chipSet, 
    factor="CEBPA", reference="wt")
head(intData)
@
\pagebreak

\section{Classification by mixture models}
\subsection{Maximum likelihood approach}
We now fit a mixture model to the correlation score $Z$. The model consists of two normal components with fixed $\mu=0$. These two components should capture $Z$ values close to zero, i.e. genes that show no differences between wild type and knock-out in at least one of the two data sets. The positive (negative) $Z$ scores are represented by a (mirrored) exponential component. Parameters are estimated using the EM-algorithm as implemented in the method \Rmethod{mlMixModel}.
<<label=mlMixModel1,results=hide>>=
mlmm = mlMixModel(intData[,"z"],
    normNull=c(2, 3), expNeg=1, expPos=4, 
    sdNormNullInit=c(0.5, 1), rateExpNegInit=0.5, rateExpPosInit=0.5,
    pi=rep(1/4, 4))
@
<<label=mlMixModel2>>=
mlmm
@
The method returns an object of class \Rclass{MixModelML}, a subclass of \Rclass{MixModel}. We now plot the model fit and the classification results:
\begin{figure}[H]
\begin{center}
<<label=mlMixModelPlot,fig=TRUE,width=8,height=4>>=
par(mfrow=c(1,2))
plotComponents(mlmm, xlim=c(-2, 2), ylim=c(0, 3))
plotClassification(mlmm)
@
\end{center}
\caption{Model fit and classification results of the maximum likelihood approach.}
\end{figure}


\subsection{Bayesian approach}
Alternatively, an Bayesian approach can be used.
<<label=bayesMixModel1,results=hide>>=
set.seed(1515)
bayesmm = bayesMixModel(intData[,"z"],
    normNull=c(2, 3), expNeg=1, expPos=4, 
    sdNormNullInit=c(0.5, 1), rateExpNegInit=0.5, rateExpPosInit=0.5,
    shapeNorm0=c(10, 10), scaleNorm0=c(10, 10), shapeExpNeg0=0.01,
    scaleExpNeg0=0.01, shapeExpPos0=0.01, scaleExpPos0=0.01,
    pi=rep(1/4, 4), sdAlpha=1, itb=2000, nmc=8000, thin=5)
@
\Rmethod{bayesMixModel} returns an object of class \Rclass{MixModelBayes}, which is also a subclass of \Rclass{MixModel}.
<<label=bayesMixModel2>>=
bayesmm
@
The same methods for plotting the model fit and classification can be applied.
\begin{figure}[H]
\begin{center}
<<label=bayesMixModelPlot,fig=TRUE,width=10,height=5>>=
par(mfrow=c(1,2))
plotComponents(bayesmm, xlim=c(-2, 2), ylim=c(0, 3))
plotClassification(bayesmm, method="mode")
@
\end{center}
\caption{Model fit and classification results of the Bayesian approach.}
\end{figure}
Note, that the parameters 'burn in' (\Rfunarg{itb}) and 'number of iterations' (\Rfunarg{nmc}) have to be chosen carefully. The method \Rmethod{plotChains} should be used to assess the convergence of the markov chains for each parameter. The settings here lead to a short runtime, but are unsuitable for real applications.\\

Both models tend to classify more genes to the positive component (component 4) than to the negative one (component 1):
<<label=compMixModels>>=
table(classification(mlmm, method="maxDens"),
      classification(bayesmm, method="mode"))
@
This is in line with the fact that H3K4me3 occurs in the promoters of active genes. Since each $z$ corresponds to a probeset (and so to at least one transcript), the corresponding microarray annotation packages can be used to obtain e.g. the gene symbols of all positivly classified $z$ scores.
<<label=annotation,eval=FALSE>>=
posProbes <- rownames(intData)[classification(bayesmm, method="mode") == 4]
library("mogene10sttranscriptcluster.db")
unlist(mget(posProbes, mogene10sttranscriptclusterSYMBOL))
@

\newpage

\bibliographystyle{unsrturl}
\bibliography{epigenomix}

\end{document}