%\VignetteIndexEntry{r3Cseq} %\VignetteKeywords{r3Cseq} %\VignettePackage{r3Cseq} %documentclass[12pt, a4paper]{article} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Rparameter}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \author{Supat Thongjuea \footnote{Bergen Center for Computational Science, Bergen, Norway}} \title{\textsf{r3Cseq: an R package for the discovery of long-range genomic interactions with chromosome conformation capture and next-generation sequencing data}} \date{October 14, 2011} \begin{document} <>= options(width = 60) olocale=Sys.setlocale(locale="C") @ \maketitle \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Abstract} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The advent of chromosome conformation capture (3C)-based and next-generation sequencing (NGS) technologies has led to the detection of many long-range genomic interactions via the generation of novel ligation products between DNA sequences that are closely juxtaposed in vivo. These interactions may involve promoter regions, transcription factor binding sites, and enhancers, and play key roles in the regulation of gene expression. To facilitate the identification of genomic regions that physically interact with the given viewpoints of interest, we have developed an R package called "r3Cseq". The package can be used to perform 3C-seq data analysis both in the presence or absence of a control experiment. It can read in a variety of mapped read input formats such as BAM, ELAND, and Bowtie, and it allows the visualization of candidate interaction regions as well as statistical analysis of each separate interaction region, greatly facilitating hypothesis generation and the interpretation of experimental results. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The document briefly describes how to use the package \Rpackage{r3Cseq}. \Rpackage{r3Cseq} is a Bioconductor-compliant R package designed to facilitate the identification of interaction regions generated by chromosome conformation capture and next-generation sequencing (3C-seq).The fundamental principles of 3C-seq are briefly described in the following \cite{Soler:2010} (Figure~\ref{fig:3CseqProcedures}), isolated cells are crosslinked to preserve in vivo nuclear proximity between DNA sequences. Nuclei prepared from these cells are then digested using a primary restriction enzyme. HindIII (6-cutter) is used as the primary restriction endonuclease.The cut DNA fragments are then ligated under dilute conditions. This results in the specific ligation of DNA sequences in close nuclear proximity. After de-crosslinking, DNA fragments are digested again using either NlaIII or DpnII as secondary restriction endonucleases to decrease fragment size. After that they are ligated again under dilute conditions, creating small circular fragments. These fragments are then PCR amplified using primers specific for the viewpoint fragment of choice. The viewpoint is the region of interest, which can be a promoter region of a gene of interest, an enhancer, and a transcrition factor binding region. Then, those amplified fragments are sequenced using massive parallel high-throughput sequencing technology. The primer sequences will be removed from the reads and then reads will be mapped against the reference genome. A mapped read file generated by the mapping software such as ELAND and Bowtie is then analyzed by using \Rpackage{r3Cseq} package. \begin{figure} \centering \includegraphics{images/3CseqProcedures.png} \caption{3C-seq procedures} \label{fig:3CseqProcedures} \end{figure} \Rpackage{r3Cseq} package has been developed to facilitate the discovery of interaction regions, which physically interact with the given viewpoints (for example a promoter region of the interested gene). The package is built on, and extends the functionality of Bioconductor package such as \Rpackage{IRanges}, \Rpackage{BSgenome}, \Rpackage{ShortRead}, and \Rpackage{rtracklayer}. The package provides classes and methods to facilitate the single-end reads, which are generated by the next-generation sequencing. The package can perform data analysis on both single input file (single lane from one experiment only) and two input files from both experiment and control. The input of \Rpackage{r3Cseq} is the mapped read file. The package can facilitate a variety of mapped read input formats; ELAND, Bowtie, and those, which are supported by the \Rpackage{ShortRead} package \cite{Morgan:2009}. The default input format is the Binary Sequence Alignment MAP (BAM). \Rpackage{r3Cseq} package can identify candidate interactions regions and it also provides the statistical analysis result. The result reports a number of reads, reads per million (RPM), p-value and fold change (experiment Vs control) per each restriction fragment. The p-value is calculated by using empirical cumulative distribution function. The package contains the function to export data into a UCSC track 'bedgraph' format that is simply uploading to the UCSC genome browser. The package also provides functionalities for plotting to show data analysis result of interaction regions. Plotting functions will be described in the visualization of 3C-seq section. More details about the functions, method and statistical testing are described in this \cite{Supat:2011}. Here is a list of some of its most important functions. \begin{enumerate} \item \Rfunction{getCoverage}: a function to read in input formats ELAND, Bowtie, and any other formats supported by Bioconductor's ShortRead package. The default input is the BAM file. \item \Rfunction{getReadCountPerRestrictionFragment} : a function to count the number of reads per restriction fragment. A user has to specify the name of restriction enzyme. The package will then automatically generate the genome-wide restriction fragment and counts how many 3C-seq reads are mapped into that particular restriction fragment. \item \Rfunction{calculateRPM} : a function to calcuate reads per million (RPM) per each restriction fragment \item \Rfunction{getInteractions} : a function to estimate p-value generated by the empirical cumulative distribution function and to calculate the fold change of experiment compared with the control. \item \emph{Visualiztion} : the package contains functions for visualizing the interaction regions with the powerful plotting facilities. These functions are \Rfunction{plotOverviewInteractions}, \Rfunction{plotInteractionsNearViewpoint}, \Rfunction{plotInteractionsPerChromosome}, and \Rfunction{plot3Cecdf}. \item \emph{Exporting the data} : the package contains functions to export the data into tab-delimited text format, which can be easily uploaded to the UCSC genome browser for further visualization and exploration. Currently it supports the bedGraph format. The package can also generate a summary report in PDF format. These functions are \Rfunction{export3Cseq2bedGraph}, \Rfunction{exportInteractions2text}, and \Rfunction{generate3CseqReport} \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Preparation input files for r3Cseq} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The required input file for \Rpackage{r3Cseq} package is the output mapped read file, obtained as an output from the mapping software. The represented identifier for a reference genome shown in each input mapped file is important. In order to run \Rpackage{r3Cseq} properly, the represented identifier for each chromosome must be in "chr[1..19XYM]" format for the mouse reference genome and "chr[1..22XYM]" format for the human reference genome. Therefore, before using \Rpackage{r3Cseq} package, a user has to check the identifier for the reference genome. If the identifier for each chromosome found in the mapped file is not in the proper format for example 'mm9\_ref\_chr01.fa', the Unix command like 'sed' might be used to replace 'mm9\_ref\_chr01.fa' to 'chr1'. The example Unix command for 'sed' shows below. \newline > sed 's/mm9\_ref\_chr01.fa'/chr1\/' file \newline %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Getting started} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The subset of 3C-seq data generated by \cite{Ralph:2011} will be used for demonstration. To load the \Rpackage{r3Cseq} package, type {\tt library(r3Cseq)}. <<>>= library(r3Cseq) @ There are 2 data sets found in the package. <<>>= Aligned3CseqMybFetalBrain<-system.file("extdata", "alignedReads.fetal.brain.subset.bam",package="r3Cseq") Aligned3CseqMybFetalLiver<-system.file("extdata", "alignedReads.fetal.liver.subset.bam",package="r3Cseq") @ \begin{enumerate} \item alignedReads.fetal.liver.subset.bam, the 3C-seq data contains the aligned reads of Myb's promoter interactions regions in fetal liver. \item alignedReads.fetal.brain.subset.bam, the 3C-seq data contains the aligned reads of the fetal brain. \end{enumerate} In the following, we will perform \Rpackage{r3Cseq} to discover interaction regions, which possibly interact with the promoter region of Myb gene in both fetal liver and brain \cite{Ralph:2011}. \subsection{r3Cseq object creation} In this section, we will analyze 3C-seq data, which was generated by \cite{Ralph:2011}. 3C-seq data were derived from fetal liver and fetal brain. We would like to see the interaction regions of these two tissue types against the Myb's promoter (viewpoint). Myb gene is highly expressed in the fetal liver and it is lowly expressed in the fetal brain. We expect to see that the Myb's promoter highly interacts with the long-range genomic regions, which are mostly candidate enhancers and those drive the expression of gene in the fetal liver. In contrast, there is low level of interaction will be found in the fetal brain. In order to see the interaction region between Myb promoter and the long-range regulatory elements, we defined the data from fetal liver as an experiment and data from fetal brain as a control with new variables called 'expFile' ane 'contrFile' respectively. <<>>= expFile<-Aligned3CseqMybFetalLiver contrFile<-Aligned3CseqMybFetalBrain @ Then, r3Cseq object will be created. <<>>= my3Cseq.obj<-new("r3Cseq",organismName='mm9',alignedReadsBamExpFile=expFile, alignedReadsBamContrFile=contrFile,isControlInvolved=TRUE, isBamInputFile=TRUE,expLabel="fetal_liver", contrLabel="fetal_brain",restrictionEnzyme='HindIII') @ Definition of input parameters is described in the \Robject{r3Cseq} object help page. Type \Robject{my3Cseq.obj} to see the r3Cseq object: <<>>= my3Cseq.obj @ \subsection{Getting reads per restriction fragments} In order to get number of reads per restriction fragement, function \Rfunction{getReadCountPerRestrictionFragment} will be performed. \Rfunction{getReadCountPerRestrictionFragment} counts the number of reads for each restriction fragment across the genome. <<>>= getReadCountPerRestrictionFragment(my3Cseq.obj) @ \subsection{Normalization} Library sizes of the experiment and the control are usually different. In order to make them comparable, normalization is required. Reads per million per restriction fragment size (RPM) can be used as normalized values, and allow for comparison of experiment versus control. \Rpackage{r3Cseq} package provides \Rfunction{calculateRPM} function to calculate RPM. <<>>= calculateRPM(my3Cseq.obj) @ \subsection{Getting interaction regions} After normalization, the \Rfunction{getInteractions} function will be used to assign the p-value and calculate fold change for each candidate interaction regions. <<>>= getInteractions(my3Cseq.obj) @ In order to see the result of interaction regions, Two functions \Rfunction{expInteractionRegions} and \Rfunction{contrInteractionRegions} need to be used to access the r3Cseq object. To get the result of interaction regions for the experiment, \Rfunction{expInteractionRegions} will be performed. <<>>= fetal.liver.interactions <-expInteractionRegions(my3Cseq.obj) fetal.liver.interactions @ And, to get the result of interaction regions for the control, \Rfunction{contrInteractionRegions} will be performed. <<>>= fetal.brain.interactions <- contrInteractionRegions(my3Cseq.obj) fetal.brain.interactions @ \subsection{Getting the viewpoint information} In order to see the viewpoint information, in this case study the viewpoint is the promoter region of Myb's promoter, \Rfunction{getViewpoint} function can be used. \Rfunction{getViewpoint} will return the RangedData object of the viewpoint information. <<>>= viewpoint<-getViewpoint(my3Cseq.obj) viewpoint @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Visualization of 3C-seq data} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \Rpackage{r3Cseq} package provides visualization functions. Those functions are \Rfunction{plotOverviewInteractions}, \Rfunction{plotInteractionsNearViewpoint}, \Rfunction{plotInteractionsPerChromosome}, and \Rfunction{plot3Cecdf}. \subsection{The overview plot of interactions} \Rfunction{plotOverviewInteractions} shows the overview of interaction regions distributed across genome. <>= plotOverviewInteractions(my3Cseq.obj) @ \begin{figure} \centering \includegraphics{r3Cseq-plotOverviewInteractions} \caption{Distribution of interaction regions across genome} \end{figure} \subsection{Empirical cumulative distribution plot} \Rfunction{plot3Cecdf} shows the empirical cumulative distribution of interaction regions. <>= plot3Cecdf(my3Cseq.obj) @ \begin{figure} \centering \includegraphics{r3Cseq-plot3Cecdf} \caption{Empirical cumulative distribution of interaction regions} \end{figure} \subsection{Zoom in interactions near the viewpoint} \Rfunction{plotInteractionsNearViewpoint} shows the zoom in of interaction regions located close to the viewpoint. <>= plotInteractionsNearViewpoint(my3Cseq.obj) @ \begin{figure} \centering \includegraphics{r3Cseq-plotInteractionsNearViewpoint} \caption{Zoom in interaction regions near the viewpoint} \end{figure} \subsection{Visualize interactions in each selected Chromosome} \Rfunction{plotInteractionsPerChromosome} shows the interaction regions found in the chromosome10. <>= plotInteractionsPerChromosome(my3Cseq.obj,"chr10") @ \begin{figure} \centering \includegraphics{r3Cseq-plotInteractionsPerChromosome} \caption{Distribution of interaction regions across chromosome 10} \end{figure} \subsection{Export interaction regions to the 'bedGraph' format} \Rfunction{export3Cseq2bedGraph} export interaction regions from RagedData to the bedGraph format, which suitable for uploading to the UCSC genome browser. <<>>= export3Cseq2bedGraph(my3Cseq.obj) @ \subsection{Summary report} \Rfunction{generate3CseqReport} generates the summary report from \Rpackage{r3Cseq} analysis results. The report contains the pdf file for all plots and the text file of interaction regions. <<>>= generate3CseqReport(my3Cseq.obj) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session Info} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <<>>= sessionInfo() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\newpage \bibliographystyle{apalike} \bibliography{r3Cseq} \end{document}