%\VignetteIndexEntry{r3Cseq} %\VignetteKeywords{r3Cseq} %\VignettePackage{r3Cseq} %documentclass[12pt, a4paper]{article} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Rparameter}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rdata}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \author{Supat Thongjuea \footnote{The Weatherall Institute of Molecular Medicine, University of Oxford, UK}} \title{\textsf{r3Cseq: an R package for the discovery of long-range genomic interactions with chromosome conformation capture and next-generation sequencing data}} \date{January 26, 2015} \begin{document} <>= options(width = 60) olocale=Sys.setlocale(locale="C") @ \maketitle \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{abstract} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The coupling of chromosome conformation capture (3C)-based and next-generation sequencing (NGS) enable high-throughput detection of long-range genomic interactions via the generation of novel ligation products between DNA sequences that are closely juxtaposed in vivo. These interactions may involve promoter regions, enhacers and other regulatory and structural elements of chromosomes, and can reveal key details in the regulation of gene expression. 3C-seq is a a variant of the method for the detection of interactions between one chosen genomic element (viewpoint) and the rest of the genome. We present an R/Bioconductor package called \Rpackage{r3Cseq}, designed to perform 3C-seq data analysis in a number of different experimental designs, with or without a control experiment. The package can also be used to perform data analysis for the experiment with replicates. The package provides functions to perform 3C-seq data normalization, statistical analysis for cis/trans interactions and visualization to facilitate the identification of genomic regions that physically interact with the given viewpoints of interest. The r3Cseq package greatly facilitates hypothesis generation and the interpretation of experimental results. \end{abstract} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This vignette describes how to use the \Rpackage{r3Cseq} package. \Rpackage{r3Cseq} is a Bioconductor-compliant R package designed to facilitate the identification of interaction regions generated by chromosome conformation capture and next-generation sequencing (3C-seq). The fundamental principles of 3C-seq briefly described in the following \cite{Soler:2010} (Figure~\ref{fig:3CseqProcedures}), isolated cells are treated with a cross-linking agent to preserve in vivo nuclear proximity between DNA sequences. The DNA isolated from these cells is then digested using a primary restriction enzyme, typically a 6-base pairs cutting enzyme such as HindIII, EcoRI or BamHI. The digested product is then ligated under dilute conditions to favor intra-molecular over inter-molecular ligation events. This digested and ligated chromatin yields composite sequences representing (distal) genomic regions that are in close physical proximity in the cell nucleus. The digested and ligated chromatin is then de-crosslinked and subjected to a second restriction digest using either Nla III or Dpn II (a 4-cutter) as a secondary restriction enzyme to decrease the fragment sizes. The resulting digested DNA is then ligated again under diluted conditions, creating small circular fragments. These fragments are inverse PCR-amplified using primers specific for a genomic region of interest (eg. promoter, enhancer, or any other element potentially involved in long-range interactions), termed the "viewpoint". The amplified fragments are then sequenced using massively parallel high-throughput sequencing. Because the 3C-seq procedure hybrid DNA molecules being a combination of viewpoint-specific primers followed by sequences dervied from the ligated interaction fragments. As such, these composite sequences are unmappable and need to be trimmed to removed the viewpoint sequences, thus leaving only the capture sequence fragments for mapping. After trimming, reads are mapped against a reference genome using alignment software such as Bowtie. A mapped read file generated by the mapping software is then transformed to the BAM file and analyzed by using \Rpackage{r3Cseq} package. \begin{figure} \centering \includegraphics{images/3CseqProcedures.png} \caption{3C-seq procedures} \label{fig:3CseqProcedures} \end{figure} \Rpackage{r3Cseq} package is built on, and extends the functionality of Bioconductor package such as \Rpackage{GenomicRanges}, \Rpackage{BSgenome}, \Rpackage{Rsamtools}, and \Rpackage{rtracklayer}. The package provides classes and methods to facilitate single-end reads, which are generated by the next-generation sequencing. The package can perform data analysis on both single input file (single lane from one experiment) and two input files from an experiment and a control. The package also provides a class and functions to perform 3C-seq data analysis from replicates (see working with replicates section). The key features workflow of \Rpackage{r3Cseq} depicts in the following figure. (Figure~\ref{fig:r3CseqFlowChart}) \begin{figure} \centering \includegraphics{images/r3CseqFlowChart.png} \caption{r3Cseq key features workflow} \label{fig:r3CseqFlowChart} \end{figure} \Rpackage{r3Cseq} analysis workflow starts from the class initialization. There are two classes found in this package. One is \Robject{r3Cseq} class that is designed to support a single experiment in both with and without a control experiment. Another is \Robject{r3CseqInBatch} class that is designed to support analysis with replicates. To initialize the class both \Robject{r3Cseq} and \Robject{r3CseqInBatch}, a user gives the input parameters for example the input file name, genome assembly version, primary restriction fragment name and so on (see more details in the manual page of \Robject{r3Cseq} and \Robject{r3CseqInBatch}). The class is then will be created and it is ready to perform 3C-seq analysis. The \Rfunction{getRawReads} and \Rfunction{getRawReadsInBatch} functions can be next used to read in the BAM files. It may take a few minutes for data processing depending on the size of the input BAM files and the speed of CPU and the size of the RAM of a computer that performs analysis. To run the \Rfunction{getRawReadsInBatch} function for replicates, a user might have a powerful computer server. \Rfunction{getRawReads} function reads in aligned reads from input BAM files and transforms aligned reads to the \Robject{GRanged} objects that can be stored in the r3Cseq object, whereas \Rfunction{getRawReadsInBatch} processes the data in batch and stores the aligned reads \Robject{GRanges} in the R files (.rdata). To count number of reads preparing for downstream analysis, \Robject{r3Cseq} provides two ways to count number of reads per region; 1) count number of reads per resitrction fragments, using the function \Rfunction{getReadCountPerRestrictionFragment} and 2) count the number of reads per non-overlapping window size, using function \Rfunction{getReadCountPerWindow}, whereas \Rfunction{getBatchReadCountPerRestrictionFragment} and \Rfunction{getBatchReadCountPerWindow} can do the same for replicates. \Rpackage{r3Cseq} provides \Rfunction{calculateRPM} and \Rfunction{calculateBatchRPM} functions to calculate reads per million per restriction fragment size (RPM) as normalized interaction frequency values. There are two methods to calculate RPM. which are described in the r3Cseq paper \cite{Supat:2012}. After data normalization, \Rfunction{getInteractions} and \Rfunction{getBatchInteractions} will be performed to identify candidate interactions. Statistical analysis for both cis and trans interactions is described in the r3Cseq paper \cite{Supat:2012}. \Rpackage{r3Cseq} provides functions \Rfunction{export3CseqRawReads2BedGraph}, \Rfunction{export3Cseq2bedGraph}, \Rfunction{exportInteractions2text}, and \Rfunction{exportBatchInteractions2text} to export raw reads and all identified interactions to the bedGraph file format, which is simply uploaded to the UCSC genome browser. The package also provides functionalities for plotting to show data analysis result of interaction regions. Plotting functions consist of \Rfunction{plotOverviewInteractions}, \Rfunction{plotInteractionPerChromosome}, \Rfunction{plotInteractionNearViewpoint}, and \Rfunction{plotDomainogramNearViewpoint}. These functions will be demonstrated in the visualization of 3C-seq data section. Here is a list of some of its most important functions. \begin{enumerate} \item \Rfunction{getRawReads}: a function to read in BAM files. \item \Rfunction{getBatchRawReads}: a function to read in multiple BAM files for replicates. \item \Rfunction{getReadCountPerRestrictionFragment} : a function to count the number of reads per restriction fragment. A user has to specify the name of restriction enzyme. The package will then automatically generate the genome-wide restriction fragments and counts how many 3C-seq reads are mapped into that particular restriction fragments. \item \Rfunction{getBatchReadCountPerRestrictionFragment} : Similar to \Rfunction{getReadCountPerRestrictionFragment} using it for replicates \item \Rfunction{getReadCountPerWindow} : a function to count the number of reads per defined non-overalapping window size. A user has to specify the window size of interest. The package will then automatically generate the genome-wide windows and counts how many 3C-seq reads are mapped into that particular windows. \item \Rfunction{getBatchReadCountPerWindow} : similar to \Rfunction{getReadCountPerWindow} using for replicates \item \Rfunction{calculateRPM} : a function to calcuate reads per million (RPM) per each restriction fragment \item \Rfunction{calculateBatchRPM} : similar to \Rfunction{calculateRPM} using for replicates \item \Rfunction{getInteractions} : a function to perform statistical analysis to identify candidate interactions \item \Rfunction{getBatchInteractions} : similar to \Rfunction{getInteractions} using for replicates \item \emph{Visualiztion} : the package contains functions for visualizing the interaction regions with the powerful plotting facilities. These functions are \Rfunction{plotOverviewInteractions}, \Rfunction{plotInteractionsNearViewpoint},\Rfunction{plotInteractionsPerChromosome}, and \Rfunction{plotDomainogramNearViewpoint}. \item \emph{Data export} : the package contains functions to export the data into tab-delimited text format, which can be easily uploaded to the UCSC genome browser for further visualization and exploration. Currently it supports the bedGraph format. These functions are \Rfunction{export3CseqRawReads2bedGraph}, \Rfunction{export3Cseq2bedGraph}, \Rfunction{exportInteractions2text}, \Rfunction{exportBatchInteractions2text}. The package can also generate a summary report in PDF format by \Rfunction{generate3CseqReport} function. \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Preparation input files for r3Cseq} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The required input file for \Rpackage{r3Cseq} package is the BAM file, obtained as an output from the mapping software. The represented identifier for a reference genome shown in each input BAM file is important to run \Rpackage{r3Cseq} properly. The represented identifier for each chromosome must be in "chr[1..19XYM]" format for the mouse reference genome and "chr[1..22XYM]" format for the human reference genome. Therefore, before using \Rpackage{r3Cseq} package, a user has to check the identifier for the reference genome. If the identifier for each chromosome found in the mapped file is not in a proper format for example 'mm9\_ref\_chr01.fa', the Unix command like 'sed' might be used to replace 'mm9\_ref\_chr01.fa' to 'chr1'. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Getting started} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The 3C-seq data generated by \cite{Ralph:2011} will be used for the demonstration. The current version of r3Cseq supports mouse, human, and rat genomes. Therefore, the package requires one of the followings \Rpackage{BSgenome} packages to be installed;\Rpackage{BSgenome.Mmusculus.UCSC.mm9.masked}, \Rpackage{BSgenome.Mmusculus.UCSC.mm10.masked}, \Rpackage{BSgenome.Hsapiens.UCSC.hg18.masked}, \Rpackage{BSgenome.Hsapiens.UCSC.hg19.masked, and \Rpackage{BSgenome.Rnorvegicus.UCSC.rn5.masked}}. Loading the \Rpackage{r3Cseq} package into R. <<>>= library(r3Cseq) @ There are 2 data sets found in the package. <<>>= data(Myb_prom_FL) data(Myb_prom_FB) @ \begin{enumerate} \item \Rdata{Myb\_prom\_FL}, the 3C-seq data contains the aligned reads of the Myb promoter interactions signal in fetal liver. It was stored in the \Robject{GRanges} object processed by the \Rpackage{Rsamtools} package. \item \Rdata{Myb\_prom\_FB}, the 3C-seq data contains the aligned reads of Myb promoter interactions signal in fetal brain. \end{enumerate} We will perform \Rpackage{r3Cseq} to discover interaction regions, which possibly interact with the promoter region of Myb gene in both fetal liver and brain \cite{Ralph:2011}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{r3Cseq object initialization} In this section, we will analyze 3C-seq data, which were derived from fetal liver (expressing high levels of Myb) and fetal brain (expression low level of Myb). The latter will be used as a negative control. More examples of r3Cseq data analysis can be found at the official r3Cseq website \url{http://r3cseq.genereg.net}. We firstly initialized the r3Cseq object. <<>>= my3Cseq.obj<-new("r3Cseq",organismName='mm9',isControlInvolved=TRUE, viewpoint_chromosome='chr10',viewpoint_primer_forward='TCTTTGTTTGATGGCATCTGTT', viewpoint_primer_reverse='AAAGGGGAGGAGAAGGAGGT',expLabel="Myb_prom_FL", contrLabel="MYb_prom_FB",restrictionEnzyme='HindIII') @ Definition of input parameters is described in the \Robject{r3Cseq} help page. We next add raw reads from Myb\_prom\_FL and Myb\_prom\_FB to the existing \Robject{my3Cseq.obj}. <<>>= expRawData(my3Cseq.obj)<-exp.GRanges contrRawData(my3Cseq.obj)<-contr.GRanges @ Type \Robject{my3Cseq.obj} to see the r3Cseq object: <<>>= my3Cseq.obj @ \subsection{Getting reads per restriction fragments/user defined window size} To get number of reads per restriction fragement, function \Rfunction{getReadCountPerRestrictionFragment} will be performed. <<>>= getReadCountPerRestrictionFragment(my3Cseq.obj) @ The package provides the function \Rfunction{getReadCountPerWindow} to count number of reads per non-overlapping window size defined by a user. \subsection{Normalization} We next perform normalization. <<>>= calculateRPM(my3Cseq.obj) @ \subsection{Getting interaction regions} After normalization, the \Rfunction{getInteractions} function will be performed. <<>>= getInteractions(my3Cseq.obj,fdr=0.05) @ In order to see the result of interaction regions, Two functions \Rfunction{expInteractionRegions} and \Rfunction{contrInteractionRegions} need to be used to access the slot of r3Cseq object. To get the result of interaction regions for the experiment, \Rfunction{expInteractionRegions} will be performed. <<>>= fetal.liver.interactions<-expInteractionRegions(my3Cseq.obj) fetal.liver.interactions @ To get the result of interaction regions for the control, \Rfunction{contrInteractionRegions} will be performed. <<>>= fetal.brain.interactions<-contrInteractionRegions(my3Cseq.obj) fetal.brain.interactions @ \subsection{Getting the viewpoint information} To see the viewpoint information, \Rfunction{getViewpoint} function can be used. \Rfunction{getViewpoint} will return the RangedData object of the viewpoint information. <<>>= viewpoint<-getViewpoint(my3Cseq.obj) viewpoint @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Visualization of 3C-seq data} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \Rpackage{r3Cseq} package provides visualization functions. These functions are \Rfunction{plotOverviewInteractions}, \Rfunction{plotInteractionsNearViewpoint}, \Rfunction{plotInteractionsPerChromosome}, and \Rfunction{PlotDomainogramNearViewpoint}. \subsection{The overview plot of interactions} \Rfunction{plotOverviewInteractions} function shows the overview of interaction regions distributed across genome. <>= plotOverviewInteractions(my3Cseq.obj) @ \begin{figure} \centering \includegraphics{r3Cseq-plotOverviewInteractions} \caption{Distribution of interaction regions across genome} \end{figure} \subsection{Plot of interactions in cis} \Rfunction{plotInteractionsNearViewpoint} function shows the zoom in of interaction regions located close to the viewpoint. <>= plotInteractionsNearViewpoint(my3Cseq.obj) @ \begin{figure} \centering \includegraphics{r3Cseq-plotInteractionsNearViewpoint} \caption{Zoom in interaction regions near the viewpoint} \end{figure} \subsection{Plot of interactions in each selected chromosome} \Rfunction{plotInteractionsPerChromosome} function shows the interaction regions found in the chromosome10. <>= plotInteractionsPerChromosome(my3Cseq.obj,"chr10") @ \begin{figure} \centering \includegraphics{r3Cseq-plotInteractionsPerChromosome} \caption{Distribution of interaction regions across chromosome 10} \end{figure} \subsection{Domainogram of interactions} \Rfunction{plotDomainogramNearViewpoint} function shows the domainogram of interactions found in cis. This function may takes several minutes to produce domainograms. We therefore, skip this command to produce plots for the vignette. You can see the example of the plots and find more details at \url{http://r3Cseq.genereg.net}. <<>>= #plotDomainogramNearViewpoint(my3Cseq.obj) @ \subsection{Associate interaction signals to the Refseq genes} \Rfunction{getExpInteractionsInRefseq} and \Rfunction{getContrInteractionsInRefseq} functions can be used to detect the list of genes that contain significant interaction signals in their proximity. <<>>= detected_genes<-getExpInteractionsInRefseq(my3Cseq.obj) head(detected_genes) @ \subsection{Export interactions to the bedGraph format} \Rfunction{export3Cseq2bedGraph} function exports all interactions from the \Robject{RangedData} to the bedGraph format, which simply upload to the UCSC genome browser. <<>>= #export3Cseq2bedGraph(my3Cseq.obj) @ \subsection{Summary report} \Rfunction{generate3CseqReport} function generates the summary report from \Rpackage{r3Cseq} analysis results. The report contains a pdf file for all plots and text files of interaction regions. This function may takes several minutes to produce the report. We therefore, skip this command during the vignette creation. <<>>= #generate3CseqReport(my3Cseq.obj) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Working with replicates} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The example of how to work with replicats can be found at \url{http://r3cseq.genereg.net/}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{r3Cseq website} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% We have developed the website \url{http://r3cseq.genereg.net}. The website provides more details of \Rpackage{r3Cseq} analysis pipeline. The example data sets and the current version of \Rpackage{r3Cseq} package can be downloaded from the website. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session Info} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <<>>= sessionInfo() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\newpage \bibliographystyle{apalike} \bibliography{r3Cseq} \end{document}