\documentclass[a4paper]{article} \title{GOTHiC - Genome Organisation Through HiC User Manual} \author{Bori Mifsud and Robert Sugar} \usepackage{datetime} \newdate{date}{19}{08}{2013} \date{\displaydate{date}} \usepackage[utf8]{inputenc} \usepackage{amsmath,amssymb} \usepackage{cite} % Make references as [1-4], not [1,2,3,4] \usepackage{url} % Formatting web addresses \usepackage{graphicx} \usepackage{indentfirst} % \VignetteIndexEntry{GOTHiC} \begin{document} \maketitle \section{Introduction} Chromosome conformation capture (3C) is a cross-link based technique to detect spatial proximity of specific genomic distant loci \cite{Dekker2002}, which has sparked a number of large-scale methods to study multiple interactions simultaneously \cite{DeWit2012}. Hi-C is a genome wide, unbiased version of 3C \cite{Lieberman-Aiden2009}, in which all interactions are sampled by paired-end high-throughput sequencing. In an ideal Hi-C library all read-pairs would reflect real interactions and the number of read-pairs between any two genomic loci would be proportional to the frequency of the interaction in the studied cell population. Unfortunately, Hi-C has been shown to suffer from a multitude of systematic biases that confound the true signal, and complex approaches have been designed to minimize their impact \cite{Yaffe2011,Imakaev2012}. \section{Description} \subsection{Data} The data were generated by Lieberman-Aiden \textit{et al.} and published in 2009 \cite{Lieberman-Aiden2009}. In this study, the authors used lymphoblastoid human cells and two different restriction enzymes (HindIII and NcoI) in the HiC experiments. The examples in this package use the mapped reads (from \textit{Bowtie} or \textit{HiCUP} \cite{Langmead2009}, \cite{Wingett}) deposited in the HiCDataLymphoblastoid package, and paired reads (from our \textit{pairReads} function) of the HindIII replicate SRR027956 for chromosome 20. The paired reads are in a \textit{GRangesList} object. \\ <>= library(GOTHiC) data(lymphoid_chr20_paired_filtered) @ \subsection{A binomial statistic for accurately modelling Hi-C data} \subsubsection{Method} First, self-ligation read-pairs and those mapping to adjacent fragments are filtered out by removing any pair of reads whose middle points are closer than 10kb (the median size of the fragments is about 2130bp in our experiment, therefore the 10kb cut-off allows for removal of read pairs resulting from incomplete digestion as well). PCR duplicates are filtered by removing read-pairs with both ends mapping to identical coordinates (HiCUP already filters invalid read-pairs and removes duplicates). Having done so, true interactions can be separated from spurious ligations using a binomial test. Briefly, the probability that a read pair is the consequence of a spurious ligation between two sites can be estimated as: \begin{equation} p_{j,h} = 2 * rel\_coverage_j * rel\_coverage_h \end{equation} where the relative coverage of a given site or region is \begin{equation} rel\_coverage_j = reads_j / reads_{total} \end{equation} and the noise fraction represents the fraction of the ligations corresponding to spurious inter-molecular ligations (having previously removed self-ligations). The observed coverage is a complex function of a multitude of biases, including the density of restriction sites, cleavage efficiency, ligation efficiency, amplification and sequencing biases and mappability. The probability of seeing a random read-pair between two regions depends on the relative coverages of the interacting regions in a multiplicative manner since the biases affect both interacting partners independently. Given a total number of read-pairs in the experiment of N (after excluding self-ligation read-pairs), the probability of observing $n_{j,h}$ read-pairs between two loci is given by the binomial density: \begin{equation} pval_{jh}=P(x \geq n_{jh})=1-\sum_{i=0}^{n_{jh}-1} \binom{N}{i} (P_{randomintermol_{jh}})^i(1-P_{randomintermol_{jh}})^{N-i} \end{equation} This allows calculating a p-value of the chance of seeing the observed number of read-pairs by chance, as a simple function of the coverage of both sites and the total number of reads. Correcting for multiple testing using the Benjamini-Hochberg multiple testing correction we obtain a q-value that can be used directly to infer significant interactions with a desired false discovery rate from a Hi-C dataset. In addition to a p-value, an observed-over-expected ratio can be easily calculated, what can be used as a measure of effect size or as an unbiased measure of interaction frequency or strength. \begin{equation} Obs\_exp\_ratio = n_{j,h} / (p_{j,h}*N) \end{equation} The binomial method does not explicitly assume that different biases are multiplicative since it does not model the total effect of all biases as a product of their relative contribution. However, the method assumes that biases affect both read ends independently. This is a reasonable assumption given our understanding of known biases and has been used in previous methods (e.g. \cite{Yaffe2011,Imakaev2012}). Thus for any two genomic fragments, the binomial test provides two statistics: (i) a p-value of observing an interaction between the two loci given a random distribution of ligations; and (ii) a measure of the strength of the interaction compared with random expectation. \subsection{Function description} \subsubsection{Pairing aligned paired NGS reads} The \textit{pairReads} function takes bowtie output files, pairs the reads, only keeps those where both ends mapped, filters for perfect duplicates to avoid PCR bias, and saves and returns a \textit{GenomicRangesList} object that contains the paired\_reads\_1 and paired\_reads\_2 \textit{GenomicRanges} with the genomic coordinates of each end of the read pairs respectively. \\ <>= dirPath = system.file("extdata", package="HiCDataLymphoblast") fileName1 = list.files(dirPath, full.names=TRUE)[1] fileName2 = list.files(dirPath, full.names=TRUE)[2] paired = pairReads(fileName1, fileName2, sampleName="lymphoid_chr20", DUPLICATETHRESHOLD = 1, fileType="Table") @ \subsubsection{Mapping aligned and paired reads to the restriction fragments} The \textit{mapReadsToRestrictionSites} function takes mapped paired NGS reads in the format of a GenomicRangesList object where the two end of the reads are in the \textit{GenomicRanges}paired\_reads\_1 and paired\_reads\_2. It prepares the digestion file from the genome supplied to it with the given restriction enzyme and specificity and maps the reads to the fragments. It returns a \textit{GenomicRangesList} containing two textit{GenomicRanges} objects containing the start of the fragment where each end of the read pairs is mapped respectively (parallel version of findOverlaps from the GenomicRanges was adapted from Kasper Daniel Hansen). \\ <>= data(lymphoid_chr20_paired_filtered) mapped=mapReadsToRestrictionSites(filtered, sampleName="lymphoid_chr20", BSgenomeName="BSgenome.Hsapiens.UCSC.hg18", genome=BSgenome.Hsapiens.UCSC.hg18, restrictionSite="A^AGCTT", enzyme="HindIII", parallel=FALSE, cores=1) @ \subsubsection{Binomial Test for detecting significant interactions in Hi-C data} The \textit{GOTHiC} or \textit{GOTHiChicup} functions perform a cumulative binomial test to detect interactions between distal genomic loci that have significantly more reads than expected by chance in Hi-C experiments. They take either mapped paired NGS reads, or the mapped and filtered reads from HiCUP as input and give back the list of significant interactions for a given bin size in the genome. They return a data.frame where each line contains the genomic coordinates of each en of a pair (chromosome and start of the bin), the observed number of reads, the relative coverage of each region, the expected frequency to observe this interaction by random, the binomial p-value and q-value (Benjamini correction of the p-value). They also generate a density plot from the p-value distribution Figure~\ref{fig1}, which helps with assessing the quality of the HiC experiments. Poor quality experiments tend to have a p-value distribution that is close to the one from a random sample (uniform 0 to 1). Good quality experiments have a clear peak around 0 (very low coverage samples can also have bimodal distribution even if the quality is lower, the number of identified significant interaction is an additional indicator of quality that would be low in low coverage samples). \\ <>= dirPath = system.file("extdata", package="HiCDataLymphoblast") fileName1 = list.files(dirPath, full.names=TRUE)[1] fileName2 = list.files(dirPath, full.names=TRUE)[2] binom=GOTHiC(fileName1,fileName2, sampleName="lymphoid_chr20", res=1000000, BSgenomeName="BSgenome.Hsapiens.UCSC.hg18", genome=BSgenome.Hsapiens.UCSC.hg18, restrictionSite="A^AGCTT", enzyme="HindIII" ,cistrans="all", filterdist=10000, DUPLICATETHRESHOLD=1, fileType="Table", parallel=FALSE, cores=NULL) @ <>= dirPath <- system.file("extdata", package="HiCDataLymphoblast") fileName <- list.files(dirPath, full.names=TRUE)[4] restrictionFile <- list.files(dirPath, full.names=TRUE)[3] binom=GOTHiChicup(fileName, sampleName='lymphoid_chr20', res=1000000, restrictionFile, cistrans='all', parallel=FALSE, cores=NULL) @ \clearpage \begin{figure}[htb] \begin{center} \includegraphics[scale=0.6]{lymphoid_chr20_pvalue_distribution} \caption{\small {\bf P-value distribution of human chromosome 20 interactions in lymphoblastoid cells.}} \label{fig1} \end{center} \end{figure} \subsection{Aknowledgements} We are grateful to Elodie Darbo for helping with putting the script together in a package format, and Philip East for testing and providing useful comments. \\ \begin{thebibliography}{10} \providecommand{\url}[1]{[#1]} \providecommand{\urlprefix}{} \bibitem{Dekker2002} Dekker J, Rippe K and Dekker M and Kleckner N: \textbf{{Capturing chromosome conformation.}} \emph{Science} 2002, \textbf{295}:1306--11. \bibitem{DeWit2012} de Wit E and de Laat W: \textbf{{A decade of 3C technologies: insights into nuclear organization.}} \emph{Genes \& development} 2012 \textbf{26} (1):11--24. \bibitem{Lieberman-Aiden2009} Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo P, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke Andreas, Stamatoyannopoulos J, Mirny LA, Lander ES and Dekker J: \textbf{{Comprehensive mapping of long-range interactions reveals folding principles of the human genome.}} \emph{Science} 2009 \textbf{326}:289--93. \bibitem{Yaffe2011} Yaffe E and Tanay A: \textbf{{Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture.}} \emph{Nature genetics} 2011 \textbf{43} (11):1059--65. \bibitem{Imakaev2012} Imakaev M, Fudenberg G, McCord RP, Naumova N, Goloborodko A, Lajoie BR, Dekker J and Mirny LA: \textbf{{Iterative correction of Hi-C data reveals hallmarks of chromosome organization.}} \emph{Nature methods} 2012 \textbf{9} (10):999--1003. \bibitem{Langmead2009} Langmead B, Trapnell C, Pop M and Salzberg SL: \textbf{{Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.}} \emph{Genome Biology} 2009 \textbf{10}:R25. \bibitem{Wingett} Wingett et al.: \textbf{{http://www.bioinformatics.babraham.ac.uk/projects/hicup/overview/}}. \end{thebibliography} \end{document}