--- title: "From reads to regions: a Bioconductor workflow to detect differential binding in ChIP-seq data" author: - name: Aaron T. L. Lun affiliation: - &WEHI The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC 3052, Melbourne, Australia - Department of Medical Biology, The University of Melbourne, Parkville, VIC 3010, Melbourne, Australia - name: Gordon K. Smyth affiliation: - *WEHI - Department of Mathematics and Statistics, The University of Melbourne, Parkville, VIC 3010, Melbourne, Australia date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{From reads to regions: a Bioconductor workflow to detect differential binding in ChIP-seq data} %\VignetteEngine{knitr::rmarkdown} output: BiocStyle::html_document: fig_caption: yes bibliography: ref.bib --- ```{r style, echo=FALSE, results='hide', message=FALSE} library(BiocStyle) library(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) opts_chunk$set(fig.asp=1) ``` # Workflow version information **R version**: `r R.version.string` **Bioconductor version**: `r BiocManager::version()` **Package**: `r packageVersion("chipseqDB")` # Introduction Chromatin immunoprecipitation with sequencing (ChIP-seq) is a popular technique for identifying the genomic binding sites of a target protein. Conventional analyses of ChIP-seq data aim to detect absolute binding (i.e., the presence or absence of a binding site) based on peaks in the read coverage. However, a number of recent studies have focused on the detection of changes in the binding profile between conditions [@rossinnes2012differential; @pal2013global]. These differential binding (DB) analyses involve counting reads into genomic intervals, and then testing those counts for significant differences between conditions. This defines a set of putative DB regions for further examination. DB analyses are easier to perform than their conventional counterparts, as the effect of genomic biases is largely mitigated when counts for different libraries are compared at the same genomic region. DB regions may also be more relevant as the change in binding can be associated with the biological difference between conditions. The key step in the DB analysis is the manner in which reads are counted. The most obvious strategy is to count reads into pre-defined regions of interest, like promoters or gene bodies [@pal2013global]. This is simple but will not capture changes outside of those regions. In contrast, *de novo* analyses do not depend on pre-specified regions, instead using empirically defined peaks or sliding windows for read counting. Peak-based methods are implemented in the `r Biocpkg("DiffBind")` and `r Biocpkg("DBChIP")` software packages [@rossinnes2012differential; @liang2012detecting], which count reads into peak intervals that have been identified with software like MACS [@zhang2008macs]. This requires some care to maintain statistical rigour, as peaks are called with the same data used to test for DB. Alternatively, window-based approaches count reads into sliding windows across the genome. This is a more direct strategy that avoids problems with data re-use and can provide increased DB detection power [@lun2014denovo]. However, its correct implementation is not straightforward due to the subtleties with interpretation of the false discovery rate (FDR). Here, we describe a computational workflow for performing a DB analysis with sliding windows. It is primarily based on the `r Biocpkg("csaw")` software package but also uses a number of other packages from the open-source Bioconductor project [@huber2015orchestrating]. The aim is to facilitate the practical implementation of window-based DB analyses by providing detailed code and expected output. The workflow described here applies to any ChIP-seq experiment with multiple experimental conditions and with multiple biological samples within one or more of the conditions. It detects and summarizes DB regions between conditions in a *de novo* manner, i.e., without making any prior assumptions about the location or width of bound regions. Detected regions are then annotated according to their proximity to annotated genes. In addition, the code can be easily adapted to accommodate batch effects, covariates and multiple experimental factors. ***Note:*** *to cite this article, please refer to https://f1000research.com/articles/4-1080/v2 for instructions.* # Links to individual workflows We use two datasets to demonstrate the application of sliding windows to detect DB events: - The first dataset studies changes in H3K9ac marking between pro-B and mature B cells [@domingo2012bcell]. The DB workflow for this dataset can be found at https://bioconductor.org/help/workflows/h3k9ac/. - The second dataset studies changes in CREB-binding protein (CBP) binding between wild-type and CBP knock-out cells [@kasper2014genomewide]. The DB workflow for this dataset can be found at https://bioconductor.org/help/workflows/cbp/. # Author information ## Author contributions A.T.T.L. developed and tested the workflow on the H3K9ac and CBP data sets. G.K.S. provided direction on the design of the workflow. Both A.T.T.L. and G.K.S. wrote the article. ## Competing interests No competing interests were disclosed. ## Grant information National Health and Medical Research Council (Program Grant 1054618 to G.K.S., Fellowship to G.K.S.); Victorian State Government Operational Infrastructure Support; Australian Government NHMRC IRIIS. ## Acknowledgements The authors would like to thank Prof. Stephen Nutt for his valuable insights on B-cell biology. # References