--- title: "extraChIPs: Range-Based operations" author: - name: Stevie Pederson affiliation: - Black Ochre Data Laboratories, Telethon Kids Institute, Adelaide, Australia - Dame Roma Mitchell Cancer Researc Laboratories, University of Adelaide - John Curtin School of Medical Research, Australian National University email: stephen.pederson.au@gmail.com package: extraChIPs resource_files: - vignettes/stitched.png output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Range-Based Operations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo=FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE) ``` # Introduction This package is designed to enable the Gene Regulatory Analysis using Variable IP ([GRAVI](https://github.com/smped/GRAVI)) workflow, as a method for detecting differential binding in ChIP-Seq datasets. As a workflow focussed on data integration, most functions provided by the package `extraChIPs` are designed to enable comparison across datasets. This vignette looks primarily at functions which work with `GenomicRanges` objects. # Installation In order to use the package `extraChIPs` and follow this vignette, we recommend using the package `BiocManager` hosted on CRAN. Once this is installed, the additional packages required for this vignette (`tidyverse`, `plyranges` and `Gviz`) can also be installed. ```{r install, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager") BiocManager::install(c("tidyverse", "plyranges", "Gviz")) BiocManager::install("extraChIPs") ``` # Coercion The advent of the `tidyverse` has led to `tibble` objects becoming a common alternative to `data.frame` or `DataFrame` objects. Simple functions within `extraChIP` enable coercion from `GRanges`, `GInteractions` and `DataFrame` objects to tibble objects, with list columns correctly handled. By default these coercion functions will coerce `GRanges` elements to a character vector. Similarly, `GRanges` represented as a character column can be coerced to the ranges element of a `GRanges` object. First let's coerce from a `tibble` (or S3 `data.frame`) to a `GRanges` ```{r gr} library(tidyverse) library(extraChIPs) set.seed(73) df <- tibble( range = c("chr1:1-10:+", "chr1:5-10:+", "chr1:5-6:+"), gene_id = "gene1", tx_id = paste0("transcript", 1:3), score = runif(3) ) df gr <- colToRanges(df, "range") gr ``` Coercion back to a `tibble` will place the ranges as a character column by default. However, this can be turned off and the conventional coercion from `as.data.frame` will instead be applied, internally wrapped in `as_tibble()` ```{r as-tibble} as_tibble(gr) as_tibble(gr, rangeAsChar = FALSE) ``` A simple feature which may be useful for printing gene names using `rmarkdown` is contained in `collapseGenes()`. Here a character vector of gene names is collapsed into a `glue` object of length `, with gene names rendered in italics by default. ```{r collapse-gene, results='asis'} gn <- c("Gene1", "Gene2", "Gene3") collapseGenes(gn) ``` # Formation of Consensus Peaks The formation of consensus peaks incorporating ranges from multiple replicates is a key part of many ChIP-Seq analyses. A common format returned by tools such as `mcas2 callpeak` is the `narrowPeak` (or `broadPeak`) format. Sets of these across multiple replicates can imported using the function `importPeaks()`, which returns a `GRangesList()` ```{r load-peaks} fl <- system.file( c("extdata/ER_1.narrowPeak", "extdata/ER_2.narrowPeak"), package = "extraChIPs" ) peaks <- importPeaks(fl) names(peaks) <- c("ER_1", "ER_2") ``` In the above we have loaded the peaks from two replicates from the Estrogen Receptor (ER) in T-47D cells. To form consensus peaks we can place a restriction on the number of replicates an overlapping peak needs to appear in. By default, the function `makeConsensus()` sets the proportion to be zero `(p = 0)` so all peaks are retained. Note that a logical vector/column is returned for each replicate, along with the number of replicates the consensus peak is derived from. ```{r make-consensus1} makeConsensus(peaks) ``` However, we may wish for peaks to appear in all replicates, so we can set the argument `p = 1` (or any other value $0 \leq p \leq 1$). ```{r make-consensus2} makeConsensus(peaks, p = 1) ``` In addition, we may wish to keep the values from the `mcols` as part of our consensus peaks, such as the `qValue`. This can be specified using the `var` argument, and will return a `CompressedList` as returned by `reduceMC()`, seen below. ```{r make-consensus-var} makeConsensus(peaks, p = 1, var = "qValue") ``` We could even identify the peak centre and pass these to the set of consensus peaks for downstream peak re-centreing. ```{r make-consensus-mutate} library(plyranges) peaks %>% endoapply(mutate, centre = start + peak) %>% makeConsensus(p = 1, var = "centre") ``` We could then find the mean of the peak centres for an averaged centre. ```{r make-consensus-centre} peaks %>% endoapply(mutate, centre = start + peak) %>% makeConsensus(p = 1, var = "centre") %>% mutate(centre = vapply(centre, mean, numeric(1))) ``` # Simple Operations Retaining `mcols()` The standard set operations implemented in the package `GenomicRanges` will always drop the `mcols` element by default. The `extraChIPs` functions `reduceMC()`, `setdiffMC()`, `intersectMC()` and `unionMC()` all produce the same output as their similarly-named functions, however, the `mcols()` elements in the query object are also returned. Where required, columns are coerced into `CompressedList` columns. This can particularly useful when needed to propagate the information contained in the initial ranges through to subsequent analytic steps ## Simplifying single `GRanges` objects The classical approach to defining TSS regions for a set of transcripts would be to use the function `resize`()`, setting the width as 1. ```{r tss} tss <- resize(gr, width = 1) tss ``` As we can see, two transcripts start at position 5, so we may choose to reduce this, which would lose the `mcols` element. The alternative `reduceMC()` will retain all `mcols`. ```{r reduce} GenomicRanges::reduce(tss) reduceMC(tss) ``` By default, this function will attempt to coerce `mcols` to a new `mcol` of the appropriate type, however, when multiple values are inevitable such as for the `tx_id` column above, these will be coerced to a `CompressedList`. The simplification of the multiple values seen in the `gene_id` can also be turned off if desired should repeated values be important for downstream analysis. ```{r reduce-no-simplify} reduceMC(tss, simplify = FALSE) ``` This allows for simple integration with `tidyverse` nesting strategies. ```{r reduce-unnest} reduceMC(tss, simplify = FALSE) %>% as_tibble() %>% unnest(everything()) ``` Whilst `reduceMC` relies on the range-reduction as implemented in `GenomicRanges::reduce()`, some alternative approaches are included, such as `chopMC()`, which finds identical ranges and nests the mcols element as `CompressedList` objects. ```{r chop} chopMC(tss) ``` In the case of the object `tss`, this output is identical to `reduceMC()`, however, given that there are no identical ranges in `gr` the two functions would behave very differently on that object. A final operation for simplifying `GRanges` objects would be `distinctMC()` which is a wrapper to `dplyr]::distinct` incorporating both the range and `mcols`. The columns to search can be called using `` approaches as detailed in the manual. ```{r distinct} distinctMC(tss) distinctMC(tss, gene_id) ``` ## Set Operations with Two `GRanges` Objects Whilst `reduce/reduceMC` is applied to a single `GRanges` object, the set operation functions `intersect`, `setdiff` and `union` are valuable approaches for comparing ranges. Using the `*MC()` functions will retain `mcols` elements from the *query range*. ```{r setopts} peaks <- GRanges(c("chr1:1-5", "chr1:9-12:*")) peaks$peak_id <- c("peak1", "peak2") GenomicRanges::intersect(gr, peaks, ignore.strand = TRUE) intersectMC(gr, peaks, ignore.strand = TRUE) setdiffMC(gr, peaks, ignore.strand = TRUE) unionMC(gr, peaks, ignore.strand = TRUE) ``` There is a performance overhead to preparation of mcols as `CompressedList` objects and all `mcols` in the query object will be returned. If only wishing to retain a subset of `mcols`, these should be selected prior to passing to these functions. ```{r setopts-plyranges} gr %>% select(tx_id) %>% intersectMC(peaks, ignore.strand = TRUE) ``` ## Overlapping proportions Whilst the functions `findOverlaps()` and `overlapsAny()` are extremely useful, the addition of `propOverlap()` returns a numeric vector with the proportion of each query range (`x`) which overlaps any range in the subject range (`y`) ```{r prop-overlaps} propOverlap(gr, peaks) ``` This is also extended to enable comparison across multiple features and to classify each peak by which features that it overlaps the most strongly. ```{r best-overlap} bestOverlap(gr, peaks, var = "peak_id") ``` # More Complex Operations ## Range Partitioning In addition to standard set operations, one set of ranges can be used to partition another set of ranges, returning `mcols` from both ranges. Ranges from the query range (`x`) are returned after being partitioned by the ranges in the subject range (`y`). Subject ranges used for partitioning *must be non-overlapping*, and if overlapping ranges are provided, these will be reduced prior to partitioning. This enables the identification of specific ranges from the query range (`x`) which overlap ranges from the subject range (`y`) Under this approach, `mcols` from both `query` and `subject` ranges will be returned to enable the clear ranges which are common and distinct within the two sets of ranges. ```{r partition-ranges} partitionRanges(gr, peaks) ``` Whilst this shares some similarity with `intersectMC()` the additional capabilities provide greater flexibility. ```{r partition-ranges-subset} partitionRanges(gr, peaks) %>% subset(is.na(peak_id)) ``` ## Stitching Ranges Using the function `stitchRanges()` we are able to join together sets of nearby ranges, but with the option of placing clear barriers between ranges, across which ranges cannot be joined. This may be useful if joining enhancers to form putative super-enhancers, but explicitly omitting defined promoter regions. ```{r stitch-ranges} enh <- GRanges(c("chr1:1-10", "chr1:101-110", "chr1:181-200")) prom <- GRanges("chr1:150:+") se <- stitchRanges(enh, exclude = prom, maxgap = 100) se ``` As a visualisation (below) ranges within 100bp were stitched together, however the region defined as a 'promoter' acted as a barrier and ranges were not stitched together across this barrier. ```{r plot-stitched, echo = FALSE, fig.height=3, fig.width=8} knitr::include_graphics("stitched.png") ``` # Session Info ```{r} sessionInfo() ```