--- title: "Matching case study II: CTCF orientation" author: "Eric S. Davis" date: "`r format(Sys.Date(), '%m/%d/%Y')`" output: rmarkdown::html_document: highlight: tango toc: true toc_float: true fig_width: 5 fig_height: 3 vignette: | %\VignetteIndexEntry{4. Matching case study II: CTCF orientation} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- In this vignette, we demonstrate the generation of covariate-matched null ranges by using the `matchRanges()` function to test the "covergence rule" of CTCF-bound chromatin loops, first described in Rao et al. 2014. ## Background In the human genome, chromatin loops play an important role in regulating gene expression by connecting regulatory loci to gene promoters. The anchors of these loops are bound by the protein CTCF, which is required for loop formation and maintenance. CTCF binds a specific non-palindromic DNA sequence motif. And the direction of this motif at loop anchors is non-random. That is, CTCF binding motifs found at the anchors of loops tend to be oriented towards the center of the loop more often then would be expected by chance. This phenomenon, first described by Rao et al in 2014, is known as the “convergence rule”. It was initially discovered by first filtering for loops that had a single CTCF binding site at each end of the loop. Here we use matchRanges to reinvestigate the convergence using all loops. We will use `hg19_10kb_ctcfBoundBinPairs` data from the `nullrangesData` package which contains features from the GM12878 cell line aligned to hg19 as well as CTCF motif data from the [CTCF data package](https://bioconductor.org/packages/devel/data/annotation/html/CTCF.html) available on Bioconductor. `hg19_10kb_ctcfBoundBinPairs` is a [`GInteractions`](https://bioconductor.org/packages/release/bioc/html/InteractionSet.html) object with all pairwise combinations of CTCF-bound 10Kb bins within 1Mb annotated with the following features: - The total CTCF signal in each bin. - The number of CTCF sites in each bin. - The distance between bin pairs. - Whether at least one CTCF site is convergent between each bin pair. - The presence or absence of a loop between each bin pair. Using these annotations and the `matchRanges()` function, we can compare CTCF motif orientations between pairs of genomic regions that are 1) connected by loops, 2) not connected by loops, 3) randomly chosen, or 4) not connected by loops, but matched for the strength of CTCF sites and distance between loop anchors. ```{r, message=FALSE, warning=FALSE, echo=FALSE, fig.width=8.5, fig.height=6.5, eval=FALSE} ## Define colors colors <- c("#e19995", "#adaf64", "#4fbe9b", "#6eb3d9", "#d098d7") ## Create artificial GInteractions library(InteractionSet) set.seed(5) pool <- GInteractions( anchor1 = GRanges(seqnames = "chr1", ranges = IRanges(start = sample(1:990, 120, replace = TRUE), width = 10)), anchor2 = GRanges(seqnames = "chr1", ranges = IRanges(start = sample(1:990, 120, replace = TRUE), width = 10)), mode = "strict", color = sample(1:5, 120, replace = TRUE) ) focal <- GInteractions( anchor1 = GRanges(seqnames = "chr1", ranges = IRanges(start = sample(1:990, 16, replace = TRUE), width = 10)), anchor2 = GRanges(seqnames = "chr1", ranges = IRanges(start = sample(1:990, 16, replace = TRUE), width = 10)), mode = "strict", color = sample(1:5, 16, replace = TRUE) ) ## Add distance to metadata pool$distance <- pairdist(pool) focal$distance <- pairdist(focal) ## Match ranges library(nullranges) set.seed(123) x <- matchRanges(focal = focal, pool = pool, covar = ~color + distance, method = 'n', replace = TRUE) ## Visualize sets library(plotgardener) library(grid) set.seed(123) pageCreate(width = 8.5, height = 6.5, showGuides = FALSE, xgrid = 0, ygrid = 0) ## Define common parameters p <- pgParams(chrom = "chr1", chromstart = 1, chromend = 1000) ## Pool set poolSet <- plotPairs(data = pool, params = p, fill = colors[pool$color], x = 1, y = 1.25, width = 2.5, height = 2.25) annoGenomeLabel(plot = poolSet, x = 1, y = 3.55) plotText(label = "Pool Set", x = 2.25, y = 0.9, just = c("center", "bottom"), fontcolor = "#33A02C", fontface = "bold", fontfamily = 'mono') ## Focal set focalSet <- plotPairs(data = focal, params = p, fill = colors[focal$color], x = 5, y = 1.1, width = 2.5, height = 0.9) annoGenomeLabel(plot = focalSet, x = 5, y = 2.05) plotText(label = "Focal Set", x = 6.25, y = 0.9, just = c("center", "bottom"), fontcolor = "#1F78B4", fontface = "bold", fontfamily = 'mono') ## Matched set matchedSet <- plotPairs(data = x, params = p, fill = colors[x$color], x = 5, y = 2.6, width = 2.5, height = 0.9) annoGenomeLabel(plot = matchedSet, x = 5, y = 3.55) plotText(label = "Matched Set", x = 6.25, y = 2.50, just = c("center", "bottom"), fontcolor = "#A6CEE3", fontface = "bold", fontfamily = 'mono') ## Arrow and matchRanges label plotSegments(x0 = 3.5, y0 = 3, x1 = 5, y1 = 3, arrow = arrow(type = "closed", length = unit(0.1, "inches")), fill = "black", lwd = 2) plotText(label = "matchRanges()", fontfamily = 'mono', x = 4.25, y = 2.9, just = c("center", "bottom")) ## Matching plots library(ggplot2) smallText <- theme(legend.title = element_text(size=8), legend.text=element_text(size=8), title = element_text(size=8), axis.title.x = element_text(size=8), axis.title.y = element_text(size=8)) plot1 <- plotPropensity(x, sets=c('f','m','p')) + smallText + theme(legend.key.size = unit(0.5, 'lines'), title = element_blank()) plot2 <- plotCovariate(x=x, covar=covariates(x)[1], sets=c('f','m','p')) + smallText + theme(legend.text = element_blank(), legend.position = 'none') plot3 <- plotCovariate(x=x, covar=covariates(x)[2], sets=c('f','m','p'))+ smallText + theme(legend.key.size = unit(0.5, 'lines')) ## Propensity scores plotText(label = "plotPropensity()", x = 1.10, y = 4.24, just = c("left", "bottom"), fontface = "bold", fontfamily = 'mono') plotText(label = "~color + distance", x = 1.25, y = 4.5, just = c("left", "bottom"), fontsize = 10, fontfamily = "mono") plotGG(plot = plot1, x = 1, y = 4.5, width = 2.5, height = 1.5, just = c("left", "top")) ## Covariate balance plotText(label = "plotCovariate()", x = 3.75, y = 4.24, just = c("left", "bottom"), fontface = "bold", fontfamily = "mono") plotText(label = covariates(x), x = c(4, 5.9), y = 4.5, just = c("left", "bottom"), fontsize = 10, fontfamily = "mono") plotGG(plot = plot2, x = 3.50, y = 4.5, width = 1.8, height = 1.5, just = c("left", "top")) plotGG(plot = plot3, x = 5.30, y = 4.5, width = 2.75, height = 1.5, just = c("left", "top")) ``` ## Matching with `matchRanges()` Before we generate our null ranges, let's take a look at our example dataset: ```{r, message=FALSE, warning=FALSE} library(nullrangesData) ## Load example data binPairs <- hg19_10kb_ctcfBoundBinPairs() binPairs ``` Let's start by defining our focal set (i.e. looped bin-pairs), our pool set (i.e un-looped bin-pairs), and our covariates of interest (i.e. `ctcfSignal` and `distance`): ```{r, message=FALSE, warning=FALSE} library(nullranges) set.seed(123) mgi <- matchRanges(focal = binPairs[binPairs$looped], pool = binPairs[!binPairs$looped], covar = ~ctcfSignal + distance + n_sites, method = 'stratified') mgi ``` When the focal and pool arguments are `GInteractions` objects, `matchRanges()` returns a `MatchedGInteractions` object. The `MatchedGInteractions` class extends `GInteractions` so all of the same operations can be applied: ```{r, message=FALSE, warning=FALSE} library(plyranges) library(ggplot2) ## Summarize ctcfSignal by n_sites mgi %>% regions() %>% group_by(n_sites) %>% summarize(ctcfSignal = mean(ctcfSignal)) %>% as.data.frame() %>% ggplot(aes(x = n_sites, y = ctcfSignal)) + geom_line() + geom_point(shape = 21, stroke = 1, fill = 'white') + theme_minimal() + theme(panel.border = element_rect(color = 'black', fill = NA)) ``` ## Assessing quality of matching We can get a quick summary of the matching quality with `overview()`: ```{r} ov <- overview(mgi) ov ``` In addition to providing a printed overview, the overview data can be extracted for convenience. For example, the `quality` property shows the absolute value of the mean difference between focal and matched sets. Therefore, the lower this value, the better the matching quality: ```{r} ov$quality ``` ### Visualizing matching results Let's visualize overall matching quality by plotting propensity scores for the focal, pool, and matched sets: ```{r, message=FALSE} plotPropensity(mgi, sets = c('f', 'p', 'm')) ``` Log transformations can be applied to 'x', 'y', or both (`c('x', 'y')`) for plotting functions to make it easier to assess quality. It is clear that the matched set is very well matched to the focal set: ```{r} plotPropensity(mgi, sets = c('f', 'p', 'm'), log = 'x') ``` We can ensure that covariate distributions have been matched appropriately by using the `covariates()` function to extract matched covariates along with `patchwork` and `plotCovarite` to visualize all distributions: ```{r, message=FALSE, warning=FALSE, fig.height=8, fig.width=5} library(patchwork) plots <- lapply(covariates(mgi), plotCovariate, x=mgi, sets = c('f', 'm', 'p')) Reduce('/', plots) ``` ## Compare CTCF site orientation Using our matched ranges, we can compare the percent of looped pairs with at least one convergent CTCF site against unlooped pairs, randomly selected pairs, and pairs that are unlooped but have been matched for our covariates. The accessor function `focal()` and `pool()` can be used to conveniently extract these matched sets: ```{r, fig.width=6, fig.height=5} ## Generate a randomly selected set from all binPairs all <- c(focal(mgi), pool(mgi)) set.seed(123) random <- all[sample(1:length(all), length(mgi), replace = FALSE)] ## Calculate the percent of convergent CTCF sites for each group g1 <- (sum(focal(mgi)$convergent) / length(focal(mgi))) * 100 g2 <- (sum(pool(mgi)$convergent) / length(pool(mgi))) * 100 g3 <- (sum(random$convergent) / length(random)) * 100 g4 <- (sum(mgi$convergent) / length(mgi)) * 100 ## Visualize barplot(height = c(g1, g2, g3, g4), names.arg = c('looped\n(focal)', 'unlooped\n(pool)', 'all pairs\n(random)', 'unlooped\n(matched)'), col = c('#1F78B4', '#33A02C', 'orange2', '#A6CEE3'), ylab = "Convergent CTCF Sites (%)", main = "Testing the Convergence Rule", border = NA, las = 1) ``` As shown above the convergent rule holds true even when controlling for CTCF signal strength and bin pair distance. Greater than 90% of looped pairs contain convergent CTCF sites, compared to only ~55% among non-looped subsets. # Session information ```{r} sessionInfo() ```