--- title: "Hi-C arithmetic with plyinteractions" output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 4 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('plyinteractions')`" vignette: > %\VignetteIndexEntry{HiCarithmetic} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) options(width = 9999) ``` The `r Biocpkg("plyinteractions")` package facilitates data aggregation, for up to hundreds of thousands and even millions of genomic interactions. In this vignette, we explore several use cases which can arise when exploring Hi-C data stored in `pairs` files. We will use a real-life `pairs` file provided by the `4DN` Consortium. This file has been generated from processing Hi-C performed in mouse from brain cell primary culture during neural development (Bonev et al., Cell 2017). Pairs have been filtered to only those mapped over `chr13`. ```{r importPairs} library(tidyverse) library(plyinteractions) ## Importing it in R pairs_file <- HiContactsData::HiContactsData('mESCs', 'pairs.gz') pairs_df <- read.delim( pairs_file, sep = "\t", header = FALSE, comment.char = "#" ) |> set_names(c( "ID", "seqnames1", "start1", "seqnames2", "start2", "strand1", "strand2" )) pairs <- as_ginteractions( pairs_df, end1 = start1, end2 = start2, keep.extra.columns = TRUE ) pairs ``` # Estimating pairs filtering thresholds We can first *in silico* digest the mouse genome to obtain the coordinates of each genomic fragment after digestion by **DpnII and HinfI**. ```{r cutGenome} ## Prepare DpnII/HinfI-digested genomic fragments library(GenomicRanges) library(Biostrings) library(plyranges) genome <- BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10 cutter <- DNAStringSet(c("GATC", "GANTC")) ## DpnII/HinfI cutting site fragments <- BiocParallel::bplapply(BPPARAM = BiocParallel::MulticoreParam(workers = 8), names(genome), function(.x) { seq <- genome[[.x]] mids <- lapply( cutter, function(cutsite) { hits <- matchPattern(cutsite, seq, fixed = "subject") start(hits) + {end(hits) - start(hits)} } ) |> unlist() |> sort() GRanges(seqnames = .x, IRanges( start = c(1, mids), end = c(mids-1, length(seq)) )) } ) |> set_names(names(genome)) |> GRangesList() |> unlist() fragments$binID <- seq_along(fragments) ``` We can then use the `annotate()` function from `r Biocpkg("plyinteractions")` to recover, for each interaction, which restriction enzyme fragment each anchor overlaps with, and how many restriction enzyme cutting sites are found between them. ```{r annotatePairs} ## Annotate for each anchor set which genomic fragment it overlaps with annotated_pairs <- pairs |> plyinteractions::annotate(fragments, by = "binID") |> mutate(n_fragments = binID.2 - binID.1, group = paste0(strand1, strand2)) annotated_pairs ``` Next, we can plot the distribution of `strand1` and `strand2` cominations as a function of the number of restriction enzyme cutting sites between anchors of each interaction. ```{r getDistrib} df <- annotated_pairs |> head(n = 1e6) |> group_by(strand1, strand2, n_fragments) |> count() |> as_tibble() |> mutate(group = paste0(strand1, strand2)) |> select(group, n_fragments, n) ggplot(df, aes(x = n_fragments, y = n, group = group, col = group)) + geom_line() + geom_point() + xlim(c(0, 15)) + annotation_logticks(sides = 'l') + theme_bw() + labs( x = "Number of restriction sites between anchors", y = "Number of pairs" ) ``` From this distribution, we can see that `--` and `++` pairs have a decreasing frequency over increasing numbers of cut sites between anchors of each interaction. These pairs are unambiguous, as the orientation of each sequenced end can only come from true cutting and religation event, (except the set of `--` and `++` pairs which have `0` cut sites between each anchor, which cannot be explained); all these pairs can be kept. The over-representation of `+-` pairs at short distance likely represent uncut fragments subsequently sequenced on each end. The under-representation of `-+` pairs at short distance likely represent self-religated fragments. We can estimate a threshold for each of these pairs sets by computing the MAD and expected , as described in [Cournac et al., 2012](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-436). ```{r getThresholds} filters <- df |> filter(n_fragments <= 50) |> arrange(n_fragments) |> group_by(n_fragments) |> mutate(median = median(n)) |> ungroup() |> mutate(MAD = median(abs(n - median))) |> mutate(withinMAD = abs(n - median) <= MAD / 0.67449) |> filter(withinMAD) |> slice_head(by = group, n = 1) |> select(group, n_fragments) |> dplyr::rename(threshold = n_fragments) filters ``` # Filtering pairs using appropriate thresholds ```{r filterPairs} annotated_pairs <- annotated_pairs |> mutate(threshold = left_join(as_tibble(mcols(annotated_pairs)), filters)$threshold) |> mutate(type = case_when( group %in% c('--', '++') & n_fragments < threshold ~ "excluded", group == '+-' & n_fragments < threshold ~ "uncut", group == '-+' & n_fragments < threshold ~ "religated", .default = "kept" )) mcols(annotated_pairs) |> as_tibble() |> count(type) |> mutate(n = scales::percent(n/sum(n))) filtered_pairs <- filter(annotated_pairs, type == 'kept') ``` # Computing distance law from pairs Another typical step when analyzing Hi-C processed data is the modeling of a so-called "distance law", (a.k.a "P(s)"), which describes the genomic distance-dependent contact frequency between pairs of genomic loci from a Hi-C experiment. We can easily recover the distance between the two anchors of each interaction (noted *s*) and plot the interaction frequency (noted *P(s)*) as a function of this genomic distance. ## Plotting distance law: first try ```{r Ps1} dat <- filtered_pairs |> mutate(s = abs(end2 - start1)) |> group_by(s) |> count(name = "n") |> as_tibble() |> mutate(Ps = n/sum(n)) p <- ggplot(dat, aes(x = s, y = Ps)) + geom_line() p ``` This is not very informative, as the distances span several orders of magnitude in both dimensions. ## Second try: switching to logarithmic scale Switching to a `log` scale in `r CRANpkg("ggplot2")` is very easy. ```{r Ps2} p + scale_x_log10() + scale_y_log10() + annotation_logticks() ``` ## Third try: aggregating data before plotting The previous P(s) plot is precise at the base-pair resolution. We can aggregate counts by binned distances: ```{r Ps3} # Calculate distance breaks evenly spaced on a log scale (base 1.1) x <- 1.1^(1:200-1) lmc <- coef(lm(c(1,1161443398)~c(x[1], x[200]))) bins_breaks <- unique(round(lmc[2]*x + lmc[1])) bins_widths <- lead(bins_breaks) - bins_breaks # Bin distances dat <- filtered_pairs |> mutate(s = abs(end2 - start1)) |> mutate( binned_s = bins_breaks[as.numeric(cut(s, bins_breaks))], bin_width = bins_widths[as.numeric(cut(s, bins_breaks))] ) |> group_by(binned_s, bin_width) |> count(name = "n") |> as_tibble() |> mutate(Ps = n / sum(n) / bin_width) # Plot results ggplot(dat, aes(x = binned_s, y = Ps)) + geom_line() + scale_x_log10() + scale_y_log10() + annotation_logticks() ``` ## With some polishing ```{r Ps4} ggplot(dat, aes(x = binned_s, y = Ps)) + geom_line() + scale_x_log10(limits = c(1e3, 1e8)) + ## This changes x axis to log scale scale_y_log10() + ## This changes y axis to log scale annotation_logticks() + ## This adds log ticks labs( x = "Genomic distance (s)", y = "P(s)", title = "Distance-dependent genomic frequency P(s) in mESC (chr. 13)" ) + ## This fixes axes titles theme_bw() ## This changes default plot theme ``` # Reproducibility `R` session information: ```{r sessioninfo, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ```