---
title: "Introduction to bootRanges"
author: "Wancen Mu"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
bibliography: library.bib
output:
  rmarkdown::html_document:
    highlight: tango
    toc: true
    toc_float: true	
vignette: |
  %\VignetteIndexEntry{1. Introduction to bootRanges}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

# Introduction 

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.width=5, fig.height=5, message=FALSE, warning=FALSE)
```

In this vignette, we demonstrate the block bootstrap functionality
implemented in *nullranges*. See the main nullranges vignette for an
overview of the idea of bootstrapping, or the diagram below.

*nullranges* contains an implementation of a block bootstrap for
genomic data, as proposed by @bickel_2010,
such that features (ranges) are sampled from the genome in
blocks. The original block bootstrapping algorithm for genomic data is
implemented in a python software called Genome Structure Correlation,
[GSC](https://github.com/ParkerLab/encodegsc).

Our description of the *bootRanges* methods is described in @bootRanges.

# Quick start

Minimal code for running `bootRanges()` is shown below. Genome
segmentation `seg` and excluded regions `exclude` are optional.

```{r eval=FALSE}
eh <- ExperimentHub()
ah <- AnnotationHub()
seg <- eh[["EH7307"]] # genome segmentation for hg38
exclude <- ah[["AH107305"]] # Kundaje excluded regions for hg38, see below
set.seed(5) # set seed for reproducibility
blockLength <- 5e5 # size of blocks to bootstrap
R <- 10 # number of iterations of the bootstrap
boots <- bootRanges(ranges, blockLength=blockLength, R=R,
                    seg=seg, exclude=exclude)
# `boots` can then be used with plyranges commands
```

# Method overview

Several algorithms are implemented in `bootRanges()`, including
segmented or not, where in the segmented version, blocks are sampled
with respect to a particular genome segmentation.
Overall, we recommend segmented block bootstrap given the
heterogeneity of structure across the entire genome. If the purpose is
block bootstrapping ranges within a smaller set of sequences, such as
motifs within transcript sequence, then the unsegmented algorithm
would be sufficient. 

In a segmented block bootstrap, the blocks are sampled and placed within
regions of a genome *segmentation*. That is, for a genome segmented into
states $1,2, \dots, S$, blocks from state *s* will be used to tile the ranges
of state *s* in each bootstrap sample. The process can be visualized in
(A), a block with length $L_b$ is randomly sampled with replacement
from state "red" and the features (ranges) that overlap this block are
then copied to the first tile (which is in the "red" state). The
sampling is allowed across chromosome (as shown here), as long as the
two blocks are in the same state.

An example workflow of `bootRanges()` used in combination with
*plyranges* [@Lee2019] is diagrammed in (B), and can be summarized as: 

1.  Compute statistics of interest between *GRanges* of feature $x$
    and *GRanges* of feature $y$ to assess association in the original
    data. This could be an enrichment (amount of overlap) or other
    possible statistics making use of covariates associated with each
    range 
2.  `bootRanges()` with optional arguments `seg` (segmentation) and
    `exclude` (excluded regions as compiled by @excluderanges) to
    create a *BootRanges* object ($y'$)
3.  Compute bootstrap distribution of test statistics between
    *GRanges* of feature $x$ and $y'$ using *plyranges*
4.  A bootstrap p-value or $z$ test can be performed for testing the
    null hypothesis that there is no true biological enrichment of the
    original data (that bootstrap data often has as high an enrichment
    as the observed data)

```{r, echo=FALSE}
knitr::include_graphics("images/bootRanges.jpeg")
```

In this vignette, we give an example of segmenting the hg38 genome by
Ensembl gene density, create bootstrapped peaks and evaluate overlaps
for observed peaks and bootstrap peaks in two types of statistics,
then we profile the time of different algorithms to generate a single
block bootstrap sample. Finally, we use toy datasets to visualize what
a segmented and unsegmented block bootstrap sample looks like.

A finally consideration is whether the blocks should scale
proportionally to the segment state length, with the default setting
of `proportionLength=TRUE`. When blocks scale proportionally,
`blockLength` provides the maximal length of a block, while the actual
block length used for a segmentation state is proportional to the
fraction of genomic basepairs covered by that state. It is
theoretically motivated to have the blocks scale with the overall
extent of the segment state. However, in practice, if the genome
segmentation states are very heterogeneous in size (e.g. orders of
magnitude differences), then the blocks constructed via the
proportional length method for the smaller segmentation states can be
too short to effectively capture inter-range distances. We therefore
recommend proportional length blocks unless some segmentation states
have a much smaller extent than others, in which case fixed length
blocks can be used. This option is visualized on toy data at the end
of this vignette.

# Segmented block bootstrap

## Case study I: DHS

### Import excluded regions

To avoid placing bootstrap features into regions
of the genome that don’t typically have features, we import
excluded regions including ENCODE-produced excludable
regions[@encode_exclude], telomeres from UCSC, centromeres. These, and
other excludable sets, are assembled in the
[excluderanges](https://dozmorovlab.github.io/excluderanges/) package
[@excluderanges].

```{r excluderanges}
suppressPackageStartupMessages(library(AnnotationHub))
ah <- AnnotationHub()
# hg38.Kundaje.GRCh38_unified_Excludable
exclude_1 <- ah[["AH107305"]]
# hg38.UCSC.centromere
exclude_2 <- ah[["AH107354"]]
# hg38.UCSC.telomere
exclude_3 <- ah[["AH107355"]]
# hg38.UCSC.short_arm
exclude_4 <- ah[["AH107356"]]
# combine them
suppressWarnings({
  exclude <- trim(c(exclude_1, exclude_2, exclude_3, exclude_4))
})
exclude <- sort(GenomicRanges::reduce(exclude))
```

### Segmentations choices

Most of the datasets we examine, the density of ranges of interest
(e.g. ChIP- or ATAC-seq peaks) is often correlated to other
large-scale patterns of other genomic features, such as
genes. @bickel_2010 
motivate the idea of bootstrapping with respect to
a segmented genome given known, large-scale genomic structures such as
isochores ("larger than 300kb").

$\color{brown}{\text{A genomic segmentation can be considered if}}$
1) it defines large (e.g. on the order of ∼1 Mb), relatively
homogeneous segments and 2) the variance of the distribution of the
test statistics become stable as block length increases (See
@bootRanges Fig 2 A). 

There are two options to build segmentation *GRanges*, either:

- Use exiting segmentation (e.g. ChromHMM, etc.) downloaded from
  AnnotationHub or external to Bioconductor (BED files imported with
  *rtracklayer*)
- Perform a de-novo segmentation of the genome using feature density,
  e.g. gene density 

**Pre-built segmentations**

Given that these evaluations take time and involve consideration of
multiple criteria, we have provided our recommended segmentation for
hg38. *nullranges* has generated pre-built segmentations for easy use
by following below section *Segmentation by gene density*
Either pre-built segmentations using *CBS* or *HMM* methods with
$L_s=2e6$ considering excludable regions can be selected from
*ExperimentHub*. We find that the segmentation and block length shown
here could be used for most analysis of hg38.

```{r, message=FALSE, warning=FALSE}
suppressPackageStartupMessages(library(ExperimentHub))
eh <- ExperimentHub()
seg_cbs <- eh[["EH7307"]]
seg_hmm <- eh[["EH7308"]]
seg <- seg_cbs
```

**Segmentation by gene density**

This section describes how we generate pre-built segmentations. First
we obtain the Ensembl genes [@ensembl2021] for segmenting by gene 
density. We obtain these using the *ensembldb* package [@ensembldb].

```{r}
suppressPackageStartupMessages(library(ensembldb))
suppressPackageStartupMessages(library(EnsDb.Hsapiens.v86))
edb <- EnsDb.Hsapiens.v86
filt <- AnnotationFilterList(GeneIdFilter("ENSG", "startsWith"))
g <- genes(edb, filter = filt)
```

We perform some processing to align the sequences (chromosomes) of `g`
with our excluded regions and our features of interest (DNase
hypersensitive sites, or DHS, defined below).

```{r}
library(GenomeInfoDb)
g <- keepStandardChromosomes(g, pruning.mode = "coarse")
seqlevels(g, pruning.mode="coarse") <- setdiff(seqlevels(g), "MT")
# normally we would assign a new style, but for recent host issues
## seqlevelsStyle(g) <- "UCSC" 
seqlevels(g) <- paste0("chr", seqlevels(g))
genome(g) <- "hg38"
g <- sortSeqlevels(g)
g <- sort(g)
table(seqnames(g))
```

**CBS segmentation**

We first demonstrate the use a CBS segmentation as implemented in
*DNAcopy* [@dnacopy].

We load the *nullranges* and *plyranges* packages, and *patchwork* in
order to produce grids of plots.

```{r}
library(nullranges)
suppressPackageStartupMessages(library(plyranges))
library(patchwork)
```

We subset the excluded ranges to those which are 500 bp or larger. The
motivation for this step is to avoid segmenting the genome into many
small pieces due to an abundance of short excluded regions. Note that we
re-save the excluded ranges to `exclude`.

Here, and below, we need to specify `plyranges::filter` as it conflicts
with `filter` exported by *ensembldb*.

```{r seg-cbs, fig.width=5, fig.height=4}
set.seed(5)
exclude <- exclude %>%
  plyranges::filter(width(exclude) >= 500)
L_s <- 1e6
seg_cbs <- segmentDensity(g, n = 3, L_s = L_s,
                          exclude = exclude, type = "cbs")
plots <- lapply(c("ranges","barplot","boxplot"), function(t) {
  plotSegment(seg_cbs, exclude, type = t)
})
plots[[1]]
plots[[2]] + plots[[3]]
```

Note here, the default *ranges* plot gives whole genome and every
fractured bind regions represents state transformations happens.
However, some transformations within small ranges cannot be visualized,
e.g 1kb. If user want to look into specific ranges of segmentation
state, the *region* argument is flexible to support.

```{r, fig.width=4, fig.height=3}
region <- GRanges("chr16", IRanges(3e7,4e7))
plotSegment(seg_cbs, exclude, type="ranges", region=region)
```

**Alternatively: HMM segmentation**

Here we use an alternative segmentation implemented in the *RcppHMM*
CRAN package, using the `initGHMM`, `learnEM`, and `viterbi` functions.

```{r seg-hmm, fig.width=5, fig.height=4}
seg_hmm <- segmentDensity(g, n = 3, L_s = L_s,
                          exclude = exclude, type = "hmm")
plots <- lapply(c("ranges","barplot","boxplot"), function(t) {
  plotSegment(seg_hmm, exclude, type = t)
})
plots[[1]]
plots[[2]] + plots[[3]]
```

### Running bootranges

We use a set of DNase hypersensitivity sites (DHS) from the ENCODE
project [@encode] in A549 cell line (ENCSR614GWM). Here, for speed, we
work with a pre-processed data object from ExperimentHub, created using
the following steps:

-   Download ENCODE DNase hypersensitive peaks in A549 from
    *AnnotationHub*
-   Subset to standard chromosomes and remove mitochondrial DNA
-   Use a chain file from UCSC to lift ranges from hg19 to hg38
-   Sort the DHS features to be bootstrapped

These steps are included in *nullrangesData* in the
`inst/scripts/make-dhs-data.R` script. 

For speed of the vignette, we restrict to a smaller number of DHS,
filtering by the signal value. We also remove unrelated metadata
columns that we don't need for the bootstrap analysis. Because we are
interested in signal value for DHS peaks later, we only keep it
here. Consider, when creating bootstrapped data, that you will be
creating an object many times larger than your original features, so
$\color{brown}{\text{filtering and trimming}}$ extra metadata can help
make the analysis more efficient.

```{r}
suppressPackageStartupMessages(library(nullrangesData))
dhs <- DHSA549Hg38()
dhs <- dhs %>% plyranges::filter(signalValue > 100) %>%
  mutate(id = seq_along(.)) %>%
  plyranges::select(id, signalValue)
length(dhs)
table(seqnames(dhs))
```

Now we apply a segmented block bootstrap with blocks of size 500kb, to
the peaks. Here we show generation of 50 iterations of a block bootstrap
followed by a typical overlap analysis using *plyranges* [@Lee2019]. (We
might normally do 100 iterations or more, depending on the granularity
of the bootstrap distribution that is needed.)

```{r}
set.seed(5) # for reproducibility
R <- 50
blockLength <- 5e5
boots <- bootRanges(dhs, blockLength, R = R, seg = seg, exclude=exclude)
boots
```

What is returned here? The `bootRanges` function returns a
*bootRanges* object, which is a simple sub-class of *GRanges*. The
iteration (`iter`) and (optionally) the block length (`blockLength`)
are recorded as metadata columns, accessible via `mcols`. We return
the bootstrapped ranges as *GRanges* rather than *GRangesList*, as the
former is more compatible with downstream overlap joins with
*plyranges*, where the iteration column can be used with `group_by` to
provide per bootstrap summary statistics, as shown below.

Note that we use the `exclude` object from the previous step, which does
not contain small ranges. If one wanted to also avoid generation of
bootstrapped features that overlap small excluded ranges, then omit this
filtering step (use the original, complete `exclude` feature set).

### Assessing quality of bootstrap samples

We can examine properties of permuted y over iterations, and compare
to the original y. To do so, we first add the original features as
iter=0. Then compute summaries: 

```{r}
suppressPackageStartupMessages(library(tidyr))
combined <- dhs %>% 
  mutate(iter=0) %>%
  bind_ranges(boots) %>% 
  plyranges::select(iter)
stats <- combined %>% 
  group_by(iter) %>%
  summarize(n = n()) %>%
  as_tibble()
head(stats)
```

We can also look at distributions of various aspects, e.g. here the
inter-feature distance of features, across a few of the bootstraps and
the original feature set y. 

```{r, warning=FALSE}
suppressPackageStartupMessages(library(ggridges))
suppressPackageStartupMessages(library(purrr))
suppressPackageStartupMessages(library(ggplot2))
interdist <- function(dat) {
    x <- dat[-1,]
    y <- dat[-nrow(dat),]
    ifelse(x$seqnames == y$seqnames,
           x$start + floor((x$width - 1)/2) -
           y$start - floor((y$width - 1)/2), NA)
}
combined %>% plyranges::filter(iter %in% 0:3) %>%
  mutate(iter = droplevels(iter)) %>%
  plyranges::select(iter) %>%
  as.data.frame() %>% 
  nest(data = !iter) %>%
  mutate(interdist = map(data, interdist)) %>% 
  dplyr::select(iter, interdist) %>% 
  unnest(interdist) %>% 
  mutate(type = ifelse(iter == 0, "original", "boot"),
         interdist = pmax(interdist, 0)) %>%
  filter(!is.na(interdist)) %>%
  ggplot(aes(log10(interdist + 1), iter, fill=type)) +
  geom_density_ridges(alpha = 0.75)
  geom_text(data = head(stats, 4),
            aes(x=1.5, y=iter, label=paste0("n=",n), fill=NULL),
            vjust=1.5)
```

### Derive statistics of interest

Suppose we have a set of features `x` and we are interested in
evaluating the enrichment of this set with the
DHS. We can calculate for example the sum observed number of
overlaps for features in `x`  with DHS in whole genome (or something
more complicated, e.g. the maximum log fold change or signal value for
DHS peaks within a `maxgap` window of `x`). 

```{r}
x <- GRanges("chr2", IRanges(1 + 50:99 * 1e6, width=1e6), x_id=1:50)
```

### Statistic I: the total number of overlaps

```{r}
x <- x %>% mutate(n_overlaps = count_overlaps(., dhs))
sum( x$n_overlaps )
```

We can repeat this with the bootstrapped features using a `group_by`
command, a `summarize`, followed by a second `group_by` and `summarize`.
It may help to step through these commands one by one to understand what
the intermediate output is.

Note that we need to use `tidyr::complete` in order to fill in
combinations of `x` and `iter` where the overlap was 0.

```{r}
boot_stats <- x %>% join_overlap_inner(boots) %>%
  group_by(x_id, iter) %>%
  summarize(n_overlaps = n()) %>%
  as.data.frame() %>%
  complete(x_id, iter, fill=list(n_overlaps = 0)) %>%
  group_by(iter) %>%
  summarize(sumOverlaps = sum(n_overlaps))
```

The above code, first grouping by `x_id` and `iter`, then subsequently
by `iter` is general and allows for more complex analysis than just mean
overlap (e.g. how many times an `x` range has 1 or more overlap, what is
the mean or max signal value for peaks overlapping ranges in `x`, etc.).

If one is interested in assessing $\color{brown}{\text{feature-wise}}$ 
statistics instead of $\color{brown}{\text{genome-wise}}$ statistics, 
eg.,the mean observed number of overlaps per feature or mean base pair 
overlap in `x`, one can also group by both (`block`,`iter`). 10,000 total 
blocks may therefore be sufficient to derive a bootstrap distribution, 
avoiding the need to generate many bootstrap genomes of data.

Finally we can plot a histogram. In this case, as the `x` features were
arbitrary, our observed value falls within the distribution of sum
number of overlap bootstrapped peaks with $x$.

```{r boot-histI}
suppressPackageStartupMessages(library(ggplot2))
ggplot(boot_stats, aes(sumOverlaps)) +
  geom_histogram(binwidth=5)+
  geom_vline(xintercept = sum(x$n_overlaps), linetype = "dashed")
```

### Statistic II: the sum of signal value for nearby peaks

```{r}
x_obs <- x %>% join_overlap_inner(dhs,maxgap=1e3)
sum(x_obs$signalValue )
boot_stats <- x %>% join_overlap_inner(boots,maxgap=1e3)  %>%
  group_by(x_id, iter) %>%
  summarize(Signal = sum(signalValue)) %>%
  as_tibble() %>% 
  complete(x_id, iter, fill=list(Signal = 0)) %>%
  group_by(iter) %>%
  summarize(sumSignal = sum(Signal))
```

Still in this case, our observed value falls within the distribution
of bootstrapped statistics. 

```{r boot-histII}
ggplot(boot_stats, aes(sumSignal)) +
  geom_histogram()+
  geom_vline(xintercept = sum(x_obs$signalValue), linetype = "dashed")
```

## Case study II: Single cell multi-omics

Instead of deriving statistics of interest directly from *GRanges*
simple metadata column or genome position, count matrix from
*SummerizedExperiment* or *SingleCellExperiment* could also be
used. In this way, we have to extract count matrix from
*SingleCellExperiment* and save it in GRanges's metadata column as a
`NumericList()` format. Then perform the block bootstrap. One case
study is to assess the correlation of the all pairs of genes and
promoter peaks from Chromium Single Cell Multiome ATAC + Gene
Expression.

We demonstrate usage by loading 10X
[Multiome](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/10x_scRNA_scATAC.html). The
first two pre-processed steps are included in *nullrangesData* in the
`inst/scripts/make-multiome-data.R` script. Here, we aggregate 14
cell types across cluster-sample groups.

```{r eval=FALSE}
## split sparse count matrix into NumericList
# sc_rna <- rna_Granges[-which(rna.sd==0)] %>%
#   mutate(counts1 = NumericList(asplit(rna.scaled, 1))) %>% sort()
# sc_promoter <- promoter_Granges[-which(promoter.sd==0)] %>%
#   mutate(counts2 = NumericList(asplit(promoter.scaled, 1))) %>% sort()
suppressPackageStartupMessages(library(nullrangesData))
data("sc_rna")
sc_rna
sc_rna$counts1
nct <- length(sc_rna$counts1[[1]])
print(paste("There are", nct, "cell types"))
data("sc_promoter")

## bootstrap promoters
library(BSgenome.Hsapiens.UCSC.hg38)
genome <- BSgenome.Hsapiens.UCSC.hg38

seqlengths(sc_promoter) <- seqlengths(genome)[1:22] # pull chrom lens from USCS
R=50
bootranges <- bootRanges(sc_promoter,blockLength = 5e5,R=R,type = "bootstrap", withinChrom = F)
```

There are two option downstream pipelines to follow after generating a
bootranges object. One is use *Plyranges* in downstream analysis as
previous sections, another one is creating a
*tidySummarizedExperiment* object where each row save derived
interested statistics.

* Plyranges: 

We can compute the mean correlation of the all pairs of genes and promoter peaks over iterations and compare to the observed statistic. The histogram below shows that our observed value falls far from the distribution of bootstrapped statistics, as we expected for this real data.

```{r eval=FALSE}
cor_gr <- sc_rna %>% join_overlap_inner(sc_promoter, maxgap=1000) %>%
  mutate(rho = 1/(nct-1) * sum(counts1 * counts2)) %>%
  summarise(meanCor = mean(rho))
## mean correlation distribution
cor_boot <- sc_rna %>%
  join_overlap_inner(bootranges, maxgap=1000) %>%
  # vectorized code are 10 times faster than cor(counts1, counts2) for plyranges pipeline
  mutate(rho = 1/(nct-1) * sum(counts1 * counts2)) %>%
  select(rho, iter) %>%
  group_by(iter) %>%
  summarise(meanCor = mean(rho)) %>%
  as.data.frame()

ggplot(cor_boot, aes(meanCor)) +
  geom_histogram(binwidth=0.01)+
  geom_vline(xintercept = cor_gr$meanCor,linetype = "dashed")+
  theme_classic()
```

If one is interested in assessing significance of gene-promoter pairs per gene, one can also do `plyranges::select` (*gene*, *peak*, *rho*) and `plyranges::group_by` *gene*.

* tidySummarizedExperiment:

Create an RangedSE for each modality, then find the overlap using the below helper function.

```{r eval=FALSE}
library(tidySummarizedExperiment)
library(purrr)
# make an SE where each row is an overlap
makeOverlapSE <- function(se_rna, se_promoter) {
  idx <- rowRanges(se_rna) %>% join_overlap_inner(rowRanges(se_promoter),maxgap = 1000)
  assay_x <- assay(se_rna, "rna")[ idx$gene, ]
  assay_y <- assay(se_promoter, "promoter")[ idx$peak, ]
  # this is needed to build the SE
  rownames(assay_x) <- rownames(assay_y) <- seq_along( idx$gene )
  names(idx) <- seq_along( idx$gene )
  SummarizedExperiment(
    assays=list(x=assay_x, y=assay_y),
    rowRanges=idx
  )
}

# create SE for observed data
se_rna <- SummarizedExperiment(
  assays=list(rna=do.call(rbind,sc_rna$counts1)),
  rowRanges=sc_rna)
se_promoter <- SummarizedExperiment(
  assays=list(promoter=do.call(rbind,sc_promoter$counts2)),
  rowRanges=sc_promoter)

se <- makeOverlapSE(se_rna, se_promoter)
se <- se %>%
  as_tibble() %>%
  nest(data = -.feature) %>%
  mutate(rho = map(data,
                   function(data) data %>% summarize(rho = cor(x, y))
  )) %>%
  unnest(rho) %>%
  select(-data)

print(paste("mean correlation is", mean(se$rho)))

# create SE for bootranges data
se_promoter_boots <- SummarizedExperiment(
  assays=list(promoter=do.call(rbind,bootranges$counts2)),
  rowRanges=bootranges)
se_boots <- makeOverlapSE(se_rna, se_promoter_boots)

se_boots <- se_boots %>%
  as_tibble() %>%
  nest(data = -c(.feature,iter)) %>%
  # vectorized code similar to cor(counts1, counts2) for tidySE pipeline
  mutate(rho = map(data,
                   function(data) data %>% summarize(rho = cor(x, y))
  )) %>%
  unnest(rho) %>%
  select(-data)

cor_boot <- se_boots %>% group_by(iter) %>% 
  summarise(meanCor = mean(rho)) 
ggplot(cor_boot, aes(meanCor)) +
  geom_histogram(binwidth=0.01)+
  geom_vline(xintercept = mean(se$rho),linetype = "dashed")+
  theme_classic()
```

For more examples of combining `bootRanges` from *nullranges* with
*plyranges* piped operations, see the relevant chapter in the
[tidy-ranges-tutorial](https://nullranges.github.io/tidy-ranges-tutorial/bootstrap-overlap.html)
book.

## Case study III: Block bootstrap one region

Generally, it makes sense to block bootstrap the entire genome at
once. This is motivated by the "tidy analysis" paradigm where loops
are avoided by stacking data into a longer format. This makes
computation more efficient in our case (as a single overlap call can
be made with all regions of interest at once, across multiple
bootstrap iterations), and it also can simplify code and avoid
repetition. 

However, in some cases, there is a single region of interest, and it
is desired to generate bootstrap data within this one region. For
this, we have a convenience function that enables bootstrap
computation.

Suppose we have data in the following region of chromosome 1:

```{r}
suppressPackageStartupMessages(library(nullrangesData))
dhs <- DHSA549Hg38()
region <- GRanges("chr1", IRanges(10e6 + 1, width=1e6))
x <- GRanges("chr1", IRanges(10e6 + 0:9 * 1e5 + 1, width=1e4))
y <- dhs %>% filter_by_overlaps(region) %>% select(NULL)
x %>% mutate(num_overlaps = count_overlaps(., y))
```

We can easily bootstrap data just in this region using the following
code:

```{r}
seg <- oneRegionSegment(region, seqlength=248956422)
y <- keepSeqlevels(y, "chr1")
set.seed(1)
boot <- bootRanges(y, blockLength=1e5, R=1, seg=seg,
                   proportionLength=FALSE)
boot
x %>% mutate(num_overlaps = count_overlaps(., boot))
```

Here it is important to use `proportionLength=FALSE` so that the
blocks will be of the size specified and not smaller (they would
otherwise be scaled down proportional to the fraction of `region`
compared to the chromosome).

# Visualizing bootstrap types

Below we present a toy example for visualizing the segmented block
bootstrap. First, we define a helper function for plotting *GRanges*
using *plotgardener* [@Kramer2022]. A key aspect here is that we color
the original and bootstrapped ranges by the genomic state (the state of
the segmentation that the original ranges fall in).

```{r}
suppressPackageStartupMessages(library(plotgardener))
my_palette <- function(n) {
  head(c("red","green3","red3","dodgerblue",
         "blue2","green4","darkred"), n)
}
plotGRanges <- function(gr) {
  pageCreate(width = 5, height = 5, xgrid = 0,
                ygrid = 0, showGuides = TRUE)
  for (i in seq_along(seqlevels(gr))) {
    chrom <- seqlevels(gr)[i]
    chromend <- seqlengths(gr)[[chrom]]
    suppressMessages({
      p <- pgParams(chromstart = 0, chromend = chromend,
                    x = 0.5, width = 4*chromend/500, height = 2,
                    at = seq(0, chromend, 50),
                    fill = colorby("state_col", palette=my_palette))
      prngs <- plotRanges(data = gr, params = p,
                          chrom = chrom,
                          y = 2 * i,
                          just = c("left", "bottom"))
      annoGenomeLabel(plot = prngs, params = p, y = 0.1 + 2 * i)
    })
  }
}
```

Create a toy genome segmentation:

```{r}
library(GenomicRanges)
seq_nms <- rep(c("chr1","chr2"), c(4,3))
seg <- GRanges(
  seqnames = seq_nms,
  IRanges(start = c(1, 101, 201, 401, 1, 201, 301),
          width = c(100, 100, 200, 100, 200, 100, 100)),
  seqlengths=c(chr1=500,chr2=400),
  state = c(1,2,1,3,3,2,1),
  state_col = factor(1:7)
)
```

We can visualize with our helper function:

```{r toysegments, fig.width=5, fig.height=4}
plotGRanges(seg)
```

Now create small ranges distributed uniformly across the toy genome:

```{r toyrangesI, fig.width=5, fig.height=4}
set.seed(1)
n <- 200
gr <- GRanges(
  seqnames=sort(sample(c("chr1","chr2"), n, TRUE)),
  IRanges(start=round(runif(n, 1, 500-10+1)), width=10)
)
suppressWarnings({
  seqlengths(gr) <- seqlengths(seg)
})
gr <- gr[!(seqnames(gr) == "chr2" & end(gr) > 400)]
gr <- sort(gr)
idx <- findOverlaps(gr, seg, type="within", select="first")
gr <- gr[!is.na(idx)]
idx <- idx[!is.na(idx)]
gr$state <- seg$state[idx]
gr$state_col <- factor(seg$state_col[idx])
plotGRanges(gr)
```

**Scaling vs. not scaling by segment length**

We can visualize block bootstrapped ranges when the blocks do not scale
to segment state length:

```{r toy-no-prop, fig.width=5, fig.height=4}
set.seed(1)
gr_prime <- bootRanges(gr, blockLength = 25, seg = seg,
                       proportionLength = FALSE)
plotGRanges(gr_prime)
```

This time the blocks scale to the segment state length. Note that in
this case `blockLength` is the *maximal* block length possible, but the
actual block lengths per segment will be smaller (proportional to the
fraction of basepairs covered by that state in the genome segmentation).

```{r toy-prop, fig.width=5, fig.height=4}
set.seed(1)
gr_prime <- bootRanges(gr, blockLength = 50, seg = seg,
                       proportionLength = TRUE)
plotGRanges(gr_prime)
```

Note that some ranges from adjacent states are allowed to be placed
within different states in the bootstrap sample. This is because, during
the random sampling of blocks of original data, a block is allowed to
extend beyond the segmentation region of the state being sampled, and
features from adjacent states are not excluded from the sampled block.

# Session information

```{r}
sessionInfo()
```

# References