--- title: "Getting started with the plyranges package" author: "Stuart Lee" package: plyranges date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Introduction} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- # `Ranges` revisted In Bioconductor there are two classes, `IRanges` and `GRanges`, that are standard data structures for representing genomics data. Throughout this document I refer to either of these classes as `Ranges` if an operation can be performed on either class, otherwise I explicilty mention if a function is approrpriate for an `IRanges` or `GRanges`. `Ranges` objects can either represent sets of integers as `IRanges` (which have start, end and width attributes) or represent genomic intervals (which have additional attributes, sequence name, and strand) as `GRanges`. In addition, both types of `Ranges` can store information about their intervals as metadata columns (for example GC content over a genomic interval). `Ranges` objects follow the tidy data principle: each row of a `Ranges` object corresponds to an interval, while each column will represent a variable about that interval, and generally each object will represent a single unit of observation (like gene annotations). Consequently, `Ranges` objects provide a powerful representation for reasoning about genomic data. In this vignette, you will learn more about `Ranges` objects and how via grouping, restriction and summarisation you can perform common data tasks. # Constructing `Ranges` To construct an `IRanges` we require that there are at least two columns that represent at either a starting coordinate, finishing coordinate or the width of the interval. ```{r} suppressPackageStartupMessages(library(plyranges)) set.seed(100) df <- data.frame(start=c(2:-1, 13:15), width=c(0:3, 2:0)) # produces IRanges rng <- df %>% as_iranges() rng ``` To construct a `GRanges` we require a column that represents that sequence name ( contig or chromosome id), and an optional column to represent the strandedness of an interval. ```{r} # seqname is required for GRanges, metadata is automatically kept grng <- df %>% transform(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE), strand = sample(c("+", "-"), 7, replace = TRUE), gc = runif(7)) %>% as_granges() grng ``` # Arithmetic on Ranges Sometimes you want to modify a genomic interval by altering the width of the interval while leaving the start, end or midpoint of the coordinates unaltered. This is achieved with the `mutate` verb along with `anchor_*` adverbs. The act of anchoring fixes either the start, end, center coordinates of the `Range` object, as shown in the figure and code below and anchors are used in combination with either `mutate` or `stretch`. ```{r, echo = FALSE, out.width="400px", fig.align="center"} knitr::include_graphics("anchors.png", dpi = 150) ``` ```{r} rng <- as_iranges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8))) grng <- as_granges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8), seqnames = "seq1", strand = c("+", "*", "-"))) mutate(rng, width = 10) mutate(anchor_start(rng), width = 10) mutate(anchor_end(rng), width = 10) mutate(anchor_center(rng), width = 10) mutate(anchor_3p(grng), width = 10) # leave negative strand fixed mutate(anchor_5p(grng), width = 10) # leave positve strand fixed ``` Similarly, you can modify the width of an interval using the `stretch` verb. Without anchoring, this function will extend the interval in either direction by an integer amount. With anchoring, either the start, end or midpoint are preserved. ```{r} rng2 <- stretch(anchor_center(rng), 10) rng2 stretch(anchor_end(rng2), 10) stretch(anchor_start(rng2), 10) stretch(anchor_3p(grng), 10) stretch(anchor_5p(grng), 10) ``` `Ranges` can be shifted left or right. If strand information is available we can also shift upstream or downstream. ```{r} shift_left(rng, 10) shift_right(rng, 10) shift_upstream(grng, 10) shift_downstream(grng, 10) ``` # Grouping `Ranges` `plyranges` introduces a new class of `Ranges` called `RangesGrouped`, this is a similiar idea to the grouped `data.frame\tibble` in `dplyr`. Grouping can act on either the core components or the metadata columns of a `Ranges` object. It is most effective when combined with other verbs such as `mutate()`, `summarise()`, `filter()`, `reduce_ranges()` or `disjoin_ranges()`. ```{r} grng <- data.frame(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE), strand = sample(c("+", "-"), 7, replace = TRUE), gc = runif(7), start = 1:7, width = 10) %>% as_granges() grng_by_strand <- grng %>% group_by(strand) grng_by_strand ``` # Restricting `Ranges` The verb `filter` can be used to restrict rows in the `Ranges`. Note that grouping will cause the `filter` to act within each group of the data. ```{r} grng %>% filter(gc < 0.3) # filtering by group grng_by_strand %>% filter(gc == max(gc)) ``` We also provide the convience methods `filter_by_overlaps` and `filter_by_non_overlaps` for restricting by any overlapping `Ranges`. ```{r} ir0 <- data.frame(start = c(5,10, 15,20), width = 5) %>% as_iranges() ir1 <- data.frame(start = 2:6, width = 3:7) %>% as_iranges() ir0 ir1 ir0 %>% filter_by_overlaps(ir1) ir0 %>% filter_by_non_overlaps(ir1) ``` # Summarising `Ranges` The `summarise` function will return a `DataFrame` because the information required to return a `Ranges` object is lost. It is often most useful to use `summarise()` in combination with the `group_by()` family of functions. ```{r} ir1 <- ir1 %>% mutate(gc = runif(length(.))) ir0 %>% group_by_overlaps(ir1) %>% summarise(gc = mean(gc)) ``` # Joins, or another way at looking at overlaps between `Ranges` A join acts on two GRanges objects, a query and a subject. ```{r} query <- data.frame(seqnames = "chr1", strand = c("+", "-"), start = c(1, 9), end = c(7, 10), key.a = letters[1:2]) %>% as_granges() subject <- data.frame(seqnames = "chr1", strand = c("-", "+"), start = c(2, 6), end = c(4, 8), key.b = LETTERS[1:2]) %>% as_granges() ``` ```{r, echo = FALSE, fig.cap = "Query and Subject Ranges", message = FALSE} library(ggplot2) query_df <- as.data.frame(query)[, -6] query_df$key <- "Query" subject_df <- as.data.frame(subject)[, -6] subject_df$key <- "Subject" melted_ranges <- rbind(query_df, subject_df) ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = 1, ymax = 3)) + geom_rect() + facet_grid(key ~ .) + scale_x_continuous(breaks = seq(1, 10, by = 1)) + xlab("Position") + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.title.y = element_blank()) ``` The join operator is relational in the sense that metadata from the query and subject ranges is retained in the joined range. All join operators in the `plyranges` DSL generate a set of hits based on overlap or proximity of ranges and use those hits to merge the two datasets in different ways. There are four supported matching algorithms: _overlap_, _nearest_, _precede_, and _follow_. We can further restrict the matching by whether the query is completely _within_ the subject, and adding the _directed_ suffix ensures that matching ranges have the same direction (strand). ```{r echo = FALSE, out.width="600px"} knitr::include_graphics("olaps.png") ``` The first function, `join_overlap_intersect()` will return a `Ranges` object where the start, end, and width coordinates correspond to the amount of any overlap between the left and right input `Ranges`. It also returns any metadatain the subject range if the subject overlaps the query. ```{r} intersect_rng <- join_overlap_intersect(query, subject) intersect_rng ``` ```{r echo = FALSE, fig.cap="Intersect Join"} intersect_df <- as.data.frame(intersect_rng)[, -c(6,7)] intersect_df$key <- "Intersect Join" melted_ranges <- rbind(query_df, subject_df, intersect_df) melted_ranges$key <- factor(melted_ranges$key, levels = c("Query", "Subject", "Intersect Join")) ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = 1, ymax = 3)) + geom_rect() + facet_grid(key ~ .) + scale_x_continuous(breaks = seq(1, 10, by = 1)) + xlab("Position") + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.title.y = element_blank()) ``` The `join_overlap_inner()` function will return the `Ranges` in the query that overlap any `Ranges` in the subject. Like the `join_overlap_intersect()` function metadata of the subject `Range` is returned if it overlaps the query. ```{r} inner_rng <- join_overlap_inner(query, subject) inner_rng ``` ```{r, echo = FALSE, fig.cap = "Inner Join", message = FALSE} inner_df <- as.data.frame(inner_rng)[, -c(6,7)] inner_df$ymin <- c(1,4) inner_df$ymax <- c(3,6) inner_df$key <- "Inner Join" melted_ranges <- rbind(query_df, subject_df) melted_ranges$ymin <- 1 melted_ranges$ymax <- 3 melted_ranges <- rbind(melted_ranges, inner_df) melted_ranges$key <- factor(melted_ranges$key, levels = c("Query", "Subject", "Inner Join")) ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) + geom_rect() + facet_grid(key ~ ., scales = "free_y") + scale_x_continuous(breaks = seq(1, 10, by = 1)) + xlab("Position") + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.title.y = element_blank()) ``` We also provide a convienence method called `find_overlaps` that computes the same result as `join_overlap_inner()`. ```{r} find_overlaps(query, subject) ``` The `join_overlap_left()` method will perform an outer left join. First any overlaps that are found will be returned similar to `join_overlap_inner()`. Then any non-overlapping ranges will be returned, with missing values on the metadata columns. ```{r} left_rng <- join_overlap_left(query, subject) left_rng ``` ```{r, echo = FALSE, fig.cap = "Left Join", message = FALSE} left_df <- as.data.frame(left_rng)[, -c(6,7)] left_df$ymin <- c(1,4, 1) left_df$ymax <- c(3,6, 3) left_df$key <- "Left Join" melted_ranges <- rbind(query_df, subject_df) melted_ranges$ymin <- 1 melted_ranges$ymax <- 3 melted_ranges <- rbind(melted_ranges, left_df) melted_ranges$key <- factor(melted_ranges$key, levels = c("Query", "Subject", "Left Join")) ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) + geom_rect() + facet_grid(key ~ ., scales = "free_y") + scale_x_continuous(breaks = seq(1, 10, by = 1)) + xlab("Position") + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.title.y = element_blank()) ``` Compared with `filter_by_overlaps()` above, the overlap left join expands the `Ranges` to give information about each interval on the query `Ranges` that overlap those on the subject `Ranges` as well as the intervals on the left that do not overlap any range on the right. ## Finding your neighbours We also provide methods for finding nearest, preceding or following `Ranges`. Conceputally this is identical to our approach for finding overlaps, except the semantics of the join are different. ```{r echo = FALSE, out.width="600px"} knitr::include_graphics("neighbours.png") ``` ```{r} join_nearest(ir0, ir1) join_follow(ir0, ir1) join_precede(ir0, ir1) # nothing precedes returns empty `Ranges` join_precede(ir1, ir0) ``` ## Example: dealing with multimapping This example is taken from the Bioconductor support [site](https://support.bioconductor.org/p/100046/). We have two `Ranges` objects. The first contains single nucleotide positions corresponding to an intensity measurement such as a ChiP-seq experiment, while the other contains coordinates for two genes of interest. We want to identify which positions in the `intensties` `Ranges` overlap the genes, where each row corresponds to a position that overlaps a single gene. First we create the two `Ranges` objects ```{r ex1} intensities <- data.frame(seqnames = "VI", start = c(3320:3321,3330:3331,3341:3342), width = 1) %>% as_granges() intensities genes <- data.frame(seqnames = "VI", start = c(3322, 3030), end = c(3846, 3338), gene_id=c("YFL064C", "YFL065C")) %>% as_granges() genes ``` Now to find where the positions overlap each gene, we can perform an overlap join. This will automatically carry over the gene_id information as well as their coordinates (we can drop those by only selecting the gene_id). ```{r} olap <- join_overlap_inner(intensities, genes) %>% select(gene_id) olap ``` Several positions match to both genes. We can count them using `summarise` and grouping by the `start` position: ```{r} olap %>% group_by(start) %>% summarise(n = n()) ``` ## Grouping by overlaps It's also possible to group by overlaps. Using this approach we can count the number of overlaps that are greater than 0. ```{r} grp_by_olap <- ir0 %>% group_by_overlaps(ir1) grp_by_olap grp_by_olap %>% mutate(n_overlaps = n()) ``` Of course we can also add overlap counts via the `count_overlaps()` function. ```{r} ir0 %>% mutate(n_overlaps = count_overlaps(., ir1)) ``` # Data Import/Output We provide convienence functions via `rtracklayer` and `GenomicAlignments` for reading/writing the following data formats to/from `Ranges` objects. | `plyranges` functions | File Format | |-----------------------|-------------| | `read_bam()` | BAM | | `read_bed()`/`write_bed()` | BED | | `read_bedgraph()`/ `write_bedgraph()` | BEDGraph | | `read_narrowpeaks()`/ `write_narrowpeaks()` | narrowPeaks | | `read_gff()` / `write_gff()` | GFF(1-3) / GTF | | `read_bigwig()` / `write_bigwig()` | BigWig | | `read_wig()` / `write_wig()` | Wig | # Mapping to GenomicRanges/IRanges For users already familiar with the `IRanges` and `GenomicRanges` we provide mappings to the `plyranges` API. ## Operations on range width For `G`Ranges`` objects all functions ignore any strandedness, unless the strand of the range is anchored. | `plyranges` functions | Description | `GenomicRanges/IRanges` command | |-----------------------|-------------|---------------------------------| | `anchor_(start/end/center/3p/5p)` | Fix the `start/end/center/` coordinates or positive/negative strand of range. Can be used in combination with any of the following | Available in functions that have a `fix` argument. | | `mutate(x, width = width + modifier)` | Modify the width of a `Ranges` | `resize` | | `stretch(x, extend)` | Extend the start and end coordinates in opposite directions by a fixed amount. | `start(x)<- start(x) - extend%/%2`, `end(x) <- end(x) + extend%/%2` | ## Operations on range width (invariant) | `plyranges` functions | Description | `GenomicRanges/IRanges` command | |-----------------------|-------------|---------------------------------| | `shift_[left/right/downstream/upstream](x, shift)` | Shift the coordinates of the interval (left/right/downstream/upstream) by an integer amount. | `shift_right` corresponds to `shift` | | `flank_[left/right/downstream/upstream](x, width)` | Generates flanking regions of size width `left/right/downstream/upstream/` | corresponds to `flank` | ## Set operations (vector wise) These are usual set-operations that act on the sets of the `Ranges` represented in x and y. By default these operations will ignore any strand information. The directed versions of these functions will take into account strand. | `plyranges` functions | Description | `GenomicRanges/IRanges` command | |-----------------------|-------------|---------------------------------| | `[intersect/setdiff/union/]_Ranges` | Set operations between two `Ranges`, ignoring strand. | `intersect/setdiff/union/` with `ignore.strand = FALSE` | | `[intersect/setdiff/union/]_anchored_Ranges` | As above taking into account strandedness. | ## Set operations (element wise) We provide infix operators and the verbs between and span to the represent element wise range operations. These map to the `pintersect/punion/psetdiff/pgap/punion(fill.gap = FALSE)` functions. ## Restrictions The verb `filter` corresponds to `subset`, while `filter_by_[overlaps/non_overlaps]` corresponds to `subsetByOverlaps`. ## Aggregation The `summarise` verb is most similar to the `aggregate` methods defined in `GenomicRanges/IRanges`. The `reduce_ranges/disjoin_ranges` correspond to the `reduce/disjoin` methods. However, the former methods allow additional summarisation. The `compute_coverage(x)` method corresponds to `[I/G]Ranges(coverage(x))`. ## Overlaps For `GRanges` objects all functions ignore any strandedness, unless the suffix `directed` is added to the funciton call | `plyranges` function | Description | `GenomicRanges/IRanges` command | |------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------| | `find_overlaps(x, y, maxgap, minoverlap)`| Returns a `Ranges` object with any range in `y` that overlaps `x`. Appends the metadata in `y` and its genomic intervals to the returning `Ranges`. | `findOverlaps(x,y, maxgap, minoverlap, type = "any")` with expanding `x` and `y` by their hits and appending the `mcols` in `y`. | | `group_by_overlaps(x, y, maxgap, minoverlap)` | Returns a GroupedRanges object grouped by the query hits. | Same as above with an additional column called `query` which contains the queryHits. | | `count_overlaps(x, y, maxgap, minoverlap)` | Returns an integer vector (used with `mutate`) | `countOverlaps(x, y, maxgap, minoverlap, type = "any")` | | `join_overlap_self(x, maxgap, minoverlap)` | Returns a `Ranges` object with any range that overlaps itself. | `findOverlaps(x,x, maxgap, minoverlap, type = "any")`| | `join_overlap_inner(x, y, maxgap, minoverlap)` | Finds the intersecting `Ranges` that overlap in `x` and `y`. Returns a `Ranges` object with the metadata from `x` and `y`. | `findOverlapsPairs(x,y, maxgap, minoverlap, type = "any")` + `pintersect`. | | `join_overlap_left(x, y, maxgap, minoverlap)` | Identical to `find_overlaps` | Identical to `find_overlaps`. | | `*_within` | Adding suffix `within` will find overlaps | Makes `type = "within"` | | `*_includes` | inverse of within functions | - | | `join_nearest[_left/right/up/downstream](x,y)` | Finds nearest neighbour `Ranges` between `x` and `y`. | `nearest` + reindexing to return a `Ranges` object. | |`join_precede[_left/right/up/downstream](x,y)` | Finds `Ranges` in `x` that preced `y` | `precedes` + reindexing to return a `Ranges` object. | |`join_follow[_left/right/up/downstream](x,y)` | Finds `Ranges` in `x` that follow `y` | `precedes` + reindexing to return a `Ranges` object. | # Appendix ```{r} sessionInfo() ```