%\VignetteIndexEntry{An Overview of the IRanges package} %\VignetteDepends{IRanges} %\VignetteKeywords{Ranges,IntegerRanges,IRanges,IRangesList,Views,AtomicList} %\VignettePackage{IRanges} \documentclass{article} \usepackage[authoryear,round]{natbib} <>= BiocStyle::latex(use.unsrturl=FALSE) @ \title{An Overview of the \Biocpkg{IRanges} package} \author{Patrick Aboyoun, Michael Lawrence, Herv\'e Pag\`es} \date{Edited: February 2018; Compiled: \today} \begin{document} \maketitle \tableofcontents <>= options(width=72) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} When analyzing sequences, we are often interested in particular consecutive subsequences. For example, the \Sexpr{letters} vector could be considered a sequence of lower-case letters, in alphabetical order. We would call the first five letters (\textit{a} to \textit{e}) a consecutive subsequence, while the subsequence containing only the vowels would not be consecutive. It is not uncommon for an analysis task to focus only on the geometry of the regions, while ignoring the underlying sequence values. A list of indices would be a simple way to select a subsequence. However, a sparser representation for consecutive subsequences would be a range, a pairing of a start position and a width, as used when extracting sequences with \Rfunction{window}. Two central classes are available in Bioconductor for representing ranges: the \Rclass{IRanges} class defined in the \Biocpkg{IRanges} package for representing ranges defined on a single space, and the \Rclass{GRanges} class defined in the \Biocpkg{GenomicRanges} package for representing ranges defined on multiple spaces. In this vignette, we will focus on \Rclass{IRanges} objects. We will rely on simple, illustrative example datasets, rather than large, real-world data, so that each data structure and algorithm can be explained in an intuitive, graphical manner. We expect that packages that apply \Biocpkg{IRanges} to a particular problem domain will provide vignettes with relevant, realistic examples. The \Biocpkg{IRanges} package is available at bioconductor.org and can be downloaded via \Rfunction{biocLite}: <>= source("http://bioconductor.org/biocLite.R") biocLite("IRanges") @ <>= library(IRanges) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{\Rclass{IRanges} objects} To construct an \Rclass{IRanges} object, we call the \Rfunction{IRanges} constructor. Ranges are normally specified by passing two out of the three parameters: start, end and width (see \Rcode{help(IRanges)} for more information). % <>= ir1 <- IRanges(start=1:10, width=10:1) ir1 ir2 <- IRanges(start=1:10, end=11) ir3 <- IRanges(end=11, width=10:1) identical(ir1, ir2) && identical(ir1, ir3) ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40), width=c(12, 6, 6, 15, 6, 2, 7)) ir @ % All of the above calls construct the same \Rclass{IRanges} object, using different combinations of the \Rcode{start}, \Rcode{end} and \Rcode{width} parameters. Accessing the starts, ends and widths is supported via the \Rfunction{start}, \Rfunction{end} and \Rfunction{width} getters: <>= start(ir) @ <>= end(ir) @ <>= width(ir) @ Subsetting an \Rclass{IRanges} object is supported, by numeric and logical indices: <>= ir[1:4] @ <>= ir[start(ir) <= 15] @ In order to illustrate range operations, we'll create a function to plot ranges. <>= plotRanges <- function(x, xlim=x, main=deparse(substitute(x)), col="black", sep=0.5, ...) { height <- 1 if (is(xlim, "IntegerRanges")) xlim <- c(min(start(xlim)), max(end(xlim))) bins <- disjointBins(IRanges(start(x), end(x) + 1)) plot.new() plot.window(xlim, c(0, max(bins)*(height + sep))) ybottom <- bins * (sep + height) - height rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col=col, ...) title(main) axis(1) } @ <>= plotRanges(ir) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ir-plotRanges} \caption{\label{fig-ir-plotRanges}% Plot of original ranges.} \end{center} \end{figure} \subsection{Normality} Sometimes, it is necessary to formally represent a subsequence, where no elements are repeated and order is preserved. Also, it is occasionally useful to think of an \Rclass{IRanges} object as a set of integers, where no elements are repeated and order does not matter. The \Rclass{NormalIRanges} class formally represents a set of integers. By definition an \Rclass{IRanges} object is said to be \textit{normal} when its ranges are: (a) not empty (i.e. they have a non-null width); (b) not overlapping; (c) ordered from left to right; (d) not even adjacent (i.e. there must be a non empty gap between 2 consecutive ranges). There are three main advantages of using a \textit{normal} \Rclass{IRanges} object: (1) it guarantees a subsequence encoding or set of integers, (2) it is compact in terms of the number of ranges, and (3) it uniquely identifies its information, which simplifies comparisons. The \Rfunction{reduce} function reduces any \Rclass{IRanges} object to a \Rclass{NormalIRanges} by merging redundant ranges. <>= reduce(ir) plotRanges(reduce(ir)) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-reduce} \caption{\label{fig-ranges-reduce}% Plot of reduced ranges.} \end{center} \end{figure} \subsection{Lists of \Rclass{IRanges} objects} It is common to manipulate collections of \Rclass{IRanges} objects during an analysis. Thus, the \Biocpkg{IRanges} package defines some specific classes for working with multiple \Rclass{IRanges} objects. The \Rclass{IRangesList} class asserts that each element is an \Rclass{IRanges} object and provides convenience methods, such as \Rfunction{start}, \Rfunction{end} and \Rfunction{width} accessors that return \Rclass{IntegerList} objects, aligning with the \Rclass{IRangesList} object. Note that \Rclass{IntegerList} objects will be covered later in more details in the ``Lists of Atomic Vectors'' section of this document. To explicitly construct an \Rclass{IRangesList}, use the \Rfunction{IRangesList} function. <>= rl <- IRangesList(ir, rev(ir)) @ % <>= start(rl) @ \subsection{Vector Extraction} As the elements of an \Rclass{IRanges} object encode consecutive subsequences, they may be used directly in sequence extraction. Note that when a \textit{normal} \Rclass{IRanges} is given as the index, the result is a subsequence, as no elements are repeated or reordered. If the sequence is a \Rclass{Vector} subclass (i.e. not an ordinary \Rclass{vector}), the canonical \Rfunction{[} function accepts an \Rclass{IRanges} object. % <>= set.seed(0) lambda <- c(rep(0.001, 4500), seq(0.001, 10, length=500), seq(10, 0.001, length=500)) xVector <- rpois(1e7, lambda) yVector <- rpois(1e7, lambda[c(251:length(lambda), 1:250)]) xRle <- Rle(xVector) yRle <- Rle(yVector) irextract <- IRanges(start=c(4501, 4901) , width=100) xRle[irextract] @ % \subsection{Finding Overlapping Ranges} The function \Rfunction{findOverlaps} detects overlaps between two \Rclass{IRanges} objects. <>= ol <- findOverlaps(ir, reduce(ir)) as.matrix(ol) @ \subsection{Counting Overlapping Ranges} The function \Rfunction{coverage} counts the number of ranges over each position. <>= cov <- coverage(ir) plotRanges(ir) cov <- as.vector(cov) mat <- cbind(seq_along(cov)-0.5, cov) d <- diff(cov) != 0 mat <- rbind(cbind(mat[d,1]+1, mat[d,2]), mat) mat <- mat[order(mat[,1]),] lines(mat, col="red", lwd=4) axis(2) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-coverage} \caption{\label{fig-ranges-coverage}% Plot of ranges with accumulated coverage.} \end{center} \end{figure} \subsection{Finding Neighboring Ranges} The \Rfunction{nearest} function finds the nearest neighbor ranges (overlapping is zero distance). The \Rfunction{precede} and \Rfunction{follow} functions find the non-overlapping nearest neighbors on a specific side. \subsection{Transforming Ranges} Utilities are available for transforming an \Rclass{IRanges} object in a variety of ways. Some transformations, like \Rfunction{reduce} introduced above, can be dramatic, while others are simple per-range adjustments of the starts, ends or widths. \subsubsection{Adjusting starts, ends and widths} Perhaps the simplest transformation is to adjust the start values by a scalar offset, as performed by the \Rfunction{shift} function. Below, we shift all ranges forward 10 positions. % <>= shift(ir, 10) @ There are several other ways to transform ranges. These include \Rfunction{narrow}, \Rfunction{resize}, \Rfunction{flank}, \Rfunction{reflect}, \Rfunction{restrict}, and \Rfunction{threebands}. For example \Rfunction{narrow} supports the adjustment of start, end and width values, which should be relative to each range. These adjustments are vectorized over the ranges. As its name suggests, the ranges can only be narrowed. % <>= narrow(ir, start=1:5, width=2) @ The \Rfunction{restrict} function ensures every range falls within a set of bounds. Ranges are contracted as necessary, and the ranges that fall completely outside of but not adjacent to the bounds are dropped, by default. % <>= restrict(ir, start=2, end=3) @ The \Rfunction{threebands} function extends \Rfunction{narrow} so that the remaining left and right regions adjacent to the narrowed region are also returned in separate \Rclass{IRanges} objects. % <>= threebands(ir, start=1:5, width=2) @ The arithmetic operators \Rfunction{+}, \Rfunction{-} and \Rfunction{*} change both the start and the end/width by symmetrically expanding or contracting each range. Adding or subtracting a numeric (integer) vector to an \Rclass{IRanges} causes each range to be expanded or contracted on each side by the corresponding value in the numeric vector. <>= ir + seq_len(length(ir)) @ % The \Rfunction{*} operator symmetrically magnifies an \Rclass{IRanges} object by a factor, where positive contracts (zooms in) and negative expands (zooms out). % <>= ir * -2 # double the width @ WARNING: The semantic of these arithmetic operators might be revisited at some point. Please restrict their use to the context of interactive visualization (where they arguably provide some convenience) but avoid to use them programmatically. \subsubsection{Making ranges disjoint} A more complex type of operation is making a set of ranges disjoint, \textit{i.e.} non-overlapping. For example, \Rfunction{threebands} returns a disjoint set of three ranges for each input range. The \Rfunction{disjoin} function makes an \Rclass{IRanges} object disjoint by fragmenting it into the widest ranges where the set of overlapping ranges is the same. <>= disjoin(ir) plotRanges(disjoin(ir)) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-disjoin} \caption{\label{fig-ranges-disjoin}% Plot of disjoined ranges.} \end{center} \end{figure} A variant of \Rfunction{disjoin} is \Rfunction{disjointBins}, which divides the ranges into bins, such that the ranges in each bin are disjoint. The return value is an integer vector of the bins. <>= disjointBins(ir) @ \subsubsection{Other transformations} Other transformations include \Rfunction{reflect} and \Rfunction{flank}. The former ``flips'' each range within a set of common reference bounds. <>= reflect(ir, IRanges(start(ir), width=width(ir)*2)) @ % The \Rfunction{flank} returns ranges of a specified width that flank, to the left (default) or right, each input range. One use case of this is forming promoter regions for a set of genes. <>= flank(ir, width=seq_len(length(ir))) @ % \subsection{Set Operations} Sometimes, it is useful to consider an \Rclass{IRanges} object as a set of integers, although there is always an implicit ordering. This is formalized by \Rclass{NormalIRanges}, above, and we now present versions of the traditional mathematical set operations \textit{complement}, \textit{union}, \textit{intersect}, and \textit{difference} for \Rclass{IRanges} objects. There are two variants for each operation. The first treats each \Rclass{IRanges} object as a set and returns a \textit{normal} value, while the other has a ``parallel'' semantic like \Rfunction{pmin}/\Rfunction{pmax} and performs the operation for each range pairing separately. The \textit{complement} operation is implemented by the \Rfunction{gaps} and \Rfunction{pgap} functions. By default, \Rfunction{gaps} will return the ranges that fall between the ranges in the (normalized) input. It is possible to specify a set of bounds, so that flanking ranges are included. <>= gaps(ir, start=1, end=50) plotRanges(gaps(ir, start=1, end=50), c(1,50)) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-gaps} \caption{\label{fig-ranges-gap}% Plot of gaps from ranges.} \end{center} \end{figure} \Rfunction{pgap} considers each parallel pairing between two \Rclass{IRanges} objects and finds the range, if any, between them. Note that the function name is singular, suggesting that only one range is returned per range in the input. <>= @ The remaining operations, \textit{union}, \textit{intersect} and \textit{difference} are implemented by the \Rfunction{[p]union}, \Rfunction{[p]intersect} and \Rfunction{[p]setdiff} functions, respectively. These are relatively self-explanatory. <>= @ <>= @ <>= @ <>= @ <>= @ <>= @ % \subsection{Mapping Ranges Between Vectors} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Vector Views} The \Biocpkg{IRanges} package provides the virtual \Rclass{Views} class, which stores a vector-like object, referred to as the ``subject'', together with an \Rclass{IRanges} object defining ranges on the subject. Each range is said to represent a \textit{view} onto the subject. Here, we will demonstrate the \Rclass{RleViews} class, where the subject is of class \Rclass{Rle}. Other \Rclass{Views} implementations exist, such as \Rclass{XStringViews} in the \Biocpkg{Biostrings} package. \subsection{Creating Views} There are two basic constructors for creating views: the \Rfunction{Views} function based on indicators and the \Rfunction{slice} based on numeric boundaries. <>= xViews <- Views(xRle, xRle >= 1) xViews <- slice(xRle, 1) xRleList <- RleList(xRle, 2L * rev(xRle)) xViewsList <- slice(xRleList, 1) @ Note that \Rclass{RleList} objects will be covered later in more details in the ``Lists of Atomic Vectors'' section of this document. \subsection{Aggregating Views} While \Rfunction{sapply} can be used to loop over each window, the native functions \Rfunction{viewMaxs}, \Rfunction{viewMins}, \Rfunction{viewSums}, and \Rfunction{viewMeans} provide fast looping to calculate their respective statistical summaries. <>= head(viewSums(xViews)) viewSums(xViewsList) head(viewMaxs(xViews)) viewMaxs(xViewsList) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Lists of Atomic Vectors} In addition to the range-based objects described in the previous sections, the \Biocpkg{IRanges} package provides containers for storing lists of atomic vectors such as \Rclass{integer} or \Rclass{Rle} objects. The \Rclass{IntegerList} and \Rclass{RleList} classes represent lists of \Rclass{integer} vectors and \Rclass{Rle} objects respectively. They are subclasses of the \Rclass{AtomicList} virtual class which is itself a subclass of the \Rclass{List} virtual class defined in the \Biocpkg{S4Vectors} package. <>= showClass("RleList") @ As the class definition above shows, the \Rclass{RleList} class is virtual with subclasses \Rclass{SimpleRleList} and \Rclass{CompressedRleList}. A \Rclass{SimpleRleList} class uses an ordinary \R{} list to store the underlying elements and the \Rclass{CompressedRleList} class stores the elements in an unlisted form and keeps track of where the element breaks are. The former ``simple list" class is useful when the \Rclass{Rle} elements are long and the latter ``compressed list" class is useful when the list is long and/or sparse (i.e. a number of the list elements have length 0). In fact, all of the atomic vector types (logical, integer, numeric, complex, character, raw, and factor) have similar list classes that derive from the \Rclass{List} virtual class. For example, there is an \Rclass{IntegerList} virtual class with subclasses \Rclass{SimpleIntegerList} and \Rclass{CompressedIntegerList}. Each of the list classes for atomic sequences, be they stored as vectors or \Rclass{Rle} objects, have a constructor function with a name of the appropriate list virtual class, such as \Rclass{IntegerList}, and an optional argument \Rcode{compress} that takes an argument to specify whether or not to create the simple list object type or the compressed list object type. The default is to create the compressed list object type. <>= args(IntegerList) cIntList1 <- IntegerList(x=xVector, y=yVector) cIntList1 sIntList2 <- IntegerList(x=xVector, y=yVector, compress=FALSE) sIntList2 ## sparse integer list xExploded <- lapply(xVector[1:5000], function(x) seq_len(x)) cIntList2 <- IntegerList(xExploded) sIntList2 <- IntegerList(xExploded, compress=FALSE) object.size(cIntList2) object.size(sIntList2) @ The \Rfunction{length} function returns the number of elements in a \Rclass{Vector}-derived object and, for a \Rclass{List}-derived object like ``simple list" or ``compressed list", the \Rfunction{lengths} function returns an integer vector containing the lengths of each of the elements: <>= length(cIntList2) Rle(lengths(cIntList2)) @ Just as with ordinary \R{} \Rclass{list} objects, \Rclass{List}-derived object support the \Rfunction{[[} for element extraction, \Rfunction{c} for concatenating, and \Rfunction{lapply}/\Rfunction{sapply} for looping. When looping over sparse lists, the ``compressed list" classes can be much faster during computations since only the non-empty elements are looped over during the \Rfunction{lapply}/\Rfunction{sapply} computation and all the empty elements are assigned the appropriate value based on their status. <>= system.time(sapply(xExploded, mean)) system.time(sapply(sIntList2, mean)) system.time(sapply(cIntList2, mean)) identical(sapply(xExploded, mean), sapply(sIntList2, mean)) identical(sapply(xExploded, mean), sapply(cIntList2, mean)) @ Unlike ordinary \R{} \Rclass{list} objects, \Rclass{AtomicList} objects support the \Rfunction{Ops} (e.g. \Rfunction{+}, \Rfunction{==}, \Rfunction{\&}), \Rfunction{Math} (e.g. \Rfunction{log}, \Rfunction{sqrt}), \Rfunction{Math2} (e.g. \Rfunction{round}, \Rfunction{signif}), \Rfunction{Summary} (e.g. \Rfunction{min}, \Rfunction{max}, \Rfunction{sum}), and \Rfunction{Complex} (e.g. \Rfunction{Re}, \Rfunction{Im}) group generics. <>= xRleList > 0 yRleList <- RleList(yRle, 2L * rev(yRle)) xRleList + yRleList sum(xRleList > 0 | yRleList > 0) @ Since these atomic lists inherit from \Rclass{List}, they can also use the looping function \Rfunction{endoapply} to perform endomorphisms. <>= safe.max <- function(x) { if(length(x)) max(x) else integer(0) } endoapply(sIntList2, safe.max) endoapply(cIntList2, safe.max) endoapply(sIntList2, safe.max)[[1]] @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session Information} Here is the output of \Rcode{sessionInfo()} on the system on which this document was compiled: <>= sessionInfo() @ \end{document}