%\VignetteIndexEntry{An Introduction to GenomicRanges} %\VignetteDepends{Rsamtools} %\VignetteKeywords{sequence, sequencing, alignments} %\VignettePackage{GenomicRanges} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\GenomicRanges}{\Rpackage{GenomicRanges}} \title{An Introduction to Genomic Ranges Classes} \author{Marc Carlson \and Patrick Aboyoun \and Herv\'{e} Pag\`{e}s} \date{\today} \begin{document} \maketitle <>= options(width=72) @ \tableofcontents \section{Introduction} The \Rpackage{GenomicRanges} package serves as the foundation for representing genomic locations within the \software{Bioconductor} project. In the \software{Bioconductor} package hierarchy, it builds upon the \Rpackage{IRanges} (infrastructure) package and provides support for the \Rpackage{BSgenome} (infrastructure), \Rpackage{Rsamtools} (I/O), \Rpackage{ShortRead} (I/O \& QA), \Rpackage{rtracklayer} (I/O), and \Rpackage{GenomicFeatures} (infrastructure) packages. This package lays a foundation for genomic analysis by introducing three classes (\Rclass{GRanges}, \Rclass{GRangesList}, and \Rclass{GAlignments}), which are used to represent single interval range features, multiple interval range features, and genomic alignments respectively. This vignette focuses on these classes and their associated methods. The \Rpackage{GenomicRanges} package is available at bioconductor.org and can be downloaded via \Rfunction{biocLite}: <>= source("http://bioconductor.org/biocLite.R") biocLite("GenomicRanges") @ <>= library(GenomicRanges) @ \section{\Rclass{GRanges}: Single Interval Range Features} The \Rclass{GRanges} class represents a collection of genomic features that each have a single start and end location on the genome. This includes features such as contiguous binding sites, transcripts, and exons. These objects can be created by using the \Rfunction{GRanges} constructor function. For example, <>= gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)), strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score = 1:10, GC = seq(1, 0, length=10)) gr @ \noindent creates a \Rclass{GRanges} object with 10 single interval features. The output of the \Rclass{GRanges} \Rmethod{show} method separates the information into a left and right hand region that are separated by \Rcode{|} symbols. The genomic coordinates (seqnames, ranges, and strand) are located on the left-hand side and the metadata columns (annotation) are located on the right. For this example, the metadata is comprised of \Rcode{score} and \Rcode{GC} information, but almost anything can be stored in the metadata portion of a \Rclass{GRanges} object. The components of the genomic coordinates within a \Rclass{GRanges} object can be extracted using the \Rmethod{seqnames}, \Rmethod{ranges}, and \Rmethod{strand} accessor functions. <>= seqnames(gr) ranges(gr) strand(gr) @ Stored annotations for these coordinates can be extracted as a \Rclass{DataFrame} object using the \Rmethod{mcols} accessor. <>= mcols(gr) mcols(gr)$score @ Finally, the total lengths of the various sequences that the ranges are aligned to can also be stored in the \Rclass{GRanges} object. So if this is data from \textit{Homo sapiens}, we can set the values as: <>= seqlengths(gr) <- c(249250621,243199373,198022430) @ And then retrieves as: <>= seqlengths(gr) @ Methods for accessing the \Rmethod{length} and \Rmethod{names} have also been defined. <>= names(gr) length(gr) @ \subsection{Splitting and combining \Rclass{GRanges} objects} \Rclass{GRanges} objects can be devided into groups using the \Rmethod{split} method. This produces a \Rclass{GRangesList} object, a class that will be discussed in detail in the next section. <>= sp <- split(gr, rep(1:2, each=5)) sp @ If you then grab the components of this list, they can also be merged by using the \Rmethod{c} and \Rmethod{append} methods. <>= c(sp[[1]], sp[[2]]) @ \subsection{Subsetting \Rclass{GRanges} objects} The expected subsetting operations are also available for \Rclass{GRanges} objects. <>= gr[2:3] @ A second argument to the \Rmethod{[} subset operator can be used to specify which metadata columns to extract from the \Rclass{GRanges} object. For example, <>= gr[2:3, "GC"] @ You can also assign into elements of the \Rclass{GRanges} object. Here is an example where the 2nd row of a \Rclass{GRanges} object is replaced with the 1st row of \Robject{gr}. <>= singles <- split(gr, names(gr)) grMod <- gr grMod[2] <- singles[[1]] head(grMod, n=3) @ Here is a second example where the metadata for score from the 3rd element is replaced with the score from the 2nd row etc. <>= grMod[2,1] <- singles[[3]][,1] head(grMod, n=3) @ There are also methods to repeat, reverse, or select specific portions of \Rclass{GRanges} objects. <>= rep(singles[[2]], times = 3) rev(gr) head(gr,n=2) tail(gr,n=2) window(gr, start=2,end=4) gr[IRanges(start=c(2,7), end=c(3,9))] @ \subsection{Basic interval operations for \Rclass{GRanges} objects} Basic interval characteristics of \Rclass{GRanges} objects can be extracted using the \Rmethod{start}, \Rmethod{end}, \Rmethod{width}, and \Rmethod{range} methods. <>= g <- gr[1:3] g <- append(g, singles[[10]]) start(g) end(g) width(g) range(g) @ The \Rclass{GRanges} class also has many methods for manipulating the intervals. For example, the \Rmethod{flank} method can be used to recover regions flanking the set of ranges represented by the \Rclass{GRanges} object. So to get a \Rclass{GRanges} object containing the ranges that include the 10 bases upstream of the ranges: <>= flank(g, 10) @ And to include the downstream bases: <>= flank(g, 10, start=FALSE) @ Similar to \Rmethod{flank}, there are also operations to \Rmethod{resize} and \Rmethod{shift} our \Rclass{GRanges} object. The \Rmethod{shift} method will move the ranges by a specific number of base pairs, and the \Rmethod{resize} method will extend the ranges by a specified width. <>= shift(g, 5) resize(g, 30) @ The \Rmethod{reduce} will align the ranges and merge overlapping ranges to produce a simplified set. <>= reduce(g) @ Sometimes you may be interested in the spaces or the qualities of the spaces between the ranges represented by your \Rclass{GRanges} object. The \Rmethod{gaps} method will help you calculate the spaces between a reduced version of your ranges: <>= gaps(g) @ And sometimes you also may want to know how many quantitatively unique fragments your ranges could possibly represent. For this task there is the \Rmethod{disjoin} method. <>= disjoin(g) @ One of the most powerful methods for looking at \Rclass{GRanges} objects is the \Rmethod{coverage} method. The \Rmethod{coverage} method quantifies the degree of overlap for all the ranges in a \Rclass{GRanges} object. <>= coverage(g) @ \subsection{Interval set operations for \Rclass{GRanges} objects} There are also operations for calculating relationships between different \Rclass{GRanges} objects. Here are a some examples for how you can calculate the \Rmethod{union}, the \Rmethod{intersect} and the asymmetric difference (using \Rmethod{setdiff}). <>= g2 <- head(gr, n=2) union(g, g2) intersect(g, g2) setdiff(g, g2) @ In addition, there is similar set of operations that act at the level of the individual ranges within each \Rclass{GRanges}. These operations all begin with a ``p", which is short for parallel. A requirement for this set of operations is that the number of elements in each \Rclass{GRanges} object has to be the same, and that both of the objects have to have the same seqnames and strand assignments throughout. <>= g3 <- g[1:2] ranges(g3[1]) <- IRanges(start=5, end=12) punion(g2, g3) pintersect(g2, g3) psetdiff(g2, g3) @ For even more information on the \Rcode{GRanges} classes be sure to consult the manual page. <>= ?GRanges @ \section{\Rclass{GRangesList}: Multiple Interval Range Features} Some important genomic features, such as spliced transcripts that are are comprised of exons, are inherently compound structures. Such a feature makes much more sense when expressed as a compound object such as a \Rclass{GRangesList}. Whenever genomic features consist of multiple ranges that are grouped by a parent feature, they can be represented as \Rclass{GRangesList} object. Consider the simple example of the two transcript \Rfunction{GRangesList} below created using the \Rfunction{GRangesList} constructor. <>= gr1 <- GRanges(seqnames = "chr2", ranges = IRanges(3, 6), strand = "+", score = 5L, GC = 0.45) gr2 <- GRanges(seqnames = c("chr1", "chr1"), ranges = IRanges(c(7,13), width = 3), strand = c("+", "-"), score = 3:4, GC = c(0.3, 0.5)) grl <- GRangesList("txA" = gr1, "txB" = gr2) grl @ The \Rmethod{show} method for a \Rclass{GRangesList} object displays it as a named list of \Rclass{GRanges} objects, where the names of this list are considered to be the names of the grouping feature. In the example above, the groups of individual exon ranges are represented as separate \Rclass{GRanges} objects which are further organized into a list structure where each element name is a transcript name. Many other combinations of grouped and labeled \Rclass{GRanges} objects are possible of course, but this example is expected to be a common arrangement. \subsection{Basic \Rclass{GRangesList} accessors} Just as with \Rclass{GRanges} object, the components of the genomic coordinates within a \Rclass{GRangesList} object can be extracted using simple accessor methods. Not surprisingly, the \Rclass{GRangesList} objects have many of the same accessors as \Rclass{GRanges} objects. The difference is that many of these methods return a list since the input is now essentially a list of \Rclass{GRanges} objects. Here are a few examples: <>= seqnames(grl) ranges(grl) strand(grl) @ The \Rmethod{length} and \Rmethod{names} methods will return the length or names of the list and the \Rmethod{seqlengths} method will return the set of sequence lengths. <>= length(grl) names(grl) seqlengths(grl) @ The \Rmethod{elementLengths} method returns a list of integers corresponding to the result of calling \Rmethod{length} on each individual \Rclass{GRanges} object contained by the \Rclass{GRangesList}. This is a faster alternative to calling \Rmethod{lapply} on the \Rclass{GRangesList}. <>= elementLengths(grl) @ You can also use \Rmethod{isEmpty} to test if a \Rclass{GRangesList} object contains anything. <>= isEmpty(grl) @ Finally, in the context of a \Rclass{GRangesList} object, the \Rmethod{mcols} method performs a similar operation to what it does on a \Rclass{GRanges} object. However, this metadata now refers to information at the list level instead of the level of the individual \Rclass{GRanges} objects. <>= mcols(grl) <- c("Transcript A","Transcript B") mcols(grl) @ \subsection{Combining \Rclass{GRangesList} objects} \Rclass{GRangesList} objects can be unlisted to combine the separate \Rclass{GRanges} objects that they contain as an expanded \Rclass{GRanges}. <>= ul <- unlist(grl) @ You can also append values together useing \Rmethod{append} or \Rmethod{c}. % <>= % grl2 <- shift(grl,10) % names(grl2) <- c("shiftTxA","shiftTxB") % append(grl, grl2) % c(grl, grl2) % @ \subsection{Basic interval operations for \Rclass{GRangesList} objects} For interval operations, many of the same methods exist for \Rclass{GRangesList} objects that exist for \Rclass{GRanges} objects. <>= start(grl) end(grl) width(grl) @ And as with \Rclass{GRanges} objects, you can also shift all the \Rclass{GRanges} objects in a \Rclass{GRangesList} object, or calculate the coverage. Both of these operations are also carried out across each \Rclass{GRanges} list member. <>= shift(grl, 20) coverage(grl) @ \subsection{Subsetting \Rclass{GRangesList} objects} As you might guess, the subsetting of a \Rclass{GRangesList} object is quite different from subsetting on a \Rclass{GRanges} object in that it acts as if you are subseting a list. If you try out the following you will notice that the standard conventions have been followed. <>= grl[1] grl[[1]] grl["txA"] grl$txB @ But in addition to this, when subsetting a \Rclass{GRangesList}, you can also pass in a second parameter (as with a \Rclass{GRanges} object) to again specify which of the metadata columns you wish to select. <>= grl[1, "score"] grl["txB", "GC"] @ The \Rmethod{head}, \Rmethod{tail}, \Rmethod{rep}, \Rmethod{rev}, and \Rmethod{window} methods all behave as you would expect them to for a list object. For example, the elements referred to by \Rmethod{window} are now list elements instead of \Rclass{GRanges} elements. <>= rep(grl[[1]], times = 3) rev(grl) head(grl, n=1) tail(grl, n=1) window(grl, start=1, end=1) grl[IRanges(start=2, end=2)] @ \subsection{Looping over \Rclass{GRangesList} objects} For \Rclass{GRangesList} objects there is also a family of \Rmethod{apply} methods. These include \Rmethod{lapply}, \Rmethod{sapply}, \Rmethod{mapply}, \Rmethod{endoapply}, \Rmethod{mendoapply}, \Rmethod{Map}, and \Rmethod{Reduce}. The different looping methods defined for \Rclass{GRangesList} objects are useful for returning different kinds of results. The standard \Rmethod{lapply} and \Rmethod{sapply} behave according to convention, with the \Rmethod{lapply} method returning a list and \Rmethod{sapply} returning a more simplified output. <>= lapply(grl, length) sapply(grl, length) @ As with \Rclass{IRanges} objects, there is also a multivariate version of \Rmethod{sapply}, called \Rmethod{mapply}, defined for \Rclass{GRangesList} objects. And, if you don't want the results simplified, you can call the \Rmethod{Map} method, which does the same things as \Rmethod{mapply} but without simplifying the output. <>= grl2 <- shift(grl, 10) names(grl2) <- c("shiftTxA", "shiftTxB") mapply(c, grl, grl2) Map(c, grl, grl2) @ Sometimes, you may not want to get back a simplified output or a list. Sometimes you will want to get back a modified version of the \Rclass{GRangesList} that you originally passed in. This is conceptually similar to the mathematical notion of an endomorphism. This is achieved using the \Rmethod{endoapply} method, which will return the results as a \Rclass{GRangesList} object. <>= endoapply(grl,rev) @ And, there is also a multivariate version of the \Rmethod{endoapply} method in the form of the \Rmethod{mendoapply} method. <>= mendoapply(c,grl,grl2) @ Finally, the \Rmethod{Reduce} method will allow the \Rclass{GRanges} objects to be collapsed across the whole of the \Rclass{GRangesList} object. % Again, this seems like a sub-optimal example to me. <>= Reduce(c,grl) @ For even more information on the \Rcode{GRangesList} classes be sure to consult the manual page. <>= ?GRangesList @ \section{Interval overlaps involving \Rclass{GRanges} and \Rclass{GRangesList} objects} Interval overlapping is the process of comparing the ranges in two objects to determine if and when they overlap. As such, it is perhaps the most common operation performed on \Rclass{GRanges} and \Rclass{GRangesList} objects. To this end, the \Rpackage{GenomicRanges} package provides a family of interval overlap functions. The most general of these functions is \Rfunction{findOverlaps}, which takes a query and a subject as inputs and returns a \Rclass{Hits} object containing the index pairings for the overlapping elements. <>= mtch <- findOverlaps(gr, grl) as.matrix(mtch) @ \noindent As suggested in the sections discussing the nature of the \Rclass{GRanges} and \Rclass{GRangesList} classes, the index in the above matrix of hits for a \Rclass{GRanges} object is a single range while for a \Rclass{GRangesList} object it is the set of ranges that define a "feature". Another function in the overlaps family is \Rfunction{countOverlaps}, which tabulates the number of overlaps for each element in the query. <>= countOverlaps(gr, grl) @ A third function in this family is \Rfunction{subsetByOverlaps}, which extracts the elements in the query that overlap at least one element in the subject. <>= subsetByOverlaps(gr,grl) @ Finally, you can use the \Rcode{select} argument to get the index of the first overlapping element in the subject for each element in the query. <>= findOverlaps(gr, grl, select="first") findOverlaps(grl, gr, select="first") @ \section{Genomic Alignments} In addition to \Rclass{GRanges} and \Rclass{GRangesList} classes, the \GenomicRanges{} package defines the \Rclass{GAlignments} class, which is a more specialized container for storing a set of genomic alignments. The class is intended to support alignments in general, not only those coming from a 'Binary Alignment Map' or 'BAM' files. Also alignments with gaps in the reference sequence (a.k.a. \emph{gapped alignments}) are supported which, for example, makes the class suited for storing junction reads from an RNA-seq experiment. More precisely, a \Rclass{GAlignments} object is a vector-like object where each element describes an \emph{alignment}, that is, how a given sequence (called \emph{query} or \emph{read}, typically short) aligns to a reference sequence (typically long). As shown later in this document, a \Rclass{GAlignments} object can be created from a 'BAM' file. In that case, each element in the resulting object will correspond to a record in the file. One important thing to note though is that not all the information present in the BAM/SAM records is stored in the object. In particular, for now, we discard the query sequences (SEQ field), the query ids (QNAME field), the query qualities (QUAL), the mapping qualities (MAPQ) and any other information that is not needed in order to support the basic set of operations described in this document. This also means that multi-reads (i.e. reads with multiple hits in the reference) don't receive any special treatment i.e. the various SAM/BAM records corresponding to a multi-read will show up in the \Rclass{GAlignments} object as if they were coming from different/unrelated queries. Also paired-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future. \subsection{Load a `BAM' file into a \Rclass{GAlignments} object} First we use the \Rfunction{readGAlignments} function from the \Rpackage{GenomicAlignments} package to load a toy `BAM' file into a \Rclass{GAlignments} object: <>= library(GenomicAlignments) aln1_file <- system.file("extdata", "ex1.bam", package="Rsamtools") aln1 <- readGAlignments(aln1_file) aln1 length(aln1) @ 3271 `BAM' records were loaded into the object. Note that \Rfunction{readGAlignments} would have discarded any `BAM' record describing an unaligned query (see description of the field in the SAM Format Specification \footnote{\url{http://samtools.sourceforge.net/SAM1.pdf}} for more information). The reader interested in tracking down these events can always use the \Rfunction{scanBam} function but this goes beyond the scope of this document. \subsection{Simple accessor methods} There is one accessor per field displayed by the \Rmethod{show} method and it has the same name as the field. All of them return a vector or factor of the same length as the object: <>= head(seqnames(aln1)) seqlevels(aln1) head(strand(aln1)) head(cigar(aln1)) head(qwidth(aln1)) head(start(aln1)) head(end(aln1)) head(width(aln1)) head(njunc(aln1)) @ \subsection{More accessor methods} \end{document}