%\VignetteIndexEntry{Streamer: A simple example} %\VignetteDepends{} %\VignetteKeywords{Stream} %\VignettePackage{Streamer} \documentclass{article} \usepackage{times} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \SweaveOpts{keep.source=TRUE} \title{Using the Streamer classes to count genomic overlaps with \Rmethod{summarizeOverlaps}} \author{Nishant Gopalakrishnan, Martin Morgan} \date{\today} \begin{document} \maketitle \section{Introduction} This vignette illustrates how users can make use of the functionality provided by the \Rclass{Producer}, \Rclass{Consumer} and \Rclass{Stream} classes in the \Rpackage{Streamer} package to process data in a streaming fashion. The users have the option of quickly being able to create their own class to stream process data by inheriting from the classes provided by the \Rpackage{Streamer} package. This example illustrates a simple \Rclass{BamInput} class that inherits from the \Rclass{Producer} class and a \Rclass{CountGOverlap} class that inherits from the \Rclass{Consumer} class. These classes allows us to count the number of hits in a BAM file corresponding to the ranges specified by the user and return the hits in a streaming manner on a per sequence basis. Finally, the results for each sequence is collated and reordered using a helper function so they appear in the same order as the ranges provided by the user. The classes that we are going to develop in this example make use of the reference class system available in \R{}. We first load the \Rpackage{GenomicAlignments} and \Rpackage{Streamer} packages. <>= library(GenomicAlignments) library(Streamer) @ \section{\Rclass{BAMInput} class} The \Rclass{BAMInput} class will be used to read gapped alignments from a file specified by the user in a streaming manner. i.e reads will be read one sequence at a time. The two inputs specified by the user are \begin{itemize} \item file: a character string specifying the file from which alignments are to be read. \item ranges: the ranges from which alingments are to be retrieved. \end{itemize} Like the design of the other classes in the \Rpackage{Streamer} package, the \Rclass{BamInput} class will have an \Rmethod{initialize} and a \Rmethod{yield} method. The \Rmethod{initialize} method will be used to initialize the fields of the \Rclass{BamInput} class and is called automatically when objects are instantiated from this class. The \Rmethod{yield} method does not take any inputs. Each call to the \Rmethod{yield} method returns a \Rclass{GAlignments} object for a single sequence within the ranges specified by the user until all the sequences have been read from the BAM file at which point, an empty \Rclass{GAlignments} object will be returned. <>= .BamInput <- setRefClass("BamInput", contains="Producer", fields=list( file="character", ranges="GRanges", .seqNames="character")) .BamInput$methods( yield=function() { "yield data from .bam file" if (verbose) msg("BamInput$.yield()") if(length(.self$.seqNames)) { seq <- .self$.seqNames[1] .self$.seqNames <- .self$.seqNames[-1] idx <- as.character(seqnames(.self$ranges)) == seq param <- ScanBamParam(which=.self$ranges[idx], what=character()) aln <- readGAlignments(.self$file, param=param) seqlevels(aln) <- seq } else { aln <- GAlignments() } list(aln) }) @ The constructor for the \Rclass{BamInput} class takes the file and ranges as input and returns and instance of the \Rclass{BamInput} class. <>= BamInput <- function(file, ranges,...) { .seqNames <- names(scanBamHeader(file)[[1]]$target) .BamInput$new(file=file, ranges=ranges, .seqNames=.seqNames, ...) } @ \section{\Rclass{CountGOverlap} class} The second class we are going to develop is a \Rclass{Consumer} class that processes the data obtained from the \Rclass{BamInput} class. The class calls the \Rmethod{summarizeOverlaps} method with the \Rclass{GAlignments} object, user supplied ranges and additional arguments to control the behaviour of the \Rmethod{summarizeOverlaps} method. The \Rclass{CountGOverlap} class has an \Rmethod{initialize} method and a \Rmethod{yield} method. The \Rmethod{initialize} method initializes the class with the options to be passed in to the \Rmethod{countGenomicOverlaps} method as well as some variables for keeping track of the order of the hits to be returned by the \Rclass{CountGOverlap} class. The \Rmethod{yield} method returns a \Rclass{DataFrame} with the number of hits. The rownames of the result returned correspond to the order of the results in the original ranges supplied by the user. (These are subsequently used to reorder the results for the hits after collating results for all the sequences) <>= .CountGOverlap <- setRefClass("CountGOverlap", contains="Consumer", fields=list(ranges="GRanges", mode="character", ignore.strand="logical")) .CountGOverlap$methods( yield=function() { "return number of hits" if (verbose) msg(".CountGOt$yield()") aln <- callSuper()[[1]] df <- DataFrame(hits=numeric(0)) if(length(aln)) { idx <- as.character(seqnames(.self$ranges)) == levels(rname(aln)) which <- .self$ranges[idx] olap <- summarizeOverlaps(which, aln, mode=.self$mode, ignore.strand=.self$ignore.strand) df <- as(assays(olap)[[1]], "DataFrame") dimnames(df) <- list(rownames(olap), seqlevels(aln)) } df }) CountGOverlap <- function(ranges, mode = c("Union", "IntersectionStrict", "IntersectionNotEmpty"), ignore.strand = FALSE, ...) { values(ranges)$pos <- seq_len(length(ranges)) .CountGOverlap$new(ranges=ranges, mode=mode, ignore.strand=ignore.strand, ...) } @ \section{Stream with \Rclass{BamInput} and \Rclass{CountGOverlap}} Instances of the \Rclass{BamInput} and \Rclass{CountGOverlap} classes can be created using their respective constructors and can subsequently be hooked up to form a stream using the \Rmethod{Stream} function provided by the \Rpackage{Streamer} package. For our example we shall make use of a BAM file available in the \Rpackage{Rsamtools} package and create a \Rclass{GenomicRanges} object for the ranges that we are interested. A \Rclass{Stream} can then be created by passing these objects as the arguments to the \Rmethod{Stream} function. A call to the \Rmethod{yield} function of the \Rclass{Stream} class will yield the results obtained by calling yield first on the \Rclass{BamInput} class and subsequently on the \Rclass{CountGOverlap} class for the first sequence in the ranges provided. <>= galn_file <- system.file("extdata", "ex1.bam", package="Rsamtools") gr <- GRanges(seqnames = Rle(c("seq2", "seq2", "seq2", "seq1"), c(1, 3, 2, 4)), ranges = IRanges(rep(10,1), width = 1:10, names = head(letters,10)), strand = Rle(strand(rep("+", 5)), c(1, 2, 2, 3, 2)), score = 1:10, GC = seq(1, 0, length=10)) bam <- BamInput(file = galn_file, ranges = gr) olap <- CountGOverlap(ranges=gr, mode="IntersectionNotEmpty") s <- Stream(bam, olap) yield(s) @ \section{Collate results} Each call to the \Rmethod{yield} function of the stream process data for one sequence. It would be convenient to have a function that processed data for all the sequences in the ranges provided and collated the results so that they are ordered correctly. (same order as the ranges provided). We proceed to create this helper \Rmethod{overlapCounter} function that takes a \Rclass{BAMInput} and \Rclass{CountGOverlap} class objects as inputs. <>= overlapCounter <- function(pr, cs) { s <- Stream(pr, cs) len <- length(levels(seqnames(pr$ranges))) lst <- vector("list", len) for(i in 1:len) { lst[[i]] <- yield(s) names(lst[[i]]) <- "Count" } do.call(rbind, lst)[names(cs$ranges), ,drop=FALSE ] } bam <- BamInput(file = galn_file, ranges = gr) olap <- CountGOverlap(ranges=gr, mode="IntersectionNotEmpty") overlapCounter(bam, olap) @ \end{document}