%\VignetteIndexEntry{Genome intervals and read alignments for functional exploration} %\VignetteDepends{girafe, RColorBrewer} %\VignetteKeywords{next-generation sequencing, read alignment} %\VignettePackage{girafe} % name of package %%%% HEAD SECTION: START EDITING BELOW %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass[11pt, a4paper, fleqn]{article} \usepackage{geometry} \usepackage{color} \definecolor{darkblue}{rgb}{0.0,0.0,0.75} \usepackage[% baseurl={http://www.bioconductor.org},% pdftitle={Introduction to package girafe},% pdfauthor={Joern Toedling},% pdfsubject={girafe Vignette},% pdfkeywords={Bioconductor},% pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,% filecolor=darkblue,urlcolor=darkblue,pagecolor=darkblue,% raiselinks,plainpages,pdftex]{hyperref} \usepackage{verbatim} % for multi-line comments \usepackage{amsmath,a4,t1enc, graphicx} \usepackage{natbib} \bibpunct{(}{)}{;}{a}{,}{,} \parindent0mm \parskip2ex plus0.5ex minus0.3ex \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\phead}[1]{{\flushleft \sf \small \textbf{#1} \quad}} \newcommand{\myincfig}[3]{% \begin{figure}[h!tb] \begin{center} \includegraphics[width=#2]{#1} \caption{\label{#1}\textit{#3}} \end{center} \end{figure} } \addtolength{\textwidth}{2cm} \addtolength{\oddsidemargin}{-1cm} \addtolength{\evensidemargin}{-1cm} \addtolength{\textheight}{2cm} \addtolength{\topmargin}{-1cm} \addtolength{\skip\footins}{1cm} %%%%%%% START EDITING HERE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \SweaveOpts{eps=false, keep.source=TRUE} % produce no 'eps' figures \title{An overview of the \Rpackage{girafe} package} \author{J.\ Toedling, C.\ Ciaudo, O.\ Voinnet, E.\ Heard, E.\ Barillot} \maketitle <>= options(length=60, stringsAsFactors=FALSE) set.seed(123) options(SweaveHooks=list( along=function() par(mar=c(2.5,4.2,4,1.5), font.lab=2), pie=function() par(mar=c(0, 0, 0, 3.7), font=2))) @ \section{Introduction} The intent of package \Rpackage{girafe} is to facilitate the functional exploration of the alignments of multiple reads\footnote{ The package has been developed for analysing single-end reads (fragment libraries) and does not support mate-pair reads yet.} from next-generation sequencing~(NGS) data to a genome. It extends the functionality of the Bioconductor~\citep{Gentleman2004} packages \Rpackage{ShortRead}~\citep{Morgan2009} and \Rpackage{genomeIntervals}. % <>= library("girafe") library("RColorBrewer") @ If you use \Rpackage{girafe} for analysing your data, please cite: \begin{itemize} \item{J Toedling, C Ciaudo, O Voinnet, E Heard and E Barillot (2010) girafe -- an {R/Bioconductor} package for functional exploration of aligned next-generation sequencing reads. \textsl{Bioinformatics}, 26(22):2902-3.} \end{itemize} \nocite{Toedling2010} \subsection*{Getting help} If possible, please send questions about \Rpackage{girafe} to the Bioconductor mailing list.\\ See \url{http://www.bioconductor.org/docs/mailList.html} \\ Their archive of questions and responses may prove helpful, too. \section{Workflow} We present the functionality of the package \Rpackage{girafe} using example data that was downloaded from the Gene Expression Omnibus (GEO) database~\citep[GSE10364]{Edgar2002}. The example data are Solexa reads of 26~nt length derived from small RNA profiling of mouse oocytes. The data has previously been described in \citet{Tam2008}. <>= exDir <- system.file("extdata", package="girafe") ### load object describing annotated mouse genome features: load(file.path(exDir, "mgi_gi.RData")) @ \subsection{Adapter trimming} We load reads that include parts of the adapter sequence. % <>= ra23.wa <- readFastq(dirPath=exDir, pattern= "aravinSRNA_23_plus_adapter_excerpt.fastq") @ <>= show(ra23.wa) @ To removing adapter sequences, we use the function \Rfunction{trimAdapter}, which relies on the \Rfunction{pairwiseAlignment} function from the \Rpackage{Biostrings} package. The adapter sequence was obtained from the GEO page of the data. % <>= adapter <- "CTGTAGGCACCATCAAT" ra23.na <- trimAdapter(ra23.wa, adapter) show(ra23.na) @ \subsection{Importing aligned reads} The reads have been mapped to the mouse genome (assembly \textit{mm9}) using the alignment tool \textit{Bowtie} alignment tool~\citep{Langmead2009}. The resulting tab-delimited map file can be read into an object of class \Rclass{AlignedRead} using the function \Rfunction{readAligned}. Both, this class and this function, are defined in the Bioconductor package \Rpackage{ShortRead}. <>= exA <- readAligned(dirPath=exDir, type="Bowtie", pattern="aravinSRNA_23_no_adapter_excerpt_mm9_unmasked.bwtmap") show(exA) @ The object of class \Rclass{AlignedRead} can be converted into an object of class \Rclass{AlignedGenomeIntervals}, which is the main class of the \Rpackage{girafe} package. % <>= exAI <- as(exA, "AlignedGenomeIntervals") organism(exAI) <- "Mm" @ For alignments in BAM format~\citep{Li2009}, there is an alternative way of importing the data. The function \Rfunction{agiFromBam} can be used to directly create \Rclass{AlignedGenomeIntervals} objects from indexed and sorted BAM files, making use of functionalities in the \Rpackage{Rsamtools} package. \subsection{Exploring the aligned reads} <>= show(exAI) @ Which chromosomes are the intervals located on? <>= table(seqnames(exAI)) @ A subset of the intervals on a specific chromosome can be obtained using subsetting via \verb='['=. <>= detail(exAI[seqnames(exAI)=="chrMT"]) @ Finally, what is the number of aligned \underline{reads} per chromosome? <>= summary(exAI) @ \subsection{Processing the aligned intervals} Sometimes, users may wish to combine certain aligned intervals. One intention could be to combine aligned reads at exactly the same position, which only differ in their sequence due to sequencing errors. Another objective could be to combine overlapping short reads that may be (degradation) products of the same primary transcript. The function \Rfunction{reduce} combines a set of aligned intervals into a single aligned interval, if the intervals: \begin{itemize} \item are on the same strand, \item are overlapping (or contained in each other) or directly adjacent to each other AND \item have the same \emph{read match specificity} (see below) \end{itemize} \phead{Read match specificity} By the \emph{read match specificity}~$r(I_i)$ of an interval~$I_i$, we refer to the total number of valid alignments of reads that have been aligned to $I_i$, \textsl{i.e.} the total numbers of intervals with the same reads aligned in the whole genome (or other set of reference sequences). If $r(I_i)=1$, the reads that were aligned to the interval~$I_i$ have no other valid alignments in the whole genome, i.e. the interval~$I_i$ is the unique match position of these reads. If $r(I_i)=2$, the reads that were aligned to the interval~$I_i$ have exactly one other valid alignment to another interval~\mbox{$I_j,~j \neq i$}. The match specificity is stored in the \Robject{matches} slot of objects of the class \Rclass{AlignedGenomeIntervals}. We first demonstrate the \Rfunction{reduce} using a toy example data object. <>= D <- AlignedGenomeIntervals( start=c(1,3,4,5,8,10,11), end=c(5,5,6,8,9,11,13), chromosome=rep(c("chr1","chr2","chr3"), c(2,2,3)), strand=c("-","-","+","+","+","+","+"), sequence=c("ACATT","ACA","CGT","GTAA","AG","CT","TTT"), reads=rep(1,7), matches=c(rep(1,6),3)) detail(D) @ Calling the \Rfunction{reduce} method on these example data results in the following: <>= detail(reduce(D)) @ Note that the two last intervals still show overlap. However, the last interval is a non-unique match position of the respective reads (\mbox{\Robject{matches}$=3$}), in contrast to the other intervals. Here is another example using the data introduced above. <>= S <- exAI[seqnames(exAI)=="chrX" & matches(exAI)==1L & exAI[,1]>1e8] detail(S) @ Calling the \Rfunction{reduce} method on these data leads to the following result: % <>= detail(reduce(S)) @ Notice that the reads that match the same segment of the X~chromosome differ in their last base. However, since most of the reads have a 'G' as final letter, the combined aligned interval als has a 'G' %an 'R', the IUPAC code for 'A' or 'G', %an 'N' as the last letter. The additional argument~\Rfunarg{method="exact"} can be specified if you want to solely combine intervals that have exactly the same start and stop position (but may have reads of slightly different sequence aligned to them). Consider the following example: <>= S2 <- exAI[seqnames(exAI)=="chr11" & matches(exAI)==1L & exAI[,1]>8e7] detail(S2) detail(reduce(S2, method="exact")) @ Notice that the 6th aligned interval in~\Robject{S2} is only shifted by 1~nt from the 5th one. By default, the function \Rfunction{reduce} would merge them into one aligned genome interval. However, when \Rfunarg{method="exact"} is specified, these two intervals are not merged since they are not at exactly the same position. There are additional methods for restricting the merging to intervals with the same 5'- and \mbox{3'-ends} (specify \Rfunarg{method="same5"} and \Rfunarg{method="same3"}, respectively). \subsection{Visualising the aligned genome intervals} The package~\Rpackage{girafe} contains functions for visualising genomic regions with aligned reads. % <>= plot(exAI, mgi.gi, chr="chrX", start=50400000, end=50410000, show="minus") @ See the result in Figure~\ref{girafe-plotAI}. \myincfig{girafe-plotAI}{0.8\textwidth}{A 10-kb region on the mouse X~chromosome. Reads aligned to the Watson strand in this region would be shown above the chromosome coordinate axis with the annotation of genome elements in this region, while reads aligned to the Crick strand are shown below. In the region shown, there are only intervals with aligned reads on the Crick strand, and these four intervals overlap with annotated microRNA positions.} Note that the annotation of genome elements (as shown in Figure~\ref{girafe-plotAI}) has to be supplied to the function. Here the object \Robject{mgi.gi} contains most annotated genes and ncRNAs for the mouse genome (assembly: \textit{mm9}). This object has been created beforehand\footnote{See the script \texttt{prepareAnnotation.R} in the package scripts directory for an example of how to create such an object.} and it is of class \Rclass{Genome\_intervals\_stranded}, a class defined in package \Rpackage{genomeIntervals}. \subsection{Summarising the data using sliding windows} The data can be searched for regions of defined interest using a sliding-window approach implemented in the function~\Rfunction{perWindow}. For each window, the number of intervals with aligned reads, the total number of reads aligned, the number of unique reads aligned, the fraction of intervals on the Plus-strand, and the higher number of aligned reads at a single interval within the window are reported. <>= exPX <- perWindow(exAI, chr="chrX", winsize=1e5, step=0.5e5) head(exPX[order(exPX$n.overlap, decreasing=TRUE),]) @ \subsection{Exporting the data} The package \Rpackage{girafe} also contains methods for exporting the data into tab-delimited text files, which can be uploaded to the UCSC genome browser\footnote{\url{http://genome.ucsc.edu}} as 'custom tracks'. Currently, there are methods for exporting the data in 'bed' format and 'bedGraph' format, either writing intervals from both strands into one file or into two separate files. Details about these track formats can be found at the UCSC genome browser web pages. <>= export(exAI, con="export.bed", format="bed", name="example_reads", description="Example reads", color="100,100,255", visibility="pack") @ Additional arguments to the export function, besides \Rfunarg{object}, \Rfunarg{con}, and \Rfunarg{format}, are treated as attributes for the track definition line, which specifies details concerning how the data should be visualised in the genome browser. Users may also wish to consult the Bioconductor package~\Rpackage{rtracklayer} for data transfer and direct interaction between R and the UCSC genome browser. \subsection{Overlap with annotated genome features} Next, we determine the degree of overlap of the aligned reads with annotated genomic elements. In this example, the annotated genome elements are provided as an object of class \Rclass{Genome\_intervals\_stranded}\footnote{% Objects of class \Rclass{Genome\_intervals} and \Rclass{AlignedGenomeIntervals} are also allowed.}. This objects needs to be created beforehand. See the script \texttt{prepareAnnotation.R} in the package scripts directory\footnote{\texttt{system.file("scripts", package="girafe")}} for an example of how to create such an object. % <>= exOv <- interval_overlap(exAI, mgi.gi) @ How many elements do read match positions generally overlap? % <>= table(listLen(exOv)) @ What are the \Sexpr{max(listLen(exOv))}~elements overlapped by a single match position? % <>= mgi.gi$ID[exOv[[which.max(listLen(exOv))]]] @ And in general, what kinds of annotated genome elements are overlapped by reads? % <>= (tabOv <- table(as.character(mgi.gi$type)[unlist(exOv)])) @ We display these overlap classes using a pie chart. % <>= my.cols <- brewer.pal(length(tabOv), "Set3") pie(tabOv, col=my.cols, radius=0.88) @ See the result in Figure~\ref{girafe-displayPie}. \myincfig{girafe-displayPie}{0.6\textwidth}{Pie chart showing what kinds of genome elements are overlapped by aligned reads. Note that the proportions of the pie chart are given by the proportions among all annotated genome elements that \mbox{have $\geq 1$ reads} mapped to them and not by the total numbers of reads mapped to elements of that class, in which case the proportion of the miRNA class would be significantly larger.} Note that function \Rfunction{interval.overlap} only determines whether two intervals are overlapping by at least one base. For restricting the result to intervals overlapping by at least a certain number of bases or by a fraction of the interval's length, see the function \Rfunction{fracOverlap}. \section{Memory usage} At the moment, \Rpackage{girafe} and the packages that it depends on, retain all the information concerning the read alignments in memory. This allows quick access to and swift operations on the data, but may limit the package's usability on machines with low amounts of RAM. The step with the highest RAM requirements is importing the alignments and saving them as objects of the \Rclass{AlignedRead} class using the functionality in package \Rpackage{ShortRead}. Usually, objects of the \Rclass{AlignedGenomeIntervals} class are created starting from \Rclass{AlignedRead} objects and the \Rclass{AlignedRead} objects can safely be discarded after this step. Since the data is summarised in that process, \Rclass{AlignedGenomeIntervals} objects require about 10--100 times less memory than the original \Rclass{AlignedRead} object\footnote{% e.g., an \Rclass{AlignedRead} object for holding $10^6$~reads of length 36~bp aligned to the mouse reference genome occupies about 1.4~Gb in RAM but is processed into an \Rclass{AlignedGenomeIntervals} object of size 66.7~Mb}. We recommend that the import of the alignments and the generation of the \Rclass{AlignedGenomeIntervals} are performed using a separate script which only needs to be called once on a machine with sufficient RAM. A suggestion for limiting memory usage is to perform the read alignments and import of the results in batches of a few million reads each. The batch-wise result \Rclass{AlignedGenomeIntervals} objects can later be combined using the basic R function ''c'', the standard way of combining objects, optionally followed by calls of the \Rfunction{reduce} function. For alignments in SAM/BAM format, the \emph{Samtools} software suite~\citep{Li2009} as well as the Bioconductor package \Rpackage{Rsamtools} allow the user to access and import only selected subsets of the data, which also leads to a lower memory footprint. For details, please refer to the documentation of these packages. Finally, while the processing of \Rclass{AlignedRead} objects is the principal way of generating \Rclass{AlignedGenomeIntervals} objects, there is also a convenience function called\\ \Rfunction{AlignedGenomeIntervals}, which can be used to create these objects from simpler objects in the work space, such as data read in using basic R functions such as \Rfunction{scan}. This convenience function may be easier to use for importing and processing the data in manageable chunks. When following these suggestions, most operations with the \Rpackage{girafe} package should be possible on a machine with 4~Gb of RAM, and we have not so far encountered a situation that requires more than 12~Gb (state as of the end of 2009). However, increased throughput of sequencing machines and longer reads will lead to increased memory requirements. Future developments of this and other NGS-related Bioconductor packages will therefore likely concern ways to reduce the memory footprint. One idea is to make use of packages like \Rpackage{ff}, which provide ways of swapping data from RAM to flat files on the hard disk, while still allowing fast and direct access to the data. \section{A word about speed} For improving the run time on machines with multiple processors, some of the functions in the \Rpackage{girafe} package have been implemented to make use of the functionality in the \Rpackage{parallel} package. If \Rpackage{parallel} has been attached and initialised before calling these functions, the functions will make use of \Rfunction{mclapply} instead of the normal \Rfunction{lapply} function. The number of cores to be used in parallel is determined by the \texttt{mc.cores} option (see the example below). For example, if \Rpackage{parallel} is functional on a given system\footnote{The \Rfunction{mclapply} function currently does not support Windows operating systems.}, there should be an obvious speed improvement in the following code example. <>= library("parallel") options("mc.cores"=4) # adjust to your machine exAI.R <- reduce(exAI, mem.friendly=TRUE) @ \section{Links to other Bioconductor packages} The \Rpackage{girafe} package is mostly built upon the interval notation and implementation provided by the packages \Rpackage{intervals} and \Rpackage{genomeIntervals}. Functions from the \Rpackage{ShortRead} package~\citep{Morgan2009} are used for importing the data, and \Rpackage{Biostrings} provides help for working with the read nucleotide sequences. \Rpackage{girafe} also makes limited use of the \Rclass{Rle} and \Rclass{IRanges} classes defined in the \Rpackage{IRanges} package. Furthermore, the data can be converted between object classes defined in \Rpackage{girafe} and \Rpackage{IRanges}. We note that many of the interval operations in \Rpackage{girafe} can also be performed using classes and functions defined in the \Rpackage{IRanges} package. However, the scope of the packages is slightly different. While \Rpackage{IRanges} is meant to be a generic infrastructure package of the Bioconductor project, the aim of \Rpackage{girafe} is to provide a single, comparatively lightweight, object class for working with reads aligned to the genome, the \Rclass{AlignedGenomeIntervals}. This class and its methods allow easy access to such data and facilitate standard operations and workflows. There is some overlap in functionality between \Rpackage{girafe}, \Rpackage{IRanges}, \Rpackage{GenomicRanges} and \Rpackage{tracklayer}. The range of interactions between these packages and new Bioconductor packages related to next-generation sequencing is likely to increase over the releases. Our aim is to provide users with a broad range of alternatives for selecting the classes and functions that are most suited for their workflows. \small \section*{Package versions} This vignette was generated using the following package versions: % <>= toLatex(sessionInfo(), locale=FALSE) @ \section*{Acknowledgements} Many thanks to Nicolas Servant, Val\'erie Cognat, Nicolas Delhomme, and especially Patrick Aboyoun for suggestions and feedback on the package. Special thanks to Julien Gagneur and Richard Bourgon for writing \Rpackage{genomeIntervals} and for rapidly answering all my questions regarding the package.\\ The plotting functions in package \Rpackage{girafe} are largely based on the function \Rfunction{plotAlongChrom} and its auxiliary functions from package \Rpackage{tilingArray}, most of which were written by Wolfgang Huber.\\ \textbf{Funding:} This work was supported by the Institut Curie, INCa "GepiG". \vspace{-0.2cm} \small %%% BIBLIOGRAPHY STARTS HERE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \bibliographystyle{abbrvnat} \bibliography{ngs} \end{document}