%\VignetteIndexEntry{Splicing graphs and RNA-seq data} %\VignetteDepends{SplicingGraphs,TxDb.Hsapiens.UCSC.hg19.knownGene,RNAseqData.HNRNPC.bam.chr14,Gviz} %\VignetteKeywords{Annotation} %\VignettePackage{SplicingGraphs} \documentclass{article} \usepackage{url} \usepackage{float} \usepackage{xcolor} \usepackage[margin=1in]{geometry} \setlength{\parindent}{0pt} \usepackage{enumitem} \setlist{nolistsep} \usepackage[margin=0.75in]{caption} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \bibliographystyle{plainnat} \SweaveOpts{keep.source=TRUE} \title{Splicing graphs and RNA-seq data} \author{Herv\'e Pag\`es\and Daniel Bindreither\and Marc Carlson\and Martin Morgan} \date{Last modified: January 2015; Compiled: \today} \begin{document} \maketitle <>= options(width=90) .precomputed_results_path <- "precomputed_results" .loadPrecomputed <- function(objname) { filename <- paste0(objname, ".rda") path <- file.path(.precomputed_results_path, filename) tempenv <- new.env(parent=emptyenv()) load(path, envir=tempenv) get(objname, envir=tempenv) } @ \tableofcontents \setlength{\parskip}{\baselineskip} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} The \Rpackage{SplicingGraphs} package allows the user to create and manipulate splicing graphs \cite{Sammeth:2010} based on annotations for a given organism. Annotations must describe a \textit{gene model}, that is, they need to contain the following information: \begin{itemize} \item The exact exon/intron structure (i.e., genomic coordinates) for the known transcripts. \item The grouping of transcripts by gene. \end{itemize} Annotations need to be provided as a \Rclass{TxDb} or \Rclass{GRangesList} object, which is how a gene model is typically represented in \Bioconductor. The \Rpackage{SplicingGraphs} package defines the \Rclass{SplicingGraphs} container for storing the splicing graphs together with the gene model that they are based on. Several methods are provided for conveniently access the information stored in a \Rclass{SplicingGraphs} object. Most of these methods are described in this document. The package also allows the user to assign RNA-seq reads to the edges of a \Rclass{SplicingGraphs} object and to summarize them. This requires that the reads have been previously aligned to the exact same reference genome that the gene model is based on. RNA-seq data from an already published study is used to illustrate this functionality. In that study \cite{Zarnack:2012}, the authors performed transcription profiling by high throughput sequencing of HNRNPC knockdown and control HeLa cells (\textit{Human}). %Although there exist several software solutions for RNA-seq %based transcriptome analysis, like RSEM \cite{RSEM2011}, %Cufflinks \cite{Roberts:2011} or the R package \Rpackage{DEXSeq} %\cite{Anders:2010}, the approach of analyzing splicing events as %complete complex structure is still new. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Splicing graphs} Alternative splicing is a frequently observed complex biological process which modifies the primary RNA transcript and leads to transcript variants of genes. Those variants can be plentiful. Especially for large genes it is often difficult to describe their structure in a formal, logical, short and convenient way. To capture the full variety of splicing variants of a certain gene in one single data structure, Heber at al \cite{Heber:2002} introduced the term \textit{splicing graph} and provided a formal framework for representing the different choices of the splicing machinery. \subsection{Definitions and example} For a comprehensive explanation of the splicing graph theory, please refer to \cite{Heber:2002, Sammeth:2010}. A splicing graph is a directed acyclic graph (DAG) where: \begin{itemize} \item Vertices (a.k.a. \textit{nodes}) represent the splicing sites for a given gene. Splicing sites are ordered by their position from 5' to 3' and numbered from \Rcode{1} to \Rcode{n}. This number is the \textit{Splicing Site id}. Splicing graphs are only defined for genes that have all the exons of all their transcripts on the same chromosome and strand. In particular, in its current form, the splicing graph theory cannot describe \textit{trans-splicing} events. \item Edges are the exons and introns between splicing sites. Their orientation follows the 5' to 3' direction, i.e., they go from low to high \textit{Splicing Site ids}. \end{itemize} Two artificial nodes that don't correspond to any splicing site are added to the graph: the \textit{root} (\Rcode{R}) and the \textit{leaf} (\Rcode{L}). Also artificial edges are added so that all the transcripts are represented by a path that goes from \Rcode{R} to \Rcode{L}. That way the graph is connected. Not all paths in the graph are necessarily supported by a transcript. Figure~\ref{img:CIB3} shows the splicing graph representation of all known transcript variants of Human gene CIB3 (Entrez ID 117286). \begin{figure}[!htbp] \begin{center} \includegraphics[width=4in,height=1.5in,angle=90,origin=lB]{images/CIB3-transcripts.pdf} \hskip 0.5in \raisebox{-0.6in}{\includegraphics[width=2in,height=5.5in]{images/CIB3-sg.pdf}} \end{center} \caption{Splicing graph representation of the four transcript variants of gene CIB3 (Entrez ID 117286). Left: transcript representation. Right: splicing graph representation. Orange arrows are edges corresponding to exons. } \label{img:CIB3} \end{figure} \subsection{Details about nodes and edges} Splicing sites can be of the following types: \begin{itemize} \item \textit{Acceptor} site (symbol: \Rcode{-}); \item \textit{Donor} site (symbol: \Rcode{\^}); \item \textit{Transcription start} site (symbol: \Rcode{[}); \item \textit{Transcription end} site (symbol: \Rcode{]}). \end{itemize} The symbols associated with individual types of nodes shown in braces above are used for the identification of complete alternative splicing events. For more information about gathering the alternative splicing events and their associated codes, please refer to \cite{Heber:2002, Sammeth:2010}. In some cases a given splicing site can be associated with 2 types. The combinations of types for such sites is limited. Only \textit{acceptor} (\Rcode{-}) and \textit{transcription start} (\Rcode{[}), or \textit{donor} (\Rcode{\^}) and \textit{transcription end} (\Rcode{]}), can occur as a pair of types. Such a dual-type node occurs when the start of an exon, which is not the first exon of a transcript, falls together with the transcription start site of another transcript of the same gene. An example of such a splicing graph is shown in Figure~\ref{img:ZNF813}. \begin{figure}[!htbp] \begin{center} \includegraphics[width=4in,height=1.5in,angle=-90,origin=rB]{images/ZNF813-transcripts.pdf} \hskip 0.5in \raisebox{-0.6in}{\includegraphics[width=3in,height=5.5in]{images/ZNF813-sg.pdf}} \end{center} \caption{Splicing graph representation of the two transcript variants of Human gene ZNF813 (Entrez ID 126017). Node $3$ is a (\Rcode{-},\Rcode{[}) dual-type node: \textit{acceptor} (\Rcode{-}) for transcript {\tt uc002qbu.2}, and \textit{transcription start} (\Rcode{[}) for transcript {\tt uc010eqq.1}. } \label{img:ZNF813} \end{figure} A similar case occurs when the end of an exon, which is not the last exon of a transcript, falls together with the transcription end site of another transcript of the same gene. Whether a non-artificial edge represents an exon or an intron is determined by the types of its two flanking nodes: \begin{itemize} \item \textbf{Exon} if going from \textit{acceptor} (\Rcode{-}), or (\Rcode{-},\Rcode{[}) dual-type, to \textit{donor} (\Rcode{\^}), or (\Rcode{\^},\Rcode{]}) dual-type; \item \textbf{Intron} if the otherway around. \end{itemize} \subsection{Uninformative nodes} Not all splicing sites (nodes) on the individual transcripts are alternative splicing sites. Therefore the initial splicing graph outlined above can be simplified by removing nodes of in- and out- degree equal to one because they are supposed to be non-informative in terms of alternative splicing. Edges associated with such nodes get sequentially merged with the previous ones and result in longer edges capturing adjacent exons/introns. The final splicing graph only contains nodes involved in alternative splicing. The result of removing uninformative nodes is illustrated in Figure~\ref{img:CIB3-simplified}. \begin{figure}[!htbp] \begin{center} \includegraphics[width=2in,height=5.5in]{images/CIB3-sg.pdf} \hskip 0.5in \includegraphics[width=2in,height=5.5in]{images/CIB3-sg2.pdf} \end{center} \caption{Nodes \Rcode{1}, \Rcode{3}, \Rcode{5}, \Rcode{6}, \Rcode{9}, \Rcode{11}, \Rcode{12}, and \Rcode{13}, are uninformative and can be removed. Left: before removal of the uninformative nodes. Right: after their removal. } \label{img:CIB3-simplified} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Computing splicing graphs from annotations} \label{sec:computing_splicing_graphs} \subsection{Choosing and loading a gene model} The starting point for computing splicing graphs is a set of annotations describing a \textit{gene model} for a given organism. In \Bioconductor, a gene model is typically represented as a \Rclass{TxDb}. A few prepackaged \Rclass{TxDb} objects are available as \Rpackage{TxDb.*} annotation packages in the \Bioconductor{} package repositories. Here the \Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene} package is used. If there is no prepackaged \Rclass{TxDb} object for the genome/track that the user wants to use, such objects can easily be created by using tools from the \Rpackage{GenomicFeatures} package. First we load the \Rpackage{TxDb.*} package: <>= library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene @ Creating the splicing graphs for all genes in the \Rclass{TxDb} object can take a long time (up to 20 minutes or more). In order to keep things running in a reasonable time in this vignette, we restrict the gene model to the genes located on chromosome 14. Let's use the \Rfunction{isActiveSeq} getter/setter for this. By default, all the chromosomes in a \Rclass{TxDb} object are ``active'': \begin{small} <>= isActiveSeq(txdb)[1:25] @ \end{small} Next we set all the values in this named logical vector to \Rcode{FALSE}, except the value for \Rcode{chr14}: <>= isActiveSeq(txdb)[-match("chr14", names(isActiveSeq(txdb)))] <- FALSE names(which(isActiveSeq(txdb))) @ \subsection{Generating a \Rclass{SplicingGraphs} object} Splicing graphs are computed by calling the \Rfunction{SplicingGraphs} function on the gene model. This is the constructor function for \Rclass{SplicingGraphs} objects. It will compute information about all the splicing graphs (1 per gene) and store it in the returned object. By default, only genes with at least 2 transcripts are considered (this can be changed by using the \Rcode{min.ntx} argument of \Rfunction{SplicingGraphs}): <>= library(SplicingGraphs) sg <- SplicingGraphs(txdb) sg @ <>= ## We load a precomputed 'sg' that contains all the bubbles in ## 'sg@.bubbles_cache'. sg_with_bubbles <- .loadPrecomputed("sg_with_bubbles") sg@.bubbles_cache <- sg_with_bubbles@.bubbles_cache ## Replace NAs with FALSE in circularity flag (because having the flag set ## to NA instead of FALSE (or vice-versa) is not considered a significant ## difference between the 2 objects). isCircular(sg) <- isCircular(sg) %in% TRUE isCircular(sg_with_bubbles) <- isCircular(sg_with_bubbles) %in% TRUE if (!identical(sg, sg_with_bubbles)) stop("'sg' is not identical to precomputed version") @ \Rcode{sg} is a \Rclass{SplicingGraphs} object. It has 1 element per gene and \Rcode{names(sg)} gives the gene ids: <>= names(sg)[1:20] @ \Rcode{seqnames(sg)} and \Rcode{strand(sg)} return the chromosome and strand of the genes: <>= seqnames(sg)[1:20] strand(sg)[1:20] table(strand(sg)) @ The number of transcripts per gene can be obtained with \Rcode{elementNROWS(sg)}: <>= elementNROWS(sg)[1:20] @ \Rfunction{elementNROWS} is a core accessor for list-like objects that returns the lengths of the individual list elements. At this point you might wonder why \Rfunction{elementNROWS} works on \Rclass{SplicingGraphs} objects. Does this mean that those objects are list-like objects? The answer is yes. What do the list elements look like, and how can you access them? This is answered in the next subsection. \subsection{Basic manipulation of a \Rclass{SplicingGraphs} object} The list elements of a list-like object can be accessed \textbf{one at a time} by subsetting the object with \Rfunction{[[} (a.k.a. double-bracket subsetting). On a \Rclass{SplicingGraphs} object, this will extract the transcripts of a given gene. More precisely it will return an \textbf{unnamed} \Rclass{GRangesList} object containing the exons of the gene grouped by transcript: \begin{small} <>= sg[["3183"]] @ \end{small} The exon-level metadata columns are: \begin{itemize} \item \Rcode{exon\_id}: The original internal exon id as stored in the \Rclass{TxDb} object. This id was created and assigned to each exon when the \Rclass{TxDb} object was created. It's not a public id like, say, an Ensembl, RefSeq, or GenBank id. Furthermore, it's only guaranteed to be unique within a \Rclass{TxDb} object. \item \Rcode{exon\_name}: The original exon name as provided by the annotation resource (e.g., UCSC, Ensembl, or GFF file) and stored in the \Rclass{TxDb} object when it was created. Set to \Rcode{NA} if no exon name was provided. \item \Rcode{exon\_rank}: The rank of the exon in the transcript. \item \Rcode{start\_SSid}, \Rcode{end\_SSid}: The \textit{Splicing Site ids} corresponding to the \textit{start} and \textit{end} coordinates of the exon. (Please be cautious to not misinterpret the meaning of \textit{start} and \textit{end} here.) Those ids were assigned by the \Rfunction{SplicingGraphs} constructor. \end{itemize} IMPORTANT NOTE: Please be aware that the \textit{start} and \textit{end} coordinates of an exon, like the \textit{start} and \textit{end} coordinates of a genomic range in general, are following the almost universal convention that \textit{start} is $<=$ \textit{end}, and this \textbf{regardless of the direction of transcription}. As mentioned previously, the \textit{Splicing Site ids} are assigned based on the order of the site positions from 5' to 3'. This means that, for a gene on the plus (resp. minus) strand, the ids in the \Rcode{start\_SSid} metadata column are always lower (resp. greater) than those in the \Rcode{end\_SSid} metadata column. However, on both strands, the \textit{Splicing Site id} increases with the rank of the exon. The \Rfunction{show} method for \Rclass{GRangesList} objects only displays the inner metadata columns (which are at the exon level for an object like \Rcode{sg[["3183"]]}). To see the outer metadata columns (transcript-level metadata columns for objects like \Rcode{sg[["3183"]]}), we need to extract them explicitely: <>= mcols(sg[["3183"]]) @ The transcript-level metadata columns are: \begin{itemize} \item \Rcode{tx\_id}: The original transcript id as provided by the annotation resource (e.g. UCSC, Ensembl, or GFF file) and stored in the \Rclass{TxDb} object when it was created. \item \Rcode{txpath}: A named list-like object with one list element per transcript in the gene. Each list element is an integer vector that describes the \textit{path} of the transcript, i.e., the \textit{Splicing Site ids} that it goes thru. \end{itemize} <>= mcols(sg[["3183"]])$txpath @ A more convenient way to extract this information is to use the \Rfunction{txpath} accessor: <>= txpath(sg[["3183"]]) @ The list elements of the \Rcode{txpath} metadata column always consist of an even number of \textit{Splicing Site ids} in ascending order. The transcripts in a \Rclass{GRangesList} object like \Rcode{sg[["3183"]]} can be plotted with \Rfunction{plotTranscripts}: <>= plotTranscripts(sg[["3183"]]) @ The resulting plot is shown in figure~\ref{img:3183-transcripts}. <>= pdf("3183-transcripts.pdf", width=6, height=3) plotTranscripts(sg[["3183"]]) dev.off() @ \begin{figure}[!htbp] \begin{center} \includegraphics[width=6in,height=3in]{3183-transcripts.pdf} \end{center} \caption{The 11 transcripts of gene HNRNPC (Entrez ID 3183).} \label{img:3183-transcripts} \end{figure} \Rclass{SplicingGraphs} objects, like most list-like objects, can be unlisted with \Rfunction{unlist}. This will extract the transcripts of all the genes and return them as a \textbf{named} \Rclass{GRangesList} object. The names on the object are the gene ids: <>= ex_by_tx <- unlist(sg) head(names(ex_by_tx)) @ Because each element in the object represents a transcript (and not a gene), the names are not unique! This means that trying to subset the object by name (e.g. with \Rcode{ex\_by\_tx["3183"]} or \Rcode{ex\_by\_tx[["3183"]]}) is probably a bad idea because this will only select the first element with that name. When the names on a vector-like object \Rcode{x} are not unique, a safe way to select the elements with some given names is to do something like \Rcode{x[names(x) \%in\% c("name1", "name2")]}. For example, to select transcripts from genes 10001 and 100129075: \begin{small} <>= ex_by_tx[names(ex_by_tx) %in% c("10001", "100129075")] @ \end{small} \subsection{Extracting and plotting graphs from a \Rclass{SplicingGraphs} object} The edges (resp. nodes) of the splicing graph of a given gene can be extracted with the \Rfunction{sgedges} (resp. \Rfunction{sgnodes}) function. An important caveat is that this can only be done for one gene at a time, or, said otherwise, these functions only work on a \Rclass{SplicingGraphs} object of length 1. Here is where subsetting with \Rfunction{[} (a.k.a. single-bracket subsetting) comes into play. Using \Rfunction{[} on a \Rclass{SplicingGraphs} object returns a \Rclass{SplicingGraphs} object containing only the selected genes: <>= sg[strand(sg) == "-"] sg[1:20] tail(sg) # equivalent to 'sg[tail(seq_along(sg))]' sg["3183"] @ Let's extract the splicing graph edges for gene HNRNPC (Entrez ID 3183): <>= sgedges(sg["3183"]) @ The \Rclass{DataFrame} object returned by \Rfunction{sgedges} has 1 row per edge. Its columns are explained below. Let's extract the splicing graph nodes for that gene: <>= sgnodes(sg["3183"]) @ The character vector returned by \Rfunction{sgnodes} contains the node ids, that is, \Rcode{R} and \Rcode{L} for the \textit{root} and \textit{leaf} nodes, and the \textit{Splicing Site ids} for the other nodes. The node ids are always returned in ascending order with \Rcode{R} and \Rcode{L} being always the first and last nodes, respectively. The \Rclass{DataFrame} object returned by \Rfunction{sgedges} has the following columns: \begin{itemize} \item \Rcode{from}, \Rcode{to}: The 2 nodes connected by the edge. \item \Rcode{sgedge\_id}: A \textit{global edge id} of the form \Rcode{gene\_id:from,to}. \item \Rcode{ex\_or\_in}: The type of the edge, i.e., exon, intron, or no type if it's an artificial edge. \item \Rcode{tx\_id}: The ids of the transcripts that support the edge. \end{itemize} Alternatively the edges and ranges of all the genes can be extracted with \Rfunction{sgedgesByGene}: <>= edges_by_gene <- sgedgesByGene(sg) @ In this case the edges are returned in a \Rclass{GRangesList} object where they are grouped by gene. \Rcode{edges\_by\_gene} has the length and names of \Rcode{sg}, that is, the names on it are the gene ids and are guaranteed to be unique. Let's look at the edges for gene 3183: \begin{small} <>= edges_by_gene[["3183"]] @ \end{small} The edge-level metadata columns are the same as the columns of the \Rclass{DataFrame} object returned by \Rfunction{sgedges}. An important difference though is that the artificial edges (i.e., edges starting from the root node (\Rcode{R}) or ending at the leaf node (\Rcode{L})) are omitted! Finally, to plot a given splicing graph: <>= plot(sg["3183"]) plot(sgraph(sg["3183"], tx_id.as.edge.label=TRUE)) @ The resulting plots are shown in figure~\ref{img:3183-sgraph}. <>= pdf("3183-sgraph.pdf", width=3, height=5) plot(sgraph(sg["3183"])) dev.off() pdf("3183-sgraph-labelled.pdf", width=3, height=5) plot(sgraph(sg["3183"], tx_id.as.edge.label=TRUE)) dev.off() @ \begin{figure}[!htbp] \begin{center} \includegraphics[width=3in,height=5in]{3183-sgraph.pdf} \hskip 0.25in \includegraphics[width=3in,height=5in]{3183-sgraph-labelled.pdf} \end{center} \caption{Splicing graph representation of gene HNRNPC (Entrez ID 3183). Left: unlabelled edges. Right: edge labelled with the transcript ids (or names). } \label{img:3183-sgraph} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Splicing graph bubbles} \subsection{Some definitions} The definition of a bubble provided in Sammeth's paper \cite{Sammeth:2010} is quite obfuscated. Here we try to give a simpler one. We start by introducing the notion of a valid path between 2 nodes of a splicing graph: a ``valid path'' between 2 nodes $s$ and $t$ ($s < t$ ) of a splicing graph is a path followed by at least 1 transcript (see the \Rfunction{txpath} function in previous section for how to extract the transcript paths). Let's illustrate this by looking at the splicing graph for gene 100309464 (Entrez ID). The graph is shown in figure~\ref{img:100309464-sgraph}. <>= pdf("100309464-sgraph-labelled.pdf", width=5, height=5) plot(sgraph(sg["100309464"], tx_id.as.edge.label=TRUE)) dev.off() @ \begin{figure}[!htbp] \begin{center} \includegraphics[width=5in,height=5in]{100309464-sgraph-labelled.pdf} \end{center} \caption{Splicing graph representation of gene 100309464 (Entrez ID). Valid paths between nodes \Rcode{R} and \Rcode{14} are: \Rcode{\{R,3,5,6,7,14\}} (followed by transcript \Rcode{uc021rto.1}); \Rcode{\{R,3,5,8,9,10,11,14\}} (followed by transcript \Rcode{uc001xcr.2}). Paths \Rcode{\{R,1,2,4,5,6,7,14\}} and \Rcode{\{R,1,2,4,5,8,9,10,11,14\}} are not followed by any known transcript and thus are not valid paths. } \label{img:100309464-sgraph} \end{figure} Path \Rcode{\{R,3,5,6,7,14\}} in that graph is followed by transcript \Rcode{uc021rto.1} and thus is valid. However path \Rcode{\{R,1,2,4,5,6,7,14\}} is not followed by any known transcript and thus is not valid. Now the definition of a bubble: there is a ``bubble'' between 2 nodes $s$ and $t$ ($s < t$ ) of a splicing graph if (a) there are at least 2 distinct valid paths between $s$ and $t$, and (b) no other nodes than $s$ and $t$ are shared by all the valid paths between them. Nodes $s$ and $t$ are said to be the ``source'' and ``sink'' of the bubble. The number of distinct valid paths between $s$ and $t$ is the ``dimension'' of the bubble. For example, there is a bubble between nodes \Rcode{5} and \Rcode{L} of gene 100309464 (see figure~\ref{img:100309464-sgraph}) because the 3 valid paths between them don't share other nodes. The dimension of this bubble is 3. However, there is no bubble between nodes \Rcode{R} and \Rcode{14} because the 2 valid paths between them also share nodes \Rcode{3} and \Rcode{5}. The set of all valid paths that form a bubble describes a ``complete alternative splicing event'' (a.k.a. ``complete AS event'', or simply ``complete event''). The individual valid paths that form the bubble are sometimes called the ``variants'' of the bubble or event. \subsection{Computing the bubbles of a splicing graph} Like with the \Rfunction{sgedges} and \Rfunction{sgnodes} functions, the bubbles can only be computed for one gene (or graph) at a time. Let's compute the bubbles for gene 100309464 (Entrez ID) (represented on figure~\ref{img:100309464-sgraph}): <>= bubbles(sg["100309464"]) @ The \Rclass{DataFrame} object returned by \Rfunction{bubbles} has 1 row per bubble in the graph and the following columns: \begin{itemize} \item \Rcode{source}, \Rcode{sink}: The source and sink of the bubble. \item \Rcode{d}: The dimension of the bubble. \item \Rcode{partitions}, \Rcode{paths}: The latter contains the valid paths between the source and sink of the bubble (note that the source and sink nodes are not reported in the paths). The former contains the list of transcripts associated with each of these paths. \item \Rcode{AScode}, \Rcode{description}: The AS code (Alternative Splicing code) and its verbose description. More on these in the next subsection. \end{itemize} Bubbles can be nested, i.e., contain (or be contained in) other bubbles of smaller (or higher) dimensions. For example, in gene 100309464, the bubble between nodes \Rcode{5} and \Rcode{L} (3rd bubble returned by \Rfunction{bubbles}) is of dimension 3 and contains the smaller bubble between nodes \Rcode{5} and \Rcode{14} (2nd bubble returned by \Rfunction{bubbles}) which is of dimension 2. Bubbles can also overlap without one containing any of the others. \subsection{AS codes} See \cite{Sammeth:2010} for how AS codes are generated. Let's illustrate this by looking at how the AS code for the 2nd bubble in gene 100309464 is obtained. In 3 steps: \begin{enumerate} \item The paths reported for this bubble are \Rcode{\{6,7\}} and \Rcode{\{8,9,10,11\}}. All the nodes involved in these paths are renumbered starting at \Rcode{1} so nodes \Rcode{6}, \Rcode{7}, \Rcode{8}, \Rcode{9}, \Rcode{10}, \Rcode{11} become \Rcode{1}, \Rcode{2}, \Rcode{3}, \Rcode{4}, \Rcode{5}, \Rcode{6}, respectively. \item Then, for each path, a string is generated by putting together the new node numbers and node types in that path. By convention, this string is set to \Rcode{0} if the path is empty. So path \Rcode{\{6,7\}} and \Rcode{\{8,9,10,11\}} generate strings \Rcode{1-2\^} and \Rcode{3-4\^5-6\^}, respectively. \item Finally, the strings obtained previously are sorted lexicographically and concatenated together separated by commas (\Rcode{,}). This gives the final code: \Rcode{1-2\^,3-4\^5-6\^}. \end{enumerate} Verbose descriptions of the AS codes are obtained by looking them up through the \Rcode{ASCODE2DESC} predefined object. \Rcode{ASCODE2DESC} is a named character vector that contains the verbose descriptions for the AS codes of the 50 most frequent patterns of internal complete events found in Human as reported in Table $1$ of Sammeth's paper \cite{Sammeth:2010}. <>= codes <- bubbles(sg["10202"])$AScode data.frame(AScode=codes, description=ASCODE2DESC[codes], row.names=NULL) @ \Rcode{ASCODE2DESC} only contains the 50 code descriptions reported in Table $1$ of Sammeth's paper \cite{Sammeth:2010}. Therefore, looking up codes through \Rcode{ASCODE2DESC} will inevitably return an \Rcode{NA} for some codes. For example, even simple codes like \Rcode{0,1[2\^} or \Rcode{0,1-2\^{}3-4]} (which mean "skip the first exon" or "skip the last 2 exons", respectively) are not in Sammeth's table, because, as its title indicates, only \textbf{internal} complete events are reported in that table (a complete event, or bubble, being considered ``internal'' when its source and sink nodes are not \Rcode{R} or \Rcode{L}, or, equivalently, when its code has no \Rcode{[} or \Rcode{]} in it).\footnote{Despite its title (``50 Most Frequent Patterns of Internal Complete Events Found in Human''), Table $1$ of Sammeth's paper \cite{Sammeth:2010} actually contains AS codes that seem to correspond to complete events that are not internal. For example, the 16th AS code in the table is \Rcode{1*2\^{},3*4\^} and its description is \Rcode{"2 alternative first exons"}, which indicates that the event is internal. This raises at least 2 questions: why is it in the table? and why does the code contain \Rcode{*} instead of \Rcode{[}? A third question could be: why exclude complete events that are not internal in the first place?} \subsection{Tabulating all the AS codes for Human chromosome 14} To extract all the bubbles from all the genes on Human chromosome 14 and tabulate their codes, we do the following: % NOTE: The lapply() loop below should take less than 1 min on a modern laptop % when the vignette is passed thru Sweave(), because in that case, 'sg' already % contains all the precomputed bubbles in 'sg@.bubble_cache'. With a fresh 'sg' % however, it would take between 6 and 7 min on a modern laptop. \begin{small} <>= AScode_list <- lapply(seq_along(sg), function(i) bubbles(sg[i])$AScode) names(AScode_list) <- names(sg) AScode_table <- table(unlist(AScode_list)) AScode_table <- sort(AScode_table, decreasing=TRUE) AScode_summary <- data.frame(AScode=names(AScode_table), NbOfEvents=as.vector(AScode_table), Desciption=ASCODE2DESC[names(AScode_table)]) nrow(AScode_summary) @ \end{small} \begin{small} <>= head(AScode_summary, n=10) @ \end{small} Amongst the \Sexpr{nrow(AScode_summary)} distinct complete events we observe on Human chromosome 14, the most frequent one is named ``skip 1 exon''. At this point it should be mentioned that not all splicing events have names since there are a variety of different events. Another interesting fact we can gain from the splicing graph is the number of complete events each gene has. To answer this we have to count the number of bubbles within each gene. <>= nb_bubbles_per_gene <- elementNROWS(AScode_list) @ Below the genes with the most observed splicing events are shown. <>= head(sort(nb_bubbles_per_gene, decreasing=TRUE)) @ Next we want to see which genes show the highest heterogeneity of splicing events. We check for the number of unique splicing events for each gene. <>= nb_unique_bubbles_per_gene <- elementNROWS(unique(CharacterList(AScode_list))) @ Below the genes with the highest heterogeneity in observed splicing events are shown <>= head(sort(nb_unique_bubbles_per_gene, decreasing=TRUE)) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Counting reads} In this section, we illustrate how RNA-seq reads can be assigned to the exonic and intronic edges of a \Rclass{SplicingGraphs} object, and how these assignments can be summarized in a table of counts. The reads we're going to assign are stored as BAM files available in the \Rpackage{RNAseqData.HNRNPC.bam.chr14} package. They've been aligned to the \textit{hg19} reference genome from UCSC, which is also the reference genome of the gene model in our \Rclass{SplicingGraphs} object \Rcode{sg}. As a general rule, the aligned reads and the annotations should always be based on the same reference genome. Let's start by loading the \Rpackage{RNAseqData.HNRNPC.bam.chr14} package: <>= library(RNAseqData.HNRNPC.bam.chr14) bam_files <- RNAseqData.HNRNPC.bam.chr14_BAMFILES names(bam_files) # the names of the runs @ Another important question about RNA-seq reads is whether they are single- or paired-end. This can be checked with the \Rfunction{quickBamFlagSummary} utility from the \Rpackage{Rsamtools} package: <>= quickBamFlagSummary(bam_files[1], main.groups.only=TRUE) @ Doing this for the 8 BAM files confirms that all the reads in the files are paired-end. This means that they should be loaded with the \Rfunction{readGAlignmentPairs} function from the \Rpackage{GenomicAlignments} package. As a quick example, we load the reads located on the first 20 million bases of chromosome 14: <>= param <- ScanBamParam(which=GRanges("chr14", IRanges(1, 20000000))) galp <- readGAlignmentPairs(bam_files[1], param=param) length(galp) # nb of alignment pairs galp @ The result is a \Rclass{GAlignmentPairs} object with 1 element per alignment pair (alignment pair = pair of BAM records). Although paired-end reads could also be loaded with \Rfunction{readGAlignments} (also from the \Rpackage{GenomicRanges} package), they should be loaded with the \Rfunction{readGAlignmentPairs}: this actually pairs the BAM records and returns the pairs in a \Rclass{GAlignmentPairs} object. Otherwise the records won't be paired and will be returned in a \Rclass{GAlignments} object. \subsection{Assigning reads to the edges of a \Rclass{SplicingGraphs} object} We can use the \Rcode{assignReads} function to assign reads to the the exonic and intronic edges of \Rcode{sg}. The same read can be assigned to more than one exonic or intronic edge. For example, a junction read with 1 junction (i.e., 1 \Rcode{N} in its CIGAR) can be assigned to an intron and its 2 flanking exons, and this can happen for one or more transcripts, from the same gene or from different genes. We're going to loop on the BAM files to assign the reads from 1 run at a time. When we load the files, we want to filter out secondary alignments, reads not passing quality controls, and PCR or optical duplicates. This is done by preparing a \Rclass{ScanBamParam} object that we'll pass to \Rfunction{readGAlignmentPairs} (see \Rfunction{ScanBamParam} in the \Rpackage{Rsamtools} package for more information about this): <>= flag0 <- scanBamFlag(isSecondaryAlignment=FALSE, isNotPassingQualityControls=FALSE, isDuplicate=FALSE) param0 <- ScanBamParam(flag=flag0) @ Loop: <>= ## The following loop takes about 7 minutes on a modern laptop/desktop... for (i in seq_along(bam_files)) { bam_file <- bam_files[i] cat("Processing run ", names(bam_file), " ... ", sep="") galp <- readGAlignmentPairs(bam_file, use.names=TRUE, param=param0) sg <- assignReads(sg, galp, sample.name=names(bam_file)) cat("OK\n") } @ <>= sg_with_reads <- .loadPrecomputed("sg_with_reads") ## Remove the reads from 'sg_with_reads' and compare with 'sg'. if (!isTRUE(all.equal(removeReads(sg_with_reads), sg))) stop("after removal of the hits, precomputed version of 'sg_with_reads' ", "is not identical to 'sg'") sg <- sg_with_reads @ The assignments to the exonic and intronic edges of a given gene can be retrieved by extracting the edges grouped by transcript for that gene. This is achieved by calling \Rfunction{sgedgesByTranscript} on the subsetted \Rclass{SplicingGraphs} object. \Rfunction{sgedgesByTranscript} is similar to \Rfunction{sgedgesByGene} (introduced in Section \ref{sec:computing_splicing_graphs}), except that it groups the edges by transcript instead of by gene. By default the edge-level metadata columns are the same as with \Rfunction{sgedgesByGene} so we use \Rcode{with.hits.mcols=TRUE} to get the additional metadata columns containing the assignments: <>= edges_by_tx <- sgedgesByTranscript(sg["3183"], with.hits.mcols=TRUE) edge_data <- mcols(unlist(edges_by_tx)) colnames(edge_data) @ There is 1 \textit{hits} column per run (or sample). They appear in the order they were created. Each of them is a \Rclass{CharacterList} object that contains the reads from that run (or sample) that were assigned to the individual edges. Let's look at the \textit{hits} column for the first run (\Rcode{ERR127306}): <>= head(edge_data[ , c("sgedge_id", "ERR127306.hits")]) @ \subsection{How does read assignment work?} [TODO: Complete this subsection.] For the purpose of explaining how reads from run \Rcode{ERR127306} were assigned to the edges of gene 3183, we load the reads that are in the region of that gene and contain at least 1 junction: <>= param <- ScanBamParam(flag=flag0, which=range(unlist(sg[["3183"]]))) reads <- readGAlignmentPairs(bam_files[1], use.names=TRUE, param=param) junction_reads <- reads[njunc(first(reads)) + njunc(last(reads)) != 0L] @ and we plot the genomic region chr14:21675000-21702000: <>= plotTranscripts(sg[["3183"]], reads=junction_reads, from=21675000, to=21702000) @ The resulting plot is shown on figure \ref{img:3183-transcripts-and-reads}. <>= pdf("3183-transcripts-and-reads.pdf", width=12, height=12) plotTranscripts(sg[["3183"]], reads=junction_reads, from=21675000, to=21702000) dev.off() @ \begin{figure}[!htbp] \begin{center} \includegraphics[width=6in,height=4in]{3183-transcripts-and-reads.pdf} \end{center} \caption{Plot of the genomic region chr14:21675000-21702000 showing the 11 transcripts of gene HNRNPC (Entrez ID 3183) and the reads in that region. Reads that are ``compatible'' with the splicing of a transcript are plotted in dark blue. Incompatible reads are in light blue.} \label{img:3183-transcripts-and-reads} \end{figure} Figure \ref{img:3183-transcripts-and-reads-zoom} is a zoom on genomic region chr14:21698400-21698600 that was obtained with: <>= plotTranscripts(sg[["3183"]], reads=junction_reads, from=21698400, to=21698600) @ <>= pdf("3183-transcripts-and-reads-zoom.pdf", width=12, height=12) plotTranscripts(sg[["3183"]], reads=junction_reads, from=21698400, to=21698600) dev.off() @ \begin{figure}[!htbp] \begin{center} \includegraphics[width=6in,height=4in]{3183-transcripts-and-reads-zoom.pdf} \end{center} \caption{Zoom on genomic region chr14:21698400-21698600 of figure \ref{img:3183-transcripts-and-reads}.} \label{img:3183-transcripts-and-reads-zoom} \end{figure} [TODO: To be continued.] \subsection{Counting and summarizing the assigned reads} We use \Rfunction{countReads} to count the number of reads assigned to each (non-artificial) edge in \Rcode{sg} for each run: <>= sg_counts <- countReads(sg) @ The returned object is a \Rclass{DataFrame} with one row per unique splicing graph edge and one column of counts per run. Two additional columns contain the splicing graph edge ids and the type of edge (exon or intron): <>= dim(sg_counts) head(sg_counts[1:5]) @ Total number of hits per run: <>= sapply(sg_counts[-(1:2)], sum) @ [TODO: Suggest some possible downstream statistical analysis that take the output of countReads() as a starting point.] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session Information} All of the output in this vignette was produced under the following conditions: \begin{small} <>= sessionInfo() @ \end{small} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{thebibliography}{} \bibitem{Sammeth:2010} {Michael Sammeth}. \newblock Complete Alternative Splicing Events Are Bubbles in Splicing Graphs \newblock {\em Computational Biology}, \textbf{16}, 1117-1140, 2010. \bibitem{Zarnack:2012} {Kathi Zarnack} and {Julian Konig} and {Mojca Tajnik} and {Inigo Martincorena} and {Sebastian Eustermann} and {Isabelle Stevant} and {Alejandro Reyes} and {Simon Anders} and {Nicholas M. Luscombe} and {Jernej Ule} \newblock Direct Competition between hnRNP C and U2AF65 Protects the Transcriptome from the Exonization of Alu Elements \newblock {\em Cell}, 2012, Europe PMC 23374342. \bibitem{RSEM2011} {LI, Bo} and {Dewey, Colin}. \newblock RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome \newblock {\em BMC Bioinformatics}, \textbf{12}(1), 323, 2011. \bibitem{Roberts:2011} {Roberts, Adam} and {Pimentel, Harold} and {Trapnell, Cole} and {Pachter, Lior}. \newblock Identification of novel transcripts in annotated genomes using RNA-Seq \newblock {\em BIOINFORMATICS}, 2011 \bibitem{Anders:2010} {Simon Anders} and {Wolfgang Huber}. \newblock {D}ifferential expression analysis for sequence count data \newblock {\em Genome Biology}, \textbf{11}, R106, 2010. \bibitem{Heber:2002} {Heber, Steffen} and {Alekseyev, Max} and {Sze, Sing-Hoi} and {Tang, Haixu} and {Pevzner, Pavel A.} \newblock Splicing graphs and EST assembly problem \newblock {\em Bioinformatics}, \textbf{18}(suppl 1), S181-S188, 2002. \end{thebibliography} \end{document}