%\VignetteIndexEntry{Splicing graphs and RNA-seq data}
%\VignetteDepends{SplicingGraphs,TxDb.Hsapiens.UCSC.hg19.knownGene,RNAseqData.HNRNPC.bam.chr14,Gviz}
%\VignetteKeywords{Annotation}
%\VignettePackage{SplicingGraphs}
\documentclass{article}

\usepackage{url}
\usepackage{float}
\usepackage{xcolor}

\usepackage[margin=1in]{geometry}

\setlength{\parindent}{0pt}

\usepackage{enumitem}
\setlist{nolistsep}

\usepackage[margin=0.75in]{caption}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}
\newcommand{\Bioconductor}{\software{Bioconductor}}

\bibliographystyle{plainnat}

\SweaveOpts{keep.source=TRUE}

\title{Splicing graphs and RNA-seq data}

\author{Herv\'e Pag\`es\and
  Daniel Bindreither\and
  Marc Carlson\and
  Martin Morgan}

\date{Last modified: January 2015; Compiled: \today}

\begin{document}

\maketitle

<<settings,echo=FALSE>>=
options(width=90)
.precomputed_results_path <- "precomputed_results"
.loadPrecomputed <- function(objname)
{
    filename <- paste0(objname, ".rda")
    path <- file.path(.precomputed_results_path, filename)
    tempenv <- new.env(parent=emptyenv())
    load(path, envir=tempenv)
    get(objname, envir=tempenv)
}
@ 

\tableofcontents

\setlength{\parskip}{\baselineskip}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

The \Rpackage{SplicingGraphs} package allows the user to create and
manipulate splicing graphs \cite{Sammeth:2010} based on annotations
for a given organism. Annotations must describe a \textit{gene model},
that is, they need to contain the following information:
\begin{itemize}
  \item The exact exon/intron structure (i.e., genomic coordinates)
        for the known transcripts.
  \item The grouping of transcripts by gene.
\end{itemize}

Annotations need to be provided as a \Rclass{TxDb} or
\Rclass{GRangesList} object, which is how a gene model is typically
represented in \Bioconductor.

The \Rpackage{SplicingGraphs} package defines the
\Rclass{SplicingGraphs} container for storing the splicing graphs
together with the gene model that they are based on.  Several methods
are provided for conveniently access the information stored in a
\Rclass{SplicingGraphs} object. Most of these methods are described in
this document.

The package also allows the user to assign RNA-seq reads to
the edges of a \Rclass{SplicingGraphs} object and to summarize them.
This requires that the reads have been previously aligned to the exact
same reference genome that the gene model is based on. RNA-seq data
from an already published study is used to illustrate this functionality.
In that study \cite{Zarnack:2012}, the authors performed transcription
profiling by high throughput sequencing of HNRNPC knockdown and control
HeLa cells (\textit{Human}).

%Although there exist several software solutions for RNA-seq 
%based transcriptome analysis, like RSEM \cite{RSEM2011},
%Cufflinks \cite{Roberts:2011} or the R package \Rpackage{DEXSeq} 
%\cite{Anders:2010}, the approach of analyzing splicing events as
%complete complex structure is still new. 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Splicing graphs}

Alternative splicing is a frequently observed complex biological 
process which modifies the primary RNA transcript and leads to 
transcript variants of genes. Those variants can be plentiful. 
Especially for large genes it is often difficult to describe their
structure in a formal, logical, short and convenient way. To capture 
the full variety of splicing variants of a certain gene in one single 
data structure, Heber at al \cite{Heber:2002} introduced the term 
\textit{splicing graph} and provided a formal framework for representing
the different choices of the splicing machinery. 

\subsection{Definitions and example}

For a comprehensive explanation of the splicing graph theory,
please refer to \cite{Heber:2002, Sammeth:2010}.

A splicing graph is a directed acyclic graph (DAG) where:
\begin{itemize}
  \item Vertices (a.k.a. \textit{nodes}) represent the splicing sites
        for a given gene. Splicing sites are ordered by their position from
        5' to 3' and numbered from \Rcode{1} to \Rcode{n}. This number
        is the \textit{Splicing Site id}.
        Splicing graphs are only defined for genes that have all
        the exons of all their transcripts on the same chromosome and
        strand. In particular, in its current form, the splicing graph
        theory cannot describe \textit{trans-splicing} events.
  \item Edges are the exons and introns between splicing sites.
        Their orientation follows the 5' to 3' direction, i.e., they go
        from low to high \textit{Splicing Site ids}.
\end{itemize}

Two artificial nodes that don't correspond to any splicing site
are added to the graph: the \textit{root} (\Rcode{R}) and the \textit{leaf}
(\Rcode{L}). Also artificial edges are added so that all the transcripts are
represented by a path that goes from \Rcode{R} to \Rcode{L}. That way the
graph is connected. Not all paths in the graph are necessarily
supported by a transcript.

Figure~\ref{img:CIB3} shows the splicing graph representation
of all known transcript variants of Human gene CIB3 (Entrez ID 117286).

\begin{figure}[!htbp]
  \begin{center}
  \includegraphics[width=4in,height=1.5in,angle=90,origin=lB]{images/CIB3-transcripts.pdf}
  \hskip 0.5in
  \raisebox{-0.6in}{\includegraphics[width=2in,height=5.5in]{images/CIB3-sg.pdf}}
  \end{center}
  \caption{Splicing graph representation of the four transcript variants
    of gene CIB3 (Entrez ID 117286).
    Left: transcript representation. Right: splicing graph representation.
    Orange arrows are edges corresponding to exons.
  }
  \label{img:CIB3}
\end{figure}

\subsection{Details about nodes and edges}

Splicing sites can be of the following types:
\begin{itemize}
  \item \textit{Acceptor} site (symbol: \Rcode{-});
  \item \textit{Donor} site (symbol: \Rcode{\^});
  \item \textit{Transcription start} site (symbol: \Rcode{[});
  \item \textit{Transcription end} site (symbol: \Rcode{]}).
\end{itemize}

The symbols associated with individual types of nodes shown in braces
above are used for the identification of complete alternative splicing
events. For more information about gathering the alternative splicing
events and their associated codes, please refer to \cite{Heber:2002,
  Sammeth:2010}.

In some cases a given splicing site can be associated with 2 types.
The combinations of types for such sites is limited.
Only \textit{acceptor} (\Rcode{-}) and \textit{transcription start}
(\Rcode{[}), or \textit{donor} (\Rcode{\^}) and \textit{transcription end}
(\Rcode{]}), can occur as a pair of types. Such a dual-type node occurs
when the start of an exon, which is not the first exon of a transcript,
falls together with the transcription start site of another transcript
of the same gene.
An example of such a splicing graph is shown in Figure~\ref{img:ZNF813}.

\begin{figure}[!htbp]
  \begin{center}
  \includegraphics[width=4in,height=1.5in,angle=-90,origin=rB]{images/ZNF813-transcripts.pdf}
  \hskip 0.5in
  \raisebox{-0.6in}{\includegraphics[width=3in,height=5.5in]{images/ZNF813-sg.pdf}}
  \end{center}
  \caption{Splicing graph representation of the two transcript
    variants of Human gene ZNF813 (Entrez ID 126017). Node $3$ is a
    (\Rcode{-},\Rcode{[}) dual-type node: \textit{acceptor} (\Rcode{-})
    for transcript {\tt uc002qbu.2}, and \textit{transcription start}
    (\Rcode{[}) for transcript {\tt uc010eqq.1}.
  }
  \label{img:ZNF813}
\end{figure}

A similar case occurs when the end of an exon, which is not the last
exon of a transcript, falls together with the transcription end site
of another transcript of the same gene.

Whether a non-artificial edge represents an exon or an intron is
determined by the types of its two flanking nodes:
\begin{itemize}
  \item \textbf{Exon} if going from \textit{acceptor} (\Rcode{-}),
        or (\Rcode{-},\Rcode{[}) dual-type, to \textit{donor} (\Rcode{\^}),
        or (\Rcode{\^},\Rcode{]}) dual-type;
  \item \textbf{Intron} if the otherway around.
\end{itemize}

\subsection{Uninformative nodes}

Not all splicing sites (nodes) on the individual transcripts are alternative 
splicing sites. Therefore the initial splicing graph outlined above can be
simplified by removing nodes of in- and out- degree equal to one because
they are supposed to be non-informative in terms of alternative splicing.
Edges associated with such nodes get sequentially merged with
the previous ones and result in longer edges capturing adjacent
exons/introns. The final splicing graph only contains nodes involved in
alternative splicing. The result of removing uninformative nodes is
illustrated in Figure~\ref{img:CIB3-simplified}.

\begin{figure}[!htbp]
  \begin{center}
  \includegraphics[width=2in,height=5.5in]{images/CIB3-sg.pdf}
  \hskip 0.5in
  \includegraphics[width=2in,height=5.5in]{images/CIB3-sg2.pdf}
  \end{center}
  \caption{Nodes \Rcode{1}, \Rcode{3}, \Rcode{5}, \Rcode{6}, \Rcode{9},
    \Rcode{11}, \Rcode{12}, and \Rcode{13}, are uninformative and can be
    removed. Left: before removal of the uninformative nodes. Right: after
    their removal.
  }
  \label{img:CIB3-simplified}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Computing splicing graphs from annotations}
\label{sec:computing_splicing_graphs}

\subsection{Choosing and loading a gene model}

The starting point for computing splicing graphs is a set of annotations
describing a \textit{gene model} for a given organism. In \Bioconductor,
a gene model is typically represented as a \Rclass{TxDb}.
A few prepackaged \Rclass{TxDb} objects are available as
\Rpackage{TxDb.*} annotation packages in the \Bioconductor{} package
repositories. Here the \Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene}
package is used. If there is no prepackaged \Rclass{TxDb}
object for the genome/track that the user wants to use, such objects can
easily be created by using tools from the \Rpackage{GenomicFeatures} 
package.

First we load the \Rpackage{TxDb.*} package:

<<txdb>>=
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
@

Creating the splicing graphs for all genes in the \Rclass{TxDb}
object can take a long time (up to 20 minutes or more). In order to keep
things running in a reasonable time in this vignette, we restrict the gene
model to the genes located on chromosome 14. Let's use the
\Rfunction{isActiveSeq} getter/setter for this. By default, all the
chromosomes in a \Rclass{TxDb} object are ``active'':

\begin{small}
<<isActiveSeq>>=
isActiveSeq(txdb)[1:25]
@
\end{small}

Next we set all the values in this named logical vector to \Rcode{FALSE},
except the value for \Rcode{chr14}:

<<keep_chr14_only>>=
isActiveSeq(txdb)[-match("chr14", names(isActiveSeq(txdb)))] <- FALSE
names(which(isActiveSeq(txdb)))
@

\subsection{Generating a \Rclass{SplicingGraphs} object}

Splicing graphs are computed by calling the \Rfunction{SplicingGraphs}
function on the gene model. This is the constructor function for
\Rclass{SplicingGraphs} objects. It will compute information about all
the splicing graphs (1 per gene) and store it in the returned object.
By default, only genes with at least 2 transcripts are considered (this
can be changed by using the \Rcode{min.ntx} argument of
\Rfunction{SplicingGraphs}):

<<SplicingGraphs_constructor>>=
library(SplicingGraphs)
sg <- SplicingGraphs(txdb)
sg
@
<<load_sg_with_bubbles,echo=FALSE>>=
## We load a precomputed 'sg' that contains all the bubbles in
## 'sg@.bubbles_cache'.
sg_with_bubbles <- .loadPrecomputed("sg_with_bubbles")
sg@.bubbles_cache <- sg_with_bubbles@.bubbles_cache
## Replace NAs with FALSE in circularity flag (because having the flag set
## to NA instead of FALSE (or vice-versa) is not considered a significant
## difference between the 2 objects).
isCircular(sg) <- isCircular(sg) %in% TRUE
isCircular(sg_with_bubbles) <- isCircular(sg_with_bubbles) %in% TRUE
if (!identical(sg, sg_with_bubbles))
    stop("'sg' is not identical to precomputed version")
@

\Rcode{sg} is a \Rclass{SplicingGraphs} object. It has 1 element per
gene and \Rcode{names(sg)} gives the gene ids:

<<names_sg>>=
names(sg)[1:20]
@

\Rcode{seqnames(sg)} and \Rcode{strand(sg)} return the chromosome and
strand of the genes:

<<seqnames_sg>>=
seqnames(sg)[1:20]
strand(sg)[1:20]
table(strand(sg))
@

The number of transcripts per gene can be obtained with
\Rcode{elementNROWS(sg)}:

<<elementNROWS_sg>>=
elementNROWS(sg)[1:20]
@

\Rfunction{elementNROWS} is a core accessor for list-like objects
that returns the lengths of the individual list elements.

At this point you might wonder why \Rfunction{elementNROWS} works on
\Rclass{SplicingGraphs} objects. Does this mean that those objects are
list-like objects? The answer is yes. 
What do the list elements look like, and how can you access them?
This is answered in the next subsection.

\subsection{Basic manipulation of a \Rclass{SplicingGraphs} object}

The list elements of a list-like object can be accessed \textbf{one
at a time} by subsetting the object with \Rfunction{[[} (a.k.a.
double-bracket subsetting). On a \Rclass{SplicingGraphs} object,
this will extract the transcripts of a given gene. More precisely
it will return an \textbf{unnamed} \Rclass{GRangesList} object
containing the exons of the gene grouped by transcript:

\begin{small}
<<sg_double_bracket_3183>>=
sg[["3183"]]
@
\end{small}

The exon-level metadata columns are:
\begin{itemize}
  \item \Rcode{exon\_id}: The original internal exon id as stored in the
        \Rclass{TxDb} object. This id was created and assigned to
        each exon when the \Rclass{TxDb} object was created.
        It's not a public id like, say, an Ensembl, RefSeq, or GenBank id.
        Furthermore, it's only guaranteed to be unique within a
        \Rclass{TxDb} object.
  \item \Rcode{exon\_name}: The original exon name as provided by the
        annotation resource (e.g., UCSC, Ensembl, or GFF file) and stored
        in the \Rclass{TxDb} object when it was created.
        Set to \Rcode{NA} if no exon name was provided.
  \item \Rcode{exon\_rank}: The rank of the exon in the transcript.
  \item \Rcode{start\_SSid}, \Rcode{end\_SSid}: The \textit{Splicing Site
        ids} corresponding to the \textit{start} and \textit{end} coordinates
        of the exon. (Please be cautious to not misinterpret the meaning
        of \textit{start} and \textit{end} here.)
        Those ids were assigned by the \Rfunction{SplicingGraphs} constructor.
\end{itemize}

IMPORTANT NOTE: Please be aware that the \textit{start} and \textit{end}
coordinates of an exon, like the \textit{start} and \textit{end}
coordinates of a genomic range in general, are following the almost
universal convention that \textit{start} is $<=$ \textit{end}, and this
\textbf{regardless of the direction of transcription}.

As mentioned previously, the \textit{Splicing Site ids} are assigned based
on the order of the site positions from 5' to 3'. This means that, for a
gene on the plus (resp. minus) strand, the ids in the \Rcode{start\_SSid}
metadata column are always lower (resp. greater) than those in the
\Rcode{end\_SSid} metadata column.

However, on both strands, the \textit{Splicing Site id} increases with the
rank of the exon.

The \Rfunction{show} method for \Rclass{GRangesList} objects only
displays the inner metadata columns (which are at the exon level for
an object like \Rcode{sg[["3183"]]}). To see the outer metadata columns
(transcript-level metadata columns for objects like \Rcode{sg[["3183"]]}),
we need to extract them explicitely:

<<mcols_sg_double_bracket_3183>>=
mcols(sg[["3183"]])
@

The transcript-level metadata columns are:
\begin{itemize}
  \item \Rcode{tx\_id}: The original transcript id as provided by the
        annotation resource (e.g. UCSC, Ensembl, or GFF file) and stored
        in the \Rclass{TxDb} object when it was created.
  \item \Rcode{txpath}: A named list-like object with one list element
        per transcript in the gene. Each list element is an integer vector
        that describes the \textit{path} of the transcript, i.e., the
        \textit{Splicing Site ids} that it goes thru.
\end{itemize}

<<mcols_sg_double_bracket_3183>>=
mcols(sg[["3183"]])$txpath
@

A more convenient way to extract this information is to use the
\Rfunction{txpath} accessor:

<<txpath_sg_3183>>=
txpath(sg[["3183"]])
@

The list elements of the \Rcode{txpath} metadata column
always consist of an even number of \textit{Splicing Site ids} in
ascending order.

The transcripts in a \Rclass{GRangesList} object like \Rcode{sg[["3183"]]}
can be plotted with \Rfunction{plotTranscripts}:

<<plotTranscripts_sg_3183,eval=FALSE>>=
plotTranscripts(sg[["3183"]])
@

The resulting plot is shown in figure~\ref{img:3183-transcripts}.

<<plotTranscripts_sg_3183_as_pdf,echo=FALSE,results=hide>>=
pdf("3183-transcripts.pdf", width=6, height=3)
plotTranscripts(sg[["3183"]])
dev.off()
@
\begin{figure}[!htbp]
  \begin{center}
  \includegraphics[width=6in,height=3in]{3183-transcripts.pdf}
  \end{center}
  \caption{The 11 transcripts of gene HNRNPC (Entrez ID 3183).}
  \label{img:3183-transcripts}
\end{figure}

\Rclass{SplicingGraphs} objects, like most list-like objects, can be
unlisted with \Rfunction{unlist}. This will extract the transcripts of
all the genes and return them as a \textbf{named} \Rclass{GRangesList}
object. The names on the object are the gene ids:

<<unlist_sg>>=
ex_by_tx <- unlist(sg)
head(names(ex_by_tx))
@

Because each element in the object represents a transcript
(and not a gene), the names are not unique! This means that trying to
subset the object by name (e.g. with \Rcode{ex\_by\_tx["3183"]}
or \Rcode{ex\_by\_tx[["3183"]]}) is probably a bad idea because this
will only select the first element with that name.
When the names on a vector-like object \Rcode{x} are not unique, a safe
way to select the elements with some given names is to do something like
\Rcode{x[names(x) \%in\% c("name1", "name2")]}.
For example, to select transcripts from genes 10001 and 100129075:

\begin{small}
<<unlist_sg>>=
ex_by_tx[names(ex_by_tx) %in% c("10001", "100129075")]
@
\end{small}

\subsection{Extracting and plotting graphs from a \Rclass{SplicingGraphs}
            object}

The edges (resp. nodes) of the splicing graph of a given gene can be
extracted with the \Rfunction{sgedges} (resp. \Rfunction{sgnodes})
function. An important caveat is that this can only be done for one
gene at a time, or, said otherwise, these functions only work on a
\Rclass{SplicingGraphs} object of length 1. Here is where subsetting
with \Rfunction{[} (a.k.a. single-bracket subsetting) comes into play.

Using \Rfunction{[} on a \Rclass{SplicingGraphs} object returns a
\Rclass{SplicingGraphs} object containing only the selected genes:

<<single_bracket_subsetting>>=
sg[strand(sg) == "-"]
sg[1:20]
tail(sg)  # equivalent to 'sg[tail(seq_along(sg))]'
sg["3183"]
@

Let's extract the splicing graph edges for gene HNRNPC (Entrez ID 3183):

<<sgedges_sg_3183>>=
sgedges(sg["3183"])
@

The \Rclass{DataFrame} object returned by \Rfunction{sgedges} has 1 row
per edge. Its columns are explained below.

Let's extract the splicing graph nodes for that gene:

<<sgnodes_sg_3183>>=
sgnodes(sg["3183"])
@

The character vector returned by \Rfunction{sgnodes} contains the node
ids, that is, \Rcode{R} and \Rcode{L} for the \textit{root} and \textit{leaf}
nodes, and the \textit{Splicing Site ids} for the other nodes.
The node ids are always returned in ascending order with \Rcode{R} and
\Rcode{L} being always the first and last nodes, respectively.

The \Rclass{DataFrame} object returned by \Rfunction{sgedges} has the
following columns:
\begin{itemize}
  \item \Rcode{from}, \Rcode{to}: The 2 nodes connected by the edge.
  \item \Rcode{sgedge\_id}: A \textit{global edge id} of the form
        \Rcode{gene\_id:from,to}.
  \item \Rcode{ex\_or\_in}: The type of the edge, i.e., exon, intron, or
        no type if it's an artificial edge.
  \item \Rcode{tx\_id}: The ids of the transcripts that support the edge.
\end{itemize}

Alternatively the edges and ranges of all the genes can be extracted
with \Rfunction{sgedgesByGene}:

<<sgedgesByGene_sg>>=
edges_by_gene <- sgedgesByGene(sg)
@

In this case the edges are returned in a \Rclass{GRangesList} object where they
are grouped by gene. \Rcode{edges\_by\_gene} has the length and names of
\Rcode{sg}, that is, the names on it are the gene ids and are guaranteed
to be unique. Let's look at the edges for gene 3183:

\begin{small}
<<edges_by_gene_3183>>=
edges_by_gene[["3183"]]
@
\end{small}

The edge-level metadata columns are the same as the columns of the
\Rclass{DataFrame} object returned by \Rfunction{sgedges}.
An important difference though is that the artificial edges (i.e., edges
starting from the root node (\Rcode{R}) or ending at the leaf node
(\Rcode{L})) are omitted!

Finally, to plot a given splicing graph:

<<plot_sg_3183,eval=FALSE>>=
plot(sg["3183"])
plot(sgraph(sg["3183"], tx_id.as.edge.label=TRUE))
@

The resulting plots are shown in figure~\ref{img:3183-sgraph}.

<<plot_sg_3183_as_pdf,echo=FALSE,results=hide>>=
pdf("3183-sgraph.pdf", width=3, height=5)
plot(sgraph(sg["3183"]))
dev.off()
pdf("3183-sgraph-labelled.pdf", width=3, height=5)
plot(sgraph(sg["3183"], tx_id.as.edge.label=TRUE))
dev.off()
@
\begin{figure}[!htbp]
  \begin{center}
  \includegraphics[width=3in,height=5in]{3183-sgraph.pdf}
  \hskip 0.25in
  \includegraphics[width=3in,height=5in]{3183-sgraph-labelled.pdf}
  \end{center}
  \caption{Splicing graph representation of gene HNRNPC (Entrez ID 3183).
    Left: unlabelled edges. Right: edge labelled with the transcript
    ids (or names).
  }
  \label{img:3183-sgraph}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Splicing graph bubbles}

\subsection{Some definitions}

The definition of a bubble provided in Sammeth's paper \cite{Sammeth:2010}
is quite obfuscated. Here we try to give a simpler one. 
We start by introducing the notion of a valid path between 2 nodes of a
splicing graph: a ``valid path'' between 2 nodes $s$ and $t$ ($s < t$ )
of a splicing graph is a path followed by at least 1 transcript (see the
\Rfunction{txpath} function in previous section for how to extract the
transcript paths).

Let's illustrate this by looking at the splicing graph for gene 100309464
(Entrez ID). The graph is shown in figure~\ref{img:100309464-sgraph}.

<<plot_sg_100309464_as_pdf,echo=FALSE,results=hide>>=
pdf("100309464-sgraph-labelled.pdf", width=5, height=5)
plot(sgraph(sg["100309464"], tx_id.as.edge.label=TRUE))
dev.off()
@
\begin{figure}[!htbp]
  \begin{center}
  \includegraphics[width=5in,height=5in]{100309464-sgraph-labelled.pdf}
  \end{center}
  \caption{Splicing graph representation of gene 100309464 (Entrez ID).
    Valid paths between nodes \Rcode{R} and \Rcode{14} are:
    \Rcode{\{R,3,5,6,7,14\}} (followed by transcript \Rcode{uc021rto.1});
    \Rcode{\{R,3,5,8,9,10,11,14\}} (followed by transcript \Rcode{uc001xcr.2}).
    Paths \Rcode{\{R,1,2,4,5,6,7,14\}} and \Rcode{\{R,1,2,4,5,8,9,10,11,14\}}
    are not followed by any known transcript and thus are not valid paths.
  }
  \label{img:100309464-sgraph}
\end{figure}

Path \Rcode{\{R,3,5,6,7,14\}} in that graph is followed by transcript
\Rcode{uc021rto.1} and thus is valid. However path \Rcode{\{R,1,2,4,5,6,7,14\}}
is not followed by any known transcript and thus is not valid.

Now the definition of a bubble: there is a ``bubble'' between 2 nodes $s$
and $t$ ($s < t$ ) of a splicing graph if (a) there are at least 2 distinct
valid paths between $s$ and $t$, and (b) no other nodes than $s$ and $t$
are shared by all the valid paths between them.
Nodes $s$ and $t$ are said to be the ``source'' and ``sink'' of the bubble.
The number of distinct valid paths between $s$ and $t$ is the ``dimension''
of the bubble.

For example, there is a bubble between nodes \Rcode{5} and \Rcode{L} of
gene 100309464 (see figure~\ref{img:100309464-sgraph}) because the 3 valid
paths between them don't share other nodes. The dimension of this bubble is 3.
However, there is no bubble between nodes \Rcode{R} and \Rcode{14} because
the 2 valid paths between them also share nodes \Rcode{3} and \Rcode{5}.

The set of all valid paths that form a bubble describes
a ``complete alternative splicing event'' (a.k.a. ``complete AS event'',
or simply ``complete event''). The individual valid paths that form
the bubble are sometimes called the ``variants'' of the bubble or event.

\subsection{Computing the bubbles of a splicing graph}

Like with the \Rfunction{sgedges} and \Rfunction{sgnodes} functions,
the bubbles can only be computed for one gene (or graph) at a time.
Let's compute the bubbles for gene 100309464 (Entrez ID) (represented on
figure~\ref{img:100309464-sgraph}):

<<bubbles_sg_100309464>>=
bubbles(sg["100309464"])
@

The \Rclass{DataFrame} object returned by \Rfunction{bubbles} has 1 row
per bubble in the graph and the following columns:
\begin{itemize}
  \item \Rcode{source}, \Rcode{sink}: The source and sink of the bubble.
  \item \Rcode{d}: The dimension of the bubble.
  \item \Rcode{partitions}, \Rcode{paths}: The latter contains the valid
        paths between the source and sink of the bubble (note that the
        source and sink nodes are not reported in the paths). The former
        contains the list of transcripts associated with each of these paths.
  \item \Rcode{AScode}, \Rcode{description}: The AS code (Alternative Splicing
        code) and its verbose description. More on these in the next
        subsection.
\end{itemize}

Bubbles can be nested, i.e., contain (or be contained in) other
bubbles of smaller (or higher) dimensions. For example, in gene 100309464,
the bubble between nodes \Rcode{5} and \Rcode{L} (3rd bubble returned
by \Rfunction{bubbles}) is of dimension 3 and contains the smaller bubble
between nodes \Rcode{5} and \Rcode{14} (2nd bubble returned by
\Rfunction{bubbles}) which is of dimension 2.

Bubbles can also overlap without one containing any of the others.

\subsection{AS codes}

See \cite{Sammeth:2010} for how AS codes are generated.

Let's illustrate this by looking at how the AS code for the 2nd bubble
in gene 100309464 is obtained. In 3 steps:
\begin{enumerate}
  \item The paths reported for this bubble are \Rcode{\{6,7\}} and
        \Rcode{\{8,9,10,11\}}. All the nodes involved in these paths are
        renumbered starting at \Rcode{1} so nodes \Rcode{6}, \Rcode{7},
        \Rcode{8}, \Rcode{9}, \Rcode{10}, \Rcode{11} become \Rcode{1},
        \Rcode{2}, \Rcode{3}, \Rcode{4}, \Rcode{5}, \Rcode{6},
        respectively.
  \item Then, for each path, a string is generated by putting together
        the new node numbers and node types in that path. By convention,
        this string is set to \Rcode{0} if the path is empty. So path
        \Rcode{\{6,7\}} and \Rcode{\{8,9,10,11\}} generate strings
        \Rcode{1-2\^} and \Rcode{3-4\^5-6\^}, respectively.
  \item Finally, the strings obtained previously are sorted
        lexicographically and concatenated together separated
        by commas (\Rcode{,}). This gives the final code:
        \Rcode{1-2\^,3-4\^5-6\^}.
\end{enumerate}

Verbose descriptions of the AS codes are obtained by looking them up
through the \Rcode{ASCODE2DESC} predefined object. \Rcode{ASCODE2DESC} is a
named character vector that contains the verbose descriptions for the AS
codes of the 50 most frequent patterns of internal complete events found
in Human as reported in Table $1$ of Sammeth's paper \cite{Sammeth:2010}.

<<ASCODE2DESC>>=
codes <- bubbles(sg["10202"])$AScode
data.frame(AScode=codes, description=ASCODE2DESC[codes], row.names=NULL)
@

\Rcode{ASCODE2DESC} only contains the 50 code descriptions
reported in Table $1$ of Sammeth's paper \cite{Sammeth:2010}. Therefore,
looking up codes through \Rcode{ASCODE2DESC} will inevitably return an
\Rcode{NA} for some codes. For example, even simple codes like
\Rcode{0,1[2\^} or \Rcode{0,1-2\^{}3-4]} (which mean "skip the
first exon" or "skip the last 2 exons", respectively) are not
in Sammeth's table, because, as its title indicates, only \textbf{internal}
complete events are reported in that table (a complete event, or bubble,
being considered ``internal'' when its source and sink nodes are not
\Rcode{R} or \Rcode{L}, or, equivalently, when its code has no \Rcode{[}
or \Rcode{]} in it).\footnote{Despite its title
(``50 Most Frequent Patterns of Internal Complete Events Found in Human''),
Table $1$ of Sammeth's paper \cite{Sammeth:2010} actually contains AS codes
that seem to correspond to complete events that are not internal. For
example, the 16th AS code in the table is \Rcode{1*2\^{},3*4\^} and
its description is \Rcode{"2 alternative first exons"}, which indicates
that the event is internal. This raises at least 2 questions: why is it
in the table? and why does the code contain \Rcode{*} instead of \Rcode{[}?
A third question could be: why exclude complete events that are not
internal in the first place?}

\subsection{Tabulating all the AS codes for Human chromosome 14}

To extract all the bubbles from all the genes on Human chromosome 14
and tabulate their codes, we do the following:

% NOTE: The lapply() loop below should take less than 1 min on a modern laptop
% when the vignette is passed thru Sweave(), because in that case, 'sg' already
% contains all the precomputed bubbles in 'sg@.bubble_cache'. With a fresh 'sg'
% however, it would take between 6 and 7 min on a modern laptop.
\begin{small}
<<AScode_summary>>=
AScode_list <- lapply(seq_along(sg), function(i) bubbles(sg[i])$AScode)
names(AScode_list) <- names(sg)
AScode_table <- table(unlist(AScode_list))
AScode_table <- sort(AScode_table, decreasing=TRUE)
AScode_summary <- data.frame(AScode=names(AScode_table),
                             NbOfEvents=as.vector(AScode_table),
                             Desciption=ASCODE2DESC[names(AScode_table)])
nrow(AScode_summary)
@
\end{small}

\begin{small}
<<head_AScode_summary>>=
head(AScode_summary, n=10)
@
\end{small}

Amongst the \Sexpr{nrow(AScode_summary)} distinct complete events we observe
on Human chromosome 14, the most frequent one is named ``skip 1 exon''.
At this point it should be mentioned that not all splicing
events have names since there are a variety of different events.

Another interesting fact we can gain from the splicing
graph is the number of complete events each gene has.
To answer this we have to count the number of bubbles within
each gene.

<<nb_bubbles_per_gene>>=
nb_bubbles_per_gene <- elementNROWS(AScode_list)
@ 

Below the genes with the most observed splicing events are shown.

<<head_sort_nb_bubbles_per_gene>>=
head(sort(nb_bubbles_per_gene, decreasing=TRUE))
@ 

Next we want to see which genes show the highest heterogeneity of splicing
events.
We check for the number of unique splicing events for each gene.

<<nb_unique_bubbles_per_gene>>=
nb_unique_bubbles_per_gene <- elementNROWS(unique(CharacterList(AScode_list)))
@ 

Below the genes with the highest heterogeneity in observed splicing events
are shown

<<head_sort_nb_unique_bubbles_per_gene>>=
head(sort(nb_unique_bubbles_per_gene, decreasing=TRUE))
@ 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Counting reads}

In this section, we illustrate how RNA-seq reads can be assigned
to the exonic and intronic edges of a \Rclass{SplicingGraphs} object, and
how these assignments can be summarized in a table of counts.
The reads we're going to assign are stored as BAM files available in the
\Rpackage{RNAseqData.HNRNPC.bam.chr14} package. They've been aligned
to the \textit{hg19} reference genome from UCSC, which is also the reference
genome of the gene model in our \Rclass{SplicingGraphs} object \Rcode{sg}.
As a general rule, the aligned reads and the annotations should always be
based on the same reference genome.

Let's start by loading the \Rpackage{RNAseqData.HNRNPC.bam.chr14} package:

<<bam_files>>=
library(RNAseqData.HNRNPC.bam.chr14)
bam_files <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
names(bam_files)  # the names of the runs
@

Another important question about RNA-seq reads is whether they are
single- or paired-end. This can be checked with the
\Rfunction{quickBamFlagSummary} utility from the \Rpackage{Rsamtools}
package:

<<quickBamFlagSummary>>=
quickBamFlagSummary(bam_files[1], main.groups.only=TRUE)
@

Doing this for the 8 BAM files confirms that all the reads in the
files are paired-end. This means that they should be loaded with the
\Rfunction{readGAlignmentPairs} function from the \Rpackage{GenomicAlignments}
package. As a quick example, we load the reads located on the first 20
million bases of chromosome 14:

<<readGAlignmentPairs>>=
param <- ScanBamParam(which=GRanges("chr14", IRanges(1, 20000000)))
galp <- readGAlignmentPairs(bam_files[1], param=param)
length(galp)  # nb of alignment pairs
galp
@

The result is a \Rclass{GAlignmentPairs} object with 1 element per
alignment pair (alignment pair = pair of BAM records). Although paired-end
reads could also be loaded with \Rfunction{readGAlignments} (also from the
\Rpackage{GenomicRanges} package), they should be loaded with the
\Rfunction{readGAlignmentPairs}: this actually pairs the BAM records
and returns the pairs in a \Rclass{GAlignmentPairs} object. Otherwise the
records won't be paired and will be returned in a \Rclass{GAlignments} object.

\subsection{Assigning reads to the edges of a \Rclass{SplicingGraphs} object}

We can use the \Rcode{assignReads} function to assign reads to the
the exonic and intronic edges of \Rcode{sg}. The same read
can be assigned to more than one exonic or intronic edge. For example,
a junction read with 1 junction (i.e., 1 \Rcode{N} in its CIGAR) can be
assigned to an intron and its 2 flanking exons, and this can happen for
one or more transcripts, from the same gene or from different genes.

We're going to loop on the BAM files to assign the reads from 1 run
at a time. When we load the files, we want to filter out secondary
alignments, reads not passing quality controls, and PCR or optical
duplicates. This is done by preparing a \Rclass{ScanBamParam} object
that we'll pass to \Rfunction{readGAlignmentPairs} (see
\Rfunction{ScanBamParam} in the \Rpackage{Rsamtools} package for more
information about this):

<<ScanBamParam>>=
flag0 <- scanBamFlag(isSecondaryAlignment=FALSE,
                     isNotPassingQualityControls=FALSE,
                     isDuplicate=FALSE)
param0 <- ScanBamParam(flag=flag0)
@

Loop:

<<assignReads,eval=FALSE>>=
## The following loop takes about 7 minutes on a modern laptop/desktop...
for (i in seq_along(bam_files)) {
    bam_file <- bam_files[i]
    cat("Processing run ", names(bam_file), " ... ", sep="")
    galp <- readGAlignmentPairs(bam_file, use.names=TRUE, param=param0)
    sg <- assignReads(sg, galp, sample.name=names(bam_file))
    cat("OK\n")
}
@
<<load_sg_with_reads,echo=FALSE>>=
sg_with_reads <- .loadPrecomputed("sg_with_reads")
## Remove the reads from 'sg_with_reads' and compare with 'sg'.
if (!isTRUE(all.equal(removeReads(sg_with_reads), sg)))
    stop("after removal of the hits, precomputed version of 'sg_with_reads' ",
         "is not identical to 'sg'")
sg <- sg_with_reads
@

The assignments to the exonic and intronic edges of a given gene
can be retrieved by extracting the edges grouped by transcript for
that gene. This is achieved by calling \Rfunction{sgedgesByTranscript}
on the subsetted \Rclass{SplicingGraphs} object.
\Rfunction{sgedgesByTranscript} is similar to \Rfunction{sgedgesByGene}
(introduced in Section \ref{sec:computing_splicing_graphs}), except that
it groups the edges by transcript instead of by gene.
By default the edge-level metadata columns are the same as with
\Rfunction{sgedgesByGene} so we use \Rcode{with.hits.mcols=TRUE} to
get the additional metadata columns containing the assignments:

<<sg_3183_assignments>>=
edges_by_tx <- sgedgesByTranscript(sg["3183"], with.hits.mcols=TRUE)
edge_data <- mcols(unlist(edges_by_tx))
colnames(edge_data)
@

There is 1 \textit{hits} column per run (or sample). They appear in the
order they were created. Each of them is a \Rclass{CharacterList}
object that contains the reads from that run (or sample) that were assigned
to the individual edges. Let's look at the \textit{hits} column for
the first run (\Rcode{ERR127306}):

<<head_edge_data_sgedge_id_ERR127306.hits>>=
head(edge_data[ , c("sgedge_id", "ERR127306.hits")])
@

\subsection{How does read assignment work?}

[TODO: Complete this subsection.]

For the purpose of explaining how reads from run \Rcode{ERR127306}
were assigned to the edges of gene 3183, we load the reads that are
in the region of that gene and contain at least 1 junction:

<<load_reads_in_3183_region>>=
param <- ScanBamParam(flag=flag0, which=range(unlist(sg[["3183"]])))
reads <- readGAlignmentPairs(bam_files[1], use.names=TRUE, param=param)
junction_reads <- reads[njunc(first(reads)) + njunc(last(reads)) != 0L]
@

and we plot the genomic region chr14:21675000-21702000:

<<plotTranscripts_sg_3183_and_reads,eval=FALSE>>=
plotTranscripts(sg[["3183"]], reads=junction_reads, from=21675000, to=21702000)
@

The resulting plot is shown on figure \ref{img:3183-transcripts-and-reads}.

<<plotTranscripts_sg_3183_and_reads_as_pdf,echo=FALSE,results=hide>>=
pdf("3183-transcripts-and-reads.pdf", width=12, height=12)
plotTranscripts(sg[["3183"]], reads=junction_reads, from=21675000, to=21702000)
dev.off()
@
\begin{figure}[!htbp]
  \begin{center}
  \includegraphics[width=6in,height=4in]{3183-transcripts-and-reads.pdf}
  \end{center}
  \caption{Plot of the genomic region chr14:21675000-21702000 showing the 11
           transcripts of gene HNRNPC (Entrez ID 3183) and the reads in that
           region. Reads that are ``compatible'' with the splicing of a
           transcript are plotted in dark blue. Incompatible reads are in
           light blue.}
  \label{img:3183-transcripts-and-reads}
\end{figure}

Figure \ref{img:3183-transcripts-and-reads-zoom} is a zoom on genomic region
chr14:21698400-21698600 that was obtained with:

<<plotTranscripts_sg_3183_and_reads_zoom,eval=FALSE>>=
plotTranscripts(sg[["3183"]], reads=junction_reads, from=21698400, to=21698600)
@

<<plotTranscripts_sg_3183_and_reads_zoom_as_pdf,echo=FALSE,results=hide>>=
pdf("3183-transcripts-and-reads-zoom.pdf", width=12, height=12)
plotTranscripts(sg[["3183"]], reads=junction_reads, from=21698400, to=21698600)
dev.off()
@
\begin{figure}[!htbp]
  \begin{center}
  \includegraphics[width=6in,height=4in]{3183-transcripts-and-reads-zoom.pdf}
  \end{center}
  \caption{Zoom on genomic region chr14:21698400-21698600 of
           figure \ref{img:3183-transcripts-and-reads}.}
  \label{img:3183-transcripts-and-reads-zoom}
\end{figure}

[TODO: To be continued.]

\subsection{Counting and summarizing the assigned reads}

We use \Rfunction{countReads} to count the number of reads assigned to
each (non-artificial) edge in \Rcode{sg} for each run:

<<countReads_sg>>=
sg_counts <- countReads(sg)
@

The returned object is a \Rclass{DataFrame} with one row per unique
splicing graph edge and one column of counts per run. Two additional
columns contain the splicing graph edge ids and the type of edge (exon
or intron):

<<head_sg_counts>>=
dim(sg_counts)
head(sg_counts[1:5])
@

Total number of hits per run:

<<sapply_sg_counts>>=
sapply(sg_counts[-(1:2)], sum)
@

[TODO: Suggest some possible downstream statistical analysis that take
 the output of countReads() as a starting point.]


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Session Information}

All of the output in this vignette was produced under the following
conditions:

\begin{small}
<<SessionInfo>>=
sessionInfo()
@
\end{small}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{thebibliography}{}

\bibitem{Sammeth:2010}
{Michael Sammeth}.
\newblock Complete Alternative Splicing Events Are Bubbles in Splicing Graphs
\newblock {\em Computational Biology}, \textbf{16}, 1117-1140, 2010.

\bibitem{Zarnack:2012}
{Kathi Zarnack} and {Julian Konig} and {Mojca Tajnik} and {Inigo Martincorena} and {Sebastian Eustermann} and {Isabelle Stevant} and {Alejandro Reyes} and {Simon Anders} and {Nicholas M. Luscombe} and {Jernej Ule}
\newblock Direct Competition between hnRNP C and U2AF65 Protects the Transcriptome from the Exonization of Alu Elements
\newblock {\em Cell}, 2012, Europe PMC 23374342.

\bibitem{RSEM2011}
{LI, Bo} and {Dewey, Colin}.
\newblock RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome
\newblock {\em BMC Bioinformatics}, \textbf{12}(1), 323, 2011.

\bibitem{Roberts:2011}
{Roberts, Adam} and {Pimentel, Harold} and {Trapnell, Cole} and {Pachter, Lior}.
\newblock Identification of novel transcripts in annotated genomes using RNA-Seq
\newblock {\em BIOINFORMATICS}, 2011

\bibitem{Anders:2010}
{Simon Anders} and {Wolfgang Huber}.
\newblock {D}ifferential expression analysis for sequence count data
\newblock {\em Genome Biology}, \textbf{11}, R106, 2010.

\bibitem{Heber:2002}
{Heber, Steffen} and {Alekseyev, Max} and {Sze, Sing-Hoi} and {Tang, Haixu} and {Pevzner, Pavel A.}
\newblock Splicing graphs and EST assembly problem
\newblock {\em Bioinformatics}, \textbf{18}(suppl 1), S181-S188, 2002.

\end{thebibliography}

\end{document}