% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{rnaSeqMap primer}
%\VignetteKeywords{}
%\VignetteDepends{}
%\VignettePackage{rnaSeqMap}
%documentclass[12pt, a4paper]{article}
\documentclass[12pt]{article}

\usepackage{amsmath,pstricks}
\usepackage{hyperref}
\usepackage[numbers]{natbib}
\usepackage{color}
\usepackage[scaled]{helvet}
\renewcommand*\familydefault{\sfdefault}

\definecolor{NoteGrey}{rgb}{0.96,0.97,0.98}

\textwidth=6.2in
\textheight=9.5in
@ \parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-1in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\code}[1]{{\texttt{#1}}}

\author{Anna Lesniewska, Michal Okoniewski}
\begin{document}
\title{\code{rnaSeqMap}: RNASeq analyses using xmap database back-end}

\maketitle

\fcolorbox{black}{NoteGrey} {
\begin{minipage}{13.5cm}
\begin{center}
\textbf{ Vignette for v.2.7.11 - created without XMAP database access.}
\end{center}

\end{minipage}
}

\tableofcontents
\newpage

\section{Recent changes and updates}

14.05.2011 - added parseGff3() and bam2sig() - MO, AL

\section{Introduction}

\code{rnaSeqMap} is a "middleware" library for RNAseq secondary analyses. It constitutes an API for such operations as:

\begin{itemize}
\item{access to any the reads of the experiment in possibly fastest time, according to any chromosome coordinates}
\item{accessing sets of reads according to genomic annotation in Ensembl}
\item{calculation of coverage and number of reads and transformations of those values}
\item{creating input for significance analysis algorithms - from edgeR and DESeq}
\item{precisely finding significant and consistent regions of expression}
\item{splicing analyses}
\item{visualizations of genes and expression regions}
\end{itemize}

The library is independent from the sequencing technology and reads mapping software. It needs either 
reads described as genome coordinates in the extended xmap database, or can alternatively read data as 
big as they can fit in the operational memory. The use with modified xmap database is recommended, as it 
overcomes memory limitations - thus the library can be efficiently run on not very powerful machines. 

The internal features of \code{rnaSeqMap} distinctive for this piece of software are:

\begin{itemize}
\item{sequencing reads and annotations in one common database - extended XMAP \cite{Yates07}}
\item{algorithm for finding irreducible regions of genomic expression - according to Aumann and Lindell \cite{Lindell03}}
\item{nucleotide-level splicing analysis}
\item{connectors for further gene- and region-level expression processing to DESeq \cite{Anders10} and edgeR \cite{Robinson09}}
\item{the routines for coverage, splicing index and region mining algorithm have been implemented in C for speed}
\end{itemize}


The area of application of the library may be adjusted to the type of experiment performed. Mainly, it is intended to analyze the data from RNAseq with depth-of-coverage estimation of expression, while it can be also used for SAGE, Chip-Seq, copy number analysis, etc. The functions in an appropriate context may be used not only for gene-level and region-level expression estimation, but also eg. for quality control of sequencing data. 

The library follows the paradigms of the exonmap package (see \cite{Okon08}), continued
currently as \code{xmapcore}. 
The point is: with the use of the annotation database, it gives a framework 
to analyze genes and intergenic regions gene by gene, filter the genes according 
to their qualities and visualize the internals of region expression. 
Due to the qualities of sequencing data, \code{rnaSeqMap} may be used not only
to genes, but also to all the regions on the genome. The approach does not do
any summarization in the initial phases so it is possible to have the fine-grained access by nucleotide
(eg. as in \code{NucleotideDistr} class).


\section{Database preparation}

The database which is used in the analysis is xmapcore database, extended by additional relational tables ( defined with indexing and partitioning) and stored procedures. First, an instance of xmapcore database has to be installed, then 
the script for adding the tables \code{seq\_reads} and \code{seq\_experiments} have to be run (see \code{rnaSeqMap.sql} script in the \code{inst\\scripts})   The table seq\_reads is filled with the reads from the sequencers, which may be for example data from .SAM files transformed by sam2xmap.awk script

\code{awk -f sam2xmap.awk input.sam > seq\_reads.txt}


The content of the awk script is as follows :

\begin{verbatim}
# AWK script to convert the SAM file to seq_read table input for X:MAP
# For mouse 19 chromosomes + 20=X,21=Y,22=MT
{
  OFS = "\t"
  strand = rshift($2,4)  % 2
  if (strand==0) strand = -1
  cc = substr($3,4)
  if (cc!=cc+0)
   {
    if (cc=="X") cc=20
    else if (cc=="Y") cc=21
    else if (cc=="M") cc=22
    else cc=30;
   }

  print(6, cc,  $4, $4+ length($10),strand) 
  #remember to change first param to the number of sample!!
}
\end{verbatim}

As the chromosome numbers are simplified in the database into numbers, the X, Y and mitochondrial chromosomes are coded as consecutive number after the last numbered chromosome - as in the example for mouse above. In the \code{rnaSeqMap} in R, they are recoded according to the species information set by selecting xmapcore database or by \code{setSpecies()}

After such preparation, all the data files must be loaded to the database, using \code{mysqlimport}, eg:

\code{myslqimport -u mysqlexampleuser -p xmapcore\_homo\_sapiens\_58 --local seq\_reads }

The \code{seq\_reads} table has then to be indexed as in the script \code{inst\\scripts\\indices.sql}.

The table \code{bio\_sample} has to be created. It contains experiment
description: all the samples in \code{seq\_reads} are identified by an integer
number, while in \code{bio\_sample} the numbers have the treatments and other
experimental attributes assigned. The content of this table is assigned to the
\code{phenoData} slot of the \code{NucleotideDistr} objects that are processed.

\section{Analysis}
After loading the rnaSeqMap library, connecting to xmap database is done with \code{xmap.connect}.

<<>>=
library(rnaSeqMap)
@

<<eval=FALSE>>=
xmap.connect("human-24")
@

Sample objects, used for analysis described in this vignette, may be found in the 
dataset:

<<>>=
data(sample_data_rnaSeqMap)
ls()
class(rs.list)
length(rs.list)
names(rs.list)
@

The list contains 17 example genes with their coverage from a real experiment that contains 6 samples. 
Alternatively, the objects can be fed with data directly from BAM files, as described in the next sections. 

\section{\code{ReadSet} and \code{NucleotideDistr} classes}

\code{ReadSet} is an initial container for sequencing reads. 
Using the database filled with reads, the \code{SeqReads} objects are created by specifying their genome coordinates and filling them with reads information:

<<eval=FALSE>>=
rs <- newSeqReads(g.ch,st,en,str)
rs <- addExperimentsToReadset(rs,1:6)
@

There are functions that create \code{SeqReads} objects for genes:

<<eval=FALSE>>=
rs <- newSeqReadsFromGene(g)
rs <- addExperimentsToReadset(rs,1:6)
@

The table \code{bio\_sample} has to be created. It contains experiment
description: all the samples in \code{seq\_reads} are identified by an integer
number, while in \code{bio\_sample} the numbers have the treatments and other
experimental attributes assigned. The content of this table is assigned to the
\code{phenoData} slot of the \code{NucleotideDistr} objects that are processed.

<<>>=
setSpecies("homo_sapiens")

<<>>=
rs <- rs.list[[1]]  # first gene from sample data

rs <- newSeqReads('chr2',220238268,220264744,-1)
rs <- newSeqReads(rnaSeqMap:::.chr.convert(2),220143989,220151622,1)
#rs <- newSeqReads('chr2',220114433,220142892,-1)

f <- c("test1.bam","test2.bam","test3.bam","test4.bam","test5.bam")
ff <- sapply(f, function(x) system.file("extdata", x, package="rnaSeqMap"))
rs <- getBamData(rs, 1:5, files=ff)
nd.cov <- getCoverageFromRS(rs, 1:5)
dd <- RleList2matrix(distribs(nd.cov))
dd[990:1000, ]    # getting a fragment of the gene
@

The distributions in an object may be normalized by getting the sum of reads for 
each sample from the database or by any other scaling vector:

<<eval=FALSE>>=
nsums <- getSumsExp()
@

<<echo=FALSE>>=
nsums <- c(25918,32577,63250,123445,126977)
@

<<>>=
nsums
nd.cov <- normalizeBySum(nd.cov, nsums)
RleList2matrix(distribs(nd.cov))[990:1000,]
@

\section{Connecting to data from BAM files}

The way of using the \code{ReadSet} and \code{NucleotideDistr} with data directly from BAM files 
is using the covdesc file, which is used as the experiment description. Covdesc is assigned 
as a \code{phenoData} table of the objects. 

Here, the sample BAM files including data from chromosome 2 and examplary genes: TUBA4A,PTPRN,MIR153-1,DNPEP,CHPF,PAX3, FARSB will be used. 


\section{Region Mining algorithm}

The region mining algorithm is an adaptation of the sliding-and-combing two window algorithm by Lindell and Aumann \cite{Lindell03}. Originally, it was used to find quantitative association rules from datasets with two numerical attributes. In the case of genomic data, especially RNAseq, the X dimension is the genome coordinate in bp, the Y dimension (decisive attribute) is the level of coverage for a given nucleotide. By specifying the $\mu$ and \textit{minsup} parameters, we may tune it to find regions of a given magnitude of coverage and minimal length. 

The regions found by this algorithm are proven to be irreducible. This means here, that if the region has mean coverage over $\mu$, then when divided into any two sub-regions, those sub-regions will both have also mean coverage higher than $\mu$. This follows the intuition of a region with a biologically consistent expression - namely - an exon. So if one applies properly tuned Lindell algorithm to the RNAseq data, should be able to find precise boundaries of real transcription. 

From a \code{NucleotideDistr} object, another object may be created that includes the distributions equal 0 when the part of the distribution is not within irreducible region,
or appropriate $\mu$-level when the nucleotide in a given sample has a coverage that 
makes it a part of an irreducible region:

<<>>=
nd.reg <- findRegionsAsND(nd.cov,6,10)
RleList2matrix(distribs(nd.reg))[990:1000,]
@


Alternatively, regions from a single distribution may be also presented as an \code{RangedData} object:

<<>>=
findRegionsAsIR(nd.cov,6,10,1)
@


\section{Interesting gene criteria}

The genes may be regarded as interesting of significant according to many
features, that can be calculated using their \code{NucleotideDistr} values.

\subsection{Coverage}

Coverage is included in the object (access with \code{distr()}), however in the 
heuristics for finding genes, we various statistical measures of the coverage for each sample:

<<>>=
dd <- RleList2matrix(distribs(nd.cov))
apply(dd,2,max)
apply(dd,2,mean)
apply(dd,2,sum)
apply(dd,2,sd)
@

\subsection{Splicing}

We use a modified version of splicing index from \cite{Gard07} paper. The value of index 
is calculated for each nucleotide in the investigated region. Gene level proportion is 
calculated using sum of coverage functions in the region for both treatments. 

However, many nucleotides have no coverage in one or both conditions. Often the coverage 
in one condition is much higher than another. Thus the value of splicing index is set to 
1 or -1 when second or first conditions have zero coverage. 
In the case when the log2 value exceeds the range (-1,1) it is limited to -1 or 1 respectively. 

<<eval=FALSE>>=
nd.si <- getSIFromND(nd.cov, c(2,3))
RleList2matrix(distribs(nd.si))[990:1000,]

@
\begin{center}

<<fig=TRUE>>=
distrCOVPlot(nd.cov, 1)
@
\end{center}

Splicing index may be plotted for both genes and any genomic regions:

\begin{center}
<<fig=TRUE>>=
distrSIPlot(nd.cov, 1,3,10,10)
@
\end{center}

\begin{center}
<<eval=FALSE,fig=TRUE>>=
plotSI(2,223436000,223437000,-1,2,4)

@
\end{center}

\subsection{Fold change in regions}

For the genes that have enough coverage, it is possible to calculate nucleotide-level fold change using:

<<>>=
nd.fc <- getFCFromND(nd.cov, c(2,3))
RleList2matrix(distribs(nd.fc))[990:1000,]
@


\subsection{Region-based coverage}

Region-based coverage is a function of discretizing the  \code{NucleotideDistr} object according to a sequence of Lindell algorithm searches.It assigns to the resulting  \code{NucleotideDistr} 
the value of maximal $\mu$ from a sequence for which there is a region covering it

<<>>=
nd.rbcov <- regionBasedCoverage(nd.cov, seqq=1:10, maxexp=40, minsup=10)
RleList2matrix(distribs(nd.rbcov))[990:1000,]
@

Such a distribution is useful for removing artifacts from the distributions: high peaks and sudden drops of signal. 

The plots are depicting the coverage before and after the region-based transformation. 

<< fig=TRUE>>=
distrCOVPlot(nd.cov, c(1,3))
@
<<fig=TRUE>>=
distrCOVPlot(nd.rbcov, c(1,3))
@


\subsection{Utilities for transforming \code{NucleotideDistr} objects}

Additional utilities to operate on \code{NucleotideDistr} objects are functions 
for combining two, averaging, getting the sum or log2 for all the elements. 


\section{Analyzing whole chromosomes}

The functions\code{spaceInChromosome} and \code{geneInChromosome}
are finding all the genes and intergenic spaces in given regions. 
Such a table of genes or spaces may be used to build a genome-wide 
pipeline for finding interesting gene expression or novel areas
of intergenic expression (possibly novel genes).

<<eval=FALSE>>=
my.genes <- geneInChromosome(22, 20000000, 20400000, 1)
my.genes
my.spaces <- spaceInChromosome(22, 20000000, 20400000, 1)
my.spaces
@

The example of function used to evaluate genes or spaces:

<<eval=FALSE>>=
test.gene <- function(g, exps, nsums, mi=5, minsup=15)
{
        rs <- newSeqReadsFromGene(g)
        rs <- addExperimentsToReadset(rs,exps)
        cat("added reads..\n")
        nd.cov <- getCoverageFromRS(rs,exps)
        cat("calculated coverage...\n")
        nd.cov <- normalizeBySum(nd.cov, nsums)
        cat("normalized...\n")
        nd.reg <- findRegionsAsND(nd.cov,as.integer(mi),minsup=minsup)
        ir.reg <- findRegionsAsIR(nd.cov,as.integer(mi),minsup=minsup)
        cat("ran region search algorithm...\n")
        print(ir.reg)

        out <- g
        out <- c(out, apply(distribs(nd.cov),2,max))
        out <- c(out, apply(distribs(nd.cov),2,mean))
        out <- c(out, apply(distribs(nd.reg),2,max))
        out
}


test.space  <- function(exps, ch,st, en, str, nsums, mi=5, minsup=15)
{
        g.ch <- rnaSeqMap:::.chromosome.number(ch)
        rs <- newSeqReads(g.ch,st,en,str)
        rs <- addExperimentsToReadset(rs,exps)
        nd.cov <- getCoverageFromRS(rs,exps)
        nd.cov <- normalizeBySum(nd.cov, nsums)
        cat("Running ch ",g.ch," start",st,"\n")
        nd.reg <- findRegionsAsND(nd.cov,as.integer(mi), minsup=minsup)

        out <- c(ch,st, en, str)
        out <- c(out, apply(distribs(nd.cov),2,max))
        out <- c(out, apply(distribs(nd.cov),2,mean))
        out <- c(out, apply(distribs(nd.reg),2,max))
        out
}

interesting.genes <- NULL
for (i in 1:length(my.genes))
{
  cat ("Running gene ", i , "----------------------\n")
  interesting.genes <- rbind(interesting.genes, test.gene(my.genes[i], 1:6, nsums))
}

interesting.spaces <- NULL
for (i in 104:(dim(my.spaces))[1])
{
  cat ("Running space ", i , "----------------------\n")
  interesting.spaces <- rbind(interesting.spaces, test.space(1:2, 22, my.spaces[i,1],  my.spaces[i,2],my.spaces[i,3] ))
}
@

\section{Producing output for \code{DESeq} and \code{edgeR}}

The functions  \code{buildDESeq} and \code{buildDGEList} are the connectors
between \code{rnaSeqMap} and the analysis of expression on the gene level as implemented in DESeq \cite{Anders10} and edgeR \cite{Robinson09}.


\section{Visualization}

\code{rnaSeqMap} has a set of chromosome-based plots  that may be used for distributions contained in a \code{NucleotideDistr} object or just a region defined by chromosome coordinates - taking data from the database and not creating any objects. 

Examples of the plots:

Histogram-based distribution of coverage:
<<eval=FALSE, fig=TRUE>>=
plotCoverageHistogram(2,223435255,223521056,-1,1,500)

@

Coverage plots for a gene, with official exons of the gene plotted below:
<<eval=FALSE, fig=TRUE>>=
distrCOVPlotg(names(rs.list)[16], 1:3)
@


\begin{thebibliography}{99}
\bibitem{Lindell03}  Aumann Y and Lindell Y, A Statistical Theory for Quantitative Association Rules, Journal of Intelligent Information Systems, 2003
\bibitem{Yates07} Yates T, Okoniewski M and Miller C, X:Map: annotation and visualization of genome structure for Affymetrix exon array analysis, Nucleic Acids Research, 2007
\bibitem{Gard07} Gardina P et al, Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array, BMC Genomics, 2006
\bibitem{Okon08} Okoniewski M et al, An annotation infrastructure for the analysis and interpretation of Affymetrix exon array data, Genome Biology, 2007
\bibitem{Anders10} Anders S and Huber W, Differential expression analysis for sequence count data, Nature Preceedings, 2010
\bibitem{Robinson09} Robinson M et al, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, Bioinformatics, 2009
\end{thebibliography}
\end{document}