%\VignetteIndexEntry{oneChannelGUI Next generation Sequencing}
%\VignetteDepends{}
%\VignetteKeywords{one channel microarray,extended Affymetrix GUI, limma, quality control, data filtering, time course, alternative splicing}
%\VignettePackage{oneChannelGUI}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt]{article}
\usepackage{times}
\usepackage{hyperref}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=0.2in
\evensidemargin=0.2in
\headheight=0in
\headsep=0in
 
\begin{document}
\title{oneChannelGUI Package: NGS secondary data analysis}
\author{Raffaele A Calogero, Francesca Cordero, Remo Sanges, Cristina Della Beffa}
\date{November 03, 2010}
\maketitle

\section{Introduction}
OneChannelGUI was initially developed to provide a new set of functions extending the capability of 
affylmGUI package. oneChannelGUI was designed specifically for life scientists who are not familiar
with R language but do wish to capitalize on the vast analysis opportunities of Bioconductor.
oneChannelGUI offers a comprehensive microarray analysis for single channel pplatforms.
Since Next Generation Sequencing (NGS) is becoming more and more used in the genomic area, 
Bioconductor is extending the number of tools for NGS analysis. Therefore, we are also extending the functionalities of
oneChannelGUI to handle RNA-seq data with a specific focus on non-coding RNA quantitative and mRNA splicing qualitative analysis. 
In the developing of NGS functionalities in oneChannelGUI our attention was focused on secondary analysis, 
i.e. data mapped on reference genome.  

\section{Non-coding RNA-seq}
The present goal of oneChannelGUI is to provide a graphical interface for the secondary analysis of ncRNA.
Secondary analysis of ncRNA can be done on a common 64 bits laptop with at least 4 Gb RAM.

\subsection{Input data}
We are constantly increasing the number of outputs derived from primary mapping tools that can be loaded on oneChannleGUI
At the present time oneChannelGUI allows the loading of data produced from various primary tools specifically designed for 
microRNA analysis, fig \ref{fig:ngs0}.
\begin{figure}[htbp]
  \begin{center}
    \includegraphics{ngs0}
    \caption{\label{fig:ngs0} Primary mapping tools supported by oneChannelGUI.}
  \end{center}
\end{figure}.

\subsubsection{SHRIMP}
SHRIMP mapping tool information can be obtained in:
\begin{Schunk}
\begin{Sinput}
 Rumble SM,  Lacroute P,  Dalca AV,  Fiume M,  Sidow A,  et al. 
 SHRiMP: Accurate Mapping of Short Color-space Reads. 
 PLoS Comput Biol 2009 5(5): e1000386. 
 doi:10.1371/journal.pcbi.1000386
 
 The program can be retrieved at:
 http://compbio.cs.toronto.edu/shrimp/  
\end{Sinput}
\end{Schunk}

To run SHRIMP version 2 the reference genome need to be indexed.
It is possible to use a subset of the genome of interest encompassing only non-coding RNAs:
it can be easily obtained using the function 
\textit{"oneChannelGUI: Export non-coding RNA fasta reference file for ncRNA-seq quantitative analysis"}.
The above function retrieves from oneChannelGUI a fasta file from the data of oneChannleGUI.
The fasta file is generated by the standalone function \textit{"ncScaffold"}, for its usage please refer to the 
standalone vignette.
Primary alingment data provided by shrimp using the above reference sequences need to be reformatted using
the function provided in the file menu: \textit{"oneChannelGUI: Reformat NGS primary mapping output"}, fig. \ref{fig:ngs10} .
For this reformatting it is necessary to have perl or active perl installed in the computer.
\begin{figure}[htbp]
  \begin{center}
    \includegraphics{ngs10}
    \caption{\label{fig:ngs10} Reformatting data produced using SHRIMP and the nc-RNA reference set.}
  \end{center}
\end{figure}.

It is also possible to use as reference the full unmasked genome available on NCBI, also in this case it is necessary to 
reformat the aligned data using the function \textit{"oneChannelGUI: Reformat NGS primary mapping output"}.
Finally it is also possible to use as reference the microRNA precursors set of mirBase, actually is implemented the version 15.0
The fasta file are available in the ngsperl dir located in the oneChannelGUI path. This folder
is created using the function \textit{"oneChannelGUI: Set NGS perl scripts folder"}.
In case miRbase precursors fasta file is used as reference the mapped data produced by SHRIMP can be directly loaded in oneChannelGUI 
without reformatting.

To run SHRIMP 
The reference genome needs to be indexed by SHRIMP with the following code:
\begin{Schunk}
\begin{Sinput}
 $SHRIMP_FOLDER/utils/project-db.py \
 --seed \
 00111111001111111100, \
 00111111110011111100, \
 00111111111100111100, \
 00111111111111001100, \
 00111111111111110000  \
 --h-flag --shrimp-mode cs /folder.where.fasta.file.is.located/ncRNAs.mm.fa

\end{Sinput}
\end{Schunk}

The flag for SOLID data is:
\begin{Schunk}
\begin{Sinput}
--shrimp-mode cs
\end{Sinput}
\end{Schunk}

The code to format the reference fasta file for SOLID data is:
\begin{Schunk}
\begin{Sinput}
$SHRIMP_FOLDER/bin/gmapper-cs -L /folder.where.fasta.file.is.located/ncRNAs.mm-cs \
-Y -V /dev/null >/dev/null
\end{Sinput}
\end{Schunk}

To run SHRIMP the code line is the following for solid data:
\begin{Schunk}
\begin{Sinput}
$SHRIMP_FOLDER/bin/gmapper-cs -L /folder.where.fasta.file.is.located/ncRNAs.mm-cs \
/folder.where.reads.file.is.located/myreads.file \
-N number.of.cores.used.by.shrimp -n 1 -U -o 1 -V -w 170% \
>myreads.file.out 2>myreads.file.log  &
\end{Sinput}
\end{Schunk}

The flag for Illumina data is:
\begin{Schunk}
\begin{Sinput}
--shrimp-mode ls
\end{Sinput}
\end{Schunk}

The code to format the reference fasta file for Illumina data is:
\begin{Schunk}
\begin{Sinput}
$SHRIMP_FOLDER/bin/gmapper-ls -L /folder.where.fasta.file.is.located/ncRNAs.mm-ls \
-Y -V /dev/null >/dev/null
\end{Sinput}
\end{Schunk}

To run SHRIMP the code line is the following for illumina data:
\begin{Schunk}
\begin{Sinput}
$SHRIMP_FOLDER/bin/gmapper-ls -L /folder.where.fasta.file.is.located/ncRNAs.mm-ls \
/folder.where.reads.file.is.located/myreads.file \
-N number.of.cores.used.by.shrimp -n 1 -U -o 1 -V -w 170% \
>myreads.file.out 2>myreads.file.log  &
\end{Sinput}
\end{Schunk}

The output of SHRIMP must be reformated to produce two files with the extension .bed and .logos.
The output of SHRIMP has the following columns:
\begin{Schunk}
\begin{Sinput}
readname	  Read tag name
contigname	Genome (Contig/Chromosome) name
strand	    Genome strand ('+' or '-')
contigstart	Start of alignment in genome (beginning with 1, not 0).
contigend	  End of alignment in genome (inclusive).
readstart	  Start of alignment in read (beginning with 1, not 0).
readend	    End of alignment in read (inclusive).
readlength	Length of the read in bases/colours.
score		    Alignment score
editstring	Edit string.
\end{Sinput}
\end{Schunk}

It is possible to copy in the oneChannelGUI path the perl scripts needed for SHRIMP data reformatting
using the general menu function \textit{oneChannelGUI: Set NGS perl scripts folder}.
It is also necessary that perl is installed in the system. For windows user the best will be
the installation of active perl, which is very strait forward.
The function \textit{oneChannelGUI: Reformat SHRIMP output} present in the File menu will allow data reformatting using as input a Target file,
which will be also used subsequently to load the reformatted .bed and .logos files in oneChannelGUI.

The reformat routine, produces two files: one with the extension .bed and an other with the extension .logos.
.bed file has three columns all represented by numbers. First column is chromosome number. 
The mitocondrial genome represented by 77, X by 88 and Y by 89.
Second column is strand given by 1 and -1
Third column is first position of the read mapping over the reference chromosome.
An example of the .bed file structure is given in figure \ref{fig:ngs1}.
\begin{figure}[htbp]
  \begin{center}
    \includegraphics[width=100mm]{ngs1}
    \caption{\label{fig:ngs1} Structure of mapping data that can be imported in oneChannelGUI.}
  \end{center}
\end{figure}

.logos file has four columns, first is the ENSEMBL gene id second column is the description of the mapping
given by Edit string:
\begin{Schunk}
\begin{Sinput}
The edit  string consists  of numbers, characters   and the following additional
symbols: '-', '(' and ')'. It is constructed as follows:
     <number> = size of a matching substring
     <letter> = mismatch, value is the tag letter
    (<letters>) = gap in the reference, value shows the letters in the tag
     - = one-base gap in the tag (i.e. insertion in the reference)
     x = crossover (inserted between the appropriate two bases)

For example:
    A perfect match for 25-bp tags is: 25
    A SNP at the 16th base of the tag is: 15A9
    A four-base insertion in the reference: 3(TGCT)20
    A four-base deletion in the reference: 5----20
    Two sequencing errors: 4x15x6	(i.e. 25 matches with 2 crossovers)
\end{Sinput}
\end{Schunk}
Third column is the position of the beginning of the alignment on the ENSENBL gene
Fourth column is the position of the first alignment on the read.

In case the mirBase precursors are used as reference there is no need of reformatting 
the output of SHRIMP.  Output files produced by SHRIMPS are directly loaded and only alignments with one SNP
or with perfect march are kept for the analysis.  

\subsubsection{miRanalyser}
miRanalyser mapping tool information can be obtained in:
\begin{Schunk}
\begin{Sinput}
Nucleic Acids Res. 2009 Jul 1;37(Web Server issue):W68-76. 
miRanalyzer: a microRNA detection and analysis tool for 
next-generation sequencing experiments.
Hackenberg et al.
 
The web program can be used at:
http://web.bioinformatics.cicbiogune.es/microRNA/miRanalyser.php 
\end{Sinput}
\end{Schunk}

The interesting point of this tool, figure \ref{fig:ngs6} is that user need only to load the reads to be
mapped as a multi-fasta file.

\begin{figure}[htbp]
  \begin{center}
    \includegraphics[width=100mm]{ngs6}
    \caption{\label{fig:ngs6} web interface of miRanalyser}
  \end{center}
\end{figure}

\begin{Schunk}
\begin{Sinput}
>ID 49862
GAGGTAGTAGGTTGTA
>ID 15490
ACCCGTAGAACCGACC
>ID 13762
GGAGCATCTCTCGGTC
\end{Sinput}
\end{Schunk}

This web tool facilitates the generation of primary data for unexperienced users.
The output is generated in few hours and can be retrieved by the user after bookmarking each of the job pages.
The output is a very complete, but we will focus only on the retrieval of the
mapping data referring to the mature miRNAs, figure \ref{fig:ngs7}
\begin{figure}[htbp]
  \begin{center}
    \includegraphics[width=100mm]{ngs7}
    \caption{\label{fig:ngs7} The arrow indicate the link that allows to retrieve the tab delimited file 
    referring to mapping to the mature subset of microRNAs}
  \end{center}
\end{figure}

\subsubsection{miRExpress}
miRExpress mapping tool information can be obtained in:
\begin{Schunk}
\begin{Sinput}
BMC Bioinformatics 2009, 10:328. 
miRExpress: Analyzing high-throughput sequencing data for
profiling microRNA expression.
Wei-Chi Wang, Feng-Mao Lin, Wen-Chi Chang, Kuan-Yu Lin,
Hsien-Da Huang and Na-Sheng Lin
 
The tool can be retrieved at:
\url{http://miRExpress.mbc.nctu.edu.tw} 
\end{Sinput}
\end{Schunk}
The alignment files can be directly loaded in oneChannelGUI. Below an example 
of the structure of the alignment files generated with miRExpress.
\begin{Schunk}
\begin{Sinput}
hsa-mir-520a
CUCAGGCUGUGACCCUCCAGAGGGAAGUACUUUCUGUUGUCUGAGAGAAAAGAAAGUGCUUCCCUUUGGACUGUUUCGGUUUGAG
Others*******************************************************************************
              CTCCAGAGGGAAGTACTTTCT 3
//

hsa-mir-99a
CCCAUUGGCAUAAACCCGUAGAUCCGAUCUUGUGGUGAAGUGGACCGCACAAGCUCGCUUCUAUGGGUCUGUGUCAGUGUG
Others***************************************************************************
                                                 AACCCGTAGATCCGATCTTGTG 30
                                                 CAAGCTCGCTTCTATGGGTCTG 87
                                                 CAAGCTCGCTTCTATGGGTCT 64
//
\end{Sinput}
\end{Schunk}

\subsubsection{miRProf}
miRProf is web mapping tool. The web tool can be used at:
\url{http://srna-tools.cmp.uea.ac.uk/animal/cgi-bin/srna-tools.cgi} 
The tab delimited files provided by miRProf can be directly loaded in oneChannelGUI.


\section{File menu}
In this menu selecting the option RNA-seq it is possible to load NGS data. 

\subsection{Data loading}
Data are loaded in oneChannelGUI using the \textit{New function} in the File Menu selecting from the menu of the available platforms
NGS, figure \ref{fig:ngs2}.
\begin{figure}[htbp]
  \begin{center}
    \includegraphics[width=100mm]{ngs2}
    \caption{\label{fig:ngs2} Data loading menu.}
  \end{center}
\end{figure}
oneChannelGUI will require a target file which is a tab delimited file with three columns: Names, FileNames and Target, figure \ref{fig:ngs3}.
\begin{figure}[htbp]
  \begin{center}
    \includegraphics[width=100mm]{ngs3}
    \caption{\label{fig:ngs3} Target file structure.}
  \end{center}
\end{figure}
The FileName column must contain the names of the files, each one belonging to a mapping experiment, with the structure previously indicated.
IMPORTANT: target column contains, togeter with the covariate, also the total number of reads and the total number of mapped reads 
separated by an underscore. In case the above data are not known they need to be sostituted by NA.
The name in the FileName column of the target file are those produced by the primary mapping tool.
The loaded files are those produced by the reformatting tools available in the FileMenu  \textit{Reformat NGS primary mapping output}


\subsection{Reformatting NGS data}
The raw data derived from a NGS experiment cannot be used directly for statistical analysis.
At the present time are supported the output files produced by SHRIMP, MicroRazerS, miRanalyzer, miRExpress and miRProf.
When user loads the set of NGS runs, described by the target file, those data are reoganized in an ExpressionSet format,
which is quite common for microarray data.
The are two options, one is the use of Genominator package and the other the use of a segmentation approach based on chipseq package.
A pop-up will ask the user to decide between the two approaches.
The Genominator based approach allows a quick and efficient reorganization of the data
based on the availability of a specific set of annotation on which the reads are mapped.
In oneChannelGUI the set of annotation is dinamically generated from ENSEMBL and allow to use Genominator
for microRNA, exons and TSS, i.e. Transcription Start Sites.

An alternative option to Genominator based data loading is a segmentation approach based on chipseq package.
In this case to reduce RAM consumption and to allow the analysis also on conventional 32 bit computers, 
the raw data are loaded and stored in files on the bases of their belonging to a specific chromosome.
Subsequently, the union of all counts over the all samples for a specific chromosome are used to define the
peaks where the reads are clustering on the plus and minus strand. 
To define chromosome peaks user need to indicate the size of extension of the reads.
We usually extend the reads to the real length, e.g. 35.
However, extension size for ncRNA quantification values are also between 100 and 200 nts, if  the size of cDNA library is considered. 
Library size is between 108 and 130 nts  if a fractionation of small RNAs (10-40 nts) was used. 
In case cDNA was prepared from total RNA, which is better for quantification, the size
is between 150 and 200 nts. For quantification the preparation of the library from total RNA is preferred since it is 
characterized by a lower inter-experiment variability. 
User needs also to define the number of reads that are mapped as random event, for a library between 10-20 milion 35-mer tags this 
value is about 8.
It is important to remember that using long extension value an high number of peaks will be created since more peaks will be 
characterized by having a number of mapped reads greater than 8. 
The name of each peak is made by:
\begin{Schunk}
\begin{Sinput}
chr name.strand.start position-end position
e.g. chr1.plus.100000-105000
     chr2.minus.500000-500630
\end{Sinput}
\end{Schunk}
After reformatting the counts are saved in an ExpressionSet object which is stored in the affylmGUIenvironment and it is
ready to be further analyzed.

In the case of miRanalyser, miRExpress, miRProf and data generated using SHRIMP, with miRBase precursors as reference,
 the loading procedure is much faster since the data produced by the primary mapping tools need only to be reorganized in a matrix.

\section{Reformatting-Normalizing NGS data menu}
The NGS data stored in the ExpressionSet can be log2 transformed.
Since it is possible that some peaks are subset of the same peak, e.g. two peaks located less then 50 bases to each other, the function 
Refining peaks allows to merge peaks located near to each other, given a user define threshold. In figure \ref{fig:ngs4}
\begin{figure}[htbp]
  \begin{center}
    \includegraphics[width=100mm]{ngs4}
    \caption{\label{fig:ngs4} A) Plot of frequence of two nearby peaks versus inter-peaks distance. The dashed lines indicate a max
    distance defined by user, in this case 50 nts. B) Plot of the refined peaks after recursive merging of nearby peaks.}
  \end{center}
\end{figure}

It is now possible to scale/normalize the NGS data using the \textit{oneChannelGUI: Scale/normalize NGS data},
which used the method described by Robinson and Oshlack in Genome Biology 2010, 11:R25 and it is implemented in edgeR package.
The outplot plot descrive the sample organization before and after normalization using Multidimensional scaling plot
which also plot the variation in the common dispersion \ref{fig:ngs5}.
\begin{figure}[htbp]
  \begin{center}
    \includegraphics[width=100mm]{ngs5}
    \caption{\label{fig:ngs5} A) Multidimensional scaling plot before normalization. B) Multidimensional scaling plot after normalization.
    it is clear that there in an outlier, i.e. the sample that get in the first dimension very far away form the others.}
  \end{center}
\end{figure}



\section{QC menu}
The section QC allows to visualized the box plot of the NGS samples after reformatting in peaks as well as 
samples PCA and hierarchical clustering.
It is also possible to visualize data using a Multidimensional scaling plot, using the function  
\textit{oneChannelGUI: Multidimensional scaling plot (edgeR package)}.
This plot is a variation on the usual multdimensional scaling (or principle coordinate) plot,
in that a distance measure particularly appropriate for the digital gene expression (DGE) context
is used. The distance between each pair of samples (columns) is the square root of the common
dispersion for the top n (default is n = 500) genes which best distinguish that pair of samples.
These top n genes are selected according to the tagwise dispersion of all the samples.  

\section{Filtering menu}
The section filtering allows remove those peaks that are little informatives, i.e. those with too little counts. 
It also allows to filter the dataset on a list of peaks identifiers, e.g. those derived from a list of differentially expressed peaks
Furthermore, tab delimited files containing the loaded counts can be exported.

\section{Statistics menu}
In this section are implemented interfaces to edgeR and baySeq package. Both for edgeR and baySeq the inteface
uses the negative binomial distribution model to detect differential expression. P-value adjustment can be made also used. 

\section{Biological Interpretation menu}
Under development
\end{document}