%\VignetteIndexEntry{oneChannelGUI Next generation Sequencing} %\VignetteDepends{} %\VignetteKeywords{one channel microarray,extended Affymetrix GUI, limma, quality control, data filtering, time course, alternative splicing} %\VignettePackage{oneChannelGUI} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass[12pt]{article} \usepackage{times} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in \oddsidemargin=0.2in \evensidemargin=0.2in \headheight=0in \headsep=0in \begin{document} \title{oneChannelGUI Package: NGS secondary data analysis} \author{Raffaele A Calogero, Francesca Cordero, Remo Sanges, Cristina Della Beffa} \date{November 03, 2010} \maketitle \section{Introduction} OneChannelGUI was initially developed to provide a new set of functions extending the capability of affylmGUI package. oneChannelGUI was designed specifically for life scientists who are not familiar with R language but do wish to capitalize on the vast analysis opportunities of Bioconductor. oneChannelGUI offers a comprehensive microarray analysis for single channel pplatforms. Since Next Generation Sequencing (NGS) is becoming more and more used in the genomic area, Bioconductor is extending the number of tools for NGS analysis. Therefore, we are also extending the functionalities of oneChannelGUI to handle RNA-seq data with a specific focus on non-coding RNA quantitative and mRNA splicing qualitative analysis. In the developing of NGS functionalities in oneChannelGUI our attention was focused on secondary analysis, i.e. data mapped on reference genome. \section{Non-coding RNA-seq} The present goal of oneChannelGUI is to provide a graphical interface for the secondary analysis of ncRNA. Secondary analysis of ncRNA can be done on a common 64 bits laptop with at least 4 Gb RAM. \subsection{Input data} We are constantly increasing the number of outputs derived from primary mapping tools that can be loaded on oneChannleGUI At the present time oneChannelGUI allows the loading of data produced from various primary tools specifically designed for microRNA analysis, fig \ref{fig:ngs0}. \begin{figure}[htbp] \begin{center} \includegraphics{ngs0} \caption{\label{fig:ngs0} Primary mapping tools supported by oneChannelGUI.} \end{center} \end{figure}. \subsubsection{SHRIMP} SHRIMP mapping tool information can be obtained in: \begin{Schunk} \begin{Sinput} Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, et al. SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput Biol 2009 5(5): e1000386. doi:10.1371/journal.pcbi.1000386 The program can be retrieved at: http://compbio.cs.toronto.edu/shrimp/ \end{Sinput} \end{Schunk} To run SHRIMP version 2 the reference genome need to be indexed. It is possible to use a subset of the genome of interest encompassing only non-coding RNAs: it can be easily obtained using the function \textit{"oneChannelGUI: Export non-coding RNA fasta reference file for ncRNA-seq quantitative analysis"}. The above function retrieves from oneChannelGUI a fasta file from the data of oneChannleGUI. The fasta file is generated by the standalone function \textit{"ncScaffold"}, for its usage please refer to the standalone vignette. Primary alingment data provided by shrimp using the above reference sequences need to be reformatted using the function provided in the file menu: \textit{"oneChannelGUI: Reformat NGS primary mapping output"}, fig. \ref{fig:ngs10} . For this reformatting it is necessary to have perl or active perl installed in the computer. \begin{figure}[htbp] \begin{center} \includegraphics{ngs10} \caption{\label{fig:ngs10} Reformatting data produced using SHRIMP and the nc-RNA reference set.} \end{center} \end{figure}. It is also possible to use as reference the full unmasked genome available on NCBI, also in this case it is necessary to reformat the aligned data using the function \textit{"oneChannelGUI: Reformat NGS primary mapping output"}. Finally it is also possible to use as reference the microRNA precursors set of mirBase, actually is implemented the version 15.0 The fasta file are available in the ngsperl dir located in the oneChannelGUI path. This folder is created using the function \textit{"oneChannelGUI: Set NGS perl scripts folder"}. In case miRbase precursors fasta file is used as reference the mapped data produced by SHRIMP can be directly loaded in oneChannelGUI without reformatting. To run SHRIMP The reference genome needs to be indexed by SHRIMP with the following code: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/utils/project-db.py \ --seed \ 00111111001111111100, \ 00111111110011111100, \ 00111111111100111100, \ 00111111111111001100, \ 00111111111111110000 \ --h-flag --shrimp-mode cs /folder.where.fasta.file.is.located/ncRNAs.mm.fa \end{Sinput} \end{Schunk} The flag for SOLID data is: \begin{Schunk} \begin{Sinput} --shrimp-mode cs \end{Sinput} \end{Schunk} The code to format the reference fasta file for SOLID data is: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/bin/gmapper-cs -L /folder.where.fasta.file.is.located/ncRNAs.mm-cs \ -Y -V /dev/null >/dev/null \end{Sinput} \end{Schunk} To run SHRIMP the code line is the following for solid data: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/bin/gmapper-cs -L /folder.where.fasta.file.is.located/ncRNAs.mm-cs \ /folder.where.reads.file.is.located/myreads.file \ -N number.of.cores.used.by.shrimp -n 1 -U -o 1 -V -w 170% \ >myreads.file.out 2>myreads.file.log & \end{Sinput} \end{Schunk} The flag for Illumina data is: \begin{Schunk} \begin{Sinput} --shrimp-mode ls \end{Sinput} \end{Schunk} The code to format the reference fasta file for Illumina data is: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/bin/gmapper-ls -L /folder.where.fasta.file.is.located/ncRNAs.mm-ls \ -Y -V /dev/null >/dev/null \end{Sinput} \end{Schunk} To run SHRIMP the code line is the following for illumina data: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/bin/gmapper-ls -L /folder.where.fasta.file.is.located/ncRNAs.mm-ls \ /folder.where.reads.file.is.located/myreads.file \ -N number.of.cores.used.by.shrimp -n 1 -U -o 1 -V -w 170% \ >myreads.file.out 2>myreads.file.log & \end{Sinput} \end{Schunk} The output of SHRIMP must be reformated to produce two files with the extension .bed and .logos. The output of SHRIMP has the following columns: \begin{Schunk} \begin{Sinput} readname Read tag name contigname Genome (Contig/Chromosome) name strand Genome strand ('+' or '-') contigstart Start of alignment in genome (beginning with 1, not 0). contigend End of alignment in genome (inclusive). readstart Start of alignment in read (beginning with 1, not 0). readend End of alignment in read (inclusive). readlength Length of the read in bases/colours. score Alignment score editstring Edit string. \end{Sinput} \end{Schunk} It is possible to copy in the oneChannelGUI path the perl scripts needed for SHRIMP data reformatting using the general menu function \textit{oneChannelGUI: Set NGS perl scripts folder}. It is also necessary that perl is installed in the system. For windows user the best will be the installation of active perl, which is very strait forward. The function \textit{oneChannelGUI: Reformat SHRIMP output} present in the File menu will allow data reformatting using as input a Target file, which will be also used subsequently to load the reformatted .bed and .logos files in oneChannelGUI. The reformat routine, produces two files: one with the extension .bed and an other with the extension .logos. .bed file has three columns all represented by numbers. First column is chromosome number. The mitocondrial genome represented by 77, X by 88 and Y by 89. Second column is strand given by 1 and -1 Third column is first position of the read mapping over the reference chromosome. An example of the .bed file structure is given in figure \ref{fig:ngs1}. \begin{figure}[htbp] \begin{center} \includegraphics[width=100mm]{ngs1} \caption{\label{fig:ngs1} Structure of mapping data that can be imported in oneChannelGUI.} \end{center} \end{figure} .logos file has four columns, first is the ENSEMBL gene id second column is the description of the mapping given by Edit string: \begin{Schunk} \begin{Sinput} The edit string consists of numbers, characters and the following additional symbols: '-', '(' and ')'. It is constructed as follows: = size of a matching substring = mismatch, value is the tag letter () = gap in the reference, value shows the letters in the tag - = one-base gap in the tag (i.e. insertion in the reference) x = crossover (inserted between the appropriate two bases) For example: A perfect match for 25-bp tags is: 25 A SNP at the 16th base of the tag is: 15A9 A four-base insertion in the reference: 3(TGCT)20 A four-base deletion in the reference: 5----20 Two sequencing errors: 4x15x6 (i.e. 25 matches with 2 crossovers) \end{Sinput} \end{Schunk} Third column is the position of the beginning of the alignment on the ENSENBL gene Fourth column is the position of the first alignment on the read. In case the mirBase precursors are used as reference there is no need of reformatting the output of SHRIMP. Output files produced by SHRIMPS are directly loaded and only alignments with one SNP or with perfect march are kept for the analysis. \subsubsection{miRanalyser} miRanalyser mapping tool information can be obtained in: \begin{Schunk} \begin{Sinput} Nucleic Acids Res. 2009 Jul 1;37(Web Server issue):W68-76. miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments. Hackenberg et al. The web program can be used at: http://web.bioinformatics.cicbiogune.es/microRNA/miRanalyser.php \end{Sinput} \end{Schunk} The interesting point of this tool, figure \ref{fig:ngs6} is that user need only to load the reads to be mapped as a multi-fasta file. \begin{figure}[htbp] \begin{center} \includegraphics[width=100mm]{ngs6} \caption{\label{fig:ngs6} web interface of miRanalyser} \end{center} \end{figure} \begin{Schunk} \begin{Sinput} >ID 49862 GAGGTAGTAGGTTGTA >ID 15490 ACCCGTAGAACCGACC >ID 13762 GGAGCATCTCTCGGTC \end{Sinput} \end{Schunk} This web tool facilitates the generation of primary data for unexperienced users. The output is generated in few hours and can be retrieved by the user after bookmarking each of the job pages. The output is a very complete, but we will focus only on the retrieval of the mapping data referring to the mature miRNAs, figure \ref{fig:ngs7} \begin{figure}[htbp] \begin{center} \includegraphics[width=100mm]{ngs7} \caption{\label{fig:ngs7} The arrow indicate the link that allows to retrieve the tab delimited file referring to mapping to the mature subset of microRNAs} \end{center} \end{figure} \subsubsection{miRExpress} miRExpress mapping tool information can be obtained in: \begin{Schunk} \begin{Sinput} BMC Bioinformatics 2009, 10:328. miRExpress: Analyzing high-throughput sequencing data for profiling microRNA expression. Wei-Chi Wang, Feng-Mao Lin, Wen-Chi Chang, Kuan-Yu Lin, Hsien-Da Huang and Na-Sheng Lin The tool can be retrieved at: \url{http://miRExpress.mbc.nctu.edu.tw} \end{Sinput} \end{Schunk} The alignment files can be directly loaded in oneChannelGUI. Below an example of the structure of the alignment files generated with miRExpress. \begin{Schunk} \begin{Sinput} hsa-mir-520a CUCAGGCUGUGACCCUCCAGAGGGAAGUACUUUCUGUUGUCUGAGAGAAAAGAAAGUGCUUCCCUUUGGACUGUUUCGGUUUGAG Others******************************************************************************* CTCCAGAGGGAAGTACTTTCT 3 // hsa-mir-99a CCCAUUGGCAUAAACCCGUAGAUCCGAUCUUGUGGUGAAGUGGACCGCACAAGCUCGCUUCUAUGGGUCUGUGUCAGUGUG Others*************************************************************************** AACCCGTAGATCCGATCTTGTG 30 CAAGCTCGCTTCTATGGGTCTG 87 CAAGCTCGCTTCTATGGGTCT 64 // \end{Sinput} \end{Schunk} \subsubsection{miRProf} miRProf is web mapping tool. The web tool can be used at: \url{http://srna-tools.cmp.uea.ac.uk/animal/cgi-bin/srna-tools.cgi} The tab delimited files provided by miRProf can be directly loaded in oneChannelGUI. \section{File menu} In this menu selecting the option RNA-seq it is possible to load NGS data. \subsection{Data loading} Data are loaded in oneChannelGUI using the \textit{New function} in the File Menu selecting from the menu of the available platforms NGS, figure \ref{fig:ngs2}. \begin{figure}[htbp] \begin{center} \includegraphics[width=100mm]{ngs2} \caption{\label{fig:ngs2} Data loading menu.} \end{center} \end{figure} oneChannelGUI will require a target file which is a tab delimited file with three columns: Names, FileNames and Target, figure \ref{fig:ngs3}. \begin{figure}[htbp] \begin{center} \includegraphics[width=100mm]{ngs3} \caption{\label{fig:ngs3} Target file structure.} \end{center} \end{figure} The FileName column must contain the names of the files, each one belonging to a mapping experiment, with the structure previously indicated. IMPORTANT: target column contains, togeter with the covariate, also the total number of reads and the total number of mapped reads separated by an underscore. In case the above data are not known they need to be sostituted by NA. The name in the FileName column of the target file are those produced by the primary mapping tool. The loaded files are those produced by the reformatting tools available in the FileMenu \textit{Reformat NGS primary mapping output} \subsection{Reformatting NGS data} The raw data derived from a NGS experiment cannot be used directly for statistical analysis. At the present time are supported the output files produced by SHRIMP, MicroRazerS, miRanalyzer, miRExpress and miRProf. When user loads the set of NGS runs, described by the target file, those data are reoganized in an ExpressionSet format, which is quite common for microarray data. The are two options, one is the use of Genominator package and the other the use of a segmentation approach based on chipseq package. A pop-up will ask the user to decide between the two approaches. The Genominator based approach allows a quick and efficient reorganization of the data based on the availability of a specific set of annotation on which the reads are mapped. In oneChannelGUI the set of annotation is dinamically generated from ENSEMBL and allow to use Genominator for microRNA, exons and TSS, i.e. Transcription Start Sites. An alternative option to Genominator based data loading is a segmentation approach based on chipseq package. In this case to reduce RAM consumption and to allow the analysis also on conventional 32 bit computers, the raw data are loaded and stored in files on the bases of their belonging to a specific chromosome. Subsequently, the union of all counts over the all samples for a specific chromosome are used to define the peaks where the reads are clustering on the plus and minus strand. To define chromosome peaks user need to indicate the size of extension of the reads. We usually extend the reads to the real length, e.g. 35. However, extension size for ncRNA quantification values are also between 100 and 200 nts, if the size of cDNA library is considered. Library size is between 108 and 130 nts if a fractionation of small RNAs (10-40 nts) was used. In case cDNA was prepared from total RNA, which is better for quantification, the size is between 150 and 200 nts. For quantification the preparation of the library from total RNA is preferred since it is characterized by a lower inter-experiment variability. User needs also to define the number of reads that are mapped as random event, for a library between 10-20 milion 35-mer tags this value is about 8. It is important to remember that using long extension value an high number of peaks will be created since more peaks will be characterized by having a number of mapped reads greater than 8. The name of each peak is made by: \begin{Schunk} \begin{Sinput} chr name.strand.start position-end position e.g. chr1.plus.100000-105000 chr2.minus.500000-500630 \end{Sinput} \end{Schunk} After reformatting the counts are saved in an ExpressionSet object which is stored in the affylmGUIenvironment and it is ready to be further analyzed. In the case of miRanalyser, miRExpress, miRProf and data generated using SHRIMP, with miRBase precursors as reference, the loading procedure is much faster since the data produced by the primary mapping tools need only to be reorganized in a matrix. \section{Reformatting-Normalizing NGS data menu} The NGS data stored in the ExpressionSet can be log2 transformed. Since it is possible that some peaks are subset of the same peak, e.g. two peaks located less then 50 bases to each other, the function Refining peaks allows to merge peaks located near to each other, given a user define threshold. In figure \ref{fig:ngs4} \begin{figure}[htbp] \begin{center} \includegraphics[width=100mm]{ngs4} \caption{\label{fig:ngs4} A) Plot of frequence of two nearby peaks versus inter-peaks distance. The dashed lines indicate a max distance defined by user, in this case 50 nts. B) Plot of the refined peaks after recursive merging of nearby peaks.} \end{center} \end{figure} It is now possible to scale/normalize the NGS data using the \textit{oneChannelGUI: Scale/normalize NGS data}, which used the method described by Robinson and Oshlack in Genome Biology 2010, 11:R25 and it is implemented in edgeR package. The outplot plot descrive the sample organization before and after normalization using Multidimensional scaling plot which also plot the variation in the common dispersion \ref{fig:ngs5}. \begin{figure}[htbp] \begin{center} \includegraphics[width=100mm]{ngs5} \caption{\label{fig:ngs5} A) Multidimensional scaling plot before normalization. B) Multidimensional scaling plot after normalization. it is clear that there in an outlier, i.e. the sample that get in the first dimension very far away form the others.} \end{center} \end{figure} \section{QC menu} The section QC allows to visualized the box plot of the NGS samples after reformatting in peaks as well as samples PCA and hierarchical clustering. It is also possible to visualize data using a Multidimensional scaling plot, using the function \textit{oneChannelGUI: Multidimensional scaling plot (edgeR package)}. This plot is a variation on the usual multdimensional scaling (or principle coordinate) plot, in that a distance measure particularly appropriate for the digital gene expression (DGE) context is used. The distance between each pair of samples (columns) is the square root of the common dispersion for the top n (default is n = 500) genes which best distinguish that pair of samples. These top n genes are selected according to the tagwise dispersion of all the samples. \section{Filtering menu} The section filtering allows remove those peaks that are little informatives, i.e. those with too little counts. It also allows to filter the dataset on a list of peaks identifiers, e.g. those derived from a list of differentially expressed peaks Furthermore, tab delimited files containing the loaded counts can be exported. \section{Statistics menu} In this section are implemented interfaces to edgeR and baySeq package. Both for edgeR and baySeq the inteface uses the negative binomial distribution model to detect differential expression. P-value adjustment can be made also used. \section{Biological Interpretation menu} Under development \end{document}