%\VignetteIndexEntry{oneChannelGUI miRNA and RNA-seq data analysis overview} %\VignetteDepends{} %\VignetteKeywords{one channel microarray,extended Affymetrix GUI, limma, quality control, data filtering, time course, alternative splicing} %\VignettePackage{oneChannelGUI} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass[12pt]{article} \usepackage{times} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in \oddsidemargin=0.2in \evensidemargin=0.2in \headheight=0in \headsep=0in \begin{document} \title{oneChannelGUI Package: NGS secondary data analysis} \author{Raffaele A Calogero, Francesca Cordero, Remo Sanges, Cristina Della Beffa} \date{August 27, 2011} \maketitle \section{Introduction} OneChannelGUI was initially developed to provide a new set of functions extending the capability of affylmGUI package. oneChannelGUI was designed specifically for life scientists who are not familiar with R language but do wish to capitalize on the vast analysis opportunities of Bioconductor. oneChannelGUI offers a comprehensive microarray analysis for single channel pplatforms. Since Second Generation Sequencing (NGS) is becoming more and more used in the genomic area, Bioconductor is extending the number of tools for NGS analysis. Therefore, we are also extending the functionalities of oneChannelGUI to handle RNA-seq data, fig. \ref{fig:ngs11}. \begin{figure}[htbp] \begin{center} \includegraphics{ngs11} \caption{\label{fig:ngs11} NGS Menu.} \end{center} \end{figure} \section{General Tools} In this menu there are all the functions necessary to install the external software required to analyse RNA-seq data. Perl or active perl should be already installed by the user. The function \textit{Installing all external software needed for RNA-seq} installs all the external software needed for RNA-seq data analysis. It is also possible to install part of the software using the functions: \begin{enumerate} \item \textit{Set NGS perl scripts folder} installs the perl scripts needed for secondary analysis of data produced with external software. \item \textit{Install Bowtie and Picard tools} allows the installation of bowtie and picard tools, as well as a set of precompiled reference sequences to be used to map short-reads. In the present implementation miRNA precursors for human, mouse rat and bovine are available. \end{enumerate} \section{File (RNA-seq)} The present goal of oneChannelGUI is to provide a graphical interface for the analysis of RNAseq, fig. \ref{fig:ngs12}. Analysis can be done on a common 64 bits laptop with at least 4 Gb RAM. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs12} } \caption{\label{fig:ngs12} File (RNA-seq) Menu.} \end{center} \end{figure} \section{New RNA-seq project} In this menu selecting the option RNA-seq it is possible to load NGS data. oneChannelGUI will require a target file which is a tab delimited file with three columns: Names, FileNames and Target, figure \ref{fig:ngs3}. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs3} } \caption{\label{fig:ngs3} Target file structure.} \end{center} \end{figure} The FileName column must contain the names of the files, each one belonging to a mapping experiment, with the structure previously indicated. IMPORTANT: target column contains, togeter with the covariate, also the total number of reads and the total number of mapped reads separated by an underscore. In case the total number of reads and the total number of mapped reads are not known, they will be defined on the basis of the mapped reads in the final counts data loaded in oneChannelGUI. The name in the FileName column of the target file are those produced by the primary mapping tool. The loadable files are those produced by the reformatting tools shown in figure \ref{fig:ngs16}. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs16} } \caption{\label{fig:ngs16} Reformatting functions available.} \end{center} \end{figure} \subsection{Primary mapping with Bowtie/Picard} In \textit{New RNA-seq project menu} it is available the function for the primary mapping of short reads: \textit{Primary mapping with Bowtie/Picard}. oneChannelGUI uses bowtie for primary mapping of Illumina short-reads, figure \ref{fig:ngs14}. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs14} } \caption{\label{fig:ngs14} Primary mapping with Bowtie} \end{center} \end{figure} The output of bowtie is a SAM file subsequently converted in a BAM file using picard java tools. Data are loaded using a oneChannelGUI target file, e.g. mytarget.txt: \begin{Schunk} \begin{Sinput} Name FileName Target C1 C1.fq ctrl C2 C2.fq ctrl T1 T1.fq trt T2 T1.fq trt \end{Sinput} \end{Schunk} Where filenames are the short reads generated by an Illumina machine in fastq format. Upon primary mapping and conversion of output to BAM file a new target file is produced, e.g. mytarget.bam.txt: Name FileName Target C1 C1.bam ctrl C2 C2.bam ctrl T1 T1.bam trt T2 T1.bam trt BAM files are subsequently imported in oneChannelGUI using Rsamtools and converted in an ExpressionSet for further statistical analyses. The function to load bam files in oneChannelGUI is available in the File menu as one of the \textit{new} function options. \subsection{miRNAs fq linker trimming} The function \textit{miRNAs fq linker trimming} is used to remove the 5 and 3 prime end linkers from single-end (SE) FASTQ formated illumina data using a perl script. The script need still some optimizations. This trimming is strongly suggested in case Bowtie mapping is performed. The scripts trims one milion reads in aprox. 5 minutes. The function loads the fastq using a conventional target file: \begin{Schunk} \begin{Sinput} Name FileName Target C1 C1.fq ctrl C2 C2.fq ctrl T1 T1.fq trt T2 T1.fq trt \end{Sinput} \end{Schunk} The output of the function are files with the addition of trim.fq to the initial filename. A new target file is also generated: \begin{Schunk} \begin{Sinput} Name FileName Target C1 C1.fqtrim.fq ctrl C2 C2.fqtrim.fq ctrl T1 T1.fqtrim.fq trt T2 T1.fqtrim.fq trt \end{Sinput} \end{Schunk} Furthermore, a pdf file with statistics of the trimmed data is produced. Each page of the pdf contains two histograms with the statistics of 3 and 5 prime end trimming, fig. \ref{fig:ngs13}. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs13} } \caption{\label{fig:ngs13} Trimming statistics.} \end{center} \end{figure} \subsection{Primary data reformatting (excluding Bowtie/Picard)} At the present time the output files produced by SHRIMP, MicroRazerS, miRanalyzer, miRExpress and miRProf are supported as input in oneChannelGUI. However, the output generated by SHRIMP, MicroRazerS, miRanalyzer need to be reformatted before loading: figure \ref{fig:ngs16}. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs16} } \caption{\label{fig:ngs16} Reformatting tools available.} \end{center} \end{figure} In the case of miRExpress and miRProf the loading procedure is much faster since the data can be directly loaded, see next subsection. \subsection{Secondary analysis: loading primary mapped data} We are constantly increasing the number of outputs derived from primary mapping tools that can be loaded on oneChannleGUI At the present time oneChannelGUI allows the loading of data produced from various primary tools specifically designed for microRNA analysis, fig \ref{fig:ngs15}. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs15} } \caption{\label{fig:ngs15} Primary mapping tools supported by oneChannelGUI.} \end{center} \end{figure}. \subsubsection{Loading BAM files generated using oneChannelGUI primary mapping of miRNAs fastq on miRBase precursors} The function \textit{Loading BAM files generated using oneChannelGUI primary mapping of miRNAs fastq on miRBase precursors} is used to handle the BAM files generated by the Bowtie primary mapping of miRNA sequencing data, using one of the following reference sets: \begin{enumerate} \item miRBase human precursor set: hs. \item miRBase mouse precursor set: mm. \item miRBase rat precursor set: rn. \item miRBase bovine precursor set: bo. \item NB: other reference sets can be requested to oneChannelGUI developers for implementation \end{enumerate} With the above mentioned function BAM data are loaded in oneChannelGUI as an expression set. Subsequently data can be filtered and/or analysed to detect differentially expressed genes with the other functions implemented in oneChannelGUI. \subsubsection{Loading SHRIMP reformatted output data and Loading SHRIMP data generated using miRBase as reference} SHRIMP mapping tool information can be obtained in: \begin{Schunk} \begin{Sinput} Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, et al. SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput Biol 2009 5(5): e1000386. doi:10.1371/journal.pcbi.1000386 The program can be retrieved at: http://compbio.cs.toronto.edu/shrimp/ \end{Sinput} \end{Schunk} To run SHRIMP version 2 the reference genome need to be indexed. It is possible to use a subset of the genome of interest encompassing only non-coding RNAs: it can be easily obtained using the function \textit{"oneChannelGUI: Export non-coding RNA fasta reference file for ncRNA-seq quantitative analysis"}. The above function retrieves from oneChannelGUI a fasta file from the data of oneChannleGUI. The fasta file is generated by the standalone function \textit{"ncScaffold"}, for its usage please refer to the standalone vignette. Primary alingment data provided by shrimp using the above reference sequences need to be reformatted using the function provided in the file menu: \textit{"oneChannelGUI: Reformat NGS primary mapping output"}, fig. \ref{fig:ngs10} . For this reformatting it is necessary to have perl or active perl installed in the computer. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs10} } \caption{\label{fig:ngs10} Reformatting data produced using SHRIMP and the nc-RNA reference set.} \end{center} \end{figure}. It is possible to use as reference the full unmasked genome available on NCBI, also in this case it is necessary to reformat the aligned data using the function \textit{"oneChannelGUI: Reformat NGS primary mapping output"}. It is also possible to use as reference the microRNA precursors set of mirBase, actually is implemented the version 17.0 The fasta file are available in the ngsperl dir located in the oneChannelGUI path. This folder is created using the function \textit{"oneChannelGUI: Set NGS perl scripts folder"}. In case miRbase precursors fasta file is used as reference the mapped data produced by SHRIMP can be directly loaded in oneChannelGUI without reformatting. To run SHRIMP The reference genome needs to be indexed by SHRIMP with the following code: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/utils/project-db.py \ --seed \ 00111111001111111100, \ 00111111110011111100, \ 00111111111100111100, \ 00111111111111001100, \ 00111111111111110000 \ --h-flag --shrimp-mode cs /folder.where.fasta.file.is.located/ncRNAs.mm.fa \end{Sinput} \end{Schunk} The flag for SOLID data is: \begin{Schunk} \begin{Sinput} --shrimp-mode cs \end{Sinput} \end{Schunk} The code to format the reference fasta file for SOLID data is: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/bin/gmapper-cs -L /folder.where.fasta.file.is.located/ncRNAs.mm-cs \ -Y -V /dev/null >/dev/null \end{Sinput} \end{Schunk} To run SHRIMP the code line is the following for solid data: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/bin/gmapper-cs -L /folder.where.fasta.file.is.located/ncRNAs.mm-cs \ /folder.where.reads.file.is.located/myreads.file \ -N number.of.cores.used.by.shrimp -n 1 -U -o 1 -V -w 170% \ >myreads.file.out 2>myreads.file.log & \end{Sinput} \end{Schunk} The flag for Illumina data is: \begin{Schunk} \begin{Sinput} --shrimp-mode ls \end{Sinput} \end{Schunk} The code to format the reference fasta file for Illumina data is: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/bin/gmapper-ls -L /folder.where.fasta.file.is.located/ncRNAs.mm-ls \ -Y -V /dev/null >/dev/null \end{Sinput} \end{Schunk} To run SHRIMP the code line is the following for illumina data: \begin{Schunk} \begin{Sinput} $SHRIMP_FOLDER/bin/gmapper-ls -L /folder.where.fasta.file.is.located/ncRNAs.mm-ls \ /folder.where.reads.file.is.located/myreads.file \ -N number.of.cores.used.by.shrimp -n 1 -U -o 1 -V -w 170% \ >myreads.file.out 2>myreads.file.log & \end{Sinput} \end{Schunk} The output of SHRIMP must be reformated to produce two files with the extension .bed and .logos. The output of SHRIMP has the following columns: \begin{Schunk} \begin{Sinput} readname Read tag name contigname Genome (Contig/Chromosome) name strand Genome strand ('+' or '-') contigstart Start of alignment in genome (beginning with 1, not 0). contigend End of alignment in genome (inclusive). readstart Start of alignment in read (beginning with 1, not 0). readend End of alignment in read (inclusive). readlength Length of the read in bases/colours. score Alignment score editstring Edit string. \end{Sinput} \end{Schunk} It is possible to copy in the oneChannelGUI path the perl scripts needed for SHRIMP data reformatting using the general menu function \textit{oneChannelGUI: Set NGS perl scripts folder}. It is also necessary that perl is installed in the system. For windows user the best will be the installation of active perl, which is very strait forward. The function \textit{oneChannelGUI: Reformat SHRIMP output} present in the File menu will allow data reformatting using as input a Target file, which will be also used subsequently to load the reformatted .bed and .logos files in oneChannelGUI. The reformat routine, produces two files: one with the extension .bed and an other with the extension .logos. .bed file has three columns all represented by numbers. First column is chromosome number. The mitocondrial genome represented by 77, X by 88 and Y by 89. Second column is strand given by 1 and -1 Third column is first position of the read mapping over the reference chromosome. An example of the .bed file structure is given in figure \ref{fig:ngs1}. \begin{figure}[htbp] \begin{center} \resizebox{50 mm}{!}{ \includegraphics{ngs1} } \caption{\label{fig:ngs1} Structure of mapping data that can be imported in oneChannelGUI.} \end{center} \end{figure} .logos file has four columns, first is the ENSEMBL gene id second column is the description of the mapping given by Edit string: \begin{Schunk} \begin{Sinput} The edit string consists of numbers, characters and the following additional symbols: '-', '(' and ')'. It is constructed as follows: = size of a matching substring = mismatch, value is the tag letter () = gap in the reference, value shows the letters in the tag - = one-base gap in the tag (i.e. insertion in the reference) x = crossover (inserted between the appropriate two bases) For example: A perfect match for 25-bp tags is: 25 A SNP at the 16th base of the tag is: 15A9 A four-base insertion in the reference: 3(TGCT)20 A four-base deletion in the reference: 5----20 Two sequencing errors: 4x15x6 (i.e. 25 matches with 2 crossovers) \end{Sinput} \end{Schunk} Third column is the position of the beginning of the alignment on the ENSENBL gene Fourth column is the position of the first alignment on the read. In case the mirBase precursors are used as reference there is no need of reformatting the output of SHRIMP. Output files produced by SHRIMPS are directly loaded and only alignments with one SNP or with perfect march are kept for the analysis. \subsubsection{Loading mature miRNAs form miRanalyser} miRanalyser mapping tool information can be obtained in: \begin{Schunk} \begin{Sinput} Nucleic Acids Res. 2009 Jul 1;37(Web Server issue):W68-76. miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments. Hackenberg et al. The web program can be used at: http://web.bioinformatics.cicbiogune.es/microRNA/miRanalyser.php \end{Sinput} \end{Schunk} The interesting point of this tool, figure \ref{fig:ngs6} is that user need only to load the reads to be mapped as a multi-fasta file. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs6} } \caption{\label{fig:ngs6} web interface of miRanalyser} \end{center} \end{figure} \begin{Schunk} \begin{Sinput} >ID 49862 GAGGTAGTAGGTTGTA >ID 15490 ACCCGTAGAACCGACC >ID 13762 GGAGCATCTCTCGGTC \end{Sinput} \end{Schunk} This web tool facilitates the generation of primary data for unexperienced users. The output is generated in few hours and can be retrieved by the user after bookmarking each of the job pages. The output is a very complete, but we will focus only on the retrieval of the mapping data referring to the mature miRNAs, figure \ref{fig:ngs7} \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs7} } \caption{\label{fig:ngs7} The arrow indicate the link that allows to retrieve the tab delimited file referring to mapping to the mature subset of microRNAs} \end{center} \end{figure} \subsubsection{Loading mature miRNAs from miRExpress} miRExpress mapping tool information can be obtained in: \begin{Schunk} \begin{Sinput} BMC Bioinformatics 2009, 10:328. miRExpress: Analyzing high-throughput sequencing data for profiling microRNA expression. Wei-Chi Wang, Feng-Mao Lin, Wen-Chi Chang, Kuan-Yu Lin, Hsien-Da Huang and Na-Sheng Lin The tool can be retrieved at: \url{http://miRExpress.mbc.nctu.edu.tw} \end{Sinput} \end{Schunk} The alignment files can be directly loaded in oneChannelGUI. Below an example of the structure of the alignment files generated with miRExpress. \begin{Schunk} \begin{Sinput} hsa-mir-520a CUCAGGCUGUGACCCUCCAGAGGGAAGUACUUUCUGUUGUCUGAGAGAAAAGAAAGUGCUUCCCUUUGGACUGUUUCGGUUUGAG Others******************************************************************************* CTCCAGAGGGAAGTACTTTCT 3 // hsa-mir-99a CCCAUUGGCAUAAACCCGUAGAUCCGAUCUUGUGGUGAAGUGGACCGCACAAGCUCGCUUCUAUGGGUCUGUGUCAGUGUG Others*************************************************************************** AACCCGTAGATCCGATCTTGTG 30 CAAGCTCGCTTCTATGGGTCTG 87 CAAGCTCGCTTCTATGGGTCT 64 // \end{Sinput} \end{Schunk} \subsubsection{Loading mature miRNAs from miRProf} miRProf is web mapping tool. The web tool can be used at: \url{http://srna-tools.cmp.uea.ac.uk/animal/cgi-bin/srna-tools.cgi} The tab delimited files provided by miRProf can be directly loaded in oneChannelGUI. \section{Reformatting-Normalizing NGS data menu} The NGS data stored in the ExpressionSet can be log2 transformed. Since it is possible that some peaks are subset of the same peak, e.g. two peaks located less then 50 bases to each other, the function Refining peaks allows to merge peaks located near to each other, given a user define threshold. In figure \ref{fig:ngs4} \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs4} } \caption{\label{fig:ngs4} A) Plot of frequence of two nearby peaks versus inter-peaks distance. The dashed lines indicate a max distance defined by user, in this case 50 nts. B) Plot of the refined peaks after recursive merging of nearby peaks.} \end{center} \end{figure} It is now possible to scale/normalize the NGS data using the \textit{oneChannelGUI: Scale/normalize NGS data}, which used the method described by Robinson and Oshlack in Genome Biology 2010, 11:R25 and it is implemented in edgeR package. The outplot plot descrive the sample organization before and after normalization using Multidimensional scaling plot which also plot the variation in the common dispersion \ref{fig:ngs5}. \begin{figure}[htbp] \begin{center} \resizebox{100mm}{!}{ \includegraphics{ngs5} } \caption{\label{fig:ngs5} A) Multidimensional scaling plot before normalization. B) Multidimensional scaling plot after normalization. it is clear that there in an outlier, i.e. the sample that get in the first dimension very far away form the others.} \end{center} \end{figure} \section{QC menu} The section QC allows to visualized the box plot of the NGS samples after reformatting in peaks as well as samples PCA and hierarchical clustering. It is also possible to visualize data using a Multidimensional scaling plot, using the function \textit{oneChannelGUI: Multidimensional scaling plot (edgeR package)}. This plot is a variation on the usual multdimensional scaling (or principle coordinate) plot, in that a distance measure particularly appropriate for the digital gene expression (DGE) context is used. The distance between each pair of samples (columns) is the square root of the common dispersion for the top n (default is n = 500) genes which best distinguish that pair of samples. These top n genes are selected according to the tagwise dispersion of all the samples. \section{Filtering menu} The section filtering allows remove those peaks that are little informatives, i.e. those with too little counts. It also allows to filter the dataset on a list of peaks identifiers, e.g. those derived from a list of differentially expressed peaks Furthermore, tab delimited files containing the loaded counts can be exported. \section{Statistics menu} In this section are implemented interfaces to edgeR and baySeq package. Both for edgeR and baySeq the inteface uses the negative binomial distribution model to detect differential expression. P-value adjustment can be made also used. \section{Biological Interpretation menu} Under development \end{document}