%\VignetteIndexEntry{EBSEA: Exon Based Strategy for Expression Analysis of genes} %\VignetteDepends{EBSEA} %\VignetteKeywords{Preprocessing, statistics} \documentclass{article} \usepackage{cite, hyperref} \usepackage{graphicx} \hypersetup{ colorlinks = true, %Colours links instead of ugly boxes urlcolor = blue, %[rgb]{0,0.125,0.376}, %Colour for external hyperlinks linkcolor = blue, %[rgb]{0,0.125,0.376}, %Colour of internal links citecolor = red %Colour of citations } \title{ \begin{center} EBSEA: Exon Based Strategy for Expression Analysis of genes \end{center} } \author{Arfa Mehmood \\[1em] {\texttt{arfa.mehmood (at) utu.fi}}} \date{March 4, 2016} \setlength\parindent{0pt} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \textnormal{\normalfont} \tableofcontents \newpage \section{Introduction} In the conventional RNA-seq pipeline, gene counts are used to find differentially expressed genes in different conditions. EBSEA follows a different approach and it determines differential expression of genes based on the exon counts of the genes. EBSEA calculates the statistical significance of each exon in a gene seperately. The results of the exons in a gene are then aggregated to find the differentially expressed (upregulated/downregulated) genes. The user provides the exon count data, which can be generated for instance, using the python scripts in the DEXSeq R/Bioconductor package.\\[1em] The statistical significance of each exon in a gene is obtained after normalizing the count data using the trimmed Mean of M values (TMM) method from Bioconductor edgeR package. The normalized counts are used to calculate the statistical significance of exons using the linear modelling approach in the Bioconductor Limma package. The exon results are then aggregated to find gene level estimates. The p-values are determined by comparing the median score to the null distribution. The p-values are further corrected using the Benjamini-Hochberg method. \section{Data} The input data to the EBSEA should consist of a dataframe consisting of counts from each sample. The colnames should represent the sample names. The rownames should consist of a gene followed by an exon number and should be seperated by a colon as shown in example data GeneName:Exonnumber.\\[1em] The origCounts data is a subset of the first 1000 rows from the exon count data from the Pasilla package in Bioconductor. It consist of seven samples which are treated or untreated. The exon count data example is shown as follows: <>= library(EBSEA) data("origCounts") head(origCounts) @ \section{Analysis} EBSEA can be run by loading the package and setting the sample groups to be compared. The sample groups should correspond to the colnames of the exon count files. The length of the sample group vector and the number of columns should be the same. If the samples are paired then the user should provide also the information about the pairs. EBSEA can then be called by giving the following parameters: <>= group <- c('Group1', 'Group1', 'Group1', 'Group2', 'Group2', 'Group2', 'Group2') result <- EBSEA(origCounts, group, plot = TRUE) @ The result consist of a list of two dataframes: \begin{itemize} \item Exon statistic table \item Gene statistic table \end{itemize} <>= result$ExonTable <- result$ExonTable[order(result$ExonTable$GeneExon), ] @ The exon statistics are as follows: <>= head(result$ExonTable) @ The column names represent the following: \begin{itemize} \item \textbf{Gene and Exon:} Gene with its respective exon \item \textbf{AveExpr:} Average Expression \item \textbf{FC:} Fold change \item \textbf{logFC:} Log fold change \item \textbf{FDR:} False discovery rate \item \textbf{P.Value:} p-value \end{itemize} The gene statistics are as follows: <>= head(result$GeneTable) @ The column names represent the following: \begin{itemize} \item \textbf{Gene}: Gene Name \item \textbf{Median:} Median value of the exon p-values \item \textbf{ExonCount:} The number of exons in a gene \item \textbf{FC:} Fold change \item \textbf{logFC:} Log fold change \item \textbf{FDR:} False discovery rate \item \textbf{P.Value:} p-value \end{itemize} The results can be viewed, stored or processed further.\\[1em] The user can visualize gene information by providing an identifier of the gene of interest:\\[1em] <>= visualizeGenes('FBgn0000064', result) @ \section{References} Laiho, A. et al.,\emph{A note on an exon-based strategy to identify differentially expressed genes in RNA-seq experiments}. PloS One, 2014. \end{document}