%\VignetteIndexEntry{metaSeq} \documentclass[11pt]{article} \usepackage{Sweave} \usepackage{amsmath} \usepackage{hyperref} \usepackage{cite} \usepackage[numbers]{natbib} \usepackage{amssymb} % keep.source: T => With comment out line in each chunk \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,png=TRUE,include=FALSE,concordance=TRUE} \usepackage{url} \usepackage[utf8]{inputenc} \setlength{\textheight}{8.5in} \setlength{\textwidth}{6in} \setlength{\topmargin}{-0.25in} \setlength{\oddsidemargin}{0.25in} \setlength{\evensidemargin}{0.25in} \newcommand{\Rpackage}[1]{{\textit{#1}}} \usepackage{Sweave} \begin{document} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \SweaveOpts{concordance=TRUE} \title{\bf metaSeq: Meta-analysis of RNA-seq count data} \author{Koki Tsuyuzaki$^1$, and Itoshi Nikaido$^2$.} \maketitle \begin{center} \noindent $^1$Department of Medical and Life Science, Tokyo University of Science.\\ \noindent $^2$Bioinformatics Research Unit, Advanced Center for Computing and Communication, RIKEN.\\ \end{center} \begin{center} {\tt k.t.the-answer@hotmail.co.jp} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\clearpage <>= options(width=70) @ \newpage \section{Introduction} This document provides the way to perform meta-analysis of RNA-seq data using \Rpackage{metaSeq} package. Meta-analysis is a attempt to integrate multiple data in different studies and retrieve much reliable and reproducible result. In transcriptome study, the goal of analysis may be differentially expressed genes (DEGs). In our package, the probability of one-sided \Rpackage{NOISeq} \cite{NOISeq} is applied in each study. This is because the numbers of reads are often different depending on its study and \Rpackage{NOISeq} is robust method against its difference (see the next section). By meta-analysis, genes which differentially expressed in many studies are detected as DEGs. \section{RSE: Read-Size Effect} In many cases, the number of reads are depend on study. For example, here we prepared multiple RNA-Seq count data designed as Breast Cancer cell lines vs Normal cells measured in 4 different studies (this data is also accessible by $\bf{data(BreastCancer)}$). \begin{center} \begin{table}[htbp] \begin{tabular}{|c|c|c|}\hline ID in this vignette & Accession (SRA / ERA Accession) & Experimental Design \\ \hline \hline StudyA & SRP008746 & Breast Cancer (n=3) vs Normal (n=2) \\ \hline StudyB & SRP006726 & Breast Cancer (n=1) vs Normal (n=1) \\ \hline StudyC & SRP005601 & Breast Cancer (n=7) vs Normal (n=1) \\ \hline StudyD & ERP000992 & Breast Cancer (n=2) vs Normal (n=1) \\ \hline \end{tabular} \end{table} \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{figure}[ht] \centering \includegraphics[width=80mm]{Fig1.jpeg} \caption{Difference of the number of reads} \label{fig1} \end{figure} \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% As shown in the figure 1, the number of reads in StudyA, B, C, and D are relatively different. Generally, statistical test is influenced by the number of reads; the more the number of reads is large, the more the statistical tests are tend to be significant (see the next section). Therefore, in meta-analysis of RNA-seq data, data may be suffered from this bias. Here we call this bias as RSE (Read Size Effect). \section{Robustness against RSE} In the point of view of robustness against RSE, we evaluated five widely used method in RNA-seq; \Rpackage{DESeq} \cite{DESeq}, \Rpackage{edgeR} \cite{edgeR}, \Rpackage{baySeq} \cite{baySeq}, and \Rpackage{NOISeq} \cite{NOISeq}. Here we used only StudyA data. All counts in the matrix are repeatedly down-sampled in accordance with distributions of binomial (the probability equals 0.5). 1 (original), 1/2, 1/4, 1/8, 1/16, and 1/32-fold data are prepared as low read size situation. In each read size, four methods are conducted (figure 2.A, this data is also accessible by $\bf{data(StudyA)}$ and $\bf{data(pvals)}$), then we focussed on how top500 genes of original data in order of significance will change its members, influenced by low read size (figure 2.B). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{figure}[ht] \centering \includegraphics[width=\linewidth]{Fig2.png} \caption{A(left): RSE in each RNA-Seq method, B(right): Top 500 genes in order of significance} \label{fig2} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Ideal method will returns same result regardless of read size, because same data was used. As shown in figure 2, \Rpackage{NOISeq} is not almost affected by the number of reads and robustlly detects same genes as DEGs. Therefore, we concluded that \Rpackage{NOISeq} is suitable method at least in the point of view of meta-analysis. Note that probability of \Rpackage{NOISeq} is not equal to p-value; it is the probability that a gene is differentially expressed \cite{NOISeq}. Our package integrates its probability by Fisher's method \cite{Fisher} or Stouffer's method (inverse normal method) \cite{Stouffer}. In regard to Stouffer's method, weighting by the number of replicates (sample size) is used. \clearpage \section{Getting started} At first, install and load the \Rpackage{metaSeq} and \Rpackage{snow}. \begin{center} <>= library("metaSeq") library("snow") @ \end{center} The RNA-seq expression data in breast cancer cell lines and normal cells is prepared. The data is measured from 4 different studies.The data is stored as a matrix (23368 rows $\times$ 18 columns). \begin{center} <>= data(BreastCancer) @ \end{center} We need to prepare two vectors. First vector is for indicating the experimental condition (e.g., 1: Cancer, 2: Normal) and second one is for indicating the source of data (e.g., A: StudyA, B: StudyB, C: StudyC, D: StudyD). \begin{center} <>= flag1 <- c(1,1,1,0,0, 1,0, 1,1,1,1,1,1,1,0, 1,1,0) flag2 <- c("A","A","A","A","A", "B","B", "C","C","C","C","C","C","C","C", "D","D","D") @ \end{center} Then, we use $\bf{meta.readData}$ to create R object for $\bf{meta.oneside.noiseq}$. \begin{center} <>= cds <- meta.readData(data = BreastCancer, factor = flag1, studies = flag2) @ \end{center} Onesided-NOISeq is performed in each studies and each probabilities are summalized as a member of list object. \begin{center} <>= ## This is very time consuming step. # cl <- makeCluster(4, "SOCK") # result <- meta.oneside.noiseq(cds, k = 0.5, norm = "tmm", replicates = "biological", # factor = flag1, conditions = c(1, 0), studies = flag2, cl = cl) # stopCluster(cl) ## Please load pre-calculated result (Result.Meta) ## by data function instead of scripts above. data(Result.Meta) result <- Result.Meta @ \end{center} Fisher's method and Stouffer's method can be applied to the result of $\bf{meta.oneside.noiseq}$. \begin{center} <>= F <- Fisher.test(result) S <- Stouffer.test(result) @ \end{center} These outputs are summalized as list whose length is 3. First member is the probability which means a gene is upper-regulated genes, and Second member is lower-regulated genes. Weight in each study is also saved as its third member (weight is used only by Stouffer's method). \clearpage \begin{center} <>= head(F$Upper) head(F$Lower) F$Weight head(S$Upper) head(S$Lower) S$Weight @ \end{center} \clearpage Generally, by meta-analysis, detection power will improved and much genes are detected as DEGs. \begin{center} \begin{table}[htbp] \begin{tabular}{|c|c|c|}\hline Method & Study & Number of DEGs \\ \hline \hline NOISeq & A & 86 \\ \hline NOISeq & B & 563 \\ \hline NOISeq & C & 99 \\ \hline NOISeq & D & 210 \\ \hline NOISeq & A, B, C, D (not meta-analysis) & 21 \\ \hline metaSeq (Fisher, Upper) & A, B, C, D & 407 \\ \hline metaSeq (Fisher, Lower) & A, B, C, D & 1483 \\ \hline metaSeq (Stouffer, Upper) & A, B, C, D & 116 \\ \hline metaSeq (Stouffer, Lower) & A, B, C, D & 2271 \\ \hline \end{tabular} \end{table} \end{center} \clearpage \section{Meta-analysis by non-NOISeq method} For some reason, we may want to use non-NOISeq method like \Rpackage{DESeq}, \Rpackage{edgeR}, or even cuffdiff \cite{cuffdiff}. We prepared other.oneside.noiseq as optional function for such methods. Returned object can be directlly applied to $\bf{Fisher.test}$ and $\bf{Stouffer.test}$. We have to prepare at least 2 matrix filled with p-value or probability. First matrix is for upper-regulated genes between control group and treatment group. On the other hand, second matrix is for lower-regulated genes. As optional parameter, weight in each study is also avilable. Weight is need for Stouffer's method but not necessary for Fisher's method. \begin{center} <>= ## Assume this matrix as one-sided p-values ## generated by non-NOISeq method (e.g., cuffdiff) upper <- matrix(runif(300), ncol=3, nrow=100) lower <- 1 - upper rownames(upper) <- paste0("Gene", 1:100) rownames(lower) <- paste0("Gene", 1:100) weight <- c(3,6,8) @ \end{center} Next, $\bf{other.oneside.pvalues}$ will return a list object for $\bf{Fisher.test}$ or $\bf{Stouffer.test}$ by upper, lower, and weight. \begin{center} <>= ## other.oneside.pvalues function return a matrix ## which can input Fisher.test or Stouffer.test result <- other.oneside.pvalues(upper, lower, weight) @ \end{center} $\bf{result}$ above can be applied to $\bf{Fisher.test}$ and $\bf{Stouffer.test}$. \begin{center} <>= F <- Fisher.test(result) str(F) head(F$Upper) head(F$Lower) F$Weight S <- Stouffer.test(result) str(S) head(S$Upper) head(S$Lower) S$Weight @ \end{center} \newpage \section{Setup} This vignette was built on: <>= sessionInfo() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \vspace{2cm} \begin{thebibliography}{9} % \providecommand{\natexlab}[1]{#1} % \providecommand{\url}[1]{\texttt{#1}} % \expandafter\ifx\csname urlstyle\endcsname\relax % \providecommand{\doi}[1]{doi: #1}\else % \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi \bibitem{NOISeq} Tarazona, S. and Garcia-Alcalde, F. and Dopazo, J. and Ferrer, A. and Conesa, A. \newblock {Genome Research} \newblock \emph {Differential expression in RNA-seq: A matter of depth}, 21(12): 2213-2223, 2011. \bibitem{DESeq} Simon Anders and Wolfgang Huber \newblock {Genome Biology} \newblock \emph {Differential expression analysis for sequence count data.}, 11: R106, 2010. \bibitem{edgeR} Robinson, M. D. and McCarthy, D. J. and Smyth, G. K. \newblock {Bioinformatics} \newblock \emph {edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.}, 26: 139-140, 2010 % \bibitem{DEGseq} % Likun Wang and Xi Wang. % \newblock{R package version 1.14.0.} % \newblock \emph {DEGseq: Identify Differentially Expressed Genes from RNA-seq data.} \bibitem{baySeq} Thomas J. Hardcastle \newblock{R package version 1.14.1.} \newblock \emph {baySeq: Empirical Bayesian analysis of patterns of differential expression in count data.}, 2012. \bibitem{Fisher} Fisher, R. A. \newblock {Statistical Methods for Research Workers, 4th edition,} \newblock Oliver and Boyd, London, 1932. \bibitem{Stouffer} Stouffer, S. A. and Suchman, E. A. and DeVinney, L. C. and Star, S. A. and Williams, R. M. Jr. \newblock {The American Soldier, Vol. 1 - Adjustment during Army Life.} \newblock Princeton, Princeton University Press, 1949 \bibitem{cuffdiff} Trapnell, C. and Williams, B. A. and Pertea, G. and Mortazavi, A. and Kwan, G. and Baren, M. J. and Salzberg, S. L. and Wold, B. J. and Pachter, L. \newblock {Nature biotechnology} \newblock \emph {Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiaiton}, 28: 511-515, 2010. \end{thebibliography} \end{document}