%\VignetteIndexEntry{GAPS/CoGAPS Users Manual} %\VignettePackage{CoGAPS} \documentclass{report} \usepackage{Sweave,fullpage,./chicago,graphicx,subfigure,amssymb,amsmath,color,hyperref,wrapfig} \author{Elana J. Fertig \\ email: \texttt{ejfertig@jhmi.edu}} \title{GAPS/CoGAPS Users Manual} \begin{document} \maketitle \tableofcontents \chapter{Introduction} \par Gene Association in Pattern Sets (GAPS) infers underlying patterns in gene expression a matrix of microarray measurements. This Markov chain Monte Carlo (MCMC) matrix decomposition which infers these patterns also infers the extent to which individual genes belong to these patterns. The CoGAPS algorithm extends GAPS to infer the coordinated activity in sets of genes for each of the inferred patterns based upon \cite{Ochs2009} and refine set membership based upon \cite{Fertig2012}. \par The GAPS algorithm is implemented in a module of the open source, C++ MCMC software Just Another Gibbs Sampler (JAGS; \url{http://www-fis.iarc.fr/~martyn/software/jags/}; \cite{Plummer2003}). We call the software package containing this implementation of the GAPS algorithm GAPS-JAGS. As an extension including a redistribution of JAGS, GAPS-JAGS is also licensed under the GNU General Public License version 2. You may freely modify and redistribute GAPS under certain conditions that are described in the top level source directory file \nolinkurl{COPYING}. \par The R package CoGAPS is designed to facilitate the corresponding analysis of microarray measurements by calling libraries in the GAPS-JAGS package. The installation instructions provided in Chapter \ref{install} will ensure proper interaction between the CoGAPS R package and GAPS-JAGS libraries. Running instructions for the GAPS and CoGAPS analyses are provided in Sections \ref{GAPSRun} and \ref{CoGAPSRun}, respectively. CoGAPS and GAPS-JAGS are freely available at \url{http://sourceforge.net/p/cogapscpp/wiki/Home/}, \url{http://www.rits.onc.jhmi.edu/dbb/custom/A6/CoGAPS.cfm}, and \url{http://astor.som.jhmi.edu/~ejfertig/ejfertig/Software.html}. \par If you use the \texttt{CoGAPS} package for your analysis please cite: \cite{Fertig2010} EJ Fertig, J Ding, AV Favorov, G Parmigiani, and MF Ochs (2010) CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. \textit{Bioinformatics} \textbf{26}: 2792-2793. \par To cite the CoGAPS algorithm use: \cite{Ochs2003} MF Ochs (2003) Bayesian Decomposition in \textit{The Analysis of Gene Expression Data: Methods and Software} G Parmigiani, E Garrett, R Irizarry, and S Zeger, ed. New York: Springer Verlag. \par To cite the gene set statistic use: \cite{Ochs2009} MF Ochs, L Rink, C Tarn, S Mburu, T Taguchi, B Eisenberg, and AK Godwin (2009) Detection of treatment-induced changes in signaling pathways in gastrointestinal stromal tumors using transcriptomic data. \textit{Cancer Research} \textbf{69}: 9125-9132. \par To site the set-membership refinement statistic use: \cite{Fertig2012} EJ Fertig, AV Favorov, and MF Ochs (2012) Identifying context-specific transcription factor targets from prior knowledge and gene expression data. \textit{2012 IEEE International Conference on Bioinformatics and Biomedicine}, B310, \textit{in press}. \par Please contact Elana J. Fertig \url{ejfertig@jhmi.edu} or Michael F. Ochs \url{mfo@jhu.edu} for assistance. \chapter{Installation Instructions} \label{install} \par The GAPS and CoGAPS algorithms are implemented in an open source C++ software based upon JAGS version 2.1.0 (GAPS-JAGS) and an R package to facilitate the running (CoGAPS, available through Bioconductor). It is important to link CoGAPS to our distribution of GAPS-JAGS to ensure proper interfacing between the GAPS algorithm and R. \par The installation instructions provided in this section describe procedures to compile GAPS-JAGS (Section \ref{GAPSJAGSInstall}) and install the CoGAPS Bioconductor package and link this package to the GAPS-JAGS libraries (Section \ref{CoGAPSInstall}). We recommend that users proceed with installation in this order; i.e., first install GAPS-JAGS according to Section \ref{GAPSJAGSInstall} and then install the CoGAPS Bioconductor package according to Section \ref{CoGAPSInstall}. \section{GAPS-JAGS} \label{GAPSJAGSInstall} \par GAPS-JAGS is currently distributed from source only. To use it, it must be compiled. In this section, we provide installation instructions for GAPS-JAGS on Unix, MAC, and Windows. We note that we describe only standard installation processes for GAPS-JAGS. More detailed installation instructions can be found in the JAGS installation manual available at \url{http://sourceforge.net/projects/mcmc-jags/files/}. \subsection{Unix / MAC} \label{GAPSJAGSUnix} \par Successful installation of GAPS-JAGS requires the primarily on the following dependencies: \begin{description} \item[\textbf{automake}:]{Available from \url{http://www.gnu.org/software/automake/}. Installation instructions are provided in the \texttt{INSTALL} file included with the package.} \item[\textbf{autoconf}:]{Available from \url{http://www.gnu.org/software/autoconf/}. Installation instructions are provided in the \texttt{INSTALL} file included with the package.} \item[\textbf{Fortran compiler}:]{gfortran can be obtained from \url{http://gcc.gnu.org/fortran}. Binaries for Mac OS 10.5 and earlier are available at \url{http://r.research.att.com/tools/}.} \item[\textbf{BLAS and LAPACK}:]{These libraries are typically available by default on most platforms. If not provided on your machine, installation instructions are provided in the JAGS user manual available at \url{http://sourceforge.net/projects/mcmc-jags/files/}.} \end{description} \par To install GAPS-JAGS, download the source from \url{http://sourceforge.net/p/cogapscpp/wiki/Home/} and enter the top directory of the downloaded source. Then, GAPS-JAGS follows the typical GNU installation procedure of \begin{verbatim} ./configure make sudo make install \end{verbatim} These commands will install GAPS-JAGS and its associated libraries into default path (typically \nolinkurl{/usr/local}). \par The installation procedure above requires administrative privileges. To install GAPS-JAGS locally into the directory \$GAPSJAGS\_PATH, the following commands can be used. \begin{verbatim} ./configure --prefix=${GAPSJAGS_PATH} make make install \end{verbatim} \par More detailed installation instructions are provided in the file \nolinkurl{INSTALL} in the top-level source directory or the JAGS installation manual available at \url{http://sourceforge.net/projects/mcmc-jags/files/}. \par On MAC, we recommend using the most recent version of \texttt{gcc} available through \texttt{xtools} to ensure proper interaction between the GAPS-JAGS libraries and R. \subsection{Windows} \par We provide an executable which installs GAPS-JAGS in Windows called \texttt{gaps-jags-1.0.0-setup.exe} at \url{http://sourceforge.net/p/cogapscpp/wiki/Home/}. To install GAPS-JAGS, download and run this executable, following the installation instructions noted on the screen. Keep note of the directory to which GAPS-JAGS was installed for the installation of rjags (by default \nolinkurl{C:\\Program Files\\GAPS-JAGS\\GAPS-JAGS-1.0.2}). If you wish to compile GAPS-JAGS yourself, follow the instructions in the JAGS installation manual (\url{http://sourceforge.net/projects/mcmc-jags/files/}). \section{CoGAPS} \label{CoGAPSInstall} \par Throughout this section, we will assume that GAPS-JAGS was successfully installed into the directory \$GAPSJAGS\_PATH. If this package was installed into the default directory using specified in the configure file (usually \texttt{/usr/local} on Unix / Mac), then the standard Bioconductor installation for CoGAPS as follows \begin{verbatim} source("http://www.bioconductor.org/biocLite.R") biocLite("CoGAPS") \end{verbatim} \par Otherwise, the following subsections contain instructions to install the CoGAPS package on the Unix / Mac and the Windows operating systems. \subsection{Unix / Mac} \par On Unix or Mac, use the following command inside R to install CoGAPS and integrate the GAPS-JAGS libraries in \$GAPSJAGS\_PATH: \begin{verbatim} source("http://www.bioconductor.org/biocLite.R") bioCLite("CoGAPS", configure.args="--with-jags-include=${GAPSJAGS_PATH}/include/GAPS-JAGS --with-jags-lib=${GAPSJAGS_PATH}/lib --with-jags-modules=${GAPSJAGS_PATH}/lib/JAGS/modules-1.0.2") \end{verbatim} If this installation fails, try installing rjags from another mirror. Installation on Mac may require adding the flag \texttt{type="source"} to \texttt{install.packages()}. \par Alternatively, the CoGAPS package can be obtained through Bioconductor. In this case, download the CoGAPS source and install using the following command line argument: \par \noindent \texttt{R CMD INSTALL --configure-args=} \par \texttt{"--with-jags-include=\${GAPSJAGS\_PATH}/include/GAPS-JAGS} \par \texttt{ --with-jags-lib=\${GAPSJAGS\_PATH}/lib} \par \texttt{ --with-jags-modules=\${GAPSJAGS\_PATH}/lib/JAGS/modules-1.0.2"} \par \texttt{CoGAPS\_\Sexpr{packageVersion('CoGAPS')}.tar.gz} \par As before, this installation procedure requires administrative privileges to install the CoGAPS package in R. If you do not have administrative privileges, follow standard R procedures to install the package locally using the lib.loc option in \texttt{install.packages} or \texttt{-l} flag in \texttt{R CMD INSTALL}. \par In some platforms, the dynamic libraries may not be properly linked for loading the CoGAPS package, leading to an error message such as \begin{verbatim} Error in dyn.load(file, DLLpath=DLLpath, ...) : unable to load shared library 'rjags.so' libjags.so.1 cannot open shared object file: No such file or directory Error : .onLoad failed in 'loadNamespace' for 'CoGAPS' Error: package 'CoGAPS' could not be loaded \end{verbatim} In this case, the user should either set the environment variable \texttt{LD\_LIBRARY\_PATH} to \texttt{GAPSJAGS\_PATH/lib} or load in the dynamic libraries \texttt{libjags} and \texttt{libjrmath} manually as follows: \begin{verbatim} > dyn.load('${GAPSJAGS_PATH}/lib/libjags.so') > dyn.load('${GAPSJAGS_PATH}/lib/libjrmath.so') \end{verbatim} \subsection{Windows} \par Before installing or running CoGAPS, the user must specify an environment variables \texttt{JAGS\_HOME} and \texttt{JAGS\_ROOT} specifying the location of GAPS-JAGS. By default, \texttt{gaps-jags-1.0.2-setup.exe} will install GAPS-JAGS into \nolinkurl{C:\\Program Files\\GAPS-JAGS\\GAPS-JAGS-1.0.2}. The corresponding environment variable can be set globally in Windows through the following steps \begin{enumerate} \item Open the \texttt{start} menu. \item Right click on the \texttt{My Computer} icon and select \texttt{properties}. \item Go to the \texttt{Advanced} tab. \item Click on the \texttt{Environment Variables} button. \item Select the \texttt{new} button under the \texttt{System variables} section. \item Set the variable name to be \texttt{JAGS\_HOME} and variable value to be \texttt{\$GAPSJAGS\_PATH}. \item Click on the \texttt{OK} button. \item Select the \texttt{new} button under the \texttt{System variables} section. \item Set the variable name to be \texttt{JAGS\_ROOT} and variable value to be \texttt{\$GAPSJAGS\_PATH}. \item Click on the \texttt{OK} button. \item Click on the \texttt{OK} button in the environment variables window. \item Click on the \texttt{OK} button in the \texttt{Advanced} pane and exit system properties. \end{enumerate} \par Alternatively, the user can set the environment variables \texttt{JAGS\_HOME} and \texttt{JAGS\_ROOT} locally through R using the following command \begin{verbatim} > Sys.setenv("JAGS_HOME"="${GAPSJAGS_PATH}") > Sys.setenv("JAGS_ROOT"="${GAPSJAGS_PATH}") \end{verbatim} In this case, the user must reenter these commands in each session of R in which CoGAPS will be installed or run. \par Once the environment variables \texttt{JAGS\_HOME} and \texttt{JAGS\_ROOT} have been set, use the following commands inside R to install CoGAPS and integrate the GAPS-JAGS libraries in \$GAPSJAGS\_PATH: \begin{verbatim} > install.packages("CoGAPS") \end{verbatim} \chapter{Running Instructions} \par In this chapter, we describe how to run both the GAPS and CoGAPS algorithms. We note that GAPS-JAGS will create temporary files in the working directory in your R session. As a result, the user must change to a directory with write permissions before running GAPS-JAGS. \section{GAPS} \label{GAPSRun} \par GAPS seeks a pattern matrix (${\bf{P}}$) and the corresponding distribution matrix of weights (${\bf{A}}$) whose product forms a mock data matrix (${\bf{M}}$) that represents the expression data ${\bf{D}}$ within noise limits ($\boldsymbol{\varepsilon}$). That is, \begin{equation} {\bf{D}} = {\bf{M}} + \boldsymbol{\varepsilon} = {\bf{A}}{\bf{P}} + \boldsymbol{\varepsilon}. \label{eq:matrixDecomp} \end{equation} The number of rows in ${\bf{P}}$ (columns in ${\bf{A}}$) defines the number of biological patterns that GAPS will infer from the measured microarray data. As in the Bayesian Decomposition algorithm \cite{Ochs2006}, the matrices ${\bf{A}}$ and ${\bf{P}}$ in GAPS are assumed to have the atomic prior described in \cite{Sibisi1997}. In the GAPS implementation, $\alpha_{A}$ and $\alpha_{P}$ are corresponding parameters for the expected number of atoms which map to each matrix element in ${\bf{A}}$ and ${\bf{P}}$, respectively. The corresponding matrices ${\bf{A}}$ and ${\bf{P}}$ are found with MCMC sampling implemented within JAGS \cite{Plummer2003}. \par The GAPS algorithm is run by calling the \texttt{GAPS} function in the CoGAPS R package as follows: \begin{verbatim} > GAPS(data, unc, outputDir, outputBase="", sep="\t", isPercentError=FALSE, numPatterns, MaxAtomsA=2^32, alphaA=0.01, MaxAtomsP=2^32, alphaP=0.01, SAIter=1000000000, iter = 500000000, thin=-1, verbose=TRUE, keepChain=FALSE) \end{verbatim} \par \noindent \textbf{\underline{Input Arguments}} \begin{description} \item[data]{The matrix of m genes by n arrays of expression data. The input can be either the data matrix itself or the file containing this data. If the latter, GAPS will read in the data using \texttt{read.table(data, sep=sep, header=T, row.names=1)}.} \item[unc]{The matrix of m genes by n arrays of uncertainty (standard deviation) for the expression data. The input can be either a file containing the uncertainty (using the format from data), a matrix containing the uncertainty, or a constant value. If unc is a constant value, it can represent either a constant uncertainty or a constant percentage of the values in data as determined by isPercentError.} \item[numPatterns]{Number of patterns into which the data will be decomposed. Must be less than the number of genes and number of arrays in the data.} \item[outputDir]{Directory to which to output result and diagnostic files created by GAPS. (Use "" to output results to the current directory).} \item[outputBase]{Prefix for all result and diagnostic files created by GAPS (optional; default="")} \item[sep]{Text delimiter for tables in data and unc (if specified in file) and any output tables (optional; default="\t")} \item[isPercentError]{Boolean indicating whether constant value in unc is the value of the uncertainty or the percentage of the data that is the uncertainty.} \item[MaxAtomsA]{Maximum number of atoms in the atomic domain used for the prior of the amplitude matrix in the decomposition \cite{Sibisi1997}. The default value will typically be sufficient for most applications (optional; default=$2^{32}$).} \item[alphaA]{Sparsity parameter reflecting the expected number of atoms per element of the amplitude matrix in the decomposition. To enforce sparsity, this parameter should typically be less than one. (optional; default=0.01)} \item[MaxAtomsP]{Maximum number of atoms in the atomic domain used for the prior of the pattern matrix in the decomposition \cite{Sibisi1997}. The default value will typically be sufficient for most applications (optional; default=$2^{32}$).} \item[alphaP]{Sparsity parameter reflecting the expected number of atoms per element of the pattern matrix in the decomposition. To enforce sparsity, this parameter should typically be less than one. (optional; default=0.01)} \item[SAIter]{Number of burn-in iterations for the MCMC matrix decomposition (optional; default=1000000000)} \item[iter]{Number of iterations to represent the distribution of amplitude and pattern matrices with the MCMC matrix decomposition (optional; default=500000000)} \item[thin]{Double whose integer part represents the number of iterations at which the samples are kept and decimal part provides an identifier for the output files from this implementation of GAPS. If thin is an integer or not specified, this decimal file identifier is assigned randomly. (optional; default=-1; code assigns number of iterations kept to be iter/10000 and file identifier to be runif(1)) } \item[verbose]{Boolean which specifies the amount of output to the user about the progress of the program. (optional; default=TRUE)} \item[keepChain]{Boolean which specifies if chain values of ${{\bf{A}}}$ and ${{\bf{P}}}$ are saved in outputDir (optional; default=FALSE).} \end{description} \par \noindent \textbf{\underline{List Items in Function Output}} \begin{description} \item[D]{Microarray data matrix.} \item[Sigma]{Data matrix with uncertainty of D.} \item[Amean]{Sampled mean value of the amplitude matrix ${{\bf{A}}}$.} \item[Asd]{Sampled standard deviation of the amplitude matrix ${{\bf{A}}}$.} \item[Pmean]{Sampled mean value of the pattern matrix ${\bf{P}}$.} \item[Psd]{Sampled standard deviation of the pattern matrix ${\bf{P}}$.} \item[meanMock]{Mock data obtained from matrix decomposition for sampled mean values (= ${\bf{Amean}}{\bf{Pmean}}$).} \item[meanChi2]{${\chi^2}$ value for the sampled mean values (${\bf{Amean}}$ and ${\bf{Pmean}}$) of the matrix decomposition.} \end{description} \par \noindent \textbf{\underline{Side Effects}} \begin{itemize} \item Makes the folder \texttt{outputDir} in which to put the results. \item Create diagnostic files with $\chi^{2}$ and number of atoms in \texttt{outputDir} \item Create files containing the mean and standard deviation of ${\bf{A}}$ and ${\bf{P}}$ estimated with MCMC in \texttt{outputDir}. \item Create files with values of ${\bf{A}}$ and ${\bf{P}}$ from the MCMC chain stored in \texttt{outputDir} if the input parameter \texttt{keepChain} is true. \end{itemize} \par Once the GAPS algorithm has been run, the inferred patterns and corresponding amplitude can be displayed using the \texttt{plotGAPS} function as follows: \begin{verbatim} > plotGAPS(A, P, outputPDF="") \end{verbatim} \par \noindent \textbf{\underline{Input Arguments}} \begin{description} \item[A]{The amplitude matrix ${\bf{Amean}}$ obtained from GAPS.} \item[P]{The pattern matrix ${\bf{Pmean}}$ obtained from GAPS.} \item[outputPDF]{Name of an \texttt{pdf} file to which the results will be output. (Optional; default="" will output plots to the screen.)} \end{description} \par \noindent \textbf{\underline{Side Effects}} \begin{itemize} \item Save the plots of ${\bf{Amean}}$ and ${\bf{Pmean}}$ to the \texttt{pdf} file \texttt{outputPDF}. \end{itemize} \subsection{Example} \par In this example, we perform the GAPS matrix decomposition on a simulated data set with known underlying patterns (ModSim) as follows. <>= library('CoGAPS') data('ModSim') nIter <- 500000 results <- GAPS(data=ModSim.D, unc=0.01, isPercentError=FALSE, numPatterns=3, SAIter=2*nIter, iter = nIter, outputDir='ModSimResults') plotGAPS(results$Amean, results$Pmean, 'ModSimFigs') message('Deleting analysis results from GAPS for Vignette') unlink('ModSimResults', recursive=T) @ Figure \ref{fig:ModSim} shows the results from plotting the GAPS estimates of ${\bf{A}}$ and ${\bf{P}}$ using \texttt{plotGAPS}, which has a fit to ${\bf{D}}$ of $\chi^{2}=\Sexpr{results$meanChi2}$. \begin{figure}[ht] \begin{center} \subfigure[Inferred amplitude matrix]{ \includegraphics[width=0.45\linewidth]{ModSimFigs_Amplitude} } \subfigure[Inferred patterns]{ \includegraphics[width=0.45\linewidth]{ModSimFigs_Patterns} } \end{center} \caption{Results from GAPS on simulated data set with known true patterns.} \label{fig:ModSim} \end{figure} Figure \ref{fig:ModSimPtrue} displays the true patterns used to create the ModSim data, stored in \texttt{ModSim.P.true}. \begin{figure}[ht] \begin{center} <>= arrayIdx <- 1:ncol(ModSim.P.true) matplot(arrayIdx, t(ModSim.P.true), type='l', lwd=10) @ \end{center} \caption{Known true patterns used to generate ModSim data.} \label{fig:ModSimPtrue} \end{figure} \section{CoGAPS} \label{CoGAPSRun} \par CoGAPS infers coordinated activity in gene sets active in each row of the pattern matrix ${\bf{P}}$ found by GAPS. Specifically, CoGAPS computes a $Z$-score based statistic on each column of the ${\bf{A}}$ matrix developed in \cite{Ochs2009}. The resulting $Z$-score for pattern $p$ and gene set $i$, $\mathcal{G}_{i}$, with $G$ elements is given by \begin{equation} Z_{i,p} = \frac{1}{G} \sum\limits_{g \in \mathcal{G}_{i}} {\frac{{\bf{A}}_{gp}}{{\bf{Asd}}_{gp}}} \label{eq:avgZ} \end{equation} where $g$ indexes the genes in the set and ${\bf{Asd}}_{gp}$ is the standard deviation of ${\bf{A}}_{gp}$ obtained from the MCMC sampling in GAPS. CoGAPS then uses random sample tests to convert the Z-scores from eq. (\ref{eq:avgZ}) to $p$ values for each gene set. \par The CoGAPS algorithm is run by calling the \texttt{CoGAPS} function in the CoGAPS R package as follows: \begin{verbatim} > CoGAPS(data, unc, GStoGenes, outputDir, outputBase="", sep="\t", isPercentError=FALSE, numPatterns, MaxAtomsA=2^32, alphaA=0.01, MaxAtomsP=2^32, alphaP=0.01, SAIter=1000000000, iter = 500000000, thin=-1, nPerm=500, verbose=TRUE, plot=FALSE, keepChain=FALSE) \end{verbatim} \par \noindent \textbf{\underline{Input Arguments}} \begin{description} \item[$\ldots$]{Input arguments from GAPS.} \item[GStoGenes]{List or data frame containing the genes in each gene set. If a list, gene set names are the list names and corresponding elements are the names of genes contained in each set. If a data frame, gene set names are in the first column and corresponding gene names are listed in rows beneath each gene set name.} \item[nPerm]{Number of permutations used for the null distribution in the gene set statistic. (optional; default=500).} \item[plot]{Use \texttt{plotGAPS} to plot results from the run of \texttt{GAPS} within \texttt{CoGAPS}.} \end{description} \par \noindent \textbf{\underline{List Items in Function Output}} \begin{description} \item[$\ldots$]{Output list from GAPS.} \item[GSUpreg]{p-values for upregulation of each gene set in each pattern.} \item[GSDownreg]{p-values for downregulation of each gene set in each pattern.} \item[GSActEst]{p-values for activity of each gene set in each pattern.} \end{description} \par \noindent \textbf{\underline{Side Effects}} \begin{itemize} \item Side effects from the \texttt{GAPS} algorithm. \item Creates files from \texttt{GSUpreg}, \texttt{GSDownreg}, and \texttt{GSActEst} into \texttt{outputDir}. \end{itemize} \par The CoGAPS algorithm can also be run manually by first running the GAPS algorithm described in Section \ref{GAPSRun} and then calling the function \texttt{calcCoGAPSStat} as follows: \begin{verbatim} > calcCoGAPSStat(Amean, Asd, GStoGenes, numPerm=500) \end{verbatim} The input arguments for \texttt{calcCoGAPSStat} are as described in the previous sections. This function will output a list containing \texttt{GSUpreg}, \texttt{GSDownreg}, and \texttt{GSActEst}. \subsection{Examples} \subsubsection{Simulated data} \par In this example, we have simulated data in EasySimGS (DGS) with three known patterns (PGS) and corresponding amplitude (AGS) with specified activity in two gene sets (gs). In this data set, each gene set is overexpressed in of the simulated patterns and underexpressed in one. <>= library('CoGAPS') data('EasySimGS') nIter <- 5e+05 results <- CoGAPS(data=DGS, unc=0.01, isPercentError=FALSE, GStoGenes=gs, numPatterns=3, SAIter = 2*nIter, iter = nIter, outputDir='GSResults', plot=FALSE) plotGAPS(results$Amean, results$Pmean, 'GSFigs') message('Deleting analysis results from CoGAPS for Vignette') unlink('GSResults', recursive=T) @ Figure \ref{fig:GS} shows the results from running CoGAPS on the GIST data in \cite{Ochs2009} with the option plot set to true. \begin{figure}[ht] \begin{center} \subfigure[Inferred amplitude matrix]{ \includegraphics[width=0.45\linewidth]{GSFigs_Amplitude} } \subfigure[Inferred patterns]{ \includegraphics[width=0.45\linewidth]{GSFigs_Patterns} } \end{center} \caption{Results from GAPS on data of simulated gene set data.} \label{fig:GS} \end{figure} \par Moreover, the gene set activity is provided in \texttt{results\$GSActEst} including p-values for upregulation in \texttt{results\$GSUpreg} and downregulation in \texttt{results\$GSDownreg}. \subsubsection{GIST data} \par We also provide the code that would be used for the CoGAPS analysis of GIST data (GIST\_TS\_20084) with gene sets defined by transcription factors (TFGSList), as in the DESIDE analysis of \cite{Ochs2009}. To enable quick package installation, we do not evaluate this code in the vignette, but leave it for the user to compare to the results of \cite{Ochs2009}. <>= library('CoGAPS') data('GIST_TS_20084') data('TFGSList') nIter <- 50000000 results <- CoGAPS(data=GIST.D, unc=GIST.S, GStoGenes=tf2ugFC, numPatterns=5, SAIter = 2*nIter, iter = nIter, outputDir='GISTResults', plot=FALSE) plotGAPS(results$Amean, results$Pmean, 'GISTFigs') message('Deleting analysis results from CoGAPS for Vignette') unlink('GISTResults', recursive=T) @ \section{Using CoGAPS-based statistics to infer gene membership in annotated gene sets} \par As we describe in the previous section, the GAPS matrix factorization can be used to infer gene set activity in each pattern the function \texttt{calcCoGAPSStat} \cite{Ochs2009}. The \texttt{computeGeneGSProb} function extends this gene set statistic to compute a statistic quantifying the likely membership of each gene annotated to a set based upon its inferred activity \cite{Fertig2012}. This statistic is formulated by comparing the expression pattern computed with CoGAPS of a given gene $g$ annotated as a member of $\mathcal{G}$ to the common expression pattern of all annotated members of $\mathcal{G}$. This similarity is quantified with the following summary statistic \begin{equation} S_{g,\mathcal{G}} = \frac{\sum_{p} -log\left(\mbox{Pr}_{\mathcal{G}p}\right){\bf{A}}_{gp}/{\bf{Asd}}_{gp}}{\sum_{p} -log\left(\mbox{Pr}_{\mathcal{G}p}\right)}, \label{eq:summaryGSStat} \end{equation} where $\mbox{Pr}_{\mathcal{G}p}$ is the probability of upregulation of the geneset, returned from \texttt{calcCoGAPSStat} as \texttt{GSActEst} based upon eq.~(\ref{eq:avgZ}). Similar to the gene set statistics, p-values for the set membership are computed with permutation tests that compare the value of $S_{g,\mathcal{G}}$ from eq.~(\ref{eq:summaryGSStat}) to the statistic formulated for a random gene set of the same size that also contains gene $g$. \par The set membership statistic is computed from the results from the GAPS matrix factorization, computed with either the \texttt{GAPS} function described in Section \ref{GAPSR un} or the \texttt{CoGAPS} function described in Section \ref{CoGAPSRun} as follows: \begin{verbatim} > computeGeneGSProb(Amean, Asd, GStoGenes, numPerm=500) \end{verbatim} \par \noindent \textbf{\underline{Input Arguments}} \begin{description} \item[Amean]{Mean amplitude matrix estimated from the GAPS matrix factorization} \item[Asd]{Standard deviation of the amplitude matrix estimated from the GAPS matrix factorization} \item[GStoGenes]{List or data frame containing the genes in each gene set. If a list, gene set names are the list names and corresponding elements are the names of genes contained in each set. If a data frame, gene set names are in the first column and corresponding gene names are listed in rows beneath each gene set name.} \item[nPerm]{Number of permutations used for the null distribution in the gene set and set membership statistics. (optional; default=500).} \end{description} \par \noindent \textbf{\underline{Function Output}} \begin{description} \item p-value of set membership for each gene specified in \texttt{GStoGenes}. \end{description} \subsection{Examples} \par Although not run in the interest of installation time, the following examples were used to generate some of the results in \cite{Fertig2012}, with the complete analysis code available from \url{http://astor.som.jhmi.edu/~ejfertig/ejfertig/Publications.html}. \subsubsection{Simulated data} \par In this example, we have simulated data in TFSimData (TFGeneReg\$D) with four known patterns (TFGeneReg\$P) and corresponding amplitude (TFGeneReg\$A) with specified activity in four gene sets (TFGeneReg\$TFGeneReg) from \cite{Fertig2012}. The following code will generate the corresponding probability of membership for each gene in TFGeneReg\$TFGeneReg using the \texttt{computeGeneGSProb} statistic. <>= data(TFSimData) results <- GAPS(data=TFGeneReg$D, unc=0.1*pmax(TFGeneReg$M,1), isPercentError=FALSE, numPatterns=4, SAIter = 2*nIter, iter = nIter, outputDir='GSResults') TFtargets <- lapply(TFGeneReg$TFGeneReg,names) TFGenesP <- lapply(TFtargets, function(x){ computeGeneGSProb(Amean=results$Amean, Asd=results$Asd, GSGenes=x)}) @ \subsubsection{GIST data} \par This example refines transcription factor targets annotated in TRANSFAC (TFGSList) to identify context-specific targets from gene expression data (GIST\_TS\_20084)from \cite{Ochs2009}. <>= # load the data data('GIST_TS_20084') data('TFGSList') # define transcription factors of interest based on Ochs et al. (2009) TFs <- c("c.Jun", 'NF.kappaB', 'Smad4', "STAT3", "Elk.1", "c.Myc", "E2F.1", "AP.1", "CREB", "FOXO", "p53", "Sp1") # run the GAPS matrix factorization nIter <- 5e7 GISTResults <- GAPS(data=GIST.D, unc=GIST.S, numPatterns=5, outputDir = 'GISTGSCoGAPS', isPercentError=F, SAIter=2*nIter, iter=nIter) # set membership statistics permTFStats <- list() for (tf in TFs) { genes <- levels(tf2ugFC[,tf]) genes <- genes[2:length(genes)] permTFStats[[tf]] <- computeGeneTFProb(Amean = GISTResults$Amean, Asd = GistResults$Asd, genes) } @ \chapter{Feedback} \par Please send feedback to Elana Fertig \texttt{ejfertig@jhmi.edu} or Michael Ochs \texttt{mfo@jhu.edu}. \par If you want to send a bug report, it must be reproducible. Send the data, describe what you think should happen, and what did happen. \chapter{Acknowledgments} \par We would like to thank Martyn Plummer for the JAGS package and speedy feedback to bugs which facilitated our development of the GAPS-JAGS software. Special thanks to paper co-authors Jie Ding, Alexander V. Favorov, and Giovanni Parmigiani for statistical advice in developing the GAPS / CoGAPS algorithms. Additional thanks to Simina M. Boca, Ludmila V. Danilova, Jeffrey Leek, and Svitlana Tyekcheva for their useful feedback. \par This work was funded by NLM LM009382 and NSF Grant 0342111. \bibliographystyle{plain} \bibliography{AppNote} \end{document}