% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- \documentclass[11pt]{article} %% Set my margins \setlength{\oddsidemargin}{0.0truein} \setlength{\evensidemargin}{0.0truein} \setlength{\textwidth}{6.5truein} \setlength{\topmargin}{0.0truein} \setlength{\textheight}{9.0truein} \setlength{\headsep}{0.0truein} \setlength{\headheight}{0.0truein} \setlength{\topskip}{0pt} %% End of margins \usepackage{subfigure} %%\pagestyle{myheadings} %%\markboth{$Date$\hfil$Revision$}{\thepage} \usepackage[pdftex, bookmarks, bookmarksopen, pdfauthor={Dongjun Chung}, pdftitle={GGPA Vignette}] {hyperref} \title{Genetic Analysis and Investigating Pleiotropic Architecture with `\texttt{GGPA}' Package} \author{Dongjun Chung$~^1$, Hang J. Kim$~^2$, and Hongyu Zhao$~^{3,4,5,6}$\\ $~^1$Department of Public Health Sciences, Medical University of South Carolina,\\ Charleston, SC, USA.\\ $~^2$ Department of Mathematical Sciences, University of Cincinnati, Cincinnati, OH, USA.\\ $~^3$ Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.\\ $~^4$ Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.\\ $~^5$ Department of Genetics, Yale School of Medicine, New Haven, CT, USA.\\ $~^6$ VA Cooperative Studies Program Coordinating Center, West Haven, CT, USA. } \date{\today} \SweaveOpts{engine=R, echo=TRUE, pdf=TRUE} \begin{document} \SweaveOpts{concordance=TRUE} %\VignetteIndexEntry{GGPA} %\VignetteKeywords{GGPA} %\VignettePackage{GGPA} \maketitle \tableofcontents \section{Overview} This vignette provides an introduction to the genetic analysis using the `\texttt{GGPA}' package. R package `\texttt{GGPA}' implements graph-GPA, a flexible statistical framework for the joint analysis of multiple genome-wide association studies (GWAS) using a hidden Markov random field architecture. We encourage questions or requests regarding `\texttt{GGPA}' package to be posted on our Google group for the GPA Suite \url{https://groups.google.com/d/forum/gpa-user-group}. Users can find the most up-to-date versions of `\texttt{GGPA}' package in our GitHub webpage (\url{http://dongjunchung.github.io/GGPA/}). The package can be loaded with the command: <>= options(prompt = "R> ") @ <>= library("GGPA") @ \noindent This vignette is organized as follows. Section \ref{workflow} describes the overall graph-GPA analysis workflow \cite{GGPA} including model fitting (Section \ref{fitting}), association mapping (Section \ref{association}), and visualization of an estimate phenotype graph (Section \ref{plot}). %Section \ref{fitting} discusses how to fit graph-GPA model. %Section \ref{association} explains command lines for association mapping using graph-GPA. %Section \ref{plot} discusses command lines to generate a graph describing the genetic relationship among phenotypes. Section \ref{ddnet} illustrates how a prior disease graph can be queried and downloaded from the graph-GPA companion webiste and incorporated to the graph-GPA analysis workflow \cite{LGGPA}. \section{Workflow}\label{workflow} \textbf{[Note]} \textbf{All the results below are based on the 200 burn-in and 200 main MCMC iterations for quick testing and building of the R package. These results are provided here only for the illustration purpose and should not be considered as real results. We recommend users to use sufficient number of burn-in and main MCMC iterations, as we use 10,000 burn-in and 40,000 main MCMC iterations for all the results in our manuscript \cite{GGPA}.}\\ In this vignette, we use the simulated GWAS data for $20,000$ SNPs and seven phenotypes for the illustration purpose. Users can find a $p$-value matrix of size $20,000 \times 7$ in the `\texttt{simulation\$pmat}' object. <>= data(simulation) dim(simulation$pmat) head(simulation$pmat) @ \noindent In this simulation studies, we assume the three strongly correlated phenotypes (n1, n2, n3), two weakly correlated phenotypes (n4, n5), and two independent phenotypes (n6, n7), as illustrated in Figure \ref{fig:pgraph}. Parameters used to generate simulation data can be found in the list object `\texttt{simulation}'. More details about simulation data generation procedure can be found in our manuscript \cite{GGPA}. <>= adjmat <- simulation$true_G diag(adjmat) <- 0 ggnet2( adjmat, label=TRUE, size=15 ) @ \begin{figure}[tb] \begin{center} <>= adjmat <- simulation$true_G diag(adjmat) <- 0 plot( adjmat, size=15 ) @ \caption{\label{fig:pgraph} True phenotype graph for simulation studies.} \end{center} \end{figure} \subsection{Fitting the graph-GPA Model}\label{fitting} We are now ready to fit a graph-GPA model using the GWAS $p$-value data described above (\texttt{simulation\$pmat}). R package \texttt{GGPA} provides flexible analysis framework and automatically adjusts its model structure based on the provided data. Users can fit the graph-GPA model with the command: <>= set.seed(12345) fit <- GGPA( simulation$pmat ) @ <>= set.seed(12345) fit <- GGPA( simulation$pmat, nBurnin=200, nMain=200 ) @ The following command prints out a summary of graph-GPA model fit, including data summary, proportion of SNPs associated with each phenotype, parameter estimates, and their standard errors. <>= fit @ Parameter estimates and their standard errors can be extracted using methods `\texttt{estimate}'. <>= str(estimates(fit)) @ \subsection{Association Mapping}\label{association} Now, based on the fitted graph-GPA model, we implement association mapping with the command: <>= assoc.marg <- assoc( fit, FDR=0.10, fdrControl="global" ) dim(assoc.marg) apply( assoc.marg, 2, table ) @ `\texttt{assoc}' method returns a binary matrix indicating association of each SNP, where one indicates that a SNP is associated with the phenotype and zero otherwise. Its rows and columns match those of input $p$-value matrix for `\texttt{GGPA}' method. `\texttt{assoc}' method allows both local (`\texttt{fdrControl="local"}') and global FDR controls (`\texttt{fdrControl="global"}'), and users can control nominal FDR level using the argument `\texttt{FDR}'. The association mapping results above indicate that about 300 $\sim$ 1400 SNPs are estimated to be associated with these phenotypes under the global FDR control at 0.10 level. `\texttt{fdr}' method for the output of `\texttt{GGPA}' method (`\texttt{fit}' in this example) further provides the matrix of local FDR that a SNP is not associated with each phenotype, where its rows and columns match those of input $p$-value matrix for `\texttt{GGPA}' method. This method will be useful when users want to scrutinize association of each SNP more closely. <>= fdr.marg <- fdr(fit) dim(fdr.marg) head(fdr.marg) @ When users are interested in the association of a SNP for certain pair of phenotypes, users can specify it using `\texttt{i}' and `\texttt{j}' arguments in both `\texttt{assoc}' and `\texttt{fdr}' methods, where `\texttt{i}' and `\texttt{j}' indicate indices of phenotypes of interest. For example, if users are interested in SNPs associated with both the first and the second phenotypes, we can specify this by setting `\texttt{i=1, j=1}'. If the `\texttt{i}' and `\texttt{j}' arguments are specified, `\texttt{assoc}' and `\texttt{fdr}' methods return a corresponding vector instead of a matrix. The association mapping results below indicate that there are 591 SNPs associated with both the first and the second phenotypes under the global FDR control at 0.10 level. <>= assoc.joint <- assoc( fit, FDR=0.10, fdrControl="global", i=1, j=2 ) length(assoc.joint) head(assoc.joint) table(assoc.joint) @ \section{Investigation of Pleiotropic Architecture Using the Phenotype Graph}\label{plot} In the joint analysis of multiple GWAS data, it is of interest to investigate the genetic relationship among the phenotypes. The graph-GPA framework allows users to check this using a phenotype graph. This phenotype graph can be generated by applying `\texttt{plot}' method to the output of `\texttt{GGPA}' method (`\texttt{fit}' in this example). <>= plot(fit) @ \noindent Figure \ref{fig:pgraph-est} shows the phenotype graph estimated using graph-GPA for the simulation data and users can see that it is identical to the true phenotype graph shown in Figure \ref{fig:pgraph}. \begin{figure}[tb] \begin{center} <>= plot(fit) @ \caption{\label{fig:pgraph-est} Phenotype graph estimated using graph-GPA.} \end{center} \end{figure} \section{graph-GPA Analysis Using a Prior Disease Graph}\label{ddnet} The graph-GPA was initially based on an uninformative prior disease graph \cite{GGPA} but later extended by allowing users to incorporate an informative prior disease graph \cite{LGGPA}. Specifically, we proposed to generate a prior disease graph based on the gene sharing pattern between diseases in the literature mining. While we showed that this approach effectively improves the graph-GPA analysis in the sense of estimation accuracy, robustness against the collinearity, and reproducibilities between independent validation dataasets \cite{LGGPA}, it still remains burdensome for most users to implement this literature mining. Hence, in order to facilitate users' convenience, we developed \textit{DDNet} (\url{http://www.chunglab.io/ddnet/}), a web interface that allows users to query diseases of interest, investigate relationships among them visually, and download the adjency matrix for the graph-GPA analysis. \begin{figure}[tb] \begin{center} \includegraphics{ddnet_step1.png} \caption{\label{fig:ddnet1} DDNet web interface: Step 1. Enter \url{http://www.chunglab.io/ddnet/} in your web browser.} \end{center} \end{figure} \begin{figure}[tb] \begin{center} \includegraphics{ddnet_step2.png} \caption{\label{fig:ddnet2} DDNet web interface: Step 2. Enter a list of diseases. Click ``Try Example'' for an example list of diseases.} \end{center} \end{figure} \begin{figure}[tb] \begin{center} \includegraphics{ddnet_step3.png} \caption{\label{fig:ddnet3} DDNet web interface: Step 3. Investigate a disease-disease network visually.} \end{center} \end{figure} \begin{figure}[tb] \begin{center} \includegraphics{ddnet_step4.png} \caption{\label{fig:ddnet4} DDNet web interface: Step 4. Download an adjacency matrix for the \texttt{graph-GPA} analysis.} \end{center} \end{figure} First, if you open the web address \url{http://www.chunglab.io/ddnet/} in your web browser, you can see the web interface that looks like Figure \ref{fig:ddnet1}. In the left side, you can a box and you can query diseases of interest. If you want to try an example list of diseases, just click ``Try Example'' on the top (Figure \ref{fig:ddnet2}). Alternatively, you can upload a text file of disease names of interest using the ``Upload'' button. Note that we constructed our disease dictionary using the Disease Ontology database (\url{http://disease-ontology.org/}). Hence, if you cannot find a disease of your interest, please check the Disease Ontology database. Then, please click the ``Submit'' button. Upon clicking the ``Submit'' button, you will see a network of the diseases you queried in the right side, as depcited in Figure \ref{fig:ddnet3}. By either using a bar of typing a value below the ``Cut-Off Value'' section, you can dynamically investigate disease network structures. Here, an edge is connected between a pair of diseases if the corresponding partial correlation coefficient is larger than the specified cut-off. If you click ``Download'' button, you can also download the disease network plot in PNG file format. If you click the ``Table'' tab above the disease graph, you can check the adjency matrix corresponding to the disease network for the specified cut-off (Figure \ref{fig:ddnet4}). You can also check the raw partial correlation coefficient matrix by clicking the ``Raw Matrix'' tab below the ``Table'' tab. By clicking ``Download'' button, you can download the adjacency matrix in the CSV file format and this can be used as a direct input for the \texttt{GGPA} package. Supposed that the downloaded CSV file is loaded to the R environment with the object name \texttt{pgraph} while the \texttt{pmat} has the corresponding genotype-phenotype association $p$-value matrix. Note that it is assumed that objects \texttt{pgraph} and \texttt{pmat} have the same number of columns and also share the same column names. Then, you can fit a graph-GPA model using the downloaded disease network as a prior distribution using the following command line. Other functions will work exactly in the same way as described in Section \ref{workflow}. <>= fit <- GGPA( pmat, pgraph ) @ \begin{thebibliography}{99} \bibitem{GGPA} Chung D, Kim H, and Zhao H (2017), ``graph-GPA: A graphical model for prioritizing GWAS results and investigating pleiotropic architecture,'' \textit{PLOS Computational Biology}, 13(2): e1005388. \bibitem{LGGPA} Kim H, Yu Z, Lawson A, Zhao H, and Chung D (2018), ``Improving SNP prioritization and pleiotropic architecture estimation by incorporating prior knowledge using graph-GPA,'' Bioinformatics, bty061. \end{thebibliography} \end{document}