% \VignetteIndexEntry{LiquidAssociation Vignette} % \VignetteKeywords{Liquid Association, conditional normal model, GEE} % \VignettePackage{LiquidAssociation} \documentclass{article} \usepackage[margin=1in]{geometry} \usepackage{graphics} \usepackage{natbib} \newcommand{\R}{\textsf{R}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \title{The LiquidAssociation Package} \author{Yen-Yi Ho} \begin{document} \maketitle %\tableofcontents \section{Introduction} The LiquidAssociation package provides analytical methods to study three-way interactions. It incorporates methods to examine a particular kind of three-way interaction called liquid association (LA). The term liquid association was first proposed by \citet{Li2002}. It describes the extend to which the correlation of a pair of variables depends on the value of a third variable. The term ``liquid" was used in contrast with ``solid" to emphasize that the association between a pair of variables changes according to the value of a third variable. Li first reported the existence of liquid association in the yeast cell cycle gene expression data set by \citet{Spellman1998}. Furthermore, the liquid association phenomena could potentially be found in other kind of data as well. Building on the ground-breaking work by Li, \citet{Ho2009} extended liquid association to accommodate more intricate co-dependencencies among the 3 variables, calling the expanded statistic {\it generalized liquid association}, or GLA for short. \\ \indent This software package provides functions to implement the estimation of liquid association through two approaches: the direct and the model-based estimation approach. For the model-based approach, we introduce the conditional normal model (CNM) and provided a generalized estimating equations (GEE)-based estimation procedure. In addition, we provide functions to perform hypothesis testing using direct estimate (sGLA) and model-based estimates (GEEb5) in this package. \section{Simple Usage} Here we present a typical work flow to investigate liquid association using a gene triplet data. We start with load the \R{} package and the example data. In this package, we use the yeast cell cycle gene expression data by Spellman (1998). The data can be obtained through package \textit{yeastCC}. The annotation package for the yeast experiment \textit{org.Sc.sgd.db} can be obtained through Bioconductor. \\ <>= library(LiquidAssociation) library(yeastCC) library(org.Sc.sgd.db) data(spYCCES) lae <- spYCCES[,-(1:4)] ### get rid of the NA elements lae <- lae[apply(is.na(exprs(lae)),1,sum) < ncol(lae)*0.3,] probname<-rownames(exprs(lae)) genes <- c("HTS1","ATP1","CYT1") geneMap <- unlist(mget(genes,revmap(org.Sc.sgdGENENAME),ifnotfound=NA)) whgene<-match(geneMap,probname) @ \indent After removing genes with high missing percentage, we keep \Sexpr{nrow(exprs(lae))} genes for further analysis. We are interested in three genes: HTS1, ATP1, and CYT1 that are involved in the Yeast electron transport pathway. We would like to know whether the correlation of HTS1 and ATP1 gene can be modulated by the level of CYT1 gene. <>= data<-t(exprs(lae[whgene,])) eSetdata<-lae[whgene,] data<-data[!is.na(data[,1]) & !is.na(data[,2]) & !is.na(data[,3]),] colnames(data)<-genes str(data) @ \indent We notice that after removing missing observations, there are \Sexpr{nrow(data)} observations left. We can use the \textit{plotGLA} function to examine whether the correlation of the first two genes changes according to the level of the third gene. The \textit{plotGLA} function produces scatter plot conditioning on the level of a third gene, $X_3$. We can specify which column in the data to be the third gene using the argument \textit{dim}. For example, we can specify HTS1 and ATP1 gene as the first two modulated genes, and CYT1 as the third controller gene by setting \textit{dim=3}. The \textit{cut} argument is used to specify the number of grid points over the third controller gene. In the \textit{plotGLA} function, the input \textit{data} can be in ExpressionSet class as well. <>= plotGLA(data, cut=3, dim=3, pch=16, filen="GLAplot", save=TRUE) @ \begin{figure}[ht] \begin{center} \includegraphics[height=3in, width=3in]{GLAplot31.pdf} \includegraphics[height=3in, width=3in]{GLAplot32.pdf} \caption{Conditional distributions of (HTS1, ATP1 | CYT1) according to the gene expression level of CYT1.} \label{GLAplot} \end{center} \end{figure} \indent According to Figure \ref{GLAplot}, we find the evidence for the existence of liquid association among the triplet since the correlation of HTS1, and ATP1 gene changes from -0.21 to 0.63. We can further perform estimation and hypothesis testing to quantify the strength of liquid association. As described in Li (2002) and Ho et al (2009), the calculation of the liquid association measures assumes all three variable are standardized with mean 0 and variance 1 and the third gene follows normal distribution. Hence in this package, within the \textit{GLA}, \textit{LA}, \textit{CNM.full}, \textit{CNM.simple}, \textit{getsGLA}, \textit{getsLA} functions the standardization and normalization steps are performed internally (so that the three variables are with mean 0 and variance 1, and the third variable is normally distributed through quantile normalization). \\ \indent We now use GLA static to robustly measure the magnitude of liquid association. The measure ranges from $-\sqrt{2/\pi} \approx -0.798$ to $\sqrt{2/\pi}$. When GLA=0, it means that the correlation of the two modulated genes does not change according to the level of the third controller gene, hence there is no evidence of LA. In addition, when GLA >0, it indicates that the correlation of the first two genes increases with increasing value of the third gene and vise versa. We use the function \textit{GLA} to calculate the GLA estimate for a given triplet data as follows: \\ <>= LAest<-LA(data) GLAest<-rep(0,3) for ( dim in 1:3){ GLAest[dim]<-GLA(data, cut=4, dim=dim) } LAest GLAest @ The data argument in these function can also be in ExpressionSet class as follows: <>= eSetGLA<-GLA(eSetdata, cut=4, dim=3, geneMap=geneMap) eSetGLA @ \indent In the above example, we calculate three GLA estimates by sequentially changing the third controller gene. In addition, the three-product-moment estimator proposed by Li (2002) can also be calculated using the \textit{LA} function as shown above. In the example, we find noticeable differences between the three-prodcut-moment and GLA estimates, LAest and GLAest, respectively. As described in Ho (2009), the GLA estimator is more robust than LA in the sense that it could still correctly capture liquid association even when the marginal mean and variance depend on the third variable. \\ \indent The second approach to estimate liquid association is using the model-based estimator. We fit the CNM using the function \textit{CNM.full} as shown below. Furthermore, the CNM model is written as: \begin{eqnarray*} \label{larho} X_3 &\sim& N(\mu_3, \sigma_3^2) \\ X_1, X_2|X_3 &\sim& N(\left( \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right), \Sigma). \end{eqnarray*} where $\Sigma=\left( \begin{array}{cc} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{array} \right)$. The mean vector ($\mu_1, \mu_2$) and variance matrix $\Sigma$ depend on the level of $X_3$ as written below: \begin{eqnarray*} \mu_1 &=& \beta_1 X_3, \\ \mu_2 &=& \beta_2 X_3, \\ \log{\sigma_{1}^2} &=& \alpha_{3} + \beta_{3} X_3, \\ \log{\sigma_{2}^2} &=& \alpha_4 + \beta_4 X_3, \\ \log{[\frac{1+\rho}{1-\rho}]} &=& \alpha_5 + \beta_5 X_3. \end{eqnarray*} <>= FitCNM.full<-CNM.full(data) FitCNM.full @ The main parameter of interest in the CNM for examining the existence of liquid association is $b_5$. As shown in the result above, the GEEb5 Wald test statistic is displayed in the 8th row 3rd column (GEEb5=\Sexpr{round(FitCNM.full@output[8,3],2)}) with p value \Sexpr{round(FitCNM.full@output[8,4],3)}. We conclude there is statistically significant evidence that indicates the existence of liquid association for (HTS1, ATP1 | CYT1). We can also perform hypothesis testing using the direct estimate approach using the \textit{getsGLA} function as follows: <>= sGLAest<-getsGLA(data, boots=20, perm=50, cut=4, dim=3) sGLAest sLAest<-getsLA(data,boots=20, perm=50) sLAest @ where the argument \textit{boots} specify the number of bootstrap iteration for estimating the bootstrap standard error. For demonstration purpose, we set boots to 20. The \textit{perm} argument specifies the number of iterations to calculate permuted p value. In addition, we can also perform hypothesis testing using sLA statistic based on the three-product-moment estimator proposed by Li (2002). In real applications, sGLA is likely to be more robust than sLA. We draw similar conclusion using sGLA test statistic and GEEb5 with p value less than 0.01. \section{Extended Examples} In this section, we demonstrate an typical analysis procedure using the entire gene expression data set. In this example, we first filter out genes with small variances and keep 150 genes for further analysis. <>= lae<-t(exprs(lae)) V<-apply(lae, 2, var, na.rm=TRUE) ibig<-V > 0.5 sum(ibig) big<-which(ibig) bigtriplet<-lae[,big] dim(bigtriplet) @ We can annotation the ORF ID in the data set to their gene names as follows: <>== x <- org.Sc.sgdGENENAME mappedgenes <- mappedkeys(x) xx <- as.list(x[mappedgenes]) mapid<-names(xx) orfid<-colnames(bigtriplet) genename1<-xx[match(orfid, mapid)] imap<-which(sapply(genename1, length) !=1) bigtriplet<-bigtriplet[,-imap] colnames(bigtriplet)<-genename1[-imap] @ After removing ORF probe set that can not be mapped to genes, a total number of \Sexpr{choose(ncol(bigtriplet),3)} possible triplet combinations that can be generated by the \Sexpr{ncol(bigtriplet)} genes. We now calculate the GLA estimates for all possible triplet combinations. Here, we demonstrate calculating GLA estimates for triplet combination $\#1$ to $\# 100$. <>= num<-choose(ncol(bigtriplet),3) pick<-t(combn(1:ncol(bigtriplet),3)) GLAout<-matrix(0, nrow=100, 3) for ( i in 1:100){ dat1<-bigtriplet[,pick[i,]] for ( dim in 1:3 ){ GLAout[i,dim]<-GLA(dat1, cut=4, dim=dim) } } @ We would like to find triplets with large GLA estimates. For example, we can choose triplet with GLA greater than 0.2 as follows: <>= GLAmax<-apply(abs(GLAout),1, max) imax<-which(GLAmax > 0.20) pickmax<-pick[imax,] GLAmax<-GLAout[imax,] trip.order<-t(apply(t(abs(GLAmax)), 2, order, decreasing=TRUE)) @ Furthermore, we perform hypothesis testing using the first triplet as demonstration. Based on the analysis result, we can not reject the null hypothesis that the liquid association is 0. We draw the same conclusion using GEEb5 and sGLA statistics. <>= whtrip<-1 data<-bigtriplet[,pickmax[whtrip,]] data<-data[,trip.order[whtrip,]] data<-data[!is.na(data[,1]) & !is.na(data[,2]) & !is.na(data[,3]),] data<-apply(data,2,qqnorm2) data<-apply(data,2, stand) FitCNM1<-CNM.full(data) FitCNM1 sGLA1<-getsGLA(data, boots=20, perm=50, cut=4, dim=3) sGLA1 @ \section{Reference} \bibliographystyle{biom} \bibliography{NGBayes3} \section{Session Information} \begin{itemize} \item R version 2.8.1 (2008-12-22), \verb|x86_64-unknown-linux-gnu| \item Locale: \verb|LC_CTYPE=en_US.iso885915;LC_NUMERIC=C;LC_TIME=en_US.iso885915;LC_COLLATE=en_US.iso885915;LC_MONETARY=C;LC_MESSAGES=en_US.iso885915;LC_PAPER=en_US.iso885915;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.iso885915;LC_IDENTIFICATION=C| \item Base packages: base, datasets, graphics, grDevices, methods, stats, tools, utils \item Other packages: AnnotationDbi~1.4.2, Biobase~2.2.2, DBI~0.2-4, geepack~1.0-13, LiquidAssociation~1.0.4, org.Sc.sgd.db~2.2.6, RSQLite~0.7-1, yeastCC~1.2.6 \end{itemize} \end{document}