--- title: "Usage of AMOUNTAIN" author: "Dong Li dxl466@cs.bham.ac.uk" date: "14 November 2016" output: BiocStyle::html_document bibliography: Bibliography.bib vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Motivation There are various ways to detect modules from weighted networks. Conventional approaches such as clustering or graph partitioning purely use the network topology to define modules [@fortunato2010community]. But we may need additional information such as increasing high-throughput omics data. On the one hand, the construction of reliable networks, especially for specific tissues, is relatively slow. On the other hand, integrating omics data with network topology has become a paradigm in system biology community in the past decade [@mitra2013integrative]. Weighted gene co-expression network (WGCN) is a pure data-driven gene network, which only relies on gene expression profiles. There is no rigorous definition of *active modules* in WGCN, but the module itself should be more *compact* and *informative* compared with random subnetworks. AMOUNTAIN [@Li056952] provides a convex optimization based approach to identify such modules. Here we embed parts of the examples from the corresponding package [AMOUNTAIN](https://bioconductor.org/packages/AMOUNTAIN) help pages into a single document. # Network simulation We follow [@li2011integrative] to construct gene co-expression networks for simulation study. Let $n$ be the number of genes, and edge weights $W$ as well as node score $z$ follow the uniform distribution in range $[0,1]$. A module contains $k$ genes inside which the edge weights as well as node score follow the uniform distribution in range $[\theta,1]$, where $\theta=\{0.5,0.6,0.7,0.8,0.9\}$. ```{r} library(AMOUNTAIN) n = 100 k = 20 theta = 0.5 pp <- networkSimulation(n, k, theta) moduleid <- pp[[3]] netid <- 1:100 restp <- netid[-moduleid] groupdesign <- list(moduleid,restp) names(groupdesign) <- c('module','background') ``` The following figure shows the weighted co-expression network when $n=100,k=20$ and red nodes indicate module members and wider edges mean larger similarities. Visualization is based on [qgraph](https://cran.r-project.org/web/packages/qgraph/index.html). ```{r} require(qgraph) pg <- qgraph(pp[[1]],groups=groupdesign,legend=TRUE) ``` When simulating a two-layer network, the basic method is to connect two independent networks with an inter-layer weight matrix $A$, which is designed to have larger weights between two modules. ```{r} n1 = 100 k1 = 20 theta1 = 0.5 n2 = 80 k2 = 10 theta2 = 0.5 ppresult <- twolayernetworkSimulation(n1,k1,theta1,n2,k2,theta2) A <- ppresult[[3]] pp <- ppresult[[1]] moduleid <- pp[[3]] netid <- 1:n1 restp <- netid[-moduleid] pp2 <- ppresult[[2]] moduleid2 <- pp2[[3]] netid2 <- 1:n2 restp2 <- netid2[-moduleid2] library(qgraph) ## labelling the groups groupdesign <- list(moduleid,restp,(moduleid2+n1),(restp2+n1)) names(groupdesign) <- c('module1','background1','module2', 'background2') twolayernet <- matrix(0,nrow=(n1+n2),ncol=(n1+n2)) twolayernet[1:n1,1:n1] <- pp[[1]] twolayernet[(n1+1):(n1+n2),(n1+1):(n1+n2)] <- pp2[[1]] twolayernet[1:n1,(n1+1):(n1+n2)] <- A twolayernet[(n1+1):(n1+n2),1:n1] <- t(A) ``` The following figure shows the the two-layer weighted co-expression network based on above simulation. ```{r} g <- qgraph(twolayernet,groups=groupdesign,legend=TRUE) ``` # Module identification for single layer network Given the network $G$ with edges weight matrix $W\in\mathbb{R}^{n\times n}$ and nodes weight vector ${\bf z}\in\mathbb{R}^{n}$, where $n$ is the number of nodes, we formulate the active modules identification on WGCN as a elastic net contrained optmization problem: $$\min_{{\bf x}\in \mathbb{R}_+^n}\ F({\bf x})=-{\bf x}^TW{\bf x}-\lambda{\bf z}^T{\bf x}\quad s.t.\quad \alpha\|{\bf x}\|_1+(1-\alpha)\|{\bf x}\|_2^2=1$$ where the module membership vector ${\bf x}\in\mathbb{R}_+^n$ is relaxed from a $0-1$ vector in which $x_i\neq0$ means node $i$ is in the module. And $\alpha$ is the parameter to balance $\ell_1$-norm and $\ell_2$-norm which actually controls the module size. Larger $\alpha$ means a more sparse vector, corresponding smaller module in this case. We adopt the euclidean projection based technique [@gong2011efficient] to solve the problem. Here we show how to use the following function in the package to find an active module for above simulated single layer network. With groundtruth in hand, we can evaluate the quality of identified modules by F-score. In order to get higher quality, we need to tune parameter $\alpha$ in the elastic net penalty and $\lambda = 1$ in the objective function. The common way to select two optimal parameters is grid search. ```{r} n = 100 k = 20 theta = 0.5 pp <- networkSimulation(n,k,theta) moduleid <- pp[[3]] alphaset <- seq(0.1,0.9,by=0.1) lambdaset <- 2^seq(-5,5) ## using a grid search to select lambda and alpha Fscores <- matrix(0,nrow = length(alphaset),ncol = length(lambdaset)) for (j in 1:length(alphaset)) { for (k in 1:length(lambdaset)) { x <- moduleIdentificationGPFixSS(pp[[1]],pp[[2]],rep(1/n,n),maxiter = 500, a=alphaset[j],lambda = lambdaset[k]) predictedid<-which(x[[2]]!=0) recall <- length(intersect(predictedid,moduleid))/length(moduleid) precise <- length(intersect(predictedid,moduleid))/length(predictedid) Fscores[j,k] <- 2*precise*recall/(precise+recall) } } ``` We can show $gridFscore$ by 3-D plot to see how these parameters affect the performance. By certain combination of these two parameters, we can almost exactly find the target model nodes with $F-score=1$. ```{r} persp(Fscores,theta = 45,phi = 30,col = "gray",scale = FALSE,xlab = 'alpha',ylab = 'lambda', zlab = 'F-score',main = 'Fscores of identified module',box = TRUE) ``` # Module identification for two-layer network The basic idea to identification modules on a two-layer network is to find two active modules on each layer, at the same time with maximal inter-later links. we have function $\texttt{moduleIdentificationGPFixSSTwolayer}$ in the package. Following the two-layer network simulation in section 1, we call the method. ```{r} ## network simulation is the same as before modres <- moduleIdentificationGPFixSSTwolayer(pp[[1]],pp[[2]],rep(1/n1,n1),pp2[[1]],pp2[[2]],rep(1/n2,n2),A) predictedid <- which(modres[[1]]!=0) recall <- length(intersect(predictedid,moduleid))/length(moduleid) precise <- length(intersect(predictedid,moduleid))/length(predictedid) F1 <- 2*precise*recall/(precise+recall) predictedid2 <- which(modres[[2]]!=0) recall2 <- length(intersect(predictedid2,moduleid2))/length(moduleid2) precise2 <- length(intersect(predictedid2,moduleid2))/length(predictedid2) F2 <- 2*precise2*recall2/(precise2+recall2) ``` And we can also select optimal parameters combination in a more sophisticated way based on the example in section 2. # Module identification for multi-layer network A general multi-layer network is a natural extension of two-layer networks. Here we consider a specific form of multi-layer network that we could conduct simple operations. The basic idea to identification modules on a two-layer network is to find two active modules on each layer, at the same time with maximal inter-later links. we have function $\texttt{moduleIdentificationGPFixSSMultilayer}$ in the package. Following the multi-layer network simulation, we call the method as: ```{r} ## network simulation n = 100 k = 20 L = 5 theta = 0.5 cpl <- multilayernetworkSimulation(n,k,theta,L) listz <- list() for (i in 1:L){ listz[[i]] <- cpl[[i+2]] } moduleid <- cpl[[2]] ## use default parameters here x <- moduleIdentificationGPFixSSMultilayer(cpl[[1]],listz,rep(1/n,n)) predictedid <- which(x[[2]]!=0) recall <- length(intersect(predictedid,moduleid))/length(moduleid) precise <- length(intersect(predictedid,moduleid))/length(predictedid) Fscore <- (2*precise*recall/(precise+recall)) ``` And we can also select optimal parameters combination in a more sophisticated way based on the example in section 2. # Module identification for real-world data The usage of the package functions is the same for real-world data, but we need to be aware of two aspects. First of all the way to calculate edges score and nodes score in a weighted network can make an impact on the performance. Different input $W$ and $\bf z$ in the objective function may lead to different modules. Secondly, we do not have groundtruth about module membership for real world data. In this case, we may need to select the proper parameter so that the desired module size can be archived. When fixing $\lambda=0.01$, we use a binary search method to select $\alpha$ for the elastic net penalty which controls the sparsity of the module. ```{r,eval=FALSE} ## binary search parameter to fix module size to 100~200 abegin = 0.01 aend = 0.9 maxsize = 200 minsize = 100 for (i in 1:100) { x <- moduleIdentificationGPFixSS(W,z,rep(1/n,n),a=(abegin+aend)/2,lambda = 0.001,maxiter = 500) predictedid <- which(x[[2]]!=0) if(length(predictedid) > maxsize){ abegin <- (abegin+aend)/2 }else if (length(predictedid) < minsize){ aend <- (abegin+aend)/2 }else break } ``` # High-performance considerations When dealing with large scale networks, pure R is proven to be slow. In the [developing version](https://github.com/fairmiracle/AMOUNTAIN) we reimplement the core functions of AMOUNTAIN by C, in which the matrix operations are based on open source [GSL](https://www.gnu.org/software/gsl/). Currently we tested it on Linux platform. Here is a table of C-version functions and pure R functions: | C-version| Pure R | Brief description | |:----------|:-------------|:--------------------------------------------| | $\texttt{CGPFixSS}$ | $\texttt{moduleIdentificationGPFixSS}$ | Module identification on single network | | $\texttt{CGPFixSSTwolayer}$ | $\texttt{moduleIdentificationGPFixSSTwolayer}$ | Module identification on two-layer network | | $\texttt{CGPFixSSMultiLayer}$ | $\texttt{moduleIdentificationGPFixSSMultilayer}$ | Module identification on multi-layer network | We found that the most efficient way to call C functions is to compile .c file into shared libraries (For Linux .so and Windows .dll) and to use $\texttt{.C(xxx)}$ in R. Although we could follow the standard way to use [Rcpp](http://www.rcpp.org/) and even [RcppGSL](http://dirk.eddelbuettel.com/code/rcpp.gsl.html) to make use of GSL, the data format transformation makes it slower. For instance, you have to transform an array $\texttt{double *}$ into $\texttt{gsl_vector}$ or even $\texttt{RcppGSL::Vector}$ object by filling each entry. It would cause additional overhead especially for large scale data. # Biological explanation Finally, we can do gene annotation enrichment analysis with interactive tools like DAVID\footnote{https://david.ncifcrf.gov} or Enrichr\footnote{http://amp.pharm.mssm.edu/Enrichr}, to see whether a module gene list can be explained by existing biological process, pathways or even diseases. # Developer page Please visit [AMOUNTAIN](https://github.com/fairmiracle/AMOUNTAIN) for new features. # Session Information Here is the output of `sessionInfo()` on the system on which this document was compiled: ```{r echo=FALSE} sessionInfo() ``` # References