--- title: "netReg" author: "Simon Dirmeier" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 2 toc_float: true theme: lumen vignette: > %\VignetteIndexEntry{netReg} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} bibliography: netReg.bib --- # Introduction Modelling biological associations or dependencies using linear regression models is often complicated when the analysed data-sets are high-dimensional and less observations than variables are available ($n \ll p$). For these scenarios methods utilizing a priori knowledge, e.g. in the form of biological networks, have been proposed, arguing that this information might provide better estimates for regression coefficients. Recently several network-based regularization techniques have been proposed [@li2008network, @kim2013network, @cheng2014graph]. `netReg` provides an efficient implementation of a graph-penalized regression model. The model introduces a priori generated biological graph information into generalized linear models yielding sparse or smooth solutions for regression coefficients. `netReg` computes coefficients using \textit{cyclic coordinate descent} as previously introduced [@fu1998penalized, @friedman2007pathwise], [@friedman2010regularization]. The package is an `R`-wrapper to an external `C++` library that uses `RcppArmadillo` [@eddelbuettel2014rcpparmadillo] for fast matrix calculations and `Dlib` [@king2009dlib] for gradient-free convex optimization for model selection. # Edgenet tutorial This section explains how to fit a linear model and do parameter estimation using `edgenet`-regularization. The model is a truncated version from [@cheng2014graph] that is able to introduce prior graphs for the design and response matrices for penalization. At first we generate some toy data randomly: ```{r} set.seed(23) X <- matrix(rnorm(1000*5), 1000) Y <- matrix(rnorm(1000*5), 1000) ``` Then we load the `netReg` library: ```{r} library(netReg) ``` For `edgenet` we need to create an affinity matrix for the co-variables first. We also can create a graph for the responses, but this is not necessary to demonstrate the method. We could create a random graph like this: ```{r} aff.mat <- matrix(rbeta(25, 1, 5), 5) aff.mat <- (t(aff.mat) + aff.mat) / 2 diag(aff.mat) <- 0 ``` We created the affinity matrix absolutely random, although in practice a *real* (biological) observed affinity matrix should be used, because in the end the affinity matrix decides the shrinkage of the coefficients. ## Model fitting Fitting a model using edge-based regularization with `netReg` is easy: ```{r} fit <- edgenet(X=X, Y=Y, G.X=aff.mat, lambda=1, psigx=1, family="gaussian") print(fit) ``` In this case we used a single affinity matrix `G.X` which represents the relationship of the covariables $\mathit{X}$. If the design matrix has $p$ variables, `G.X` has to be an $(p \times p)$-dimensional symmetric matrix. We can also include a matrix for the response matrix `Y` with $q$ dependent variables. In that case the affinity matrix `G.Y` has to be $(q \times q)$-dimensional (and also symmetric). The `fit` object contains information about coefficients, intercepts, residuals, etc. Having the coefficients estimated we are able to predict novel data-sets: ```{r} X.new <- matrix(rnorm(10*5),10) pred <- predict(fit, X) ``` The `pred` objects contains the predicted values for the responses. ## Model selection In most cases we do not have the optimal shrinkage parameters $\lambda$, $\psi_{gx}$ and $\psi_{gy}$. For these settings you can use `netReg`'s included model selection. We use Powell's BOBYQA algorithm ([@powell2009bobyqa]) for gradient-free optimization that is included in the `C++` library `Dlib`. Doing the model selection only requires calling `cv.edgenet`: ```{r} cv <- cv.edgenet(X=X, Y=Y, G.X=aff.mat, family="gaussian", maxit=1000) print(cv) ``` You can use the fitted parameters for the normal `edgenet` function. In this scenario $\lambda$, $\psi_{gx}$ and $\psi_{gy}$ should be roughly $0$ for three reasons: * we had enough data and a small number of covariables ($n > p$), so we can find the *BLUE* estimator, * we created `X` and `Y` independent of each other, * our prior graph `aff.mat` had little weight. Let's do a scenario where we need to shrink some coefficients, i.e. $n \ll p$. We choose a small $p$, such that the computation does not take too long. ```{r} p <- 25 X <- matrix(rnorm(10*p), 10) Y <- matrix(rnorm(10*p), 10) aff.mat <- matrix(rgamma(p * p, 5, 1), p) aff.mat <- (t(aff.mat) + aff.mat) diag(aff.mat) <- 0 cv <- cv.edgenet(X=X, Y=Y, G.X=aff.mat, family="gaussian", maxit=1000) print(cv) ``` In the above example $\lambda$ should have changed quite a bit, while $\psi_{gy}$ should still be $0$. Since we generated `aff.mat` randomly $\psi_{gx}$ should be roughly (or exact) zero as well. This makes sense intuitively since we did not put any biological relationships into the affinity matrices. # References