---
title: "netReg"
author: "Simon Dirmeier"
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
        toc: true
        toc_depth: 2
        toc_float: true
        theme: lumen
vignette: >
 %\VignetteIndexEntry{netReg}
 %\VignetteEngine{knitr::rmarkdown}
 %\usepackage[utf8]{inputenc}
bibliography: netReg.bib
---

# Introduction

Modelling biological associations or dependencies using linear regression 
models is often complicated when the analysed data-sets are high-dimensional 
and less observations than variables are available ($n \ll p$). 
For these scenarios methods utilizing a priori knowledge, e.g. in the form of 
biological networks, have been proposed, arguing that this information might 
provide better estimates for regression coefficients. Recently several 
network-based regularization techniques have been proposed 
[@li2008network, @kim2013network, @cheng2014graph].

`netReg` provides an efficient implementation of a graph-penalized 
regression model. The model introduces a priori generated biological graph 
information into generalized linear models yielding sparse or smooth solutions 
for regression coefficients.  

`netReg` computes coefficients using \textit{cyclic coordinate descent} as 
previously introduced [@fu1998penalized, @friedman2007pathwise], 
[@friedman2010regularization]. The package is an `R`-wrapper to an external 
`C++` library that uses `RcppArmadillo` [@eddelbuettel2014rcpparmadillo] for 
fast matrix calculations and `Dlib` [@king2009dlib] for gradient-free 
convex optimization for model selection.

# Edgenet tutorial

This section explains how to fit a linear model and do parameter estimation 
using `edgenet`-regularization. The model is a truncated version from 
[@cheng2014graph] that is able to introduce prior graphs for the design and 
response matrices for penalization.

At first we generate some toy data randomly:

```{r}
    set.seed(23)
    X <- matrix(rnorm(1000*5), 1000)
    Y <- matrix(rnorm(1000*5), 1000)
```

Then we load the `netReg` library:

```{r}
    library(netReg)
```

For `edgenet` we need to create an affinity matrix for the co-variables first. 
We also can create a graph for the responses, but this is not necessary to
demonstrate the method. We could create a random graph like this:

```{r}
    aff.mat <- matrix(rbeta(25, 1, 5),  5)
    aff.mat <- (t(aff.mat) + aff.mat) / 2
    diag(aff.mat) <- 0
```

We created the affinity matrix absolutely random, although in practice a *real*
(biological) observed affinity matrix should be used, because in the end the 
affinity matrix decides the shrinkage of the coefficients.

## Model fitting

Fitting a model using edge-based regularization with `netReg` is easy:

```{r}
    fit <- edgenet(X=X, Y=Y, G.X=aff.mat, lambda=1, psigx=1, family="gaussian")
    print(fit)
```

In this case we used a single affinity matrix `G.X` which represents the
relationship of the covariables $\mathit{X}$. If the design matrix has $p$ 
variables, `G.X` has to be an $(p \times p)$-dimensional symmetric matrix. 
We can also include a matrix for the response matrix `Y` with $q$ dependent 
variables. In that case the affinity matrix `G.Y` has to be 
$(q \times q)$-dimensional (and also symmetric).

The `fit` object contains information about coefficients, intercepts, 
residuals, etc. Having the coefficients estimated we are able to
predict novel data-sets:

```{r}
    X.new <- matrix(rnorm(10*5),10)
    pred  <- predict(fit, X)
```

The `pred` objects contains the predicted values for the responses.

## Model selection

In most cases we do not have the optimal shrinkage parameters $\lambda$, 
$\psi_{gx}$ and $\psi_{gy}$. For these settings you can use `netReg`'s 
included model selection. We use Powell's BOBYQA algorithm 
([@powell2009bobyqa]) for gradient-free optimization that is included in the
`C++` library `Dlib`. Doing the model selection only requires calling 
`cv.edgenet`:

```{r}
    cv <- cv.edgenet(X=X, Y=Y, G.X=aff.mat, family="gaussian", maxit=1000)
    print(cv)
```

You can use the fitted parameters for the normal `edgenet` function. 
In this scenario $\lambda$, $\psi_{gx}$ and $\psi_{gy}$ should be 
roughly $0$ for three reasons:

* we had enough data and a small number of covariables ($n > p$), so we can 
find the *BLUE* estimator,
* we created `X` and `Y` independent of each other,
* our prior graph `aff.mat` had little weight.

Let's do a scenario where we need to shrink some coefficients, i.e. $n \ll p$.
We choose a small $p$, such that the computation does not take too long.

```{r}
    p <- 25
    X <- matrix(rnorm(10*p), 10)
    Y <- matrix(rnorm(10*p), 10)
    aff.mat <- matrix(rgamma(p * p, 5, 1), p)
    aff.mat <- (t(aff.mat) + aff.mat)
    diag(aff.mat) <- 0
    cv <- cv.edgenet(X=X, Y=Y, G.X=aff.mat, family="gaussian", maxit=1000)
    print(cv)
```

In the above example $\lambda$ should have changed quite a bit, while
$\psi_{gy}$ should still be $0$. Since we generated `aff.mat` randomly 
$\psi_{gx}$ should be roughly (or exact) zero as well. This makes sense 
intuitively since we did not put any biological relationships into the 
affinity matrices.

# References