---
title: "KBoost"
author: "Luis F. Iglesias-Martinez, Barbara De Kegel and Walter Kolch"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{KBoost}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

## Introduction

KBoost is a gene regulatory network inference algorithm. It builds several models unsing kernel principal components regression and boosting and from them estimates the probability that each transcription factor regulates a gene.
KBoost has one main function: `kboost` and a case specific function: `kboost_human_symbol`.

## Quickstart

The function `kboost` infers gene regulatory networks from gene expression data. The gene expression data (`D4_multi_1` in the examples below) needs to be a numerical matrix where the columns are the genes and the rows are the observations, patients or experiments.

#### Without Prior knowledge:

```{r message=FALSE}
library(KBoost)

data(D4_multi_1)

grn = kboost(D4_multi_1)

grn$GRN[91:93,2:5]
```


#### With Prior knowledge:

KBoost has a Bayesian formulation which allows the user to include prior knowledge. In biology this is commonly the case, particularly with well studied organisms. If the user knows of transcription factors that are well known to regulate certain genes, they can include this in a matrix `prior_weights` of size GxK, where G is the number of genes and K the number of transcription factors.

Take for example the well-known TP53-MDM2 interaction. The user would need fist to build the matrix `prior_weights` and in the column that corresponds to TP53 and the row that corresponds to MDM2 type a number that represents the prior probability of this interaction. We recommend using values higher that 0.5 but lower than 1 to avod nummerical errors in these cases. On the other hand, for interactions where no prior knowledge is available, the user can simply set 0.5.


```{r message=FALSE}
library(KBoost)

data(D4_multi_1)

# Matrix of size 100x100 with all values set to 0.5
prior_weights = matrix(0.5,100,100)

# For this example assume we know from previous experiments that TF2 regulates the gene in row 91
prior_weights[91,2] = 0.8

grn = kboost(X=D4_multi_1, prior_weights=prior_weights)

# Note that the first entry now has a slightly higher probability than in the 
# previous example, as a result of adding the prior
grn$GRN[91:93,2:5]
```


#### With gene symbols

```{r message=FALSE}
library(KBoost)

# A random 10x5 numerical matrix
X = rnorm(50,0,1)
X = matrix(X,10,5)

# Gene names corresponding to the columns of X
gen_names = c("TP53","YY1","CTCF","MDM2","ESR1")

grn = KBoost_human_symbol(X,gen_names,pos_weight=0.6, neg_weight=0.4)

# TFs are taken from Lambert et al., 4 columns in the output network indicates 4 of the genes are TFs.
grn$GRN

# Look at the prior weights based on the Gerstein network.
# Output indicates the YY1-TP53 edge is present in the Gerstein network.
grn$prior_weights
```


## Main Functions


### KBoost(X, TFs, prior_weights, g, v, ite)

Function to infer gene regulatory network from gene expression data.

Input:

* `X`: an NxG matrix where N is the number of observations and G the number of genes.
* `TFs`: a vector of numerical indexes of the K genes in X that are TFs (default 1:G).
* `prior_weights`: a GxK matrix with the prior probabilities of each interaction (default is 0.5 for all values).
* `g`: a positive scalar that corresponds to the width parameter in the RBF Kernel (default 40).
* `v`: a positive scalar lower than 1 that is the shrinkage parameter for each boosting iteration (default 0.1).
* `ite`: an integer that represents the maximum number of iterations (default 3).

Output:  
List with the following fields:

* `GRN`:  A matrix with the gene regulatory network.
* `GRN_UP`: A matrix with the gene regulatory network before the heuristic step of multiplying each column by its variance.
* `prior`: The prior for the best model at each iteration.
* `model`: the transcription factors with the highest posteriors at each iteration per gene.
* `prior_weights`: a GxK matrix with the prior probabilities of each interaction.
* `g`: a positive scalar that corresponds to the width parameter in the RBF Kernel.
* `v`: a positive scalar lower than 1 that is the shrinkage parameter for each boosting iteration.
* `ite`: an integer that represents the maximum number of iterations.


### KBoost_human_symbol(X, gen_names, g, v, ite, pos_weight, neg_weight)

Function to infer gene regulatory network from human cell lines or patient samples. This function automatically builds a prior from *Gerstein et al. (2012)* and uses the list of TFs from *Lambert et al. (2018)*. The gene expression data needs to be a numerical matrix. 

Input:

* `X`: an NxG numeric matrix with the expression values of G genes and N obersvations. The gene names can be specified as column names.
* `gen_names`: a set of SYMBOL gene names that correspond to the names of the columns of X. Not required if column names of X are already gene names.
* `g`: a positive scalar with the width parameter for the RBF kernel. (default = 40).
* `v`: a number between 0 and 1 with the shrinkage parameter. (default = 0.1).
* `ite`: an integer with the number of iterations (default = 3).
* `pos_weight`: the prior weight for edges that were previously found in the *Gerstein et al.* network (default = 0.6).
* `neg_weight`: the prior weight for edges that were not found in the *Gerstein et al.* network (default = 0.5).

Output:   
List with the following fields:

* `GRN`:  A matrix with the gene regulatory network.
* `GRN_UP`: A matrix with the gene regulatory network before the heuristic step of multiplying each column by its variance.
* `prior`: The prior for the best model at each iteration.
* `model`: the transcription factors with the highest posteriors at each iteration per gene.
* `prior_weights`: a GxK matrix with the prior probabilities of each interaction.
* `g`: a positive scalar that corresponds to the width parameter in the RBF Kernel.
* `v`: a positive scalar smaller than 1 that is the shrinkage parameter for each boosting iteration.
* `ite`: an integer that represents the maximum number of iterations.


### AUPR_AUROC_matrix(Net, G_mat, auto_remove, TFs, upper_limit)
Function to calculate the AUROC and AUPR of a known network.

Input:

* `Net`: An inferred network with the predictive probabilities that each transcription factor regulates each gene.
* `G_mat`: A matrix with the gold standard network.
* `auto_remove`: TRUE if the auto-regulation is to be discarded.
* `TFs`: the indexes of the rows of Net that are TFs.
* `upper_limit`: Max number of edges to use (default = all possible edges).

Output:  
List with the following fields:

* `AUPR`: the area under the precision-recall (PR) curve.
* `AUROC`: the area under the receiver operator characteristic (ROC) curve.
* `th`: All the unique values of Net.
* `Prec`: The precision at each value of th.
* `Rec`: The recall at each value of th.
* `FPR`: The false positive rate at each value of th.
* `TP`: The true positives at each value of th.
* `FP`: The false positives at each value of th.
* `TN`: The true negatives at each value of th.
* `FN`: The false negatives at each value of th.


### d4_mfac(v, g, ite)

Function to produce the KBoost AUPR and AUROC results on the DREAM4 Multifactorial Challenge.

Input:

* `g`: a number larger than 0 that is the width parameter for the RBF Kernel
* `v`: a number between 0 and 1 that is the shrinkage parameter
* `ite`: an integer with number of iterations.

Output:

* `auprs`: a matrix with the AUPR per D4 multifactorial dataset.
* `aurocs`: a matrix with the AUROC per D4 multifactorial dataset.


### `get_prior_Gerstein(gen_names, TFs, pos_weight, neg_weight)`

Function to build a prior from a previously built Network on ChIP-Seq from *Gerstein et al. (2012)*.

Input:

* `gen_names`: the gene names  of the G genes in the user's subset in Symbol nomenclature.
* `TFs`: the indexes of the K genes in the user's subset which are TFs.
* `pos_weight`: the prior weight for edges that were previously found in the *Gerstein et al.* network
* `neg_weight`: the prior weight for edges that were not found in the *Gerstein et al.* network

Output:

* `prior_weights`: a GxK matrix with prior weights that a TF regulates a gene given the network published by *Gerstein et al.*


### grid_search_kboost(dataset, vs, gs, ite)

Function to perform a grid search and find the best hyperparameters.

Input:

* `dataset`: One of the three datasets in the package, 1 for IRMA, 2 for DREAM4 multifactorial and 3 for DREAM5.
* `vs`: The range of values of v. All values need to be between 0 and 1.
* `gs`: The range of values of g. All values need to be larger than 0.
* `ite`: An integer that is the number of iterations.

Output:  
List with the following fields:

* `aurocs`: a 3 dimensional marray with the AUROCs. Columns are the gs, the rows the datasets, vs, and the last dimension is the different datasets within a dataset.
* `auprs`: a 3 dimensional matrix with the AUPRs. Columns are the gs, the rows the datasets, vs, and the last dimension is the different datasets within a dataset.


### irma_check(g, v, ite)
Function to produce the AUPR and AUROC Results on the DREAM4 Multifactorial Challenge.

Input: 

* `g`: a number larger than 0 that is the width parameter for the RBF Kernel
* `v`: a number between 0 and 1 that is the shrinkage parameter
* `ite`: an integer with number of iterations.

Output: 

* `auprs`: a matrix with the AUPR per IRMA dataset.
* `aurocs`: a matrix with the AUROC per IRMA dataset.


### net_dist_bin(GRN,TFs,thr)

Function to calculate the shortest distance between nodes.

Input:

* `GRN`: An inferred networks with the predictive probabilities that a transcription factor regulates a gene.
* `TFs`: A vector with indexes of the rows of GRN which correspond to TFs.
* `thr`: A scalar between 0 and 1 that is used select the edges with large posterior probabilities.

Output:

* `dist_mat`: A matrix with the shortest distances between TFs (columns) and all genes (rows).

Example:
```{r message=FALSE}
library(KBoost)
data(D4_multi_1)
Net = kboost(D4_multi_1)
dist = net_dist_bin(Net$GRN,Net$TFs,0.1)
```


### net_summary_bin(GRN,TFs,thr,a,b)

Function to summarize the GRN filtered with a threshold.

Input:

* `GRN`: An inferred networks with the predictive probabilities that a transcription facor regulates a gene.
* `TFs`: A vector with indexes of the rows of GRN which correspond to TFs.
* `thr`: a scalar between 0 and 1, edges with posterior probabilities lower than thr will be discarded.
* `a`: a scalar for the Katz and PageRank centrality measures. Default the inverse of the largest eigenvalue of GRN.
* `b`: a scalar for the Katz and PageRank centrality measures. Default is 1.


Output:
List with the following fields:

* `GRN_table`: a sorted table version of the GRN.
* `Outdegree`: the outdegree of each TF.
* `Indegree`: the indegree of each gene.    
* `Close_centr`: A matrix with the closeness centrality measure per TF.

Example:

```{r message=FALSE}
library(KBoost)
data(D4_multi_1)
Net = kboost(D4_multi_1)
Net_Summary = net_summary_bin(Net$GRN)
```


### net_refine(Net)

Function to do a heuristic post-processing suggested by Slawek and Arodz that improves accuracy. Each column is multiplied by its variance.

Input:

* `Net`: a GRN with TFs in the columns.

Output:

* `Net`: a refined GRN.


### write_GRN_D4(GRN,TFs, filename)
Function to write output in DREAM4 Challenge Format.

Input:

* `GRN`: a GxK gene regulatory network.
* `TFs`: a K set of indixes of G that are TFs.
* `filename`: a string with the name of the file to store the GRN.


## Datasets

### DREAM 4 Multifactorial Perturbation Challenge Datasets

#### D4_multi_1, D4_multi_2, D4_multi_3, D4_multi_4 and D4_multi_5

The gene expression datasets from the DREAM4 multifactorial perturbation challenge.
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed

To use:
```{r message=FALSE}
data(D4_multi_1)
```


#### G_D4_multi_1, G_D4_multi_2, G_D4_multi_3, G_D4_multi_4 and G_D4_multi_5

The gold standard networks from gene expression datasets from the DREAM4 multifactorial perturbation challenge.
https://www.synapse.org/#!Synapse:syn3049712/wiki/74628
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed

To use:
```{r message=FALSE}
data(G_D4_multi_1)
```


### Gerstein_Prior_ENET_2

Gene Regulatory Network from the ChIPSeq Dataset Encode in Human Cell-Lines. A matrix with two columns
The fist column is a transcription factor and the second is a gene.

*Gerstein, M.B., et al. Architecture of the human regulatory network derived from ENCODE data. Nature 2012;489(7414):91-100.*

To use:
```{r message=FALSE}
data(Gerstein_Prior_ENET_2)
```


### Human_TFs

Set of Genes that are Transcription Factors in Symbol nomenclature.

*Lambert, S.A., et al. The Human Transcription Factors. Cell 2018;172(4):650-665.*

To use:
```{r message=FALSE}
data(Human_TFs)
```