---
title: "Building PPIs from StringDB"
author:
- name: Jonathan Ronen
  affiliation: &id Berlin Institute for Medical Systems Biology, Max Delbrück Center
- name: Altuna Akalin
  affiliation: *id

date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Generation of PPI graph}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction
This vignette demonstrates how a Protein-Protein interaction (PPI) graph may be
constructed from the database [stringDB][stringdb].

# Obtaining network data from stringDB
Other networks can also be used with _netSmooth_. We mostly rely on networks
from stringDB. StringDB has multiple species available such as human, mouse,
zebrafish, _C.elengas_ and _D.melanogaster_. It is also possible to prune the
network differently. For our purposes we use the edges that have highest
confidence score. Below, we are showing how to obtain and prune human network
from stringDB. Specifically, we use the work flow below.

1. Get human network/graph from STRINGdb.
2. Prune the network to get only high-confidence edges
3. Create adjacency matrix
4. Map protein ids in the network to Ensembl Gene ids in the adjacency matrix

```{r , echo=TRUE,eval=FALSE}
require(STRINGdb)
require(igraph)
require(biomaRt)

# 1. getSTRINGdb for human
string_db <- STRINGdb$new(species=9606)
human_graph <- string_db$get_graph()

# 2. get edges with high confidence score
edge.scores <- E(human_graph)$combined_score
ninetyth.percentile <- quantile(edge.scores, 0.9)
thresh <- data.frame(name='90th percentile',
                     val=ninetyth.percentile)
human_graph <- subgraph.edges(human_graph,
                              E(human_graph)[combined_score > ninetyth.percentile])

# 3. create adjacency matrix
adj_matrix <- as_adjacency_matrix(human_graph)


# 4. map gene ids to protein ids

### get gene/protein ids via Biomart
mart=useMart(host = 'grch37.ensembl.org',
             biomart='ENSEMBL_MART_ENSEMBL',
             dataset='hsapiens_gene_ensembl')

### extract protein ids from the human network
protein_ids <- sapply(strsplit(rownames(adj_matrix), '\\.'),
                      function(x) x[2])

### get protein to gene id mappings
mart_results <- getBM(attributes = c("ensembl_gene_id",
                                     "ensembl_peptide_id"),
                      filters = "ensembl_peptide_id", values = protein_ids,
                      mart = mart)

### replace protein ids with gene ids
ix <- match(protein_ids, mart_results$ensembl_peptide_id)
ix <- ix[!is.na(ix)]

newnames <- protein_ids
newnames[match(mart_results[ix,'ensembl_peptide_id'], newnames)] <-
    mart_results[ix, 'ensembl_gene_id']
rownames(adj_matrix) <- newnames
colnames(adj_matrix) <- newnames

ppi <- adj_matrix[!duplicated(newnames), !duplicated(newnames)]
nullrows <- Matrix::rowSums(ppi)==0
ppi <- ppi[!nullrows,!nullrows] ## ppi is the network with gene ids
```


-------


```{r}
sessionInfo()
```

[stringdb]: https://string-db.org "string-db.org"