---
title: "Basic checks of BioPlex PPI data"
output:
  BiocStyle::html_document:
    self_contained: yes 
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('BioPlex')`"
vignette: >
  % \VignetteIndexEntry{2. Data checks}
  % \VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)

cache.dir <- tools::R_user_dir("BioPlex", which = "cache")
bfc <- BiocFileCache::BiocFileCache(cache.dir)
for(b in BiocFileCache::bfcinfo(bfc)$rid) BiocFileCache::bfcremove(bfc, b)
```

# Setup

```{r, message = FALSE}
library(BioPlex)
library(AnnotationDbi)
library(AnnotationHub)
library(graph)
```

Connect to 
[AnnotationHub](http://bioconductor.org/packages/AnnotationHub):

```{r ahub, message = FALSE}
ah <- AnnotationHub::AnnotationHub()
```

Connect to 
[ExperimentHub](http://bioconductor.org/packages/ExperimentHub):

```{r ehub, message = FALSE}
eh <- ExperimentHub::ExperimentHub()
```

OrgDb package for human:
```{r orgdb, message = FALSE}
orgdb <- AnnotationHub::query(ah, c("orgDb", "Homo sapiens"))
orgdb <- orgdb[[1]]
orgdb
keytypes(orgdb)
```

# Check: identify CORUM complexes that have a subunit of interest

Get core set of complexes:
```{r corumCore}
core <- getCorum(set = "core", organism = "Human")
```

Turn the CORUM complexes into a list of graph instances,
where all nodes of a complex are connected to all other nodes of that complex
with undirected edges.

```{r corum2glist}
core.glist <- corum2graphlist(core, subunit.id.type = "UNIPROT")
```

Identify complexes that have a subunit of interest:
```{r corum-subunit}
has.cdk2 <- hasSubunit(core.glist, 
                       subunit = "CDK2",
                       id.type = "SYMBOL")
```

Check the answer:
```{r corum-subunit2}
table(has.cdk2)
cdk2.glist <- core.glist[has.cdk2]
lapply(cdk2.glist, function(g) unlist(graph::nodeData(g, attr = "SYMBOL")))
```

We can then also inspect the graph with plotting utilities from the 
[Rgraphviz](https://bioconductor.org/packages/Rgraphviz)
package:

```{r corum-subunit3, message = FALSE, eval = FALSE}
plot(cdk2.glist[[1]], main = names(cdk2.glist)[1])
```

# Check: extract BioPlex PPIs for a CORUM complex

Get the latest version of the 293T PPI network:

```{r bioplex293T}
bp.293t <- getBioPlex(cell.line = "293T", version = "3.0")
```

Turn the BioPlex PPI network into one big graph where bait and prey relationship
are represented by directed edges from bait to prey.

```{r bpgraph}
bp.gr <- bioplex2graph(bp.293t)
```

Now we can also easily pull out a BioPlex subnetwork for a CORUM complex
of interest:

```{r}
n <- graph::nodes(cdk2.glist[[1]])
bp.sgr <- graph::subGraph(n, bp.gr)
bp.sgr
```

# Check: identify interacting domains for a PFAM domain of interest

Add PFAM domain annotations to the node metadata:

```{r}
bp.gr <- BioPlex::annotatePFAM(bp.gr, orgdb)
```

Create a map from PFAM to UNIPROT:

```{r}
unip2pfam <- graph::nodeData(bp.gr, graph::nodes(bp.gr), "PFAM")
pfam2unip <- stack(unip2pfam)
pfam2unip <- split(as.character(pfam2unip$ind), pfam2unip$values)
head(pfam2unip, 2)
```

Let's focus on [PF02023](http://pfam.xfam.org/family/PF02023), corresponding to the
zinc finger-associated SCAN domain. For each protein containing the SCAN domain,
we now extract PFAM domains connected to the SCAN domain by an edge in the BioPlex network.

```{r}
scan.unip <- pfam2unip[["PF02023"]]
getIAPfams <- function(n) graph::nodeData(bp.gr, graph::edges(bp.gr)[[n]], "PFAM")
unip2iapfams <- lapply(scan.unip, getIAPfams)
unip2iapfams <- lapply(unip2iapfams, unlist)
names(unip2iapfams) <- scan.unip
```

Looking at the top 5 PFAM domains most frequently connected to the SCAN domain 
by an edge in the BioPlex network ...

```{r}
pfam2iapfams <- unlist(unip2iapfams)
sort(table(pfam2iapfams), decreasing = TRUE)[1:5]
```

... we find [PF02023](http://pfam.xfam.org/family/PF02023), the SCAN domain itself,
and [PF00096](http://pfam.xfam.org/family/PF00096), a C2H2 type zinc finger domain.
This finding is consistent with results reported in the
[BioPlex 3.0 publication](https://doi.org/10.1016/j.cell.2021.04.011).


See also the 
[PFAM domain-domain association analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/PFAM.html)
for a more comprehensive analysis of PFAM domain associations in the BioPlex network.

# Check: expressed genes are showing up as prey (293T cells)

Get RNA-seq data for HEK293 cells from GEO: 
[GSE122425](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122425)

```{r gse122425}
se <- getGSE122425()
se
```

Inspect expression of prey genes:

```{r prey-expression}
bait <- unique(bp.293t$SymbolA)
length(bait)
prey <- unique(bp.293t$SymbolB)
length(prey)
```

```{r}
ind <- match(prey, rowData(se)$SYMBOL)
par(las = 2)
boxplot(log2(assay(se, "rpkm") + 0.5)[ind,], 
        names = se$title, 
        ylab = "log2 RPKM")
```

How many prey genes are expressed (raw read count > 0) in all 3 WT reps:

```{r prey-expression2}
# background: how many genes in total are expressed in all three WT reps
gr0 <- rowSums(assay(se)[,1:3] > 0)
table(gr0 == 3)
# prey: expressed in all three WT reps
table(gr0[ind] == 3)
# prey: expressed in at least one WT rep
table(gr0[ind] > 0)
```

Are prey genes overrepresented in the expressed genes?
```{r prey-expression-ora}
exprTable <-
     matrix(c(9346, 1076, 14717, 32766),
            nrow = 2,
            dimnames = list(c("Expressed", "Not.expressed"),
                            c("In.prey.set", "Not.in.prey.set")))
exprTable
```

Test using hypergeometric test (i.e. one-sided Fisher's exact test):
```{r prey-expression-ora2}
fisher.test(exprTable, alternative = "greater")
```

Alternatively: permutation test, i.e. repeatedly sample number of prey genes 
from the background, and assess how often we have as many or more than 9346 genes
expressed:

```{r prey-expression-293T-perm}
permgr0 <- function(gr0, nr.genes = length(prey)) 
{
    ind <- sample(seq_along(gr0), nr.genes)
    sum(gr0[ind] == 3)
}
```

```{r prey-expression-perm2}
perms <- replicate(permgr0(gr0), 1000)
summary(perms)
(sum(perms >= 9346) + 1) / 1001
```

# Check: is there a relationship between prey frequency and prey expression level?

Check which genes turn up most frequently as prey:

```{r prey-freq}
prey.freq <- sort(table(bp.293t$SymbolB), decreasing = TRUE)
preys <- names(prey.freq)
prey.freq <- as.vector(prey.freq)
names(prey.freq) <- preys
head(prey.freq)
summary(prey.freq)
hist(prey.freq, breaks = 50, main = "", xlab = "Number of PPIs", ylab = "Number of genes")
```

Prey genes are involved `r round(mean(as.vector(prey.freq)))` PPIs on average.

There doesn't seem to be a strong correlation between expression level and the 
frequency of gene to turn up as prey: 

```{r}
ind <- match(names(prey.freq), rowData(se)$SYMBOL)
rmeans <- rowMeans(assay(se, "rpkm")[ind, 1:3])
log.rmeans <- log2(rmeans + 0.5)
par(pch = 20)
plot( x = prey.freq,
      y = log.rmeans,
      xlab = "prey frequency",
      ylab = "log2 RPKM")
cor(prey.freq, 
    log.rmeans,
    use = "pairwise.complete.obs")
```

See also the [BioNet maximum scoring subnetwork analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/BioNet.html) 
for a more comprehensive analysis of the 293T transcriptome data from GSE122425
when mapped onto BioPlex PPI network.

# Check: differential protein expression (HEK293 vs. HCT116)  

Get the relative protein expression data comparing 293T and HCT116 cells
from Supplementary Table S4A of the BioPlex 3 paper:

```{r bp.prot}
bp.prot <- getBioplexProteome()
bp.prot
rowData(bp.prot)
```

A couple of quick sanity checks:

1. The relative abundances are scaled to sum up to 100% for each protein:

```{r}
rowSums(assay(bp.prot)[1:5,]) 
```

2. The `rowData` column `log2ratio` corresponds to the mean of the five HEK samples, 
divided by the mean of the five HCT samples (and then taking log2 of it):

```{r}
ratio <- rowMeans(assay(bp.prot)[1:5, 1:5]) / rowMeans(assay(bp.prot)[1:5, 6:10])
log2(ratio)
```

3. The `rowData` column `adj.pvalue` stores Benjamini-Hochberg adjusted *p*-values
from a *t*-test between the five HEK samples and the five HCT samples:

```{r}
t.test(assay(bp.prot)[1, 1:5], assay(bp.prot)[1, 6:10])
```

The [Transcriptome-Proteome analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/TranscriptomeProteome.html) also explores the agreement between differential gene expression and 
differential protein expression when comparing HEK293 against HCT116 cells.

# SessionInfo

```{r sessionInfo}
sessionInfo()
```