--- title: "Basic checks of BioPlex PPI data" output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('BioPlex')`" vignette: > % \VignetteIndexEntry{2. Data checks} % \VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) cache.dir <- tools::R_user_dir("BioPlex", which = "cache") bfc <- BiocFileCache::BiocFileCache(cache.dir) for(b in BiocFileCache::bfcinfo(bfc)$rid) BiocFileCache::bfcremove(bfc, b) ``` # Setup ```{r, message = FALSE} library(BioPlex) library(AnnotationDbi) library(AnnotationHub) library(graph) ``` Connect to [AnnotationHub](http://bioconductor.org/packages/AnnotationHub): ```{r ahub, message = FALSE} ah <- AnnotationHub::AnnotationHub() ``` Connect to [ExperimentHub](http://bioconductor.org/packages/ExperimentHub): ```{r ehub, message = FALSE} eh <- ExperimentHub::ExperimentHub() ``` OrgDb package for human: ```{r orgdb, message = FALSE} orgdb <- AnnotationHub::query(ah, c("orgDb", "Homo sapiens")) orgdb <- orgdb[[1]] orgdb keytypes(orgdb) ``` # Check: identify CORUM complexes that have a subunit of interest Get core set of complexes: ```{r corumCore} core <- getCorum(set = "core", organism = "Human") ``` Turn the CORUM complexes into a list of graph instances, where all nodes of a complex are connected to all other nodes of that complex with undirected edges. ```{r corum2glist} core.glist <- corum2graphlist(core, subunit.id.type = "UNIPROT") ``` Identify complexes that have a subunit of interest: ```{r corum-subunit} has.cdk2 <- hasSubunit(core.glist, subunit = "CDK2", id.type = "SYMBOL") ``` Check the answer: ```{r corum-subunit2} table(has.cdk2) cdk2.glist <- core.glist[has.cdk2] lapply(cdk2.glist, function(g) unlist(graph::nodeData(g, attr = "SYMBOL"))) ``` We can then also inspect the graph with plotting utilities from the [Rgraphviz](https://bioconductor.org/packages/Rgraphviz) package: ```{r corum-subunit3, message = FALSE, eval = FALSE} plot(cdk2.glist[[1]], main = names(cdk2.glist)[1]) ``` # Check: extract BioPlex PPIs for a CORUM complex Get the latest version of the 293T PPI network: ```{r bioplex293T} bp.293t <- getBioPlex(cell.line = "293T", version = "3.0") ``` Turn the BioPlex PPI network into one big graph where bait and prey relationship are represented by directed edges from bait to prey. ```{r bpgraph} bp.gr <- bioplex2graph(bp.293t) ``` Now we can also easily pull out a BioPlex subnetwork for a CORUM complex of interest: ```{r} n <- graph::nodes(cdk2.glist[[1]]) bp.sgr <- graph::subGraph(n, bp.gr) bp.sgr ``` # Check: identify interacting domains for a PFAM domain of interest Add PFAM domain annotations to the node metadata: ```{r} bp.gr <- BioPlex::annotatePFAM(bp.gr, orgdb) ``` Create a map from PFAM to UNIPROT: ```{r} unip2pfam <- graph::nodeData(bp.gr, graph::nodes(bp.gr), "PFAM") pfam2unip <- stack(unip2pfam) pfam2unip <- split(as.character(pfam2unip$ind), pfam2unip$values) head(pfam2unip, 2) ``` Let's focus on [PF02023](http://pfam.xfam.org/family/PF02023), corresponding to the zinc finger-associated SCAN domain. For each protein containing the SCAN domain, we now extract PFAM domains connected to the SCAN domain by an edge in the BioPlex network. ```{r} scan.unip <- pfam2unip[["PF02023"]] getIAPfams <- function(n) graph::nodeData(bp.gr, graph::edges(bp.gr)[[n]], "PFAM") unip2iapfams <- lapply(scan.unip, getIAPfams) unip2iapfams <- lapply(unip2iapfams, unlist) names(unip2iapfams) <- scan.unip ``` Looking at the top 5 PFAM domains most frequently connected to the SCAN domain by an edge in the BioPlex network ... ```{r} pfam2iapfams <- unlist(unip2iapfams) sort(table(pfam2iapfams), decreasing = TRUE)[1:5] ``` ... we find [PF02023](http://pfam.xfam.org/family/PF02023), the SCAN domain itself, and [PF00096](http://pfam.xfam.org/family/PF00096), a C2H2 type zinc finger domain. This finding is consistent with results reported in the [BioPlex 3.0 publication](https://doi.org/10.1016/j.cell.2021.04.011). See also the [PFAM domain-domain association analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/PFAM.html) for a more comprehensive analysis of PFAM domain associations in the BioPlex network. # Check: expressed genes are showing up as prey (293T cells) Get RNA-seq data for HEK293 cells from GEO: [GSE122425](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122425) ```{r gse122425} se <- getGSE122425() se ``` Inspect expression of prey genes: ```{r prey-expression} bait <- unique(bp.293t$SymbolA) length(bait) prey <- unique(bp.293t$SymbolB) length(prey) ``` ```{r} ind <- match(prey, rowData(se)$SYMBOL) par(las = 2) boxplot(log2(assay(se, "rpkm") + 0.5)[ind,], names = se$title, ylab = "log2 RPKM") ``` How many prey genes are expressed (raw read count > 0) in all 3 WT reps: ```{r prey-expression2} # background: how many genes in total are expressed in all three WT reps gr0 <- rowSums(assay(se)[,1:3] > 0) table(gr0 == 3) # prey: expressed in all three WT reps table(gr0[ind] == 3) # prey: expressed in at least one WT rep table(gr0[ind] > 0) ``` Are prey genes overrepresented in the expressed genes? ```{r prey-expression-ora} exprTable <- matrix(c(9346, 1076, 14717, 32766), nrow = 2, dimnames = list(c("Expressed", "Not.expressed"), c("In.prey.set", "Not.in.prey.set"))) exprTable ``` Test using hypergeometric test (i.e. one-sided Fisher's exact test): ```{r prey-expression-ora2} fisher.test(exprTable, alternative = "greater") ``` Alternatively: permutation test, i.e. repeatedly sample number of prey genes from the background, and assess how often we have as many or more than 9346 genes expressed: ```{r prey-expression-293T-perm} permgr0 <- function(gr0, nr.genes = length(prey)) { ind <- sample(seq_along(gr0), nr.genes) sum(gr0[ind] == 3) } ``` ```{r prey-expression-perm2} perms <- replicate(permgr0(gr0), 1000) summary(perms) (sum(perms >= 9346) + 1) / 1001 ``` # Check: is there a relationship between prey frequency and prey expression level? Check which genes turn up most frequently as prey: ```{r prey-freq} prey.freq <- sort(table(bp.293t$SymbolB), decreasing = TRUE) preys <- names(prey.freq) prey.freq <- as.vector(prey.freq) names(prey.freq) <- preys head(prey.freq) summary(prey.freq) hist(prey.freq, breaks = 50, main = "", xlab = "Number of PPIs", ylab = "Number of genes") ``` Prey genes are involved `r round(mean(as.vector(prey.freq)))` PPIs on average. There doesn't seem to be a strong correlation between expression level and the frequency of gene to turn up as prey: ```{r} ind <- match(names(prey.freq), rowData(se)$SYMBOL) rmeans <- rowMeans(assay(se, "rpkm")[ind, 1:3]) log.rmeans <- log2(rmeans + 0.5) par(pch = 20) plot( x = prey.freq, y = log.rmeans, xlab = "prey frequency", ylab = "log2 RPKM") cor(prey.freq, log.rmeans, use = "pairwise.complete.obs") ``` See also the [BioNet maximum scoring subnetwork analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/BioNet.html) for a more comprehensive analysis of the 293T transcriptome data from GSE122425 when mapped onto BioPlex PPI network. # Check: differential protein expression (HEK293 vs. HCT116) Get the relative protein expression data comparing 293T and HCT116 cells from Supplementary Table S4A of the BioPlex 3 paper: ```{r bp.prot} bp.prot <- getBioplexProteome() bp.prot rowData(bp.prot) ``` A couple of quick sanity checks: 1. The relative abundances are scaled to sum up to 100% for each protein: ```{r} rowSums(assay(bp.prot)[1:5,]) ``` 2. The `rowData` column `log2ratio` corresponds to the mean of the five HEK samples, divided by the mean of the five HCT samples (and then taking log2 of it): ```{r} ratio <- rowMeans(assay(bp.prot)[1:5, 1:5]) / rowMeans(assay(bp.prot)[1:5, 6:10]) log2(ratio) ``` 3. The `rowData` column `adj.pvalue` stores Benjamini-Hochberg adjusted *p*-values from a *t*-test between the five HEK samples and the five HCT samples: ```{r} t.test(assay(bp.prot)[1, 1:5], assay(bp.prot)[1, 6:10]) ``` The [Transcriptome-Proteome analysis vignette](https://ccb-hms.github.io/BioPlexAnalysis/articles/TranscriptomeProteome.html) also explores the agreement between differential gene expression and differential protein expression when comparing HEK293 against HCT116 cells. # SessionInfo ```{r sessionInfo} sessionInfo() ```