Topology-based pathway analysis of RNA-seq data

Ivana Ihnatova1* and Ludwig Geistlinger2**

1Institute of Biostatistics and Analyses, Masarykova University Brno
2School of Public Health, City University of New York

*ihnatova@iba.muni.cz
**ludwig.geistlinger@sph.cuny.edu

29 October 2019

Abstract

The ToPASeq package implements methods for topology-based pathway analysis of RNA-seq data. This includes Topological Analysis of Pathway Phenotype Association (TAPPA; Gao and Wang, 2007), PathWay Enrichment Analysis (PWEA; Hung et al., 2010), and the Pathway Regulation Score (PRS; Ibrahim et al., 2012).

Package

ToPASeq 1.20.0

1 Setup
2 Preparing the data
3 Differential expression
4 Pathway analysis
- 4.1 Pathway Regulation Score (PRS)

1 Setup

Note: the ToPASeq package currently undergoes a major rework due to the change of the package maintainer. It is recommended to use the topology-based methods implemented in the EnrichmentBrowser or the graphite package instead.

We start by loading the package.

library(ToPASeq)

2 Preparing the data

For RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone Himes et al., 2014.

We load the airway dataset

library(airway)
data(airway)

For further analysis, we only keep genes that are annotated to an ENSEMBL gene ID.

airSE <- airway[grep("^ENSG", rownames(airway)),]
dim(airSE)

## [1] 63677     8

assay(airSE)[1:4,1:4]

##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003        679        448        873        408
## ENSG00000000005          0          0          0          0
## ENSG00000000419        467        515        621        365
## ENSG00000000457        260        211        263        164

3 Differential expression

The EnrichmentBrowser package incorporates established functionality from the limma package for differential expression analysis. This involves the voom transformation when applied to RNA-seq data. Alternatively, differential expression analysis for RNA-seq data can also be carried out based on the negative binomial distribution with edgeR and DESeq2.

This can be performed using the function EnrichmentBrowser::deAna and assumes some standardized variable names:

GROUP defines the sample groups being contrasted,
BLOCK defines paired samples or sample blocks, as e.g. for batch effects.

For more information on experimental design, see the limma user’s guide, chapter 9.

For the airway dataset, the GROUP variable indicates whether the cell lines have been treated with dexamethasone (1) or not (0).

airSE$GROUP <- ifelse(airway$dex == "trt", 1, 0)
table(airSE$GROUP)

## 
## 0 1 
## 4 4

Paired samples, or in general sample batches/blocks, can be defined via a BLOCK column in the colData slot. For the airway dataset, the sample blocks correspond to the four different cell lines.

airSE$BLOCK <- airway$cell
table(airSE$BLOCK)

## 
## N052611 N061011 N080611  N61311 
##       2       2       2       2

For RNA-seq data, the deAna function can be used to carry out differential expression analysis between the two groups either based on functionality from limma (that includes the voom transformation), or alternatively, the frequently used edgeR or DESeq2 package. Here, we use the analysis based on edgeR.

library(EnrichmentBrowser)
airSE <- deAna(airSE, de.method="edgeR")

## Excluding 49453 genes not satisfying min.cpm threshold

rowData(airSE, use.names=TRUE)

## DataFrame with 14224 rows and 4 columns
##                                 FC          edgeR.STAT                 PVAL
##                          <numeric>           <numeric>            <numeric>
## ENSG00000000003 -0.400831866750167    34.0274066005961 0.000216432282687383
## ENSG00000000419  0.187085102783581    6.04980215402107   0.0352937489311501
## ENSG00000000457 0.0184725952974607  0.0381476546950004    0.849333808212269
## ENSG00000000460 -0.136734267771027   0.475212812615912    0.507377185432325
## ENSG00000000971  0.406378441992468    27.8716659010763  0.00045037126529647
## ...                            ...                 ...                  ...
## ENSG00000273329  -0.21921465034743    1.43041469563266    0.261202451389649
## ENSG00000273344 0.0193231195014678 0.00903927021496614    0.926290333382998
## ENSG00000273373 -0.055473520645829  0.0729809190374344    0.792925024187221
## ENSG00000273382 -0.887239966221599    9.01498059318473   0.0143161906210786
## ENSG00000273486 -0.113268125908014   0.178437330942989    0.682286869640572
##                            ADJ.PVAL
##                           <numeric>
## ENSG00000000003 0.00210677511808889
## ENSG00000000419  0.0887742325016232
## ENSG00000000457   0.902504414164897
## ENSG00000000460   0.633532642632778
## ENSG00000000971  0.0035808165889195
## ...                             ...
## ENSG00000273329   0.394824061911495
## ENSG00000273344   0.952403766231008
## ENSG00000273373   0.862143826940761
## ENSG00000273382  0.0452317848498938
## ENSG00000273486   0.778386596888115

4 Pathway analysis

Pathways are typically represented as graphs, where the nodes are genes and edges between the nodes represent interaction between genes.

The graphite package provides pathway collections from major pathway databases such as KEGG, Biocarta, Reactome, and NCI.

Here, we retrieve human KEGG pathways.

library(graphite)
pwys <- pathways(species="hsapiens", database="kegg")
pwys

## KEGG pathways for hsapiens
## 311 entries, retrieved on 23-10-2019

As the airway dataset uses ENSEMBL gene IDs, but the nodes of the pathways are based on NCBI Entrez Gene IDs,

nodes(pwys[[1]])

##  [1] "ENTREZID:10327"  "ENTREZID:124"    "ENTREZID:125"    "ENTREZID:126"   
##  [5] "ENTREZID:127"    "ENTREZID:128"    "ENTREZID:130"    "ENTREZID:130589"
##  [9] "ENTREZID:131"    "ENTREZID:160287" "ENTREZID:1737"   "ENTREZID:1738"  
## [13] "ENTREZID:2023"   "ENTREZID:2026"   "ENTREZID:2027"   "ENTREZID:217"   
## [17] "ENTREZID:218"    "ENTREZID:219"    "ENTREZID:220"    "ENTREZID:2203"  
## [21] "ENTREZID:221"    "ENTREZID:222"    "ENTREZID:223"    "ENTREZID:224"   
## [25] "ENTREZID:226"    "ENTREZID:229"    "ENTREZID:230"    "ENTREZID:2538"  
## [29] "ENTREZID:2597"   "ENTREZID:26330"  "ENTREZID:2645"   "ENTREZID:2821"  
## [33] "ENTREZID:3098"   "ENTREZID:3099"   "ENTREZID:3101"   "ENTREZID:387712"
## [37] "ENTREZID:3939"   "ENTREZID:3945"   "ENTREZID:3948"   "ENTREZID:441531"
## [41] "ENTREZID:501"    "ENTREZID:5105"   "ENTREZID:5106"   "ENTREZID:5160"  
## [45] "ENTREZID:5161"   "ENTREZID:5162"   "ENTREZID:5211"   "ENTREZID:5213"  
## [49] "ENTREZID:5214"   "ENTREZID:5223"   "ENTREZID:5224"   "ENTREZID:5230"  
## [53] "ENTREZID:5232"   "ENTREZID:5236"   "ENTREZID:5313"   "ENTREZID:5315"  
## [57] "ENTREZID:55276"  "ENTREZID:55902"  "ENTREZID:57818"  "ENTREZID:669"   
## [61] "ENTREZID:7167"   "ENTREZID:80201"  "ENTREZID:83440"  "ENTREZID:84532" 
## [65] "ENTREZID:8789"   "ENTREZID:92483"  "ENTREZID:92579"  "ENTREZID:9562"

we first map the gene IDs in the airway dataset from ENSEMBL to ENTREZ IDs.

airSE <- idMap(airSE, org="hsa", from="ENSEMBL", to="ENTREZID")

##

## Encountered 83 from.IDs with >1 corresponding to.ID

## (the first to.ID was chosen for each of them)

## Excluded 1697 from.IDs without a corresponding to.ID

## Encountered 6 to.IDs with >1 from.ID (the first from.ID was chosen for each of them)

Next, we define all genes with adjusted p-value below 0.01 as differentially expressed, and collect their log2 fold change for further analysis.

all <- names(airSE)
de.ind <- rowData(airSE)$ADJ.PVAL < 0.01
de <- rowData(airSE)$FC[de.ind]
names(de) <- all[de.ind]

This results in 2,426 DE genes - out of 11,780 genes in total.

length(all)

## [1] 12521

length(de)

## [1] 2483

4.1 Pathway Regulation Score (PRS)

The Pathway Regulation Score (PRS) incorporates the pathway topology by weighting the indiviudal gene-level log2 fold changes by the number of downstream DE genes. The weighted absolute fold changes are summed across the pathway and statistical significance is assessed by permutation of genes. Ibrahim et al., 2012

res <- prs(de, all, pwys[1:100], nperm=100)

## 3302 node labels mapped to the expression data
## Average coverage 45.19421 %
## 0 (out of 100) pathways without a mapped node

## 9 pwys were filtered out

head(res)

##                                              nPRS    p.value
## Histidine metabolism                     9.493928 0.00990099
## Cysteine and methionine metabolism       9.025157 0.00990099
## Tyrosine metabolism                      8.034960 0.00990099
## Endocrine resistance                     7.155402 0.00990099
## Glycine, serine and threonine metabolism 6.291857 0.00990099
## Drug metabolism - cytochrome P450        5.903873 0.00990099

Corresponding gene weights (number of downstream DE genes) can be obtained for a pathway of choice via

ind <- grep("ErbB signaling pathway", names(pwys))
weights <- prsWeights(pwys[[ind]], de, all)

## 66 node labels mapped to the expression data
## Average coverage 75 %
## 0 (out of 1) pathways without a mapped node

weights

##    572   5601   5291   6416   8440   5609   5063    369   2932     25   1399 
##      0      0      0      0      0      0      1      0      0      0      0 
##   5594   6655   5595    208   5599   6198   5602   2549    867   1027    868 
##      0      0      0      0      0      0      3      5      0      0      0 
##   6654  10000   8503   5290   5335   1026   6776   2002   5605  25759  10298 
##      0      0      1      1      0      1      1      1      0      0      0 
##   5894   3845   4609   2064    207     27    817   5295   1956  53358    818 
##      2      0      1      0      0      1      0      1      0      1      0 
##   5058   5578    673   4690   6464   1398   5604   5747 145957   5293   6777 
##      0      0      0      0      0      0      1      0      0      1      1 
##   3265    685   6199   3725   2885   5062 399694   1978   6714   2475   4893 
##      0      1      0      1      0      0      1      0      0      1      0

Inspecting the genes with maximum number of downstream DE genes

weights[weights == max(weights)]

## 2549 
##    5

reveals important upstream regulators including several G protein subunits such as subunit beta 2 (Entrez Gene ID 2783).