GSVA 1.42.0
License: GPL (>= 2)
GSVA is an R package distributed as part of the Bioconductor project. To install the package, start R and enter:
install.packages("BiocManager")
BiocManager::install("GSVA")
Once GSVA is installed, it can be loaded with the following command.
library(GSVA)
Given a gene expression data matrix, which we shall call X
, with rows
corresponding to genes and columns to samples, such as this one simulated from
random Gaussian data:
p <- 10000 ## number of genes
n <- 30 ## number of samples
## simulate expression values from a standard Gaussian distribution
X <- matrix(rnorm(p*n), nrow=p,
dimnames=list(paste0("g", 1:p), paste0("s", 1:n)))
X[1:5, 1:5]
s1 s2 s3 s4 s5
g1 0.90559851 0.5829776 -0.2746646 -0.4537617 -0.5464220
g2 -0.01726376 0.2759739 0.2511800 0.1426546 -0.3434284
g3 -0.09981949 -0.4422418 -1.9256850 -0.8391432 -0.2683976
g4 0.99802918 1.2075807 -0.9672491 -1.2010650 1.8151629
g5 -0.08165344 -1.0245883 0.1329302 0.9030261 -0.2503627
Given a collection of gene sets stored, for instance, in a list
object, which
we shall call gs
, with genes sampled uniformly at random without replacement
into 100 different gene sets:
## sample gene set sizes
gs <- as.list(sample(10:100, size=100, replace=TRUE))
## sample gene sets
gs <- lapply(gs, function(n, p)
paste0("g", sample(1:p, size=n, replace=FALSE)), p)
names(gs) <- paste0("gs", 1:length(gs))
We can calculate GSVA enrichment scores as follows:
gsva.es <- gsva(X, gs, verbose=FALSE)
dim(gsva.es)
[1] 100 30
gsva.es[1:5, 1:5]
s1 s2 s3 s4 s5
gs1 0.05554084 0.0879225878 -0.003136382 0.08184313 0.142096062
gs2 -0.02454630 0.0986869655 0.083657296 -0.07588916 0.059950740
gs3 -0.08568490 0.0215598562 -0.326250712 0.26162707 0.004584937
gs4 -0.03292368 -0.0007826325 -0.047700508 -0.01512064 0.113713161
gs5 0.20741285 -0.0253447489 -0.237033763 -0.23344755 -0.143545309
The first argument to the gsva()
function is the gene expression data matrix
and the second the collection of gene sets. The gsva()
function can take the
input expression data and gene sets using different specialized containers that
facilitate the access and manipulation of molecular and phenotype data, as well
as their associated metadata. Another advanced features include the use of
on-disk and parallel backends to enable, respectively, using GSVA on large
molecular data sets and speed up computing time. You will find information on
these features in this vignette.
Gene set variation analysis (GSVA) provides an estimate of pathway activity by transforming an input gene-by-sample expression data matrix into a corresponding gene-set-by-sample expression data matrix. This resulting expression data matrix can be then used with classical analytical methods such as differential expression, classification, survival analysis, clustering or correlation analysis in a pathway-centric manner. One can also perform sample-wise comparisons between pathways and other molecular data types such as microRNA expression or binding data, copy-number variation (CNV) data or single nucleotide polymorphisms (SNPs).
The GSVA package provides an implementation of this approach for the following methods:
plage (Tomfohr, Lu, and Kepler 2005). Pathway level analysis of gene expression (PLAGE) standardizes expression profiles over the samples and then, for each gene set, it performs a singular value decomposition (SVD) over its genes. The coefficients of the first right-singular vector are returned as the estimates of pathway activity over the samples. Note that, because of how SVD is calculated, the sign of its singular vectors is arbitrary.
zscore (Lee et al. 2008). The z-score method standardizes expression profiles over the samples and then, for each gene set, combines the standardized values as follows. Given a gene set \(\gamma=\{1,\dots,k\}\) with standardized values \(z_1,\dots,z_k\) for each gene in a specific sample, the combined z-score \(Z_\gamma\) for the gene set \(\gamma\) is defined as: \[ Z_\gamma = \frac{\sum_{i=1}^k z_i}{\sqrt{k}}\,. \]
ssgsea (Barbie et al. 2009). Single sample GSEA (ssGSEA) is a
non-parametric method that calculates a gene set enrichment score per sample
as the normalized difference in empirical cumulative distribution functions
(CDFs) of gene expression ranks inside and outside the gene set. By default,
the implementation in the GSVA package follows the last step described in
(Barbie et al. 2009, online methods, pg. 2) by which pathway scores are
normalized, dividing them by the range of calculated values. This normalization
step may be switched off using the argument ssgsea.norm
in the call to the
gsva()
function; see below.
gsva (Hänzelmann, Castelo, and Guinney 2013). This is the default method of the package and similarly to ssGSEA, is a non-parametric method that uses the empirical CDFs of gene expression ranks inside and outside the gene set, but it starts by calculating an expression-level statistic that brings gene expression profiles with different dynamic ranges to a common scale.
The interested user may find full technical details about how these methods work in their corresponding articles cited above. If you use any of them in a publication, please cite them with the given bibliographic reference.
The workhorse of the GSVA package is the function gsva()
, which requires
the following two input arguments:
matrix
of expression values with genes corresponding to rows and samples
corresponding to columns.ExpressionSet
object; see package Biobase.SummarizedExperiment
object, see package
SummarizedExperiment.list
object where each element corresponds to a gene set defined by a
vector of gene identifiers, and the element names correspond to the names of
the gene sets.GeneSetCollection
object; see package GSEABase.One advantage of providing the input data using specialized containers such as
ExpressionSet
, SummarizedExperiment
and GeneSetCollection
is that the
gsva()
function will automatically map the gene identifiers between the
expression data and the gene sets (internally calling the function
mapIdentifiers()
from the package GSEABase), when they come
from different standard nomenclatures, i.e., Ensembl versus Entrez, provided
the input objects contain the appropriate metadata; see next section.
If either the input gene expression data is provided as a matrix
object or
the gene sets are provided in a list
object, or both, it is then the
responsibility of the user to ensure that both objects contain gene identifiers
following the same standard nomenclature.
Before the actual calculations take place, the gsva()
function will apply
the following filters:
Discard genes in the input expression data matrix with constant expression.
Discard genes in the input gene sets that do not map to a gene in the input gene expression data matrix.
Discard gene sets that, after applying the previous filters, do not meet a minimum and maximum size, which by default is one for the minimum size and has no limit for the maximum size.
If, as a result of applying these three filters, either no genes or gene sets
are left, the gsva()
function will prompt an error. A common cause for such
an error at this stage is that gene identifiers between the expression data
matrix and the gene sets do not belong to the same standard nomenclature and
could not be mapped. This may happen because either the input data were not
provided using some of the specialized containers described above or the
necessary metadata in those containers that allows the software to successfully
map gene identifiers, is missing.
By default, the gsva()
function employs the method described by
Hänzelmann, Castelo, and Guinney (2013) but this can be changed using the argument
method
, which can take values gsva
(default), zscore
, plage
or
ssgsea
, corresponding to the methods briefly described in the introduction.
When method="gsva"
(default), the user can additionally tune the following
parameters:
kcdf
: The first step of the GSVA algorithm brings gene expression
profiles to a common scale by calculating an expression statistic through
a non-parametric estimation of the CDF across samples. Such a non-parametric
estimation employs a kernel function and the kcdf
parameter allows the
user to specify three possible values for that function: (1) "Gaussian"
,
the default value, which is suitable for continuous expression data, such as
microarray fluorescent units in logarithmic scale and RNA-seq log-CPMs,
log-RPKMs or log-TPMs units of expression; (2) "Poisson"
, which is
suitable for integer counts, such as those derived from RNA-seq alignments; (3)
"none"
, which will enforce a direct estimation of the CDF without a kernel
function.
mx.diff
: The last step of the GSVA algorithm calculates the gene set
enrichment score from two Kolmogorov-Smirnov random walk statistics. This
parameter is a logical flag that allows the user to specify two possible ways
to do such calculation: (1) TRUE
, the default value, where the enrichment
score is calculated as the magnitude difference between the largest positive
and negative random walk deviations; (2) FALSE
, where the enrichment score
is calculated as the maximum distance of the random walk from zero.
abs.ranking
: Logical flag used only when mx.diff=TRUE
. By default,
abs.ranking=FALSE
and it implies that a modified Kuiper statistic is used
to calculate enrichment scores, taking the magnitude difference between the
largest positive and negative random walk deviations. When abs.ranking=TRUE
the original Kuiper statistic is used, by which the largest positive and
negative random walk deviations are added together. In this case, gene sets
with genes enriched on either extreme (high or low) will be regarded as
highly activated.
tau
: Exponent defining the weight of the tail in the random walk. By
default tau=1
. When method="ssgsea"
, this parameter is also used and its
default value becomes then tau=0.25
to match the methodology described in
(Barbie et al. 2009).
In general, the default values for the previous parameters are suitable for most analysis settings, which usually consist of some kind of normalized continuous expression values.
Gene sets constitute a simple, yet useful, way to define pathways because we use pathway membership definitions only, neglecting the information on molecular interactions. Gene set definitions are a crucial input to any gene set enrichment analysis because if our gene sets do not capture the biological processes we are studying, we will likely not find any relevant insights in our data from an analysis based on these gene sets.
There are multiple sources of gene sets, the most popular ones being The Gene Ontology (GO) project and The Molecular Signatures Database (MSigDB). Sometimes gene set databases will not include the ones we need. In such a case we should either curate our own gene sets or use techniques to infer them from data.
The most basic data container for gene sets in R is the list
class of objects,
as illustrated before in the quick start section, where we defined a toy collection
of three gene sets stored in a list object called gs
:
class(gs)
[1] "list"
length(gs)
[1] 100
head(lapply(gs, head))
$gs1
[1] "g7954" "g9788" "g7319" "g8773" "g6139" "g1393"
$gs2
[1] "g9781" "g2040" "g8357" "g4052" "g8933" "g8023"
$gs3
[1] "g418" "g6148" "g5551" "g9539" "g1543" "g8600"
$gs4
[1] "g1342" "g5230" "g7674" "g6211" "g7840" "g5280"
$gs5
[1] "g1433" "g1530" "g1771" "g231" "g8879" "g5328"
$gs6
[1] "g3549" "g3247" "g6218" "g6083" "g1878" "g9938"
Using a Bioconductor organism-level package such as org.Hs.eg.db we can easily build a list object containing a collection of gene sets defined as GO terms with annotated Entrez gene identifiers, as follows:
library(org.Hs.eg.db)
goannot <- select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns="GO")
head(goannot)
ENTREZID GO EVIDENCE ONTOLOGY
1 1 GO:0003674 ND MF
2 1 GO:0005576 HDA CC
3 1 GO:0005576 IDA CC
4 1 GO:0005576 TAS CC
5 1 GO:0005615 HDA CC
6 1 GO:0008150 ND BP
genesbygo <- split(goannot$ENTREZID, goannot$GO)
length(genesbygo)
[1] 18593
head(genesbygo)
$`GO:0000002`
[1] "291" "1890" "4205" "4358" "4976" "9361" "10000" "55186" "80119"
[10] "84275" "92667"
$`GO:0000003`
[1] "2796" "2797" "8510" "286826"
$`GO:0000009`
[1] "55650" "79087"
$`GO:0000010`
[1] "23590" "57107"
$`GO:0000012`
[1] "1161" "2074" "3981" "7141" "7515" "23411"
[7] "54840" "54840" "55775" "55775" "55775" "200558"
[13] "100133315"
$`GO:0000014`
[1] "2021" "2067" "2072" "4361" "4361" "5932" "6419" "9941" "10111"
[10] "64421"
A more sophisticated container for gene sets is the GeneSetCollection
object class defined in the GSEABase package, which also
provides the function getGmt()
to import
gene matrix transposed (GMT) files such as those
provided by MSigDB into a
GeneSetCollection
object. The experiment data package
GSVAdata provides one such object with the old (3.0) version
of the C2 collection of curated genesets from MSigDB,
which can be loaded as follows.
library(GSEABase)
library(GSVAdata)
data(c2BroadSets)
class(c2BroadSets)
[1] "GeneSetCollection"
attr(,"package")
[1] "GSEABase"
c2BroadSets
GeneSetCollection
names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
The documentation of GSEABase contains a description of the
GeneSetCollection
class and its associated methods.
Here we illustrate how GSVA provides an analogous quantification of pathway activity in both microarray and RNA-seq data by using two such datasets that have been derived from the same biological samples. More concretely, we will use gene expression data of lymphoblastoid cell lines (LCL) from HapMap individuals that have been profiled using both technologies (Huang et al. 2007, @pickrell_understanding_2010). These data form part of the experimental package GSVAdata and the corresponding help pages contain details on how the data were processed. We start loading these data and verifying that they indeed contain expression data for the same genes and samples, as follows:
library(Biobase)
data(commonPickrellHuang)
stopifnot(identical(featureNames(huangArrayRMAnoBatchCommon_eset),
featureNames(pickrellCountsArgonneCQNcommon_eset)))
stopifnot(identical(sampleNames(huangArrayRMAnoBatchCommon_eset),
sampleNames(pickrellCountsArgonneCQNcommon_eset)))
Next, for the current analysis we use the subset of canonical pathways from the C2 collection of MSigDB Gene Sets. These correspond to the following pathways from KEGG, REACTOME and BIOCARTA:
canonicalC2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
grep("^REACTOME", names(c2BroadSets)),
grep("^BIOCARTA", names(c2BroadSets)))]
canonicalC2BroadSets
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
unique identifiers: 55902, 2645, ..., 8544 (6744 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
Additionally, we extend this collection of gene sets with two formed by genes
with sex-specific expression, which also form part of the GSVAdata
experiment data package. Here we use the constructor function GeneSet
from the
GSEABase package to build the objects that we add to the
GeneSetCollection
object canonicalC2BroadSets
.
data(genderGenesEntrez)
MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(),
collectionType=BroadCollection(category="c2"), setName="MSY")
MSY
setName: MSY
geneIds: 266, 84663, ..., 353513 (total: 34)
geneIdType: EntrezId
collectionType: Broad
bcCategory: c2 (Curated)
bcSubCategory: NA
details: use 'details(object)'
XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(),
collectionType=BroadCollection(category="c2"), setName="XiE")
XiE
setName: XiE
geneIds: 293, 8623, ..., 1121 (total: 66)
geneIdType: EntrezId
collectionType: Broad
bcCategory: c2 (Curated)
bcSubCategory: NA
details: use 'details(object)'
canonicalC2BroadSets <- GeneSetCollection(c(canonicalC2BroadSets, MSY, XiE))
canonicalC2BroadSets
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., XiE (835 total)
unique identifiers: 55902, 2645, ..., 1121 (6810 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
We calculate now GSVA enrichment scores for these gene sets using first the
microarray data and then the RNA-seq integer count data. Note that the only
requirement to do the latter is to set the argument kcdf="Poisson"
, which is
"Gaussian"
by default. Note, however, that if our RNA-seq derived expression
levels would be continuous, such as log-CPMs, log-RPKMs or log-TPMs, the default
value of the kcdf
argument should remain unchanged.
esmicro <- gsva(huangArrayRMAnoBatchCommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500)
esrnaseq <- gsva(pickrellCountsArgonneCQNcommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500,
kcdf="Poisson")
We are going to assess how gene expression profiles correlate between microarray
and RNA-seq data and compare those correlations with the ones derived at pathway
level. To compare gene expression values of both technologies, we will transform
first the RNA-seq integer counts into log-CPM units of expression using the
cpm()
function from the edgeR package.
library(edgeR)
lcpms <- cpm(exprs(pickrellCountsArgonneCQNcommon_eset), log=TRUE)
We calculate Spearman correlations between gene expression profiles of the previous log-CPM values and the microarray RMA values.
genecorrs <- sapply(1:nrow(lcpms),
function(i, expmicro, exprnaseq) cor(expmicro[i, ], exprnaseq[i, ],
method="spearman"),
exprs(huangArrayRMAnoBatchCommon_eset), lcpms)
names(genecorrs) <- rownames(lcpms)
Now calculate Spearman correlations between GSVA enrichment scores derived from the microarray and the RNA-seq data.
pwycorrs <- sapply(1:nrow(esmicro),
function(i, esmicro, esrnaseq) cor(esmicro[i, ], esrnaseq[i, ],
method="spearman"),
exprs(esmicro), exprs(esrnaseq))
names(pwycorrs) <- rownames(esmicro)
Figure 1 below shows the two distributions of these correlations and we can see that GSVA enrichment scores provide an agreement between microarray and RNA-seq data comparable to the one observed between gene-level units of expression.
par(mfrow=c(1, 2), mar=c(4, 5, 3, 2))
hist(genecorrs, xlab="Spearman correlation", main="Gene level\n(RNA-seq log-CPMs vs microarray RMA)",
xlim=c(-1, 1), col="grey", las=1)
hist(pwycorrs, xlab="Spearman correlation", main="Pathway level\n(GSVA enrichment scores)",
xlim=c(-1, 1), col="grey", las=1)
Finally, in Figure 2 we compare the actual GSVA enrichment scores for two gene sets formed by genes with sex-specific expression. Concretely, one gene set (XIE) formed by genes that escape chromosome X-inactivation in females (Carrel and Willard 2005) and another gene set (MSY) formed by genes located on the male-specific region of chromosome Y (Skaletsky et al. 2003).
par(mfrow=c(1, 2))
rmsy <- cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ])
plot(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ], xlab="GSVA scores RNA-seq",
ylab="GSVA scores microarray", main=sprintf("MSY R=%.2f", rmsy), las=1, type="n")
abline(lm(exprs(esmicro)["MSY", ] ~ exprs(esrnaseq)["MSY", ]), lwd=2, lty=2, col="grey")
points(exprs(esrnaseq["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"]),
exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Female"],
col="red", pch=21, bg="red", cex=1)
points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Male"],
col="blue", pch=21, bg="blue", cex=1)
legend("topleft", c("female", "male"), pch=21, col=c("red", "blue"), pt.bg=c("red", "blue"), inset=0.01)
rxie <- cor(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ])
plot(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ], xlab="GSVA scores RNA-seq",
ylab="GSVA scores microarray", main=sprintf("XiE R=%.2f", rxie), las=1, type="n")
abline(lm(exprs(esmicro)["XiE", ] ~ exprs(esrnaseq)["XiE", ]), lwd=2, lty=2, col="grey")
points(exprs(esrnaseq["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"]),
exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Female"],
col="red", pch=21, bg="red", cex=1)
points(exprs(esrnaseq)["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Male"],
col="blue", pch=21, bg="blue", cex=1)
legend("topleft", c("female", "male"), pch=21, col=c("red", "blue"), pt.bg=c("red", "blue"), inset=0.01)
We can see how microarray and RNA-seq single-sample GSVA enrichment scores correlate very well in these gene sets, with \(\rho=0.80\) for the male-specific gene set and \(\rho=0.79\) for the female-specific gene set. Male and female samples show higher GSVA enrichment scores in their corresponding gene set.
In (Verhaak et al. 2010) four subtypes of glioblastoma multiforme (GBM) -proneural, classical, neural and mesenchymal- were identified by the characterization of distinct gene-level expression patterns. Using four gene set signatures specific to brain cell types (astrocytes, oligodendrocytes, neurons and cultured astroglial cells), derived from murine models by Cahoy et al. (2008), we replicate the analysis of Verhaak et al. (2010) by using GSVA to transform the gene expression measurements into enrichment scores for these four gene sets, without taking the sample subtype grouping into account. We start by having a quick glance to the data, which forms part of the GSVAdata package:
data(gbm_VerhaakEtAl)
gbm_eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 11861 features, 173 samples
element names: exprs
protocolData: none
phenoData
rowNames: TCGA.02.0003.01A.01 TCGA.02.0010.01A.01 ...
TCGA.12.0620.01A.01 (173 total)
varLabels: subtype
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation:
head(featureNames(gbm_eset))
[1] "AACS" "FSTL1" "ELMO2" "CREB3L1" "RPS11" "PNMA1"
table(gbm_eset$subtype)
Classical Mesenchymal Neural Proneural
38 56 26 53
data(brainTxDbSets)
lengths(brainTxDbSets)
astrocytic_up astroglia_up neuronal_up oligodendrocytic_up
85 88 98 70
lapply(brainTxDbSets, head)
$astrocytic_up
[1] "GRHL1" "GPAM" "PAPSS2" "MERTK" "BTG1" "SLC46A1"
$astroglia_up
[1] "BST2" "SERPING1" "ACTA2" "C9orf167" "C1orf31" "ANXA4"
$neuronal_up
[1] "STXBP1" "JPH4" "CACNG3" "BRUNOL6" "CLSTN2" "FAM123C"
$oligodendrocytic_up
[1] "DCT" "ZNF536" "GNG8" "ELOVL6" "NR2C1" "RCBTB1"
GSVA enrichment scores for the gene sets contained in brainTxDbSets
are calculated, in this case using mx.diff=FALSE
, as follows:
gbm_es <- gsva(gbm_eset, brainTxDbSets, mx.diff=FALSE)
Figure 3 shows the GSVA enrichment scores obtained for the up-regulated gene sets across the samples of the four GBM subtypes. As expected, the neural class is associated with the neural gene set and the astrocytic gene sets. The mesenchymal subtype is characterized by the expression of mesenchymal and microglial markers, thus we expect it to correlate with the astroglial gene set. The proneural subtype shows high expression of oligodendrocytic development genes, thus it is not surprising that the oligodendrocytic gene set is highly enriched for ths group. Interestingly, the classical group correlates highly with the astrocytic gene set. In summary, the resulting GSVA enrichment scores recapitulate accurately the molecular signatures from Verhaak et al. (2010).
library(RColorBrewer)
subtypeOrder <- c("Proneural", "Neural", "Classical", "Mesenchymal")
sampleOrderBySubtype <- sort(match(gbm_es$subtype, subtypeOrder),
index.return=TRUE)$ix
subtypeXtable <- table(gbm_es$subtype)
subtypeColorLegend <- c(Proneural="red", Neural="green",
Classical="blue", Mesenchymal="orange")
geneSetOrder <- c("astroglia_up", "astrocytic_up", "neuronal_up",
"oligodendrocytic_up")
geneSetLabels <- gsub("_", " ", geneSetOrder)
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]
heatmap(exprs(gbm_es)[geneSetOrder, sampleOrderBySubtype], Rowv=NA,
Colv=NA, scale="row", margins=c(3,5), col=hmcol,
ColSideColors=rep(subtypeColorLegend[subtypeOrder],
times=subtypeXtable[subtypeOrder]),
labCol="", gbm_es$subtype[sampleOrderBySubtype],
labRow=paste(toupper(substring(geneSetLabels, 1,1)),
substring(geneSetLabels, 2), sep=""),
cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.23,1.21, "Proneural", col="red", cex=1.2)
text(0.36,1.21, "Neural", col="green", cex=1.2)
text(0.47,1.21, "Classical", col="blue", cex=1.2)
text(0.62,1.21, "Mesenchymal", col="orange", cex=1.2)
mtext("Gene sets", side=4, line=0, cex=1.5)
mtext("Samples ", side=1, line=4, cex=1.5)
We illustrate here how to conduct a differential expression analysis at
pathway level. We will use an example gene expression microarray data from
Armstrong et al. (2002), which consists of 37 different individuals with human
acute leukemia, where 20 of them have conventional childhood acute
lymphoblastic leukemia (ALL) and the other 17 are affected with the MLL
(mixed-lineage leukemia gene) translocation. This leukemia data set is stored as
an ExpressionSet
object called leukemia
in the GSVAdata
package and and details on how the data was pre-processed can be found in the
corresponding help page. Jointly with the RMA expression values, we
provide some metadata including the main phenotype corresponding to the leukemia
sample subtype.
data(leukemia)
leukemia_eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12626 features, 37 samples
element names: exprs
protocolData
sampleNames: CL2001011101AA.CEL CL2001011102AA.CEL ...
CL2001011152AA.CEL (37 total)
varLabels: ScanDate
varMetadata: labelDescription
phenoData
sampleNames: CL2001011101AA.CEL CL2001011102AA.CEL ...
CL2001011152AA.CEL (37 total)
varLabels: subtype
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu95a
Next, we calculate GSVA enrichment scores setting the minimum and maximum gene set size to 10 and 500 genes, respectively.
leukemia_es <- gsva(leukemia_eset, c2BroadSets, min.sz=10, max.sz=500)
The object returned by the function gsva()
is always of the same class
as the input object with the expression data. Therefore, in this case,
we obtain an ExpressionSet
object with features corresponding to gene
sets.
class(leukemia_es)
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
leukemia_es
ExpressionSet (storageMode: lockedEnvironment)
assayData: 2701 features, 37 samples
element names: exprs
protocolData: none
phenoData
sampleNames: CL2001011101AA.CEL CL2001011102AA.CEL ...
CL2001011152AA.CEL (37 total)
varLabels: subtype
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation:
head(featureNames(leukemia_es))
[1] "NAKAMURA_CANCER_MICROENVIRONMENT_UP"
[2] "NAKAMURA_CANCER_MICROENVIRONMENT_DN"
[3] "WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP"
[4] "WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN"
[5] "WINTER_HYPOXIA_UP"
[6] "WINTER_HYPOXIA_DN"
We will perform now a differential expression analysis using limma (Smyth 2004) between the two different leukemia subtypes.
library(limma)
mod <- model.matrix(~ factor(leukemia_es$subtype))
colnames(mod) <- c("ALL", "MLLvsALL")
fit <- lmFit(leukemia_es, mod)
fit <- eBayes(fit)
res <- decideTests(fit, p.value=0.01)
summary(res)
ALL MLLvsALL
Down 2 40
NotSig 2697 2605
Up 2 56
There are 96 MSigDB C2 differentially expressed pathways with FDR < 1%. Figure @(fig:leukemiavolcano) below shows a volcano plot of the expression changes.
tt <- topTable(fit, coef=2, n=Inf)
DEpwys <- rownames(tt)[tt$adj.P.Val <= 0.01]
plot(tt$logFC, -log10(tt$P.Value), pch=".", cex=4, col=grey(0.75),
main="", xlab="GSVA enrichment score difference", ylab=expression(-log[10]~~Raw~P-value))
abline(h=-log10(max(tt$P.Value[tt$adj.P.Val <= 0.01])), col=grey(0.5), lwd=1, lty=2)
points(tt$logFC[match(DEpwys, rownames(tt))],
-log10(tt$P.Value[match(DEpwys, rownames(tt))]), pch=".", cex=5, col="darkred")
text(max(tt$logFC)*0.85, -log10(max(tt$P.Value[tt$adj.P.Val <= 0.01])), "1% FDR", pos=3)
Figure @(fig:leukemiaheatmap} below shows a heatmap of GSVA enrichment scores for the 96 differentially expressed pathways.
DEpwys_es <- exprs(leukemia_es[DEpwys, ])
colorLegend <- c("darkred", "darkblue")
names(colorLegend) <- c("ALL", "MLL")
sample.color.map <- colorLegend[pData(leukemia_es)[, "subtype"]]
names(sample.color.map) <- colnames(DEpwys_es)
sampleClustering <- hclust(as.dist(1-cor(DEpwys_es, method="spearman")),
method="complete")
geneSetClustering <- hclust(as.dist(1-cor(t(DEpwys_es), method="pearson")),
method="complete")
heatmap(DEpwys_es, ColSideColors=sample.color.map, xlab="samples",
ylab="Pathways", margins=c(2, 20),
labRow=substr(gsub("_", " ", gsub("^KEGG_|^REACTOME_|^BIOCARTA_", "",
rownames(DEpwys_es))), 1, 35),
labCol="", scale="row", Colv=as.dendrogram(sampleClustering),
Rowv=as.dendrogram(geneSetClustering))
legend("topleft", names(colorLegend), fill=colorLegend, inset=0.01, bg="white")
The gsva()
function can be also used through an interactive web app developed
with shiny. To start it just type on the R console:
res <- igsva()
It will open your browser with the web app shown here below. The button
SAVE & CLOSE
will close the app and return the resulting object on the
R console. Hence, the need to call igsva() on the right-hand side of an
assignment if you want to store the result in your workspace. Alternatively,
you can use the DOWNLOAD
button to download the result in a CSV file.
In the starting window of the web app, after running GSVA, a non-parametric
kernel density estimation of sample profiles of GSVA scores will be shown.
By clicking on one of the lines, the cumulative distribution of GSVA scores
for the corresponding samples will be shown in the GeneSets
tab, as
illustrated in the image below.
Here is the output of sessionInfo()
on the system on which this document was
compiled running pandoc 2.5:
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] RColorBrewer_1.1-2 edgeR_3.36.0 limma_3.50.0
[4] GSVAdata_1.29.0 hgu95a.db_3.13.0 GSEABase_1.56.0
[7] graph_1.72.0 annotate_1.72.0 XML_3.99-0.8
[10] org.Hs.eg.db_3.14.0 AnnotationDbi_1.56.0 IRanges_2.28.0
[13] S4Vectors_0.32.0 Biobase_2.54.0 BiocGenerics_0.40.0
[16] GSVA_1.42.0 BiocStyle_2.22.0
loaded via a namespace (and not attached):
[1] MatrixGenerics_1.6.0 httr_1.4.2
[3] sass_0.4.0 BiocSingular_1.10.0
[5] bit64_4.0.5 jsonlite_1.7.2
[7] DelayedMatrixStats_1.16.0 bslib_0.3.1
[9] highr_0.9 BiocManager_1.30.16
[11] blob_1.2.2 GenomeInfoDbData_1.2.7
[13] yaml_2.2.1 RSQLite_2.2.8
[15] lattice_0.20-45 beachmat_2.10.0
[17] digest_0.6.28 GenomicRanges_1.46.0
[19] XVector_0.34.0 htmltools_0.5.2
[21] Matrix_1.3-4 pkgconfig_2.0.3
[23] magick_2.7.3 bookdown_0.24
[25] zlibbioc_1.40.0 xtable_1.8-4
[27] ScaledMatrix_1.2.0 HDF5Array_1.22.0
[29] BiocParallel_1.28.0 KEGGREST_1.34.0
[31] cachem_1.0.6 SummarizedExperiment_1.24.0
[33] magrittr_2.0.1 crayon_1.4.1
[35] memoise_2.0.0 evaluate_0.14
[37] tools_4.1.1 matrixStats_0.61.0
[39] stringr_1.4.0 Rhdf5lib_1.16.0
[41] locfit_1.5-9.4 DelayedArray_0.20.0
[43] irlba_2.3.3 Biostrings_2.62.0
[45] compiler_4.1.1 jquerylib_0.1.4
[47] GenomeInfoDb_1.30.0 rsvd_1.0.5
[49] rlang_0.4.12 rhdf5_2.38.0
[51] grid_4.1.1 RCurl_1.98-1.5
[53] rhdf5filters_1.6.0 SingleCellExperiment_1.16.0
[55] bitops_1.0-7 rmarkdown_2.11
[57] DBI_1.1.1 R6_2.5.1
[59] knitr_1.36 fastmap_1.1.0
[61] bit_4.0.4 stringi_1.7.5
[63] parallel_4.1.1 Rcpp_1.0.7
[65] vctrs_0.3.8 png_0.1-7
[67] xfun_0.27 sparseMatrixStats_1.6.0
Armstrong, Scott A, Jane E Staunton, Lewis B Silverman, Rob Pieters, Monique L den Boer, Mark D Minden, Stephen E Sallan, Eric S Lander, Todd R Golub, and Stanley J Korsmeyer. 2002. “MLL Translocations Specify a Distinct Gene Expression Profile That Distinguishes a Unique Leukemia.” Nature Gen. 30 (1): 41–47. https://doi.org/10.1038/ng765.
Barbie, David A., Pablo Tamayo, Jesse S. Boehm, So Young Kim, Susan E. Moody, Ian F. Dunn, Anna C. Schinzel, et al. 2009. “Systematic RNA Interference Reveals That Oncogenic KRAS-driven Cancers Require TBK1.” Nature 462 (7269): 108–12. https://doi.org/10.1038/nature08460.
Cahoy, John D., Ben Emery, Amit Kaushal, Lynette C. Foo, Jennifer L. Zamanian, Karen S. Christopherson, Yi Xing, et al. 2008. “A Transcriptome Database for Astrocytes, Neurons, and Oligodendrocytes: A New Resource for Understanding Brain Development and Function.” J. Neurosci. 28 (1): 264–78. https://doi.org/10.1523/JNEUROSCI.4178-07.2008.
Carrel, Laura, and Huntington F Willard. 2005. “X-Inactivation Profile Reveals Extensive Variability in X-Linked Gene Expression in Females.” Nature 434 (7031): 400–404.
Hänzelmann, Sonja, Robert Castelo, and Justin Guinney. 2013. “GSVA: Gene Set Variation Analysis for Microarray and RNA-Seq Data.” BMC Bioinformatics 14: 7. https://doi.org/10.1186/1471-2105-14-7.
Huang, R Stephanie, Shiwei Duan, Wasim K Bleibel, Emily O Kistner, Wei Zhang, Tyson A Clark, Tina X Chen, et al. 2007. “A Genome-Wide Approach to Identify Genetic Variants That Contribute to Etoposide-Induced Cytotoxicity.” Proceedings of the National Academy of Sciences of the United States of America 104 (23): 9758–63.
Lee, Eunjung, Han-Yu Chuang, Jong-Won Kim, Trey Ideker, and Doheon Lee. 2008. “Inferring Pathway Activity Toward Precise Disease Classification.” PLOS Comput Biol 4 (11): e1000217. https://doi.org/10.1371/journal.pcbi.1000217.
Pickrell, Joseph K., John C. Marioni, Athma A. Pai, Jacob F. Degner, Barbara E. Engelhardt, Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens, Yoav Gilad, and Jonathan K. Pritchard. 2010. “Understanding Mechanisms Underlying Human Gene Expression Variation with RNA Sequencing.” Nature 464 (7289): 768–72.
Skaletsky, Helen, Tomoko Kuroda-Kawaguchi, Patrick J. Minx, Holland S. Cordum, LaDeana Hillier, Laura G. Brown, Sjoerd Repping, et al. 2003. “The Male-Specific Region of the Human Y Chromosome Is a Mosaic of Discrete Sequence Classes.” Nature 423 (6942): 825–37.
Smyth, Gordon K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Stat Appl Genet Mol Biol 3: Article3. https://doi.org/10.2202/1544-6115.1027.
Tomfohr, John, Jun Lu, and Thomas B Kepler. 2005. “Pathway Level Analysis of Gene Expression Using Singular Value Decomposition.” BMC Bioinformatics 6 (September): 225. https://doi.org/10.1186/1471-2105-6-225.
Verhaak, Roel G W, Katherine A Hoadley, Elizabeth Purdom, Victoria Wang, Yuan Qi, Matthew D Wilkerson, C Ryan Miller, et al. 2010. “Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR, and NF1.” Cancer Cell 17 (1): 98–110. https://doi.org/10.1016/j.ccr.2009.12.020.