This tutorial aims to illustrate iClusterBayes2, an updated iClusterBayes function for identifying cancer molecular subtypes based on multi-omics data. We use the TCGA glioblastoma (GBM) multi-omics data generated by The Cancer Genome Atlas (TCGA) project (TCGA 2018) as an example. The level-3 GBM multi-omics data are available at the TCGA Public Portal and the cBio Cancer Genomics Portal. The GBM dataset consists of somatic mutation, DNA copy number, and microarray gene expression data for 84 GBM primary samples. The somatic mutation data are available for 306 selected genes in which 1 and 0 are used to indicate a gene with mutation and no mutation, respectively. The level-3 DNA copy number data are made of 16295 chromosome segment means (output of R package DNAcopy) for the 84 samples. The mRNA expression data are made of 1740 unified most variable genes across three microarray platforms (Affymetrix Human Exon 1.0 ST GeneChips, Affymetrix HT-HG-U133A GeneChips, and custom designed Agilent 244K array), which are detailed in Verhaak et al. 2010.
In general, the iCluster analysis of multi-omics data can be carried out in the following steps.
library(iClusterPlus)
library(patchwork)
library(dplyr)
library(lattice)
library(umap)
library(ggplot2)
library(gplots)
library(knitr)
For the somatic mutation data (1: mutation, 0: no mutation), genes with low-frequency mutations contribute little to sample clustering. Therefore, we recommend filtering out genes with too few mutations for iCluster analysis (e.g., < 2%).
data(gbm)
gbm.mut2 <- gbm.mut[,apply(gbm.mut,2,mean) > 0.02]
For copy number data, we use function CNregions to merge the copy number regions with similar neighboring segment means. We suggest reducing the GBM copy number regions to 5,000-6,000 regions.
gbm.cn <- CNregions(seg=gbm.seg,epsilon=0,frac.overlap=0.5, rmSmallseg=TRUE,nProbes=5)
dim(gbm.cn)
## [1] 84 6860
# order the samples
gbm.cn <- gbm.cn[order(rownames(gbm.cn)),]
# the sample names for the three data sets should be in the same order
all(rownames(gbm.cn)==rownames(gbm.exp))
## [1] TRUE
all(rownames(gbm.cn)==rownames(gbm.mut2))
## [1] TRUE
For gene expression data, we recommend performing variance (or coefficient of variation) filtering to select the top most variable genes (e.g. 1000 - 5000 genes) for iCluster analysis. For example, we use the top 1740 unified most variable genes as described above for our iCluster analysis.
The iCluster methods are based on latent factor model. We suggest using PCA variance plot to determine the number of latent factors used for iCluster modeling.
# PCA variance plot for continuous multi-omics data
pcaVarPlot(xList=list(gbm.cn,gbm.exp),K=50)
From the PCA variance plot, it can be seen that the rate of decrease of the explained variance starts to slow down significantly around the 5th PC. An optimal number of latent factors used for iCluster modeling can be chosen around the change point where the rate of decrease of the explained variance of the PCs starts to slow down significantly. As an example, here, we choose K=5 as the number of latent factors for iCluster modeling of the GBM multi-omics data. It can be found that a few latent variables (e.g., K=3-7) are usually sufficient for identification of cancer subtypes. In addition, the iClusterBayes method is a Bayesian integrative clustering method. A Markov chain Monte Carlo (MCMC) algorithm is used for the model parameter estimation and feature selection, which are computationally intensive. To reduce computation time and make the MCMC sampling more efficient, it is optimal to use a small number of factors (e.g., K < 8) for modeling. Otherwise, the user will need to tune the iClusterBayes parameters (e.g., sdev and beta.var.scale) to make the MCMC chains mix well.
The iCluster modeling generates a latent factor matrix (meanZ), which capture the major sources of variation of the multi-omics data. As a result, the latent factor matrix can be used for sample clustering and visualization.
set.seed(321)
date()
## [1] "Wed Mar 11 16:11:59 2026"
icfit <- iClusterBayes2(xList=list(gbm.mut2,gbm.cn,gbm.exp),type=c("binomial","gaussian","gaussian"),
K=5,n.burnin=2000,n.draw=12000,prior.gamma=c(0.5,0.5,0.5),sdev=0.05,thin=2)
date()
## [1] "Wed Mar 11 16:16:56 2026"
After the modeling, we need to check the distributions of the posterior probabilities (PPs) of the model parameters being selected. In general, if the PPs spread in the region of [0, 1], it indicates that the MCMC sampling work. if the PPs concentrate around 0, it indicates that the MCMC sampling may not work. The user will need to tune the sampling parameters sdev and beta.var.scale for the MCMC sampling.
par(mfrow=c(1,3))
hist(icfit$beta.pp[[1]], col="yellow")
hist(icfit$beta.pp[[2]], col="blue")
hist(icfit$beta.pp[[3]], col="green")
Before clustering analysis, we project the samples to a 2-dimensional space to estimate the number of clusters.
## clustering of cells using Louvain method and the latent factor matrix generated by iCluster
gbm.umap <- umap(icfit$meanZ)
colnames(gbm.umap$layout) <- c("UMAP1","UMAP2")
plot(gbm.umap$layout)
From the UMAP plot, it seems that there are 3-4 major clusters. Therefore, we perform k-means clustering for the GBM samples using the latent factor matrix meanZ and let the number of clusters be 3-4. The user can also try other clustering methods (e.g., k-medoids) on the latent factor matrix meanZ.
km3 <- kmeans(icfit$meanZ, 3, nstart = 100)
km4 <- kmeans(icfit$meanZ, 4, nstart = 100)
gbm.clusters <- data.frame(gbm.umap$layout,ic3=as.factor(km3$cluster), ic4=as.factor(km4$cluster))
p1 <- ggplot(gbm.clusters, aes(x=UMAP1, y=UMAP2, col=ic3)) + geom_point(size = 3) + theme_bw()
p2 <- ggplot(gbm.clusters, aes(x=UMAP1, y=UMAP2, col=ic4)) + geom_point(size = 3) + theme_bw()
p1 + p2
It is necessary to make heatmaps of the multi-omics data to check if the clusters are biologically meaningful. We will make the heatmaps for the 3 and 4 clusters separately. On the following heatmaps, the columns are the samples ordered by the clusters and the rows are the features ordered by the clustering methods specified by the plotHMBayes function.
Figure A: Heatmap of the 3 iClusters.
##### plot heatmaps #######
chr <- unlist(strsplit(colnames(gbm.cn),"\\."))
chr <- chr[seq(1,length(chr),by=2)]
chr <- gsub("chr","",chr)
chr <- as.numeric(chr)
bw.col <- colorpanel(2,low="white",high="black")
col.scheme <- alist()
col.scheme[[1]] <- bw.col
col.scheme[[2]] <- bluered(256)
col.scheme[[3]] <- bluered(256)
icfit$clusters <- gbm.clusters$ic3
hm3c <- plotHMBayes(fit=icfit, xList=list(gbm.mut2,gbm.cn,gbm.exp), type=c("binomial","gaussian","gaussian"),
sample.order=NULL, feature.order=c(F,F,T),dist.method="euclidean",hclust.method="ward.D",
sparse=c(T,T,T), threshold=c(0.5,0.5,0.5), feature.scale=c(F,F,T),
col.scheme=col.scheme,chr=chr,plot.chr=c(F,T,F),cap=c(0,0.99,0.99))
Figure B: Heatmap of the 4 iClusters.
icfit$clusters <- gbm.clusters$ic4
hm4c <- plotHMBayes(fit=icfit, xList=list(gbm.mut2,gbm.cn,gbm.exp), type=c("binomial","gaussian","gaussian"),
sample.order=NULL, feature.order=c(F,F,T), dist.method="euclidean", hclust.method="ward.D",
sparse=c(T,T,T), threshold=c(0.5,0.5,0.5), feature.scale=c(F,F,T),
col.scheme=col.scheme,chr=chr,plot.chr=c(F,T,F),cap=c(0,0.99,0.99))
Last, we compare the 3 and 4 iClusters.
table(gbm.clusters$ic3, gbm.clusters$ic4)
##
## 1 2 3 4
## 1 13 0 0 0
## 2 0 0 9 23
## 3 0 13 26 0
We can see that the samples characterized by normal chromosome 7 are grouped into the same cluster in the 3-cluster solution (Figure A) and the 4-cluster solution (Figure B). The samples characterized by amplified chromosome 7 are divided into 2 clusters (Figure A) and 3 clusters in Figure B, respectively.
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.2
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] knitr_1.50 gplots_3.2.0 ggplot2_4.0.1
## [4] umap_0.2.10.0 lattice_0.22-7 dplyr_1.1.4
## [7] patchwork_1.3.2 iClusterPlus_1.47.2 irlba_2.3.5.1
## [10] Matrix_1.7-4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.2 gtools_3.9.5
## [5] tidyselect_1.2.1 Rcpp_1.1.0 bitops_1.0-9 jquerylib_0.1.4
## [9] png_0.1-8 scales_1.4.0 yaml_2.3.10 fastmap_1.2.0
## [13] reticulate_1.44.1 R6_2.6.1 labeling_0.4.3 generics_0.1.4
## [17] tibble_3.3.0 openssl_2.3.4 bslib_0.9.0 pillar_1.11.1
## [21] RColorBrewer_1.1-3 rlang_1.1.6 cachem_1.1.0 xfun_0.54
## [25] caTools_1.18.3 sass_0.4.10 S7_0.2.1 cli_3.6.5
## [29] withr_3.0.2 magrittr_2.0.4 digest_0.6.39 grid_4.5.2
## [33] rstudioapi_0.17.1 askpass_1.2.1 lifecycle_1.0.4 vctrs_0.6.5
## [37] KernSmooth_2.23-26 RSpectra_0.16-2 evaluate_1.0.5 glue_1.8.0
## [41] farver_2.1.2 rmarkdown_2.30 tools_4.5.2 pkgconfig_2.0.3
## [45] htmltools_0.5.8.1