Contents

1 Introduction

The Self-training Gene Clustering Pipeline (SGCP) is a robust framework designed for constructing and analyzing gene co-expression networks. Its primary objective is to organize genes into clusters, or modules, based on their similar expression patterns within these networks. SGCP introduces several innovative steps that facilitate the identification of highly enriched modules through an unsupervised approach. What sets SGCP apart from existing frameworks is its incorporation of a unique semi-supervised clustering method that integrates Gene Ontology (GO) information. This additional step significantly enhances the quality of the identified modules. SGCP ultimately produces modules that are notably enriched with relevant biological insights. For a comprehensive overview of SGCP, including detailed methodology and results, refer to the preprint of the manuscript available here.

1.1 SGCP installation

To install the package SGCP package and its dependencies, follow these steps. For more information please visit here). In this manual guide, SGCP also relies on two more packages; SummarizedExperiment and org.Hs.eg.db.

library(BiocManager)
BiocManager::install('SGCP')

Let’s start.

library(SGCP)

1.2 SGCP General Input

SGCP requires three main inputs: gene expression data, gene identifiers, and a genome-wide annotation database.

  • expData: A matrix or dataframe of size m*n, where m is the number of genes and n is the number of samples. It can be either DNA-microarray or RNA-seq data. In expData, rows represent genes, columns represent samples, and the entry at position (i, j) is the expression value for gene i in sample j. SGCP does not perform normalization or batch effect correction; these preprocessing steps should be completed beforehand.

  • geneID: A vector of size m, where each entry at index i denotes the gene identifier for gene i. There is a one-to-one correspondence between the rows in expData and the entries in the geneID vector, meaning index i in geneID corresponds to row i in expData.

  • annotation_db: The name of a genome-wide annotation package for the organism of interest, used for the gene ontology (GO) enrichment step. The annotation_db must be installed by the user before using SGCP.

Here, are some important annotation_db along with its corresponding identifiers.

organism annotation_db gene identifier
Homo sapiens (Hs) org.Hs.eg.db Entrez Gene identifiers
Drosophila melanogaster (Dm) org.Dm.eg.db Entrez Gene identifiers
Rattus norvegicus (Rn) org.Rn.eg.db Entrez Gene identifiers
Mus musculus (Mm) org.Mm.eg.db Entrez Gene identifiers
Arabidopsis thaliana (At) org.At.tair.db TAIR identifiers

Note that genes: * must have expression values across all samples (i.e. no missing value). * must have non-zero variance across all the samples. * must have exactly one unique identifier, say geneID. * must have GO annotation.

Note that SGCP depends on GOstats for GO enrichment, thus (geneID) and (annotation_db) must be compatible to this package standard.

1.2.1 SGCP Input Example

For illustrative purposes we will use an example of a gene expression provided with SGCP. This dataset originally represent 5000 genes in 57 samples for Homo sapiens, for more information see (Cheng et al.) Here, to ease the computation, 1000 genes with the highest variance and 5 samples have been selected from the original dataset.

For this input example we need to install the following packages.

The input example is organized as an object of SummarizedExperiment. The assay field contains the expression matrix, where rows correspond to genes and columns correspond to samples. The rowData field denotes the gene Entrez IDs. There is a one-to-one correspondence between rows in the assay and rowData fields; thus, rowData at index i shows the gene Entrez ID for the gene at row i in the assay field. We call this object cheng.

The annotation_db used is org.Hs.eg.db, which must be installed if it is not already. For installation, see link. Let’s look at the input example.

library(SummarizedExperiment)
data(cheng)
cheng
## class: SummarizedExperiment 
## dim: 1500 10 
## metadata(0):
## assays(1): Normalized gene expression Of ischemic cardiomyopathy (ICM)
## rownames(1500): 7503 100462977 ... 1152 11082
## rowData names(2): SYMBOL ENTREZID
## colnames(10): SRR7426784 SRR7426785 ... SRR7426792 SRR7426793
## colData names(1): sampleName
print("gene expression...")
## [1] "gene expression..."
print("rownames and colnames correspond to gene Entrez ids and sample names")
## [1] "rownames and colnames correspond to gene Entrez ids and sample names"
head(assay(cheng))
##           SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503        4.608516   4.822920  11.477016  11.164324  10.631023   5.372338
## 100462977   6.566498  11.170815  10.721761   7.120300  10.991193  12.889641
## 4878        6.925341   9.098948   9.550381   7.742856   7.539139   6.818301
## 4879        6.020677  10.545173   6.398164   8.415020   7.936110   5.492805
## 6192        9.256628   9.511192   4.276300   3.705296   3.392204   9.581542
## 9086        9.098932   9.110217   3.858846   3.077543   2.923131   9.426101
##           SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503        4.618841   4.628351   4.398288   4.686936
## 100462977  12.024684  12.295130   7.481746   6.661691
## 4878        6.181450   6.083118   9.985733   7.622827
## 4879        5.255798   6.266800   8.812797   8.678046
## 6192        9.211812   9.109689   9.309379   9.158489
## 9086        9.567030   9.557452   9.316588   9.059718
print(" \n gene ids...")
## [1] " \n gene ids..."
print("rownames are the gene Entrez ids")
## [1] "rownames are the gene Entrez ids"
head(rowData(cheng))
## DataFrame with 6 rows and 2 columns
##                SYMBOL    ENTREZID
##           <character> <character>
## 7503             XIST        7503
## 100462977    MTRNR2L1   100462977
## 4878             NPPA        4878
## 4879             NPPB        4879
## 6192           RPS4Y1        6192
## 9086           EIF1AY        9086

This data has the dimension of 1500 genes and 10 samples.

Now, we are ready to create the three main inputs. Using assay and rowData functions we can have access to expression matrix and gene Entrez IDs respectively.

message("expData")
## expData
expData <- assay(cheng)
head(expData)
##           SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503        4.608516   4.822920  11.477016  11.164324  10.631023   5.372338
## 100462977   6.566498  11.170815  10.721761   7.120300  10.991193  12.889641
## 4878        6.925341   9.098948   9.550381   7.742856   7.539139   6.818301
## 4879        6.020677  10.545173   6.398164   8.415020   7.936110   5.492805
## 6192        9.256628   9.511192   4.276300   3.705296   3.392204   9.581542
## 9086        9.098932   9.110217   3.858846   3.077543   2.923131   9.426101
##           SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503        4.618841   4.628351   4.398288   4.686936
## 100462977  12.024684  12.295130   7.481746   6.661691
## 4878        6.181450   6.083118   9.985733   7.622827
## 4879        5.255798   6.266800   8.812797   8.678046
## 6192        9.211812   9.109689   9.309379   9.158489
## 9086        9.567030   9.557452   9.316588   9.059718
dim(expData)
## [1] 1500   10
message(" \n geneID")
##  
##  geneID
geneID <- rowData(cheng)
geneID <- geneID$ENTREZID
head(geneID)
## [1] "7503"      "100462977" "4878"      "4879"      "6192"      "9086"
length(geneID)
## [1] 1500
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
annotation_db <- "org.Hs.eg.db"

1.3 SGCP Pipeline Parameters and Workflow

SGCP operates through five main steps to produce the final modules. Each step provides parameters that can be adjusted by the user. Each step is implemented as a single function.

  1. Network Construction step (adjacencyMatrix function)
    • calibration: Boolean, default FALSE. If TRUE SGCP performs calibration step for more information see link.
    • norm: Boolean, default TRUE. If TRUE SGCP divides each genes (rows) by its norm2.
    • tom: Boolean, default TRUE. If TRUE SGCP adds TOM to the network.
    • saveAdja: Boolean, default FALSE. If TRUE, the adjacency matrix will be saved .
    • adjaNameFile: string indicating the name of file for saving adjacency matrix.
    • hm: string indicates the name of file for saving adjacency matrix heat map.
  2. Network Clustering step (clustering function)
    • kopt: An integer, default is NULL. Denotes the optimal number of clusters k by the user.
    • method: Either “relativeGap”, “secondOrderGap”, “additiveGap”, or NULL (defualt), Defines the method for number of cluster.
    • func.GO: A function for gene ontology validation, default is sum.
    • func.conduct: A function for conductance validation, default is min.
    • maxIter: An integer identifying the maximum number of iteration in kmeans.
    • numStart: An integer identifying the number of starts in kmeans.
    • saveOrig: Boolean, default TRUE. If TRUE, keeps the transformed matrix.
    • n_egvec: Either “all” or an integer, default is 100. Indicates the number of columns of transformed matrix to be kept.
    • sil: Boolean, default FALSE. If TRUE, calculates silhouette index per gene. at the end of this step, initial clusters are produced.
  3. Gene Ontology Enrichment step (geneOntology function)
    • direction: Test direction, default is c(“over”, “under”), for over-represented, or under-represented GO terms
    • ontology: GO ontologies, default c(“BP”, “CC”, “MF”), where BP stands for Biological Process, CC for Cellular Component, and MF for Molecular Function.
    • hgCutoff: A numeric value in 0 and 1, default is 0.05. GO terms with p-values smaller than hgCutoffare retained.
    • cond: Boolean, default is TRUE. If TRUE, performs conditional hypergeometric test.
  4. Gene Semi-labeling step (semiLabeling function)
    • cutoff: a numeric in 0 and 1, default is NULL. This baseline is used to determined the significance of GO terms, distinguishing between remarkable and unremarkable genes.
    • percent: a number in 0 and 1, default is 0.1. Indicates the percentile used to identify top GO terms.
    • stp: a number in 0 and ,1 default is 0.01. Specifies the increment added to the percent parameter.
  5. Semi-supervised Classification step (semiSupevised function)
    • model: Either “knn” for k-nearest neighbors or “lr” for logistic regression, specifying the classification model.
    • kn: An integer, default is NULL, indicating the number of neighbors in k-nearest neighbors (knn). If kn is NULL, then kn is determined as follow kn = 20 if 2 * k < 30 otherwise kn = 20 : 30,

At the end of this step, final modules are produced. Subsequently, SGCP performs an additional step of Gene Ontology Enrichment on those final modules.

Detailed of the steps are available in the manuscript in SGCP.

In Network Construction, user can identify of any of steps to be added to the network. In Network Clustering, SGCP behaves as follows; If kopt is not NULL, SGCP finds clusters based on kopt. If kopt is NULL and method is not NULL, SGCP picks k based on the selected method. If geneID and annotation_db are NULL, SGCP determines the optimal method and its corresponding number of cluster based on conductance validation. It selects the method that func.conduct (default min) on its cluster is minimum. Otherwise, SGCP will use gene ontology validation (default) to find the optimal method and its corresponding number of clusters. It performs gene ontology enrichment on the cluster with minimum conductance index per method and selects the method that has the maximum func.GO (default sum) over -log10 of p-values. In Gene Semi-labeling step, if cutoff is not NULL, SGCP considers genes associated to GO terms more significant than\(~\)cutoff\(~\) as remarkable. Otherwise, SGCP collects all GO terms from all clusters and selects the top percent (default 0.1) most significant GO terms among them. If genes associated to these terms come from more than a single cluster, SGCP these genes as remarkable. Otherwise, it increases\(~\)percent\(~\)to\(~\)stp\(~\)and repeat this process until remarkable genes come from at least two clusters. In Semi-supervised Classification, remarkable clusters are the clusters that have at least one remarkable gene.

2 Automatic Run, ezSGCP Function

The ezSGCP function implements the SGCP pipeline in one function. Parameters align with those specified in the here with specific mappings for clarity.

*__Network Construction__: Parameter `calib` corresponds to `calibration`.
*__Network Clustering__: Parameters `method_k`, `f.GO`, `f.conduct`, `maxIteration`, and `numberStart` correspond to `method`, `func.GO`, `func.conduct`, `maxIter`, and `numStart`, respectively.
*__Gene Ontology Enrichment__: Parameters `dir`, `onto`, `hgCut`, and `condTest` correspond to `direction`, `ontology`, `hgCutoff`, and `cond`, respectively.
* __Gene Ontology Enrichment__: The boolean parameter `semilabel` controls whether `SGCP` continues to the Semi-supervised Classification step (`TRUE`) or stops after initial clusters (`FALSE`).

For a comprehensive description of each parameter, refer to the help documentation.

Below is an example of how to call this function. Due to its computation time (up to 15 minutes), the result has been precomputed and stored as sgcp, with the function call commented out. To run it yourself, uncomment the function call. In this example, the sil parameter is set to TRUE to compute the gene silhouette index.

# sgcp <- ezSGCP(expData = expData, geneID = geneID, annotation_db = annotation_db,
#               eff.egs = FALSE , saveOrig = FALSE, sil = TRUE, hm = NULL)
data(sgcp)
summary(sgcp, show.data=TRUE)
##                Length Class      Mode   
## semilabel       1     -none-     logical
## clusterLabels   3     data.frame list   
## clustering     13     -none-     list   
## initial.GO      2     -none-     list   
## semiLabeling    2     -none-     list   
## semiSupervised  3     -none-     list   
## final.GO        2     -none-     list

If you run the ezSGCP function, it may provide status updates during execution. Here’s an example of what you might see:

##  starting network construction step...
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
##  it may take time...
## network is created, done!...

##  starting network clustering step...
## calculating normalized Laplacian 
##  it may take time...
## calculating eigenvalues/vectors 
##  it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3

##  Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method secondOrderGap...
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method additiveGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index 
##  it may take time...
## network clustering is done...

## starting initial gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

## starting semi-labeling stage...
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..

## starting semi-supervised step...
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
##  it may take time
## Loading required package: ggplot2
## Loading required package: lattice
##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      4      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   2      -none-     character
## param       0      -none-     list     
## 24-nearest neighbor model
## Training set outcome distribution:

##   1   2 
## 498 667 

## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
##   1   2 
## 130 205 
## semi-supervised done!..

## starting final gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

## ezSGCP done!..

As observed, SGCP provides detailed information at each step of its execution. In this example, the analysis considers three possible numbers of clusters: 2, 3, and 5. Following gene ontology validation, the optimal number of clusters is determined to be 2.

While SGCP generally performs well with default parameters, users have the option to customize settings according to their specific needs. For example, enabling calib can enhance the enrichment of the final modules in certain cases. Similarly, the addition of TOM may not always be beneficial.

By default, SGCP operates under the assumption that users do not have prior knowledge of the exact number of clusters or the optimal method for clustering. Therefore, it utilizes gene ontology validation to establish initial clusters. We highly recommend that users evaluate SGCP using all three potential methods: “relativeGap”, “secondOrderGap”, and “additiveGap”.

SGCP offers visualization capabilities to users. The SGCP_ezPLOT function accepts the sgcp result from ezSGCP, along with expData and geneID inputs. It generates two files, an Excel spreadsheet and a PDF, depending on the initial call. Users can also retain the plotting object by setting keep = TRUE. In this example, we enable silhouette_in to visualize the silhouette index plot.

message("PCA of expression data without label")
SGCP_ezPLOT(sgcp = sgcp, expreData = expData, silhouette_index = TRUE, keep = FALSE)

Note that in order to store files in xlsx format, Excel must be installed on your system.

For a detailed explanation of parameters, please refer to the help document.

3 Step-By-Step Run, SGCP In Detail

The ezSGCP function described earlier acts as a wrapper, sequentially calling several other SGCP functions in the following order. Let’s explore each step in detail using the example data.

  1. adjacencyMatrix function, performs Network Construction step
  2. clustering function, performs Network Clustering step
  3. geneOntology function, performs Gene Ontology Enrichment step (produces initial clusters)
  4. semiLabeling function, performs Semi-labeling step
  5. semiSupervised function, performs Semi-supervised step
  6. geneOntology function, performs Gene Ontology Enrichment step (produces final modules)

To conserve space, let’s omit the details about sgcp.

rm(sgcp)

Here, we’ll skip the details of the function’s input and output, as they are described in the help document. SGCP enables users to visualize the Principal Component Analysis (PCA) of the input gene expression data using the SGCP_plot_pca function.

message("Plotting PCA of expression data")
## Plotting PCA of expression data
pl <- SGCP_plot_pca(m = expData, clusLabs = NULL, tit = "PCA plot", ps = .5)
print(pl)

3.1 adjacencyMatrix function

The Network Construction step utilizes the adjacencyMatrix function. By default, this function normalizes each gene vector using its L2 norm and employs a Gaussian kernel metric for similarity computation. Additionally, it integrates the second-order neighborhood information of genes into the network as Topological Overlap Matrix (TOM). The code snippet below demonstrates its usage:

resAdja <- adjacencyMatrix(expData = expData, hm = NULL)
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
##  it may take time...
## network is created, done!...
resAdja[0:10, 0:5]
##               1            2            3            4            5
## 1  0.000000e+00 1.561704e-23 1.684533e-18 5.096042e-21 2.304447e-34
## 2  1.561704e-23 0.000000e+00 3.073427e-08 9.883634e-11 1.560804e-14
## 3  1.684533e-18 3.073427e-08 0.000000e+00 2.990119e-04 3.639579e-13
## 4  5.096042e-21 9.883634e-11 2.990119e-04 0.000000e+00 2.664746e-14
## 5  2.304447e-34 1.560804e-14 3.639579e-13 2.664746e-14 0.000000e+00
## 6  1.880926e-37 1.859059e-16 1.732464e-15 6.205653e-17 6.756836e-01
## 7  7.700277e-16 3.234158e-06 1.891155e-02 3.050233e-04 7.223951e-11
## 8  4.549591e-34 2.956936e-14 5.344484e-13 1.850700e-14 6.616417e-01
## 9  2.052374e-34 1.424767e-14 2.918285e-13 2.113249e-14 7.306804e-01
## 10 5.767117e-34 3.294000e-14 6.190725e-13 3.546475e-14 7.471861e-01

resAdja represents the adjacency matrix of the gene co-expression network. To visualize the heatmap of this adjacency matrix, you can use the SGCP_plot_heatMap function. Uncomment the following lines of code to view the heatmap.

#message("Plotting adjacency heatmap")
#pl <- SGCP_plot_heatMap(m = resAdja, tit = "Adjacency Heatmap", 
#                    xname = "genes", yname = "genes")
#print(pl)

3.2 clustering function

The Network Clustering step is implemented using clustering function, which takes the adjacency matrix, geneID, and annotation_db as inputs. Depending on the initial call, it returns a list of results. In this example, the sil parameter is set to TRUE to compute the silhouette index per cluster

  • dropped.indices: A vector of indices corresponding to dropped genes, which are considered noise and removed during processing.

  • geneID: A vector containing gene identifiers.

  • method: Indicates the method used to determine the number of clusters.

  • k: The number of clusters identified.

  • Y: Transformed matrix with dimensions (number of genes) x (2*k).

  • X: Eigenvalues corresponding to the first 2 * k columns in Y.

  • cluster: An object of class “kmeans” representing the clustering results.

  • clusterLabels: A vector containing the cluster labels assigned to each gene. There is a one-to-one correspondence between geneID and clusterLabels.

  • conductance: A list containing mean and median conductance indices, along with individual cluster conductance indices for each method. The clusterConductance field indexes denote cluster labels, and their values indicate the conductance index.

  • cvGOdf: A dataframe used for gene ontology validation, presenting enrichment results for each method on the cluster with the minimum conductance index.

  • cv: A string indicating the validation method for the number of clusters; “cvGO”:gene ontology validation, “cvConductance”: conductance validation, “userMethod”: user-defined method, “userkopt”: user-defined kopt.

  • clusterNumberPlot: A plotting object displaying results for methods such as “relativeGap”, “secondOrderGap”, and “additiveGap”.

  • silhouette: A dataframe showing silhouette indices for genes.

  • original:A list containing matrix transformation details, including eigenvalues (n_egvec), with the top columns of transformation retained.

Uncomment and run the following code snippet to execute the function. Since it may take some time, the results are stored in resClus.

# resClus <- clustering(adjaMat = resAdja, geneID = geneID, annotation_db = annotation_db,
#                       eff.egs = FALSE , saveOrig = FALSE, sil = TRUE)

data(resClus)
summary(resClus)
##                   Length Class      Mode     
## dropped.indices      0   -none-     numeric  
## geneID            1500   -none-     character
## method               1   -none-     character
## k                    1   -none-     numeric  
## Y                 6000   -none-     numeric  
## X                    4   -none-     numeric  
## cluster              9   kmeans     list     
## clusterLabels     1500   -none-     character
## conductance          3   -none-     list     
## cvGOdf               9   data.frame list     
## cv                   1   -none-     character
## clusterNumberPlot    3   -none-     list     
## silhouette           3   data.frame list
# lets drop noisy genes from expData

geneID <- resClus$geneID
if(length(resClus$dropped.indices)>0 ){ expData <- expData[-resClus$dropped.indices, ]}

# removing the adjacency matrix
rm(resAdja)

If you execute the code, the following information will be printed out for you.

## calculating normalized Laplacian 
##  it may take time...
## calculating eigenvalues/vectors 
##  it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3

##  Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method secondOrderGap...
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method additiveGap....
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...

##  method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index 
##  it may take time...
## network clustering is done...

We observed that three methods (“relativeGap”, “secondOrderGap”, “additiveGap”) produced three potentially distinct numbers of clusters. Following gene ontology validation, SGCP selected the “secondOrderGap” method, resulting in 2 clusters. The cv field under “cvGO” indicates the number of clusters determined through gene ontology validation. The original field provides the first n_egvec columns of eigenvectors and their corresponding eigenvalues from the transformed matrix, which can be useful for further investigation.

To visualize PCA plots of the expression and transformed data with and without labels, use SGCP_plot_pca. Uncomment the following lines to view the resulting PCAs.

message("Plotting PCA of expression data with label")
## Plotting PCA of expression data with label
# pl <- SGCP_plot_pca(m = expData, clusLabs = NULL, tit = "PCA plot without label", ps = .5)
# print(pl)
message("Plptting PCA of transformed data without label")
## Plptting PCA of transformed data without label
# pl <- SGCP_plot_pca(m = resClus$Y, clusLabs = NULL, tit = "PCA plot without label", ps = .5)
# print(pl)
message("Plotting PCA of transformed data with label")
## Plotting PCA of transformed data with label
# pl <- SGCP_plot_pca(m = resClus$Y, clusLabs = resClus$clusterLabels, tit = "PCA plot with label", ps = .5)
# print(pl)

In resClus, you can find plotting objects for each method used to determine the number of clusters.

if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
    message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
    print(resClus$clusterNumberPlot$relativeGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...

if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
    message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
    print(resClus$clusterNumberPlot$secondOrderGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...

if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
    message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
    print(resClus$clusterNumberPlot$additiveGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...

In SGCP, you can visualize the conductance index per cluster using the SGCP_plot_conductance function. The conductance index is calculated only when both kopt and method are NULL. If either parameter is specified, SGCP skips conductance computation as it does not need to validate the number of clusters.

# checking if conductance index is calculated
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance" ){
    message("plotting cluster conductance index...")
    pl <- SGCP_plot_conductance(conduct = resClus$conductance, 
                        tit = "Cluster Conductance Index", 
                        xname = "clusterLabel", yname = "conductance index")
    print(pl)}
## plotting cluster conductance index...

In here We observed that clusters with lower conductance indices tend to exhibit higher enrichment. This analysis confirms our observation. In the clustering function, gene ontology enrichment is compared among clusters with the minimum conductance for the methods “relativeGap” (cluster rg1), “secondOrderGap” (cluster sg4), and “additiveGap” (cluster ag1). Among them, cluster rg1 shows higher enrichment. Interestingly, this cluster also demonstrates a lower conductance index compared to ag1 and sg4.

In SGCP, you can also plot the silhouette index per gene if the sil parameter is set to TRUE in the clustering function.

# checking if silhouette index is calculated
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance" ){
    message("plotting gene silhouette index...")
    pl <- SGCP_plot_silhouette(resClus$silhouette, tit = "Gene Silhouette Index",
                            xname = "genes", yname = "silhouette index")
    print(pl)}
## plotting gene silhouette index...

The silhouette index ranges between -1 and 1, where values closer to 1 indicate better clustering fit for genes. As observed, the majority of genes have a silhouette index very close to 1, suggesting they are well-described by the clusters. Interestingly, in the worst-case scenario, some genes have a silhouette index of zero, with no negative indices observed for any gene.

3.3 geneOntology function

The Gene Ontology Enrichment step is executed using the geneOntology function. This function calculates gene ontology enrichment on the clusters and returns two outputs; GOresults, results of gene ontology enrichment and FinalGOTermGenes, a list that maps gene IDs to the corresponding GO terms identified in GOresults. r Please uncomment the code above to execute the function and store the results in resInitialGO, keeping in mind that it may take some time to complete.

# resInitialGO <- geneOntology(geneUniv = geneID, clusLab = resClus$clusterLabels, 
#                              annotation_db = annotation_db)

data(resInitialGO)

head(resInitialGO$GOresults)
##   clusterNum  GOtype       GOID       Pvalue OddsRatio  ExpCount Count Size
## 1          1 underMF GO:0005201 5.280629e-15 0.1084677  41.36170    10   72
## 2          1 underMF GO:0005198 2.081705e-11 0.2448824  62.04255    29  108
## 3          1 underCC GO:0031012 1.276834e-10 0.3605085 107.53125    67  186
## 4          1 underCC GO:0030312 2.159404e-10 0.3663565 108.10938    68  187
## 5          1 underCC GO:0005576 1.076738e-09 0.5227524 350.34375   294  606
## 6          1 underCC GO:0071944 1.653199e-09 0.5292955 446.31250   390  772
##                                          Term
## 1 extracellular matrix structural constituent
## 2                structural molecule activity
## 3                        extracellular matrix
## 4            external encapsulating structure
## 5                        extracellular region
## 6                              cell periphery

If you execute the code, the following information will be printed out for you.

## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

Above shows the enrichment dataframe returned from GOstats (link)[https://bioconductor.org/packages/release/bioc/html/GOstats.html] package. clusterNum indicates the cluster label.

SGCP offers four functions to visualize gene ontology enrichment results: SGCP_plot_jitter, SGCP_plot_density, SGCP_plot_bar, and SGCP_plot_pie. It’s important to note that p-values are log-transformed in these visualizations. Therefore, higher points indicate more significant GO terms.

message("plotting initial GO p-values jitters...")
## plotting initial GO p-values jitters...
pl <- SGCP_plot_jitter(df = resInitialGO$GOresults, 
                    tit = "Initial GO p-values",
                    xname = "cluster", yname = "-log10 p-value", ps = 3)
print(pl)

message("plotting initial GO p-values density")
## plotting initial GO p-values density
pl <- SGCP_plot_density(df = resInitialGO$GOresults, 
                    tit = "Initial GO p-values Density",
                    xname = "cluster", yname = "-log10 p-value")

print(pl)
## Picking joint bandwidth of 0.199

message("plotting initial GO p-values mean")
## plotting initial GO p-values mean
pl <- SGCP_plot_bar(df = resInitialGO$GOresults, tit = "Cluster Performance",
                    xname = "cluster", yname = "mean -log10 p-value")
print(pl)

message("plotting initial GO p-values pie chart...")
## plotting initial GO p-values pie chart...
pl <- SGCP_plot_pie(df = resInitialGO$GOresults, tit = "Initial GO Analysis",
                xname = "cluster", yname = "count", posx = 1.8)
print(pl)

Each segment in the plot is labeled with the log-transformed p-value of the most significant term found in that segment.

3.4 semiLabeling function

The Gene Semi-labeling step utilizes the semiLabeling function. This function takes the output from the geneOntology function and returns a dataframe that identifies remarkable and unremarkable genes, along with the associated cutoff value.

resSemiLabel <- semiLabeling(geneID = geneID, 
                            df_GO = resInitialGO$GOresults, 
                            GOgenes = resInitialGO$FinalGOTermGenes)
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..
print(head(resSemiLabel$geneLabel))
##      geneID label
## 1      7503     1
## 2 100462977  <NA>
## 3      4878     2
## 4      4879     2
## 5      6192     1
## 6      9086  <NA>
table(resSemiLabel$geneLabel$label)
## 
##   1   2 
## 667 498

In the table above, genes are listed with their corresponding cluster labels if they are remarkable; otherwise, they are labeled as NA. A cluster is considered remarkable if it contains at least one remarkable gene. Therefore, both clusters shown are remarkable here. In our discussion here, we highlighted that if the number of clusters exceeds 2, SGCP will eliminate unremarkable clusters through the Semi-labeling and Semi-supervised steps. Genes from these clusters are reassigned to remarkable clusters. Consequently, the number of modules may decrease while module enrichment increases. These steps modify the original clusters produced by the clustering function, resulting in a different set of clusters. Therefore, SGCP returns two sets of results: (i) immediately after the clustering function, referred to as clusters, and (ii) after the semiSupervised function, referred to as modules. In our manuscript, we distinguish these as pSGCP and SGCP, respectively.

3.5 semiSupervised function

The Gene Semi-supervised step utilizes the semiSupervised function. The final module labeling is provided in FinalLabeling, which is a dataframe containing geneIDs with their corresponding semi-labels and final labels. Uncomment and run the following code to perform gene semi-supervised labeling.

resSemiSupervised <- semiSupervised(specExp = resClus$Y, 
                                geneLab = resSemiLabel$geneLabel,
                                model = "knn", kn = NULL)
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
##  it may take time
## Loading required package: ggplot2
## Loading required package: lattice
##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      4      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   2      -none-     character
## param       0      -none-     list     
## 24-nearest neighbor model
## Training set outcome distribution:
## 
##   1   2 
## 667 498
## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
##   1   2 
## 205 130
## semi-supervised done!..
print(head(resSemiSupervised$FinalLabeling))
##      geneID semiLabel FinalLabel
## 1      7503         1          1
## 2 100462977      <NA>          2
## 3      4878         2          2
## 4      4879         2          2
## 5      6192         1          1
## 6      9086      <NA>          1
print(table(resSemiSupervised$FinalLabeling$FinalLabel))
## 
##   1   2 
## 872 628

In this step, the k-nearest neighbor model is selected, and its hyper-parameters are chosen through cross-validation based on the accuracy metric. The table above displays the gene semi-labeling and final labeling.

You can visualize the PCA plot with the final labeling using the SGCP_plot_pca function. Uncomment the following code to view the resulting PCAs:

# message("Plotting PCA of transformed data with label")
# pl <- SGCP_plot_pca(m = expData, 
#                clusLabs = resSemiSupervised$FinalLabeling$FinalLabel, 
#                tit = "PCA plot with label", ps = .5)
print(pl)

# message("Plotting PCA of transformed data with label")
# pl <- SGCP_plot_pca(m = resClus$Y, 
#                clusLabs = resSemiSupervised$FinalLabeling$FinalLabel, 
#                tit = "PCA plot with label", ps = .5)
#print(pl)

3.6 geneOntology function

Finally, the Gene Ontology Enrichment step is performed once more on the final modules to enhance module enrichment. Again, due to its time-consuming nature, the code is commented out, and the output is stored in resFinalGO. Uncomment and run the following code for practice:

# resFinalGO <- geneOntology(geneUniv = geneID, clusLab = resSemiSupervised$FinalLabeling$FinalLabel, 
#                              annotation_db = annotation_db)

data(resFinalGO)
head(resFinalGO$GOresults)
##   clusterNum  GOtype       GOID       Pvalue OddsRatio  ExpCount Count Size
## 1          1 underMF GO:0005201 4.792355e-15 0.1081310  41.41277    10   72
## 2          1 underMF GO:0005198 1.872816e-11 0.2440998  62.11915    29  108
## 3          1 underCC GO:0031012 1.118311e-10 0.3593320 107.65761    67  186
## 4          1 underCC GO:0030312 1.893638e-10 0.3651603 108.23641    68  187
## 5          1 underCC GO:0005576 8.138271e-10 0.5201268 350.75543   294  606
## 6          1 underCC GO:0071944 1.165513e-09 0.5259400 446.83696   390  772
##                                          Term
## 1 extracellular matrix structural constituent
## 2                structural molecule activity
## 3                        extracellular matrix
## 4            external encapsulating structure
## 5                        extracellular region
## 6                              cell periphery

Upon running the code, you will observe the following output:

## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying  genes for each GOTerm...
## calling GOstats for overCC...
## identifying  genes for each GOTerm...
## calling GOstats for overMF...
## identifying  genes for each GOTerm...
## calling GOstats for underBP...
## identifying  genes for each GOTerm...
## calling GOstats for underCC...
## identifying  genes for each GOTerm...
## calling GOstats for underMF...
## identifying  genes for each GOTerm...
## gene ontology done!..

Now, let’s examine the corresponding plots for final module enrichment.

message("plotting final GO p-values jitters...")
## plotting final GO p-values jitters...
pl <- SGCP_plot_jitter(df = resFinalGO$GOresults, tit = "Final GO p-values",
                    xname = "module", yname = "-log10 p-value", ps = 3)
print(pl)

message("plotting final GO p-values density...")
## plotting final GO p-values density...
pl <- SGCP_plot_density(df = resFinalGO$GOresults, 
                    tit = "Final GO p-values Density",
                    xname = "module", yname = "-log10 p-value")
print(pl)
## Picking joint bandwidth of 0.197

rm(pl)
message("plotting final GO p-values mean...")
## plotting final GO p-values mean...
pl <- SGCP_plot_bar(df = resFinalGO$GOresults, tit = "Module Performance",
                xname = "module", yname = "mean -log10 p-value")
print(pl)

rm(pl)
message("plotting final GO p-values pie xhart...")
## plotting final GO p-values pie xhart...
pl <- SGCP_plot_pie(df = resFinalGO$GOresults, tit = "Final GO Analysis",
                xname = "module", yname = "count", posx = 1.9)
print(pl)

rm(pl)

Comparing this result with the initial clusters, we observe a slight increase in gene ontology enrichment. We have found that when all initial clusters are remarkable, as observed here, the Semi-labeling and *Semi-supervised** steps do not fundamentally change the gene cluster labels, resulting in convergence of initial clusters towards final clusters. However, if initial clusters include non-significant ones, SGCP eliminates these clusters, enhancing the enrichment of remarkable clusters. Consequently, the final modules differ from the initial clusters. Further details are available in the manuscript. SGCP manuscript.

rm(resClus, resInitialGO, resSemiLabel, resSemiSupervised, resFinalGO, pl)
sI <- sessionInfo()
sI
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] caret_6.0-94                lattice_0.22-6             
##  [3] ggplot2_3.5.1               org.Hs.eg.db_3.20.0        
##  [5] AnnotationDbi_1.68.0        SummarizedExperiment_1.36.0
##  [7] Biobase_2.66.0              GenomicRanges_1.58.0       
##  [9] GenomeInfoDb_1.42.0         IRanges_2.40.0             
## [11] S4Vectors_0.44.0            BiocGenerics_0.52.0        
## [13] MatrixGenerics_1.18.0       matrixStats_1.4.1          
## [15] SGCP_1.6.0                  knitr_1.48                 
## [17] BiocStyle_2.34.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.17.1       jsonlite_1.8.9         
##   [4] magrittr_2.0.3          magick_2.8.5            farver_2.1.2           
##   [7] rmarkdown_2.28          zlibbioc_1.52.0         vctrs_0.6.5            
##  [10] memoise_2.0.1           RCurl_1.98-1.16         tinytex_0.53           
##  [13] htmltools_0.5.8.1       S4Arrays_1.6.0          cellranger_1.1.0       
##  [16] pROC_1.18.5             SparseArray_1.6.0       parallelly_1.38.0      
##  [19] sass_0.4.9              bslib_0.8.0             plyr_1.8.9             
##  [22] lubridate_1.9.3         rootSolve_1.8.2.4       cachem_1.1.0           
##  [25] lifecycle_1.0.4         iterators_1.0.14        pkgconfig_2.0.3        
##  [28] Matrix_1.7-1            R6_2.5.1                fastmap_1.2.0          
##  [31] GenomeInfoDbData_1.2.13 future_1.34.0           digest_0.6.37          
##  [34] Exact_3.3               colorspace_2.1-1        RSpectra_0.16-2        
##  [37] RSQLite_2.3.7           labeling_0.4.3          timechange_0.3.0       
##  [40] fansi_1.0.6             httr_1.4.7              abind_1.4-8            
##  [43] compiler_4.4.1          proxy_0.4-27            bit64_4.5.2            
##  [46] withr_3.0.2             DBI_1.2.3               highr_0.11             
##  [49] MASS_7.3-61             lava_1.8.0              DelayedArray_0.32.0    
##  [52] gld_2.6.6               ModelMetrics_1.2.2.2    Category_2.72.0        
##  [55] tools_4.4.1             zip_2.3.1               future.apply_1.11.3    
##  [58] nnet_7.3-19             glue_1.8.0              nlme_3.1-166           
##  [61] grid_4.4.1              reshape2_1.4.4          generics_0.1.3         
##  [64] recipes_1.1.0           gtable_0.3.6            class_7.3-22           
##  [67] data.table_1.16.2       lmom_3.2                utf8_1.2.4             
##  [70] XVector_0.46.0          foreach_1.5.2           pillar_1.9.0           
##  [73] stringr_1.5.1           genefilter_1.88.0       splines_4.4.1          
##  [76] dplyr_1.1.4             survival_3.7-0          bit_4.5.0              
##  [79] annotate_1.84.0         tidyselect_1.2.1        RBGL_1.82.0            
##  [82] GO.db_3.20.0            Biostrings_2.74.0       bookdown_0.41          
##  [85] xfun_0.48               expm_1.0-0              hardhat_1.4.0          
##  [88] timeDate_4041.110       stringi_1.8.4           UCSC.utils_1.2.0       
##  [91] yaml_2.3.10             boot_1.3-31             evaluate_1.0.1         
##  [94] codetools_0.2-20        tibble_3.2.1            Rgraphviz_2.50.0       
##  [97] BiocManager_1.30.25     graph_1.84.0            cli_3.6.3              
## [100] rpart_4.1.23            xtable_1.8-4            DescTools_0.99.57      
## [103] munsell_0.5.1           jquerylib_0.1.4         Rcpp_1.0.13            
## [106] readxl_1.4.3            globals_0.16.3          png_0.1-8              
## [109] parallel_4.4.1          XML_3.99-0.17           gower_1.0.1            
## [112] blob_1.2.4              GOstats_2.72.0          AnnotationForge_1.48.0 
## [115] bitops_1.0-9            listenv_0.9.1           mvtnorm_1.3-1          
## [118] GSEABase_1.68.0         ipred_0.9-15            scales_1.3.0           
## [121] prodlim_2024.06.25      e1071_1.7-16            ggridges_0.5.6         
## [124] purrr_1.0.2             openxlsx_4.2.7.1        crayon_1.5.3           
## [127] rlang_1.1.4             KEGGREST_1.46.0