0.1 Preamble
1 Introduction
- 1.0.1 Setup
- 1.1 Retrieving the data-set
  - 1.1.1 Create the COTAN object
  - 1.1.2 Drop low quality cells
2 Establish Uniform Transcript cells’ clusters

0.1 Preamble

library(COTAN)
library(zeallot)

# necessary to solve precedence of overloads
conflicted::conflict_prefer("%<-%", "zeallot")

# enable multi-processing (not on Windows)
prevOptState <- options(parallelly.fork.enable = TRUE)

1 Introduction

This tutorial shows the available tools in COTAN to help with genes’ clustering

1.0.1 Setup

Define a directory where the data and the output will be stored

dataDir <- file.path(tempdir(), "COTAN_vignette_data")
dir.create(dataDir, recursive = TRUE, showWarnings = FALSE)

print(dataDir)
#> [1] "/tmp/RtmpDqd2Tg/COTAN_vignette_data"

outDir <- dataDir

# Log-level 2 was chosen to showcase better how the package works
# In normal usage a level of 0 or 1 is more appropriate
setLoggingLevel(2L)
#> Setting new log level to 2

# This file will contain all the logs produced by the package
# as if at the highest logging level
setLoggingFile(file.path(outDir, "vignette_uniform_clustering.log"))
#> Setting log file to be: /tmp/RtmpDqd2Tg/COTAN_vignette_data/vignette_uniform_clustering.log

1.1 Retrieving the data-set

Download the data-set for mouse cortex E17.5 from Yuzwa et al. (2017)

GEO <- "GSM2861514"

dir.create(file.path(dataDir, GEO), showWarnings = FALSE)

datasetFileName <-
  file.path(dataDir, GEO, "GSM2861514_E175_All_Cells_DGE.txt.gz")


# retries up to 5 times to get the dataset
attempts <- 0L
maxAttempts <- 5L
ok <- FALSE

while (attempts < maxAttempts && !ok) {
  attempts <- attempts + 1L

  if (!file.exists(datasetFileName)) {
    res <- try(
      GEOquery::getGEOSuppFiles(
        GEO = GEO,
        makeDirectory = TRUE,
        baseDir = dataDir,
        filter_regex = base::basename(datasetFileName),
        fetch_files = TRUE
      ),
      silent = TRUE
    )
  }

  ok <- file.exists(datasetFileName)

  if (!ok && attempts < maxAttempts) {
    Sys.sleep(1)
  }
}

assertthat::assert_that(
  ok,
  msg = paste0(
    "Failed to retrieve file '", datasetFileName,
    "' after ", maxAttempts, " attempts."
  )
)
#> [1] TRUE

rawDataset <- read.csv(datasetFileName, sep = "\t", row.names = 1L)

print(dim(rawDataset))
#> [1] 17085  2000

1.1.1 Create the COTAN object

Initialize the COTAN object with the row count table and the metadata from the experiment.

cond <- "mouse_cortex_E17.5"

obj <- COTAN(raw = rawDataset)
obj <-
  initializeMetaDataset(
    obj,
    GEO = GEO,
    sequencingMethod = "Drop_seq",
    sampleCondition = cond
  )
#> Initializing `COTAN` meta-data

logThis(paste0("Condition ", getMetadataElement(obj, datasetTags()[["cond"]])),
        logLevel = 1L)
#> Condition mouse_cortex_E17.5

Assign to each cell its origin

# mark cells origin
data(vignette.cells.origin)
head(vignette.cells.origin, 18)
#> GCGAATTGTGAA ACCGATAACTGA CCCCGGGTGCGA CTGTAGATGTTA TCATCGAAGCGC TTCTACCGAGTC 
#>        Other        Other        Other        Other        Other        Other 
#> CTGTTCCCGGCG GCGTGTTAGTTC CTCGCGCGTTTA GATGTATAACTT GCGCTATGATTT CGTTTAGTTTAC 
#>        Other        Other        Other        Other        Other     Cortical 
#> GTGGAGGCCCAT TGTCACTACATC TCTAGAACAACG ACCTTTGTTCGT TTGTCTTCTTCG TAAAATATCGCC 
#>        Other        Other     Cortical     Cortical     Cortical     Cortical 
#> Levels: Cortical Other

obj <- 
  addCondition(obj, condName = "origin", conditions = vignette.cells.origin)

rm(vignette.cells.origin)

1.1.2 Drop low quality cells

# use previously established results to determine
# which cells were dropped in the cleaning stage
data(vignette.split.clusters)

cellsToDrop <-
  getCells(obj)[!(getCells(obj) %in% names(vignette.split.clusters))]

obj <- dropGenesCells(obj, cells = cellsToDrop)

# Log the remaining number of cells

logThis(paste("n cells", getNumCells(obj)), logLevel = 1L)
#> n cells 1783

table(getCondition(obj, condName = "origin"))
#> 
#> Cortical    Other 
#>      844      939

rm(vignette.split.clusters)

2 Establish Uniform Transcript cells’ clusters

2.1 COTAN analysis

In this part, the COTAN model is calibrated: all the contingency tables are computed and used to get the statistics necessary to COEX evaluation and storing

The COEX coefficient between the two genes, measures the deviation between the 2×2 contingency table of the joint occurrence of the genes in the cells, and their expected co-expression assuming independence.

obj <-
  proceedToCoex(
    obj,
    calcCoex = FALSE,
    optimizeForSpeed = TRUE,
    cores = 3L,
    deviceStr = "cuda"
  )
#> COTAN dataset analysis: START
#> Genes/cells selection done: dropped [4330] genes and [0] cells
#> Working on [12755] genes and [1783] cells
#> Estimate `dispersion`: START
#> Total calculations elapsed time: 9.57616972923279
#> Estimate `dispersion`: DONE
#> Estimate `dispersion`: DONE
#> `dispersion` | min: -0.0349260648055935 | max: 760.574661197634 | % negative: 12.5362602900823
#> COTAN genes' COEX estimation not requested
#> COTAN dataset analysis: DONE

2.2 Uniform Clustering

It is possible to obtain a cells’ clusterization based on the concept of uniformity of expression of the genes across the cells in each cluster. That is the clusters satisfy the null hypothesis of the COTAN model: the genes expression is not dependent on the cell in consideration.

2.2.1 First UT clusterization

There are two functions involved into obtaining a proper clusterization: the first is cellsUniformClustering. This function uses standard tools clusterization methods via a COTAN specific data input and reduction. The function then discards and re-clusters any non-uniform cluster.

Please note that the most important parameters for the users are the GDIThresholds inside the Uniform Transcript checkers: they determine how strict is the check. Default constructed advance checks give a pretty strong guarantee of uniformity for the cluster.

# This code is a too computationally heavy to be used in a vignette [~20 min.]
# So the result was stored and will be re-loaded in the next chunk


initialClusters <- NULL

if (TRUE) {
  # read from the last file among those named all_check_results_XX.csv
  mergeDir <- file.path(outDir, cond, "reclustering")

  resFiles <-
    list.files(
      path = mergeDir,
      pattern = "partial_clusterization_.*csv",
      full.names = TRUE
    )

  if (length(resFiles) != 0L) {
    prevClustRes <-
      read.csv(resFiles[length(resFiles)], header = TRUE, row.names = 1L)
    initialClusters <- getColumnFromDF(prevClustRes, 1L)
    initialClusters[is.na(initialClusters)] <- "-1"
    initialClusters <- niceFactorLevels(initialClusters)
  }
}


# default constructed checker is OK
advChecker <- new("AdvancedGDIUniformityCheck")

c(splitClusters, splitCoexDF) %<-%
  cellsUniformClustering(
    obj,
    initialResolution = 0.8,
    checker = advChecker,
    dataMethod = "LogLikelihood", # of observed against expected data
    useCoexEigen = TRUE, # project on `COEX` eigenvectors instead doing a `PCA`
    numReducedComp = 50L,
    minimumUTClusterSize = 50L, # too large values imply many un-clustered cells
    initialClusters = initialClusters,
    initialIteration = 1L,
    optimizeForSpeed = TRUE,
    deviceStr = "cuda",
    cores = 3L,
    saveObj = TRUE, # save intermediate data for log and restart
    outDir = outDir
  )

# add the newly found clusterization to the `COTAN` object
obj <-
  addClusterization(
    obj,
    clName = "split",
    clusters = splitClusters,
    coexDF = splitCoexDF
  )

table(splitClusters)

Loading pre-calculated clusterization from data

data("vignette.split.clusters", package = "COTAN")
splitClusters <- vignette.split.clusters

# explicitly calculate the DEA of the clusterization to store it
splitCoexDF <- DEAOnClusters(obj, clusters = splitClusters)
#> Differential Expression Analysis - START
#> ****************
#> Total calculations elapsed time: 1.50585699081421
#> Differential Expression Analysis - DONE

obj <-
  addClusterization(
    obj,
    clName = "split",
    clusters = splitClusters,
    coexDF = splitCoexDF,
    override = FALSE
  )

2.2.2 Extract `split` clusters’ statistics

It is possible to extract some statistics about a stored clusterization:

c(summaryData, summaryPlot) %<-%
  clustersSummaryPlot(
    obj,
    clName = "split",
    plotTitle = "split summary",
    condName = "origin"
  )

head(summaryData, n = 10L)
#>    split   origin CellNumber CellPercentage MeanUDE MedianUDE ExpGenes25
#> 1     -1 Cortical         48            2.7    1.14      0.63       1342
#> 2     01 Cortical        134            7.5    0.90      0.79       1155
#> 3     02 Cortical         85            4.8    0.87      0.82       1136
#> 4     03 Cortical         56            3.1    1.32      1.29       1755
#> 5     04 Cortical        124            7.0    1.75      1.76       2348
#> 6     05 Cortical        158            8.9    1.30      1.28       1745
#> 7     06 Cortical        152            8.5    0.80      0.77       1037
#> 8     07 Cortical         64            3.6    1.14      0.99       1417
#> 9     08 Cortical          1            0.1    1.16      1.16       1495
#> 10    09 Cortical          8            0.4    1.27      1.15       1583
#>    ExpGenes
#> 1     10349
#> 2     10890
#> 3     10237
#> 4     10063
#> 5     11601
#> 6     11375
#> 7     10640
#> 8     10654
#> 9      1495
#> 10     5689

The ExpGenes column contains the number of genes that are expressed in any cell of the relevant cluster, while the ExpGenes25 column contains the number of genes expressed in at the least 25% of the cells of the relevant cluster

plot(summaryPlot)
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_text()`).

2.2.3 Relevance of marker genes for the `split` clusters

It is possible to visualize how relevant are some marker genes for the clusters comprising a given clusterization

# these are some genes associated to each cortical layer
layersGenes <- list(
  "L1"   = c("Reln",   "Lhx5"),
  "L2/3" = c("Cux1",   "Satb2"),
  "L4"   = c("Rorb",   "Sox5"),
  "L5/6" = c("Bcl11b", "Fezf2"),
  "Prog" = c("Hes1",    "Vim")
)

c(splitHeatmap, splitScoreDF, splitPValueDF) %<-%
  clustersMarkersHeatmapPlot(
    obj,
    groupMarkers = layersGenes,
    clName = "split",
    kCuts = 2L,
    adjustmentMethod = "holm"
  )

ComplexHeatmap::draw(splitHeatmap)

In the above graph, it is possible to see that the found clusters align well to the expression of the layers’ genes.

2.2.4 `UMAP` of the `split` clusterization

It is also possible to check how the clusters are spread in a UMAP plot that uses the same data reduction methods as the one used in the creation of the clusterization

c(umapPlot, cellsRDM) %<-%
  cellsUMAPPlot(
    obj,
    clName = "split",
    clusters = NULL,
    useCoexEigen = TRUE, # Aligning these to `split` generation gives best res.
    dataMethod = "LogLikelihood", # "AdjBinarized" is also a good choice
    numComp = 50L, # number of components of the data reduced matrix
    colors = NULL, # a list of user defined colors for the various clusters
    numNeighbors = 30L, # UMAP related parameters
    minPointsDist = 0.3 # UMAP related parameters
  )

plot(umapPlot)

Unfortunately, due to build-timeout limits, the above plot cannot be displayed, as no COEX matrix was calculated in the COTAN object, but we can show that the clusterization is good even changing the data reduction method, in this case the one from Seurat.

c(umapPlot, cellsRDM) %<-%
  cellsUMAPPlot(
    obj,
    clName = "split",
    clusters = NULL,
    useCoexEigen = FALSE,
    dataMethod = "LogNormalized", 
    numComp = 50L, # number of components of the data reduced matrix
    genesSel = "HVG_Seurat",
    numGenes = 500L,
    colors = NULL, # a list of user defined colors for the various clusters
    numNeighbors = 30L, # UMAP related parameters
    minPointsDist = 0.3 # UMAP related parameters
  )
#> Elaborating Reduced dimensionality Data Matrix - START
#> Running genes' selection: START
#> Warning: Feature names cannot have underscores ('_'), replacing with dashes
#> ('-')
#> Normalizing layer: counts
#> Finding variable features for layer counts
#> Running genes' selection: DONE
#> Total calculations elapsed time: 5.53704905509949
#> Elaborating Reduced dimensionality Data Matrix - DONE
#> UMAP plot: START
#> 18:11:04 UMAP embedding parameters a = 0.9922 b = 1.112
#> 18:11:04 Read 1783 rows and found 50 numeric columns
#> 18:11:04 Using Annoy for neighbor search, n_neighbors = 30
#> 18:11:04 Building Annoy index with metric = cosine, n_trees = 50
#> 0%   10   20   30   40   50   60   70   80   90   100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 18:11:04 Writing NN index file to temp file /tmp/RtmpDqd2Tg/file19c70819444621
#> 18:11:04 Searching Annoy index using 1 thread, search_k = 3000
#> 18:11:05 Annoy recall = 100%
#> 18:11:06 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
#> 18:11:08 Initializing from normalized Laplacian + noise (using RSpectra)
#> 18:11:08 Commencing optimization for 500 epochs, with 76514 positive edges
#> 18:11:08 Using rng type: pcg
#> 18:11:11 Optimization finished
#> Total calculations elapsed time: 7.31450772285461
#> UMAP plot: DONE

plot(umapPlot)
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.

2.2.5 Merged UT clusterization

Since the algorithm that creates the clusters is not directly geared to achieve cluster uniformity, there might be some clusters that can be merged back together and still be Uniform Transcript.

This is the purpose of the function mergeUniformCellsClusters() that takes a existing clusterization and tries to merge pairs of cluster. If a pair forms a uniform cluster (via the uniformity check) the pair is actually merged.

Since the number of possible pairs grows quadratically with the number of clusters, in order to avoid running the totality of the possible checks, the function relies on a related cluster distance to select the a-priori most likely to be merged.

Again, here, the most important parameters for the users are the GDIThresholds inside the Uniform Transcript checkers: they define how strict is the check. Default constructed advance check gives a pretty strong guarantee of uniformity for the cluster.

If one wants to achieve a more relaxed merge, and thus less clusters overall, it is possible to define a sequence of checkers, each less stringent than the previous, that will be applied one after the other: this allows to execute first the merges that result in the most uniform clusters and only later merge those that satisfy only the relaxed checks.


checkersList <-
  list(
    advChecker,
    shiftCheckerThresholds(advChecker, 0.01), # Sensible threshold shifts really
    shiftCheckerThresholds(advChecker, 0.03)  # depend on the dataset
  )

prevCheckRes <- data.frame()

# The following code shows the case when one wants to re-use the already
# calculated checks from the saved intermediate outputs.
# This can apply in cases one wants to add a new checker to the list above or
# in the cases the execution is interrupted prematurely.
#
# Note that a similar facility exists for the `cellsUniformClustering()`
# function above.
#
if (TRUE) {
  # read from the last file among those named all_check_results_XX.csv
  mergeDir <- file.path(outDir, cond, "leafs_merge")

  resFiles <-
    list.files(
      path = mergeDir,
      pattern = "all_check_results_.*csv",
      full.names = TRUE
    )

  if (length(resFiles) != 0L) {
    prevCheckRes <-
      read.csv(resFiles[length(resFiles)], header = TRUE, row.names = 1L)
  }
}

c(mergedClusters, mergedCoexDF) %<-%
  mergeUniformCellsClusters(
    obj,
    clusters = splitClusters,
    checkers = checkersList,
    allCheckResults = prevCheckRes,
    optimizeForSpeed = TRUE,
    deviceStr = "cuda",
    cores = 6L,
    useDEA = TRUE,
    saveObj = TRUE,
    outDir = outDir
  )

# add the newly found clusterization to the `COTAN` object
obj <-
  addClusterization(
    obj,
    clName = "merge",
    override = TRUE,
    clusters = mergedClusters,
    coexDF = mergedCoexDF
  )

table(mergedClusters)

Loading pre-calculated clusterization from data

data("vignette.merge.clusters", package = "COTAN")
mergedClusters <- vignette.merge.clusters[getCells(obj)]

# explicitly calculate the DEA of the clusterization to store it
mergedCoexDF <- DEAOnClusters(obj, clusters = mergedClusters)
#> Differential Expression Analysis - START
#> *********
#> Total calculations elapsed time: 1.11781454086304
#> Differential Expression Analysis - DONE

obj <-
  addClusterization(
    obj,
    clName = "merge",
    clusters = mergedClusters,
    coexDF = mergedCoexDF,
    override = FALSE
  )

# this gives immediate visual information about which clusters were merged
table(splitClusters, mergedClusters)
#>              mergedClusters
#> splitClusters  -1   2   3   4   5   6   7   8   9
#>            -1  76   0   0   0   0   0   0   0   0
#>            01   0 140   0   0   0   0   0   0   0
#>            02   0  86   0   0   0   0   0   0   0
#>            03   0   0  57   0   0   0   0   0   0
#>            04   0   0   0 126   0   0   0   0   0
#>            05   0   0 159   0   0   0   0   0   0
#>            06   0 152   0   0   0   0   0   0   0
#>            07   0   0   0   0  64   0   0   0   0
#>            08   0   0   0   0   0  75   0   0   0
#>            09   0   0   0   0   0  61   0   0   0
#>            10   0   0   0   0   0   0 197   0   0
#>            11   0   0  50   0   0   0   0   0   0
#>            12   0   0   0   0   0   0 137   0   0
#>            13   0   0   0   0   0   0 134   0   0
#>            14   0   0   0   0   0   0   0 145   0
#>            15   0   0   0   0   0   0   0   0 124

2.2.6 Extract `merge` clusters’ statistics

It is possible to extract some statistics about a stored clusterization:

c(summaryData, summaryPlot) %<-%
  clustersSummaryPlot(
    obj,
    clName = "merge",
    plotTitle = "merge summary",
    condName = "origin"
  )

head(summaryData, n = 10L)
#>    merge   origin CellNumber CellPercentage MeanUDE MedianUDE ExpGenes25
#> 1     -1 Cortical         48            2.7    1.14      0.63       1342
#> 2      2 Cortical        371           20.8    0.85      0.79       1101
#> 3      3 Cortical        215           12.1    1.30      1.28       1747
#> 4      4 Cortical        124            7.0    1.75      1.76       2348
#> 5      5 Cortical         64            3.6    1.14      0.99       1417
#> 6      6 Cortical          9            0.5    1.26      1.15       1797
#> 7      7 Cortical          7            0.4    0.96      0.87       2049
#> 8      8 Cortical          3            0.2    1.11      0.61       3037
#> 9      9 Cortical          3            0.2    1.46      1.66       3524
#> 10    -1    Other         28            1.6    1.26      1.06       1329
#>    ExpGenes
#> 1     10349
#> 2     12016
#> 3     11804
#> 4     11601
#> 5     10654
#> 6      6035
#> 7      4694
#> 8      3037
#> 9      3524
#> 10     9330

plot(summaryPlot)
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_text()`).

2.2.7 Relevance of marker genes for the `merge` clusters

Again, it is possible to visualize how relevant are the marker genes from above:

c(mergeHeatmap, mergeScoresDF, mergePValuesDF) %<-%
  clustersMarkersHeatmapPlot(
    obj,
    groupMarkers = layersGenes,
    clName = "merge",
    condNameList = list("origin")
  )

ComplexHeatmap::draw(mergeHeatmap)

It is also possible to use any clusterization as if a condition:

c(mergeVsSplitHeatmap, ..) %<-%
  clustersMarkersHeatmapPlot(
    obj,
    clName = "merge",
    condNameList = list("split"),
    conditionsList = list(splitClusters)
  )

ComplexHeatmap::draw(mergeVsSplitHeatmap)

2.2.8 `UMAP` of the `merge` clusterization

It is also possible to check how the clusters are spread in a UMAP plot that uses the same (or other) data reduction methods as the one used in the creation of the clusterization

Again it is possible to plot the clusterization on a UMAP plot

c(umapPlot, .) %<-%
  cellsUMAPPlot(
    obj,
    clName = "merge",
    clusters = NULL,
    useCoexEigen = TRUE,
    dataMethod = "LogLikelihood", # "AdjBinarized" is also a good choice
    numComp = 50L, # number of components of the data reduced matrix
    colors = NULL, # a list of user defined colors for the various clusters
    numNeighbors = 30L, # UMAP related parameters
    minPointsDist = 0.3 # UMAP related parameters
  )

plot(umapPlot)

Again, due to build-timeout limits, the above plot cannot be displayed, as no COEX matrix was calculated in the COTAN object, but we can show that the clusterization is good even changing the data reduction method, in this case the one from scanpy.

c(umapPlot, .) %<-%
  cellsUMAPPlot(
    obj,
    clName = "merge",
    clusters = NULL,
    useCoexEigen = FALSE,
    dataMethod = "LogNormalized", # "AdjBinarized" is also a good choice
    numComp = 50L, # number of components of the data reduced matrix
    genesSel = "HVG_Scanpy",
    numGenes = 500L,
    colors = NULL, # a list of user defined colors for the various clusters
    numNeighbors = 30L, # UMAP related parameters
    minPointsDist = 0.3 # UMAP related parameters
  )
#> Elaborating Reduced dimensionality Data Matrix - START
#> Running genes' selection: START
#> Warning: Feature names cannot have underscores ('_'), replacing with dashes
#> ('-')
#> Normalizing layer: counts
#> Finding variable features for layer data
#> Running genes' selection: DONE
#> Total calculations elapsed time: 4.83974599838257
#> Elaborating Reduced dimensionality Data Matrix - DONE
#> UMAP plot: START
#> 18:11:23 UMAP embedding parameters a = 0.9922 b = 1.112
#> 18:11:23 Read 1783 rows and found 50 numeric columns
#> 18:11:23 Using Annoy for neighbor search, n_neighbors = 30
#> 18:11:23 Building Annoy index with metric = cosine, n_trees = 50
#> 0%   10   20   30   40   50   60   70   80   90   100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 18:11:24 Writing NN index file to temp file /tmp/RtmpDqd2Tg/file19c70855cc4079
#> 18:11:24 Searching Annoy index using 1 thread, search_k = 3000
#> 18:11:24 Annoy recall = 100%
#> 18:11:25 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
#> 18:11:27 Initializing from normalized Laplacian + noise (using RSpectra)
#> 18:11:27 Commencing optimization for 500 epochs, with 73286 positive edges
#> 18:11:27 Using rng type: pcg
#> 18:11:31 Optimization finished
#> Total calculations elapsed time: 7.24008011817932
#> UMAP plot: DONE

plot(umapPlot)
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's fill values.

2.3 Vignette clean-up stage

The next few lines are just to clean.

if (file.exists(file.path(outDir, paste0(cond, ".cotan.RDS")))) {
  # delete file if it exists
  file.remove(file.path(outDir, paste0(cond, ".cotan.RDS")))
}
if (file.exists(file.path(outDir, paste0(cond, "_times.csv")))) {
  # delete file if it exists
  file.remove(file.path(outDir, paste0(cond, "_times.csv")))
}
if (dir.exists(file.path(outDir, cond))) {
  unlink(file.path(outDir, cond), recursive = TRUE)
}
# if (dir.exists(file.path(outDir, GEO))) {
#   unlink(file.path(outDir, GEO), recursive = TRUE)
# }

# stop logging to file
setLoggingFile("")
#> Closing previous log file - Setting log file to be:
file.remove(file.path(outDir, "vignette_uniform_clustering.log"))
#> [1] TRUE

options(prevOptState)

Sys.time()
#> [1] "2026-04-28 18:11:31 EDT"

sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] zeallot_0.2.0    COTAN_2.13.0     BiocStyle_2.41.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] matrixStats_1.5.0           spatstat.sparse_3.1-0      
#>   [3] httr_1.4.8                  RColorBrewer_1.1-3         
#>   [5] doParallel_1.0.17           tools_4.6.0                
#>   [7] sctransform_0.4.3           R6_2.6.1                   
#>   [9] lazyeval_0.2.3              uwot_0.2.4                 
#>  [11] ggdist_3.3.3                GetoptLong_1.1.1           
#>  [13] withr_3.0.2                 sp_2.2-1                   
#>  [15] gridExtra_2.3               parallelDist_0.2.7         
#>  [17] progressr_0.19.0            cli_3.6.6                  
#>  [19] Biobase_2.73.0              textshaping_1.0.5          
#>  [21] Cairo_1.7-0                 spatstat.explore_3.8-0     
#>  [23] fastDummies_1.7.6           labeling_0.4.3             
#>  [25] sass_0.4.10                 Seurat_5.5.0               
#>  [27] S7_0.2.2                    spatstat.data_3.1-9        
#>  [29] readr_2.2.0                 proxy_0.4-29               
#>  [31] ggridges_0.5.7              pbapply_1.7-4              
#>  [33] systemfonts_1.3.2           rentrez_1.2.4              
#>  [35] dichromat_2.0-0.1           parallelly_1.47.0          
#>  [37] limma_3.69.0                torch_0.17.0               
#>  [39] generics_0.1.4              shape_1.4.6.1              
#>  [41] ica_1.0-3                   spatstat.random_3.4-5      
#>  [43] dplyr_1.2.1                 distributional_0.7.0       
#>  [45] dendextend_1.19.1           Matrix_1.7-5               
#>  [47] S4Vectors_0.51.0            abind_1.4-8                
#>  [49] lifecycle_1.0.5             yaml_2.3.12                
#>  [51] SummarizedExperiment_1.43.0 SparseArray_1.13.0         
#>  [53] Rtsne_0.17                  grid_4.6.0                 
#>  [55] promises_1.5.0              crayon_1.5.3               
#>  [57] miniUI_0.1.2                lattice_0.22-9             
#>  [59] beachmat_2.29.0             cowplot_1.2.0              
#>  [61] magick_2.9.1                pillar_1.11.1              
#>  [63] knitr_1.51                  ComplexHeatmap_2.29.0      
#>  [65] GenomicRanges_1.65.0        rjson_0.2.23               
#>  [67] future.apply_1.20.2         codetools_0.2-20           
#>  [69] glue_1.8.1                  spatstat.univar_3.1-7      
#>  [71] data.table_1.18.2.1         vctrs_0.7.3                
#>  [73] png_0.1-9                   spam_2.11-3                
#>  [75] gtable_0.3.6                assertthat_0.2.1           
#>  [77] cachem_1.1.0                zigg_0.0.2                 
#>  [79] xfun_0.57                   S4Arrays_1.13.0            
#>  [81] mime_0.13                   Rfast_2.1.5.2              
#>  [83] Seqinfo_1.3.0               survival_3.8-6             
#>  [85] SingleCellExperiment_1.35.0 iterators_1.0.14           
#>  [87] tinytex_0.59                statmod_1.5.1              
#>  [89] fitdistrplus_1.2-6          ROCR_1.0-12                
#>  [91] nlme_3.1-169                bit64_4.8.0                
#>  [93] RcppAnnoy_0.0.23            bslib_0.10.0               
#>  [95] irlba_2.3.7                 KernSmooth_2.23-26         
#>  [97] otel_0.2.0                  colorspace_2.1-2           
#>  [99] BiocGenerics_0.59.0         tidyselect_1.2.1           
#> [101] processx_3.9.0              bit_4.6.0                  
#> [103] compiler_4.6.0              curl_7.1.0                 
#> [105] httr2_1.2.2                 xml2_1.5.2                 
#> [107] DelayedArray_0.39.0         plotly_4.12.0              
#> [109] bookdown_0.46               scales_1.4.0               
#> [111] lmtest_0.9-40               callr_3.7.6                
#> [113] rappdirs_0.3.4              stringr_1.6.0              
#> [115] digest_0.6.39               goftest_1.2-3              
#> [117] spatstat.utils_3.2-2        rmarkdown_2.31             
#> [119] GEOquery_2.81.0             XVector_0.53.0             
#> [121] coro_1.1.0                  htmltools_0.5.9            
#> [123] pkgconfig_2.0.3             MatrixGenerics_1.25.0      
#> [125] fastmap_1.2.0               rlang_1.2.0                
#> [127] GlobalOptions_0.1.4         htmlwidgets_1.6.4          
#> [129] ggthemes_5.2.0              shiny_1.13.0               
#> [131] farver_2.1.2                jquerylib_0.1.4            
#> [133] zoo_1.8-15                  jsonlite_2.0.0             
#> [135] BiocParallel_1.47.0         BiocSingular_1.29.0        
#> [137] magrittr_2.0.5              dotCall64_1.2              
#> [139] patchwork_1.3.2             Rcpp_1.1.1-1.1             
#> [141] viridis_0.6.5               reticulate_1.46.0          
#> [143] stringi_1.8.7               MASS_7.3-65                
#> [145] plyr_1.8.9                  parallel_4.6.0             
#> [147] listenv_0.10.1              ggrepel_0.9.8              
#> [149] deldir_2.0-4                splines_4.6.0              
#> [151] tensor_1.5.1                hms_1.1.4                  
#> [153] circlize_0.4.18             ps_1.9.3                   
#> [155] igraph_2.3.0                spatstat.geom_3.7-3        
#> [157] RcppHNSW_0.6.0              reshape2_1.4.5             
#> [159] stats4_4.6.0                ScaledMatrix_1.21.0        
#> [161] XML_3.99-0.23               evaluate_1.0.5             
#> [163] SeuratObject_5.4.0          RcppParallel_5.1.11-2      
#> [165] BiocManager_1.30.27         tzdb_0.5.0                 
#> [167] foreach_1.5.2               httpuv_1.6.17              
#> [169] RANN_2.6.2                  tidyr_1.3.2                
#> [171] purrr_1.2.2                 polyclip_1.10-7            
#> [173] future_1.70.0               clue_0.3-68                
#> [175] scattermore_1.2             ggplot2_4.0.3              
#> [177] conflicted_1.2.0            rsvd_1.0.5                 
#> [179] xtable_1.8-8                RSpectra_0.16-2            
#> [181] later_1.4.8                 viridisLite_0.4.3          
#> [183] ragg_1.5.2                  tibble_3.3.1               
#> [185] memoise_2.0.1               IRanges_2.47.0             
#> [187] cluster_2.1.8.2             globals_0.19.1

Guided tutorial to Uniform Transcript cells’ clustering using COTAN

28 April 2026

Package

Contents

0.1 Preamble

1 Introduction

1.0.1 Setup

1.1 Retrieving the data-set

1.1.1 Create the COTAN object

1.1.2 Drop low quality cells

2 Establish Uniform Transcript cells’ clusters

2.1 COTAN analysis

2.2 Uniform Clustering

2.2.1 First UT clusterization

2.2.2 Extract `split` clusters’ statistics

2.2.3 Relevance of marker genes for the `split` clusters

2.2.4 `UMAP` of the `split` clusterization

2.2.5 Merged UT clusterization

2.2.6 Extract `merge` clusters’ statistics

2.2.7 Relevance of marker genes for the `merge` clusters

2.2.8 `UMAP` of the `merge` clusterization

2.3 Vignette clean-up stage

Guided tutorial to Uniform Transcript cells’ clustering using COTAN

28 April 2026

Package

Contents

0.1 Preamble

1 Introduction

1.0.1 Setup

1.1 Retrieving the data-set

1.1.1 Create the COTAN object

1.1.2 Drop low quality cells

2 Establish Uniform Transcript cells’ clusters

2.1 COTAN analysis

2.2 Uniform Clustering

2.2.1 First UT clusterization

2.2.2 Extract split clusters’ statistics

2.2.3 Relevance of marker genes for the split clusters

2.2.4 UMAP of the split clusterization

2.2.5 Merged UT clusterization

2.2.6 Extract merge clusters’ statistics

2.2.7 Relevance of marker genes for the merge clusters

2.2.8 UMAP of the merge clusterization

2.3 Vignette clean-up stage

2.2.2 Extract `split` clusters’ statistics

2.2.3 Relevance of marker genes for the `split` clusters

2.2.4 `UMAP` of the `split` clusterization

2.2.6 Extract `merge` clusters’ statistics

2.2.7 Relevance of marker genes for the `merge` clusters

2.2.8 `UMAP` of the `merge` clusterization