---
title: "Guided tutorial to Uniform Transcript cells' clustering using COTAN"
author: 
- name: "Silvia Giulia Galfrè"
  affiliation: "Department of Computer Science, University of Pisa"
- name: "Marco Fantozzi"
  affiliation: "University of Parma"

package: COTAN
output:
  BiocStyle::html_document:
    keep_md: true
vignette: >
  %\VignetteIndexEntry{Uniform clustering using COTAN}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Preamble

```{r knitr options, echo = FALSE, include = FALSE}
plotDev <- if (requireNamespace("ragg", quietly = TRUE)) {
  "ragg_png"
} else if (requireNamespace("Cairo", quietly = TRUE)) {
  "cairo_png"
} else {
  "png"
}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7L,
  fig.height = 5L,
  dev = plotDev,
  fig.crop = FALSE,
  dpi = 100L
)
```


```{r preamble, message=FALSE, warning=FALSE}
library(COTAN)
library(zeallot)

# necessary to solve precedence of overloads
conflicted::conflict_prefer("%<-%", "zeallot")

# enable multi-processing (not on Windows)
prevOptState <- options(parallelly.fork.enable = TRUE)
```

# Introduction

This tutorial shows the available tools in COTAN to help with genes' clustering

### Setup

Define a directory where the data and the output will be stored

```{r setup-and-log}
dataDir <- file.path(tempdir(), "COTAN_vignette_data")
dir.create(dataDir, recursive = TRUE, showWarnings = FALSE)

print(dataDir)

outDir <- dataDir

# Log-level 2 was chosen to showcase better how the package works
# In normal usage a level of 0 or 1 is more appropriate
setLoggingLevel(2L)

# This file will contain all the logs produced by the package
# as if at the highest logging level
setLoggingFile(file.path(outDir, "vignette_uniform_clustering.log"))
```


## Retrieving the data-set

Download the data-set for `r "mouse cortex E17.5"` from
`r "Yuzwa et al."` (2017)

```{r download-data}
GEO <- "GSM2861514"

dir.create(file.path(dataDir, GEO), showWarnings = FALSE)

datasetFileName <-
  file.path(dataDir, GEO, "GSM2861514_E175_All_Cells_DGE.txt.gz")


# retries up to 5 times to get the dataset
attempts <- 0L
maxAttempts <- 5L
ok <- FALSE

while (attempts < maxAttempts && !ok) {
  attempts <- attempts + 1L

  if (!file.exists(datasetFileName)) {
    res <- try(
      GEOquery::getGEOSuppFiles(
        GEO = GEO,
        makeDirectory = TRUE,
        baseDir = dataDir,
        filter_regex = base::basename(datasetFileName),
        fetch_files = TRUE
      ),
      silent = TRUE
    )
  }

  ok <- file.exists(datasetFileName)

  if (!ok && attempts < maxAttempts) {
    Sys.sleep(1)
  }
}

assertthat::assert_that(
  ok,
  msg = paste0(
    "Failed to retrieve file '", datasetFileName,
    "' after ", maxAttempts, " attempts."
  )
)

rawDataset <- read.csv(datasetFileName, sep = "\t", row.names = 1L)

print(dim(rawDataset))
```

### Create the COTAN object

Initialize the `COTAN` object with the row count table and
the metadata from the experiment.

```{r create-cotan-object}
cond <- "mouse_cortex_E17.5"

obj <- COTAN(raw = rawDataset)
obj <-
  initializeMetaDataset(
    obj,
    GEO = GEO,
    sequencingMethod = "Drop_seq",
    sampleCondition = cond
  )

logThis(paste0("Condition ", getMetadataElement(obj, datasetTags()[["cond"]])),
        logLevel = 1L)
```

Assign to each cell its origin

```{r add-origin-to-the-cotan-object}
# mark cells origin
data(vignette.cells.origin)
head(vignette.cells.origin, 18)

obj <- 
  addCondition(obj, condName = "origin", conditions = vignette.cells.origin)

rm(vignette.cells.origin)
```


### Drop low quality cells

```{r drop-low-quality-cells}
# use previously established results to determine
# which cells were dropped in the cleaning stage
data(vignette.split.clusters)

cellsToDrop <-
  getCells(obj)[!(getCells(obj) %in% names(vignette.split.clusters))]

obj <- dropGenesCells(obj, cells = cellsToDrop)

# Log the remaining number of cells

logThis(paste("n cells", getNumCells(obj)), logLevel = 1L)

table(getCondition(obj, condName = "origin"))

rm(vignette.split.clusters)
```

# Establish Uniform Transcript cells' clusters

## COTAN analysis

In this part, the `COTAN` model is calibrated:
all the contingency tables are computed and used to get
the statistics necessary to `COEX` evaluation and storing

The `COEX` coefficient between the two genes, measures the deviation between
the 2×2 contingency table of the joint occurrence of the genes in the cells,
and their expected co-expression assuming independence.

```{r calculate-and-store-coex-matrix}
obj <-
  proceedToCoex(
    obj,
    calcCoex = FALSE,
    optimizeForSpeed = TRUE,
    cores = 3L,
    deviceStr = "cuda"
  )
```

## Uniform Clustering

It is possible to obtain a cells' *clusterization* based on the concept of 
*uniformity* of expression of the genes across the cells in each cluster.
That is the clusters satisfy the null hypothesis of the `COTAN` model:
the genes expression is not dependent on the cell in consideration.

### First UT clusterization

There are two functions involved into obtaining a proper *clusterization*:
the first is `cellsUniformClustering`. This function uses standard tools
*clusterization* methods via a `COTAN` specific data input and reduction.
The function then discards and re-clusters any *non-uniform* cluster.

Please note that the most important parameters for the users are the
`GDIThreshold`s inside the **Uniform Transcript** checkers: they determine how 
strict is the check. Default constructed advance checks give a pretty strong
guarantee of uniformity for the *cluster*.

```{r create-ut-clusterization, eval=FALSE, include=TRUE}
# This code is a too computationally heavy to be used in a vignette [~20 min.]
# So the result was stored and will be re-loaded in the next chunk


initialClusters <- NULL

if (TRUE) {
  # read from the last file among those named all_check_results_XX.csv
  mergeDir <- file.path(outDir, cond, "reclustering")

  resFiles <-
    list.files(
      path = mergeDir,
      pattern = "partial_clusterization_.*csv",
      full.names = TRUE
    )

  if (length(resFiles) != 0L) {
    prevClustRes <-
      read.csv(resFiles[length(resFiles)], header = TRUE, row.names = 1L)
    initialClusters <- getColumnFromDF(prevClustRes, 1L)
    initialClusters[is.na(initialClusters)] <- "-1"
    initialClusters <- niceFactorLevels(initialClusters)
  }
}


# default constructed checker is OK
advChecker <- new("AdvancedGDIUniformityCheck")

c(splitClusters, splitCoexDF) %<-%
  cellsUniformClustering(
    obj,
    initialResolution = 0.8,
    checker = advChecker,
    dataMethod = "LogLikelihood", # of observed against expected data
    useCoexEigen = TRUE, # project on `COEX` eigenvectors instead doing a `PCA`
    numReducedComp = 50L,
    minimumUTClusterSize = 50L, # too large values imply many un-clustered cells
    initialClusters = initialClusters,
    initialIteration = 1L,
    optimizeForSpeed = TRUE,
    deviceStr = "cuda",
    cores = 3L,
    saveObj = TRUE, # save intermediate data for log and restart
    outDir = outDir
  )

# add the newly found clusterization to the `COTAN` object
obj <-
  addClusterization(
    obj,
    clName = "split",
    clusters = splitClusters,
    coexDF = splitCoexDF
  )

table(splitClusters)
```

Loading pre-calculated *clusterization* from data

```{r load-pre-calculated-clusterization_split}
data("vignette.split.clusters", package = "COTAN")
splitClusters <- vignette.split.clusters

# explicitly calculate the DEA of the clusterization to store it
splitCoexDF <- DEAOnClusters(obj, clusters = splitClusters)

obj <-
  addClusterization(
    obj,
    clName = "split",
    clusters = splitClusters,
    coexDF = splitCoexDF,
    override = FALSE
  )
```

### Extract `split` clusters' statistics

It is possible to extract some statistics about a stored *clusterization*:

```{r extract-clusterization-summary_split}
c(summaryData, summaryPlot) %<-%
  clustersSummaryPlot(
    obj,
    clName = "split",
    plotTitle = "split summary",
    condName = "origin"
  )

head(summaryData, n = 10L)
```
The `ExpGenes` column contains the number of genes that are expressed in any
cell of the relevant *cluster*, while the `ExpGenes25` column contains the
number of genes expressed in at the least 25% of the cells of the relevant
*cluster*

```{r summary-as-plot}
plot(summaryPlot)
```

### Relevance of marker genes for the `split` clusters

It is possible to visualize how relevant are some *marker genes* for the
*clusters* comprising a given *clusterization*

```{r define-layers-specific-genes}
# these are some genes associated to each cortical layer
layersGenes <- list(
  "L1"   = c("Reln",   "Lhx5"),
  "L2/3" = c("Cux1",   "Satb2"),
  "L4"   = c("Rorb",   "Sox5"),
  "L5/6" = c("Bcl11b", "Fezf2"),
  "Prog" = c("Hes1",    "Vim")
)
```

```{r clusterization-data-visualization_split}
c(splitHeatmap, splitScoreDF, splitPValueDF) %<-%
  clustersMarkersHeatmapPlot(
    obj,
    groupMarkers = layersGenes,
    clName = "split",
    kCuts = 2L,
    adjustmentMethod = "holm"
  )

ComplexHeatmap::draw(splitHeatmap)
```

In the above graph, it is possible to see that the found clusters align
well to the expression of the layers' genes.


### `UMAP` of the `split` clusterization

It is also possible to check how the *clusters* are spread in a `UMAP` plot
that uses the same data reduction methods as the one used in the creation of
the *clusterization*

```{r clusterization-umap-plot-via-drm_split, eval=FALSE, include=TRUE}
c(umapPlot, cellsRDM) %<-%
  cellsUMAPPlot(
    obj,
    clName = "split",
    clusters = NULL,
    useCoexEigen = TRUE, # Aligning these to `split` generation gives best res.
    dataMethod = "LogLikelihood", # "AdjBinarized" is also a good choice
    numComp = 50L, # number of components of the data reduced matrix
    colors = NULL, # a list of user defined colors for the various clusters
    numNeighbors = 30L, # UMAP related parameters
    minPointsDist = 0.3 # UMAP related parameters
  )

plot(umapPlot)
```

Unfortunately, due to build-timeout limits, the above plot cannot be displayed,
as no `COEX` matrix was calculated in the `COTAN` object,
but we can show that the *clusterization* is good even changing the
data reduction method, in this case the one from `Seurat`.

```{r clusterization-umap-plot-via-drm_split_as-seurat}
c(umapPlot, cellsRDM) %<-%
  cellsUMAPPlot(
    obj,
    clName = "split",
    clusters = NULL,
    useCoexEigen = FALSE,
    dataMethod = "LogNormalized", 
    numComp = 50L, # number of components of the data reduced matrix
    genesSel = "HVG_Seurat",
    numGenes = 500L,
    colors = NULL, # a list of user defined colors for the various clusters
    numNeighbors = 30L, # UMAP related parameters
    minPointsDist = 0.3 # UMAP related parameters
  )

plot(umapPlot)
```

### Merged UT clusterization

Since the algorithm that creates the *clusters* is not directly geared to
achieve cluster uniformity, there might be some *clusters* that can be merged
back together and still be **Uniform Transcript**.

This is the purpose of the function `mergeUniformCellsClusters()` that takes a
existing *clusterization* and tries to merge pairs of cluster.
If a pair forms a uniform cluster (via the uniformity check) the pair is
actually merged.

Since the number of possible pairs grows quadratically with the number of
clusters, in order to avoid running the totality of the possible checks,
the function relies on a *related cluster distance* to select the *a-priori*
most likely to be merged.

Again, here, the most important parameters for the users are the `GDIThreshold`s
inside the **Uniform Transcript** checkers: they define how strict is the
check. Default constructed advance check gives a pretty strong guarantee of
uniformity for the *cluster*.

If one wants to achieve a more *relaxed* merge, and thus less clusters overall,
it is possible to define a *sequence* of checkers, each less stringent than the
previous, that will be applied one after the other: this allows to execute first
the merges that result in the *most* uniform clusters and only later merge those
that satisfy only the relaxed checks.

```{r merge-clusters-if-ut, eval=FALSE, include=TRUE}

checkersList <-
  list(
    advChecker,
    shiftCheckerThresholds(advChecker, 0.01), # Sensible threshold shifts really
    shiftCheckerThresholds(advChecker, 0.03)  # depend on the dataset
  )

prevCheckRes <- data.frame()

# The following code shows the case when one wants to re-use the already
# calculated checks from the saved intermediate outputs.
# This can apply in cases one wants to add a new checker to the list above or
# in the cases the execution is interrupted prematurely.
#
# Note that a similar facility exists for the `cellsUniformClustering()`
# function above.
#
if (TRUE) {
  # read from the last file among those named all_check_results_XX.csv
  mergeDir <- file.path(outDir, cond, "leafs_merge")

  resFiles <-
    list.files(
      path = mergeDir,
      pattern = "all_check_results_.*csv",
      full.names = TRUE
    )

  if (length(resFiles) != 0L) {
    prevCheckRes <-
      read.csv(resFiles[length(resFiles)], header = TRUE, row.names = 1L)
  }
}

c(mergedClusters, mergedCoexDF) %<-%
  mergeUniformCellsClusters(
    obj,
    clusters = splitClusters,
    checkers = checkersList,
    allCheckResults = prevCheckRes,
    optimizeForSpeed = TRUE,
    deviceStr = "cuda",
    cores = 6L,
    useDEA = TRUE,
    saveObj = TRUE,
    outDir = outDir
  )

# add the newly found clusterization to the `COTAN` object
obj <-
  addClusterization(
    obj,
    clName = "merge",
    override = TRUE,
    clusters = mergedClusters,
    coexDF = mergedCoexDF
  )

table(mergedClusters)
```


Loading pre-calculated *clusterization* from data

```{r load-pre-calculated-clusterization_merge}
data("vignette.merge.clusters", package = "COTAN")
mergedClusters <- vignette.merge.clusters[getCells(obj)]

# explicitly calculate the DEA of the clusterization to store it
mergedCoexDF <- DEAOnClusters(obj, clusters = mergedClusters)

obj <-
  addClusterization(
    obj,
    clName = "merge",
    clusters = mergedClusters,
    coexDF = mergedCoexDF,
    override = FALSE
  )

# this gives immediate visual information about which clusters were merged
table(splitClusters, mergedClusters)
```

### Extract `merge` clusters' statistics

It is possible to extract some statistics about a stored *clusterization*:

```{r extract-clusterization-summary_merge}
c(summaryData, summaryPlot) %<-%
  clustersSummaryPlot(
    obj,
    clName = "merge",
    plotTitle = "merge summary",
    condName = "origin"
  )

head(summaryData, n = 10L)

plot(summaryPlot)
```

### Relevance of marker genes for the `merge` clusters

Again, it is possible to visualize how relevant are the *marker genes* from
above:

```{r  clusterization-data-visualization_merge}
c(mergeHeatmap, mergeScoresDF, mergePValuesDF) %<-%
  clustersMarkersHeatmapPlot(
    obj,
    groupMarkers = layersGenes,
    clName = "merge",
    condNameList = list("origin")
  )

ComplexHeatmap::draw(mergeHeatmap)
```

It is also possible to use any *clusterization* as if a condition:

```{r clusterization-data-visualization_merge-vs-split}}
c(mergeVsSplitHeatmap, ..) %<-%
  clustersMarkersHeatmapPlot(
    obj,
    clName = "merge",
    condNameList = list("split"),
    conditionsList = list(splitClusters)
  )

ComplexHeatmap::draw(mergeVsSplitHeatmap)
```

### `UMAP` of the `merge` clusterization

It is also possible to check how the *clusters* are spread in a `UMAP` plot
that uses the same (or other) data reduction methods as the one used in the
creation of the *clusterization*

Again it is possible to plot the *clusterization* on a `UMAP` plot

```{r clusterization-umap-plot-via-drm_merge, eval=FALSE, include=TRUE}
c(umapPlot, .) %<-%
  cellsUMAPPlot(
    obj,
    clName = "merge",
    clusters = NULL,
    useCoexEigen = TRUE,
    dataMethod = "LogLikelihood", # "AdjBinarized" is also a good choice
    numComp = 50L, # number of components of the data reduced matrix
    colors = NULL, # a list of user defined colors for the various clusters
    numNeighbors = 30L, # UMAP related parameters
    minPointsDist = 0.3 # UMAP related parameters
  )

plot(umapPlot)
```

Again, due to build-timeout limits, the above plot cannot be displayed,
as no `COEX` matrix was calculated in the `COTAN` object,
but we can show that the *clusterization* is good even changing the
data reduction method, in this case the one from `scanpy`.

```{r clusterization-umap-plot-via-drm_merge_as-seurat}
c(umapPlot, .) %<-%
  cellsUMAPPlot(
    obj,
    clName = "merge",
    clusters = NULL,
    useCoexEigen = FALSE,
    dataMethod = "LogNormalized", # "AdjBinarized" is also a good choice
    numComp = 50L, # number of components of the data reduced matrix
    genesSel = "HVG_Scanpy",
    numGenes = 500L,
    colors = NULL, # a list of user defined colors for the various clusters
    numNeighbors = 30L, # UMAP related parameters
    minPointsDist = 0.3 # UMAP related parameters
  )

plot(umapPlot)
```


## Vignette clean-up stage

The next few lines are just to clean.

```{r clean-up}
if (file.exists(file.path(outDir, paste0(cond, ".cotan.RDS")))) {
  # delete file if it exists
  file.remove(file.path(outDir, paste0(cond, ".cotan.RDS")))
}
if (file.exists(file.path(outDir, paste0(cond, "_times.csv")))) {
  # delete file if it exists
  file.remove(file.path(outDir, paste0(cond, "_times.csv")))
}
if (dir.exists(file.path(outDir, cond))) {
  unlink(file.path(outDir, cond), recursive = TRUE)
}
# if (dir.exists(file.path(outDir, GEO))) {
#   unlink(file.path(outDir, GEO), recursive = TRUE)
# }

# stop logging to file
setLoggingFile("")
file.remove(file.path(outDir, "vignette_uniform_clustering.log"))

options(prevOptState)
```

```{r closure}
Sys.time()

sessionInfo()
```