---
title: "Guided tutorial to datasets cleaning using COTAN"
author: 
- name: "Silvia Giulia Galfrè"
  affiliation: "Department of Computer Science, University of Pisa"
- name: "Marco Fantozzi"
  affiliation: "University of Parma"

package: COTAN
output:
  BiocStyle::html_document:
    keep_md: true
vignette: >
  %\VignetteIndexEntry{Cleaning tutorial using COTAN}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Preamble

```{r knitr options, echo = FALSE, include = FALSE}
plotDev <- if (requireNamespace("ragg", quietly = TRUE)) {
  "ragg_png"
} else if (requireNamespace("Cairo", quietly = TRUE)) {
  "cairo_png"
} else {
  "png"
}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7L,
  fig.height = 5L,
  dev = plotDev,
  fig.crop = FALSE,
  dpi = 100L
)
```


```{r preamble, message=FALSE, warning=FALSE}
library(COTAN)
library(zeallot)

# necessary to solve precedence of overloads
conflicted::conflict_prefer("%<-%", "zeallot")

# enable multi-processing (not on Windows)
prevOptState <- options(parallelly.fork.enable = TRUE)
```

# Introduction

This tutorial shows the available tools in COTAN to help with datasets cleaning

### Setup

Define a directory where the data and the output will be stored.

```{r setup-and-log}
dataDir <- file.path(tempdir(), "COTAN_vignette_data")
dir.create(dataDir, recursive = TRUE, showWarnings = FALSE)

print(dataDir)

outDir <- dataDir

# Log-level 2 was chosen to showcase better how the package works
# In normal usage a level of 0 or 1 is more appropriate
setLoggingLevel(2L)

# This file will contain all the logs produced by the package
# as if at the highest logging level
setLoggingFile(file.path(outDir, "vignette_dataset_cleaning.log"))
```

## Retrieving the data-set

Download the data-set for `r "mouse cortex E17.5"` from
`r "Yuzwa et al."` (2017)

```{r download-data}
GEO <- "GSM2861514"

dir.create(file.path(dataDir, GEO), showWarnings = FALSE)

datasetFileName <-
  file.path(dataDir, GEO, "GSM2861514_E175_All_Cells_DGE.txt.gz")

# retries up to 5 times to get the dataset
attempts <- 0L
maxAttempts <- 5L
ok <- FALSE

while (attempts < maxAttempts && !ok) {
  attempts <- attempts + 1L

  if (!file.exists(datasetFileName)) {
    res <- try(
      GEOquery::getGEOSuppFiles(
        GEO = GEO,
        makeDirectory = TRUE,
        baseDir = dataDir,
        filter_regex = base::basename(datasetFileName),
        fetch_files = TRUE
      ),
      silent = TRUE
    )
  }

  ok <- file.exists(datasetFileName)

  if (!ok && attempts < maxAttempts) {
    Sys.sleep(1)
  }
}

assertthat::assert_that(
  ok,
  msg = paste0(
    "Failed to retrieve file '", datasetFileName,
    "' after ", maxAttempts, " attempts."
  )
)

rawDataset <- read.csv(datasetFileName, sep = "\t", row.names = 1L)

print(dim(rawDataset))
```

### Create the COTAN object

Initialize the `COTAN` object with the row count table and
the metadata from the experiment.

```{r create-cotan-object}
cond <- "mouse_cortex_E17.5"

obj <- COTAN(raw = rawDataset)
obj <-
  initializeMetaDataset(
    obj,
    GEO = GEO,
    sequencingMethod = "Drop_seq",
    sampleCondition = cond
  )

logThis(paste0("Condition ", getMetadataElement(obj, datasetTags()[["cond"]])),
        logLevel = 1L)
```

#### Alternative ways to create a COTAN object

It is also possible to create a `COTAN` object starting from a
`SingleCellExperiment` object (and vice-versa).

Note that default patterns try to look-up in the most common locations...

```{r conversions-to-and-from-sce}
sceObj <- convertToSingleCellExperiment(obj)

obj2 <- convertFromSingleCellExperiment(
  sceObj,
  assayName = "counts",   # the assay that contains the raw counts
  genesNamesPattern = "", # reg-exp to identify genes' names column
  cellsIDsPattern = "",   # reg-exp to identify cells' ID column
  clNamesPattern = "",    # reg-exp to identify clusterizations columns
  condNamesPattern = ""   # reg-exp to identify cells' conditions columns
)

rm(sceObj, obj2)
```

### Assign to each cell its origin

```{r add-origin-to-the-cotan-object}
# mark cells origin
data(vignette.cells.origin)
head(vignette.cells.origin, 18)

obj <- 
  addCondition(obj, condName = "origin", conditions = vignette.cells.origin)

rm(vignette.cells.origin)

# Sometimes the condition is encoded in the cells ID using some separator
# so it is useful to use the following function to retrieve it
if (FALSE) { # This is not the case with the dataset currently in use
  batch <- COTAN::conditionsFromNames(
    names = getCells(obj), splitPattern = "_", fragmentNum = 2L)
  names(batch) <- getCells(obj)
  obj <- addCondition(obj, condName = "batch", conditions = batch)
}
```

# Data cleaning

Before proceeding with the analysis, it might be necessary to clean the data.

The analysis uses a matrix of raw `UMI` counts as the input.
However the given raw data might not satisfy the needed minimal quality
requirements: it is better to remove any potential cell doublets or
multiplets, as well as any low quality or dying cells.

## Plots from the raw counts

We can check the library size (`UMI` number) with an *Empirical Cumulative
Distribution* function discriminating the cells by their origin

Please note that in the following plots it is **not** necessary
to specify a *condition*

```{r ecd-plot}
plot(ECDPlot(obj, condName = "origin"))
```

The we can also plot directly the distribution of the library size

```{r cells-size-plot}
plot(cellSizePlot(obj, condName = "origin"))
```

and we can plot the distribution of the number of detected genes in each cell,
useful to identify possible doublets

```{r number-detected-genes}
plot(genesSizePlot(obj, condName = "origin"))
```

As last plot we can check the scatter plot between the library size and
the number of detected genes: cells far away from the main group are usually
problematic

```{r scatter-plot}
plot(scatterPlot(obj, condName = "origin"))
```

## Dropping problematic cells

During the cleaning, every time we want to remove cells or genes
we can use the `dropGenesCells()` function.

Drop cells with too many reads as they are probably doublets or multiplets
and show the updated plot 

```{r cull-probable-doublets}
# Store the used threshold in the COTAN object for later reference
cellsSizeThr <- 6000L
obj <- addElementToMetaDataset(obj, "Cells size threshold", cellsSizeThr)

cellsToRem <- getCells(obj)[getCellsSize(obj) > cellsSizeThr]
obj <- dropGenesCells(obj, cells = cellsToRem)

plot(cellSizePlot(obj, condName = "origin"))
```

To drop cells by gene number: high genes count might also indicate doublets...

```{r cull-probable-doublets_2}
# Store the used threshold in the COTAN object for later reference
genesSizeHighThr <- 3000L
obj <-
  addElementToMetaDataset(obj, "Num genes high threshold", genesSizeHighThr)

cellsToRem <- getCells(obj)[getNumExpressedGenes(obj) > genesSizeHighThr]
obj <- dropGenesCells(obj, cells = cellsToRem)

plot(genesSizePlot(obj, condName = "origin"))
```

Drop cells with too low genes expression as they are probably dead

```{r cull-probable-dead-cells}
# Store the used threshold in the COTAN object for later reference
genesSizeLowThr <- 500L
obj <- addElementToMetaDataset(obj, "Num genes low threshold", genesSizeLowThr)

cellsToRem <- getCells(obj)[getNumExpressedGenes(obj) < genesSizeLowThr]
obj <- dropGenesCells(obj, cells = cellsToRem)

plot(genesSizePlot(obj, condName = "origin"))
```


Cells with a too high percentage of mitochondrial genes are
likely dead (or at the last problematic) cells. So we drop them!

```{r calculate-mitochondrial-genes-percentage}
c(mitPlot, mitSizes) %<-%
  mitochondrialPercentagePlot(obj, genePrefix = "^mt-", condName = "origin")

plot(mitPlot)
```

```{r cull-probable-dead-cells_2}
# Store the used threshold in the COTAN object for later reference
mitPercThr <- 3.5
obj <- addElementToMetaDataset(obj, "Mitoc. perc. threshold", mitPercThr)

cellsToRem <- rownames(mitSizes)[mitSizes[["mit.percentage"]] > mitPercThr]
obj <- dropGenesCells(obj, cells = cellsToRem)

c(mitPlot, mitSizes) %<-%
  mitochondrialPercentagePlot(obj, genePrefix = "^mt-", condName = "origin")

plot(mitPlot)
```

As last plot we can check the scatter plot between the library size and
the number of detected genes: cells far away from the main group are usually
problematic

```{r scatter-plot_2}
plot(scatterPlot(obj, condName = "origin"))
```

We want also to log the current status.

```{r log-current-dataset-status}
logThis(paste("n cells", getNumCells(obj)), logLevel = 1L)

getDims(obj)
```

The `clean()` function estimates the `nu` parameters from the data and drops
genes/cells that do not satisfy the minimal expression thresholds.
Therefore, it need to be run again every time any genes or cells are dropped
from the data.

```{r calculate-b-group-from-pca}
obj <- addElementToMetaDataset(obj, "Num drop B group", 0L)

obj <- clean(obj)

c(pcaCellsPlot, pcaCellsData, genesPlot,
  UDEPlot, nuPlot, zoomedNuPlot) %<-% cleanPlots(obj)

plot(pcaCellsPlot)
```

The above plot shows a sub-group of cells `B` that is somewhat distinct from
the other cells
With the plot below it is possible to see the most expressed genes
in the cells of the `B` group

```{r show-most-expressed-genes-b-group}
plot(genesPlot)
```

To color the PCA based on `nu` (so the cells' efficiency)

```{r cells-efficiency-pca-plot}
plot(UDEPlot)
```

`UDE` (color) should **not** correlate with principal components!
This is very important.

The next part is used to remove the cells with efficiency too low.

```{r cells-efficiency-plot}
plot(nuPlot)
```

We can zoom on the smallest values and, if COTAN detects a clear elbow,
we can decide to remove the cells.

```{r cells-efficiency-plot_low-tail}
plot(zoomedNuPlot)
```

We can observe here that the `B` group cells are really enriched in hemoglobin
genes so we prefer to remove them.
They can be extracted from the `pcaCellsData` object and removed.

```{r cull-b-group-and-recalculate}
cellsToRem <- rownames(pcaCellsData)[pcaCellsData[["groups"]] == "B"]
obj <- dropGenesCells(obj, cells = cellsToRem)

# Store how many times the B group was dropped
obj <- addElementToMetaDataset(obj, "Num drop B group", 1L)
```

We also save the defined threshold in the metadata and re-run the estimation

```{r cull-low-efficiency-cells}
udeLowThr <- 0.34
obj <- addElementToMetaDataset(obj, "Low UDE cells' threshold", udeLowThr)

obj <- estimateNuLinear(obj)

cellsToRem <- getCells(obj)[getNu(obj) < udeLowThr]
obj <- dropGenesCells(obj, cells = cellsToRem)
```

### Check all good

We repeat all the plots now that the *low quality* cells are removed
In many practical cases the last few steps are repeated multiple times until 
deemed appropriate

```{r raw-plots}
plot(ECDPlot(obj, condName = "origin"))

plot(cellSizePlot(obj, condName = "origin"))

plot(genesSizePlot(obj, condName = "origin"))

plot(scatterPlot(obj, condName = "origin"))

c(mitPlot, mitSizes) %<-%
  mitochondrialPercentagePlot(obj, genePrefix = "^mt-", condName = "origin")

plot(mitPlot)
```

Then we rerun also the cleaning and associated plots and
notice that all is essentially fine

```{r recalculate-clean-plots}
obj <- clean(obj)

c(pcaCellsPlot, pcaCellsData, genesPlot,
  UDEPlot, nuPlot, zoomedNuPlot) %<-% cleanPlots(obj)
```

```{r show-clean-plots}
plot(pcaCellsPlot)

plot(genesPlot)

plot(UDEPlot)

plot(nuPlot)
plot(zoomedNuPlot)

# It is also possible to directly plot the PCA results
plot(pcaCellsData)
```

Log the remaining number of cells

```{r log-updated-dataset-status}
logThis(paste("n cells", getNumCells(obj)), logLevel = 1L)

table(getCondition(obj, condName = "origin"))
```

## Vignette clean-up stage

The next few lines are just to clean.

```{r final-clean-up}
if (file.exists(file.path(outDir, paste0(cond, ".cotan.RDS")))) {
  # delete file if it exists
  file.remove(file.path(outDir, paste0(cond, ".cotan.RDS")))
}
if (file.exists(file.path(outDir, paste0(cond, "_times.csv")))) {
  # delete file if it exists
  file.remove(file.path(outDir, paste0(cond, "_times.csv")))
}
if (dir.exists(file.path(outDir, cond))) {
  unlink(file.path(outDir, cond), recursive = TRUE)
}
# if (dir.exists(file.path(outDir, GEO))) {
#   unlink(file.path(outDir, GEO), recursive = TRUE)
# }

# stop logging to file
setLoggingFile("")
file.remove(file.path(outDir, "vignette_dataset_cleaning.log"))

options(prevOptState)
```

```{r closure}
Sys.time()

sessionInfo()
```