---
title: "scDataviz: single cell dataviz and downstream analyses"
author: "Kevin Blighe"
date: "`r Sys.Date()`"
package: "`r packageVersion('scDataviz')`"
output:
  html_document:
    toc: true
    toc_depth: 2
    number_sections: true
    theme: united
    highlight: tango
fig_width: 7
always_allow_html: true
bibliography: library.bib
vignette: >
    %\VignetteIndexEntry{scDataviz: single cell dataviz and downstream analyses}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
    %\usepackage[utf8]{inputenc}
---

# Introduction

In the single cell World, which includes flow cytometry, mass cytometry,
single-cell RNA-seq (scRNA-seq), and others, there is a need to improve
data visualisation and to bring analysis capabilities to researchers even
from non-technical backgrounds. *scDataviz* [@scDataviz] attempts to fit
into this space, while also catering for advanced users. Additonally, due
to the way that *scDataviz* is designed, which is based on
*SingleCellExperiment* [@Lun], it has a 'plug and play' feel, and immediately
lends itself as flexibile and compatibile with studies that go beyond
*scDataviz*. Finally, the graphics in *scDataviz* are generated via the
*ggplot* [@Wickham] engine, which means that users can 'add on' features
to these with ease.

This package just provides some additional functions for dataviz and clustering,
and provides another way of identifying cell-types in clusters. It is not strictly
intended as a standalone analysis package. For a comprehensive high-dimensional
cytometry workflow, it is recommended to check out the work by Nowicka
*et al.* [CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets](https://f1000research.com/articles/6-748).
For a more comprehensive scRNA-seq workflow, please check out [OSCA](https://osca.bioconductor.org)
and [Analysis of single cell RNA-seq data](https://scrnaseq-course.cog.sanger.ac.uk/website/index.html).


```{r, echo = FALSE, message = FALSE}

  suppressWarnings(library(knitr))

  suppressWarnings(library(kableExtra))

  opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE)

```

# Installation

## 1. Download the package from Bioconductor

```{r getPackage, eval = FALSE}

  if (!requireNamespace('BiocManager', quietly = TRUE))
    install.packages('BiocManager')

  BiocManager::install('scDataviz')

```

Note: to install development version:

```{r getPackageDevel, eval = FALSE}

  devtools::install_github('kevinblighe/scDataviz')

```

## 2. Load the package into R session

```{r Load, message = FALSE}

  library(scDataviz)

```

# Tutorial 1: CyTOF FCS data

Here, we will utilise some of the flow cytometry data from
[Deep phenotyping detects a pathological CD4+ T-cell complosome signature in systemic sclerosis](https://www.nature.com/articles/s41423-019-0360-8).

This can normally be downloadedd via `git clone` from your command prompt:

```{r, engine = 'bash', eval = FALSE}

  git clone https://github.com/kevinblighe/scDataviz_data/ ;

```

In a practical situation, we would normally read in this data from the raw FCS
files and then QC filter, normalise, and transform them. This can be achieved
via the `processFCS` function, which, by default, also removes variables based
on low variance and downsamples [randomly] your data to 100000 variables. The
user can change these via the `downsample` and `downsampleVar` parameters. An
example (*not run*) is given below:

```{r readFCS, eval = FALSE}

  filelist <- list.files(
    path = "scDataviz_data/FCS/",
    pattern = "*.fcs|*.FCS",
    full.names = TRUE)
  filelist

  metadata <- data.frame(
    sample = gsub('\\ [A-Za-z0-9]*\\.fcs$', '',
      gsub('scDataviz_data\\/FCS\\/\\/', '', filelist)),
    group = c(rep('Healthy', 7), rep('Disease', 11)),
    treatment = gsub('\\.fcs$', '',
      gsub('scDataviz_data\\/FCS\\/\\/[A-Z0-9]*\\ ', '', filelist)),
    row.names = filelist,
    stringsAsFactors = FALSE)
  metadata

  inclusions <- c('Yb171Di','Nd144Di','Nd145Di',
    'Er168Di','Tm169Di','Sm154Di','Yb173Di','Yb174Di',
    'Lu175Di','Nd143Di')

  markernames <- c('Foxp3','C3aR','CD4',
    'CD46','CD25','CD3','Granzyme B','CD55',
    'CD279','CD45RA')

  names(markernames) <- inclusions
  markernames

  exclusions <- c('Time','Event_length','BCKG190Di',
    'Center','Offset','Width','Residual')

  sce <- processFCS(
    files = filelist,
    metadata = metadata,
    transformation = TRUE,
    transFun = function (x) asinh(x),
    asinhFactor = 5,
    downsample = 10000,
    downsampleVar = 0.7,
    colsRetain = inclusions,
    colsDiscard = exclusions,
    newColnames = markernames)

```

In flow and mass cytometry, getting the correct marker names in the FCS files can
be surprisingly difficult. In many cases, from experience, a facility may label
the markers by their metals, such as Iridium (Ir), Ruthenium (Ru), Terbium (Tb),
*et cetera* - this is the case for the data used in this tutorial. The true marker
names may be held as pData encoded within each FCS, accessible via:

```{r readFCSparameters, eval = FALSE}

  library(flowCore)
  pData(parameters(
    read.FCS(filelist[[4]], transformation = FALSE, emptyValue = FALSE)))

```

Whatever the case, it is important to sort out marker naming issues prior to the experiment
being conducted in order to avoid any confusion.

For this vignette, due to the fact that the raw FCS data is > 500 megabytes,
we will work with a smaller pre-prepared dataset that has been downsampled to
10000 cells using the above code. This data comes included with the package.

Load the pre-prepared complosome data.

```{r load}

  load(system.file('extdata/', 'complosome.rdata', package = 'scDataviz'))
  
```

One can also create a new *SingleCellExperiment* object manually using any
type of data, including any data from scRNA-seq produced elsewhere. Import
functions for data deriving from other sources is covered in Tutorials 2 and
3 in this vignette. All functions in *scDataviz* additionally accept
data-frames or matrices on their own, de-necessitating the reliance on the
*SingleCellExperiment* class.


## Perform principal component analysis (PCA)

We can use the *PCAtools* [@PCAtools] package for the purpose of performing
PCA.

```{r ex1, fig.height = 7, fig.width = 8, fig.cap = 'Perform PCA'}

  library(PCAtools)
  p <- pca(assay(sce, 'scaled'), metadata = metadata(sce))

  biplot(p,
    x = 'PC1', y = 'PC2',
    lab = NULL,
    xlim = c(min(p$rotated[,'PC1'])-1, max(p$rotated[,'PC1'])+1),
    ylim = c(min(p$rotated[,'PC2'])-1, max(p$rotated[,'PC2'])+1),
    pointSize = 1.0,
    colby = 'treatment',
    legendPosition = 'right',
    title = 'PCA applied to CyTOF data',
    caption = paste0('10000 cells randomly selected after ',
      'having filtered for low variance'))

```

We can add the rotated component loadings as a new reduced dimensional
component to our dataset.

```{r addPCAdim}

  reducedDim(sce, 'PCA') <- p$rotated

```

For more functionality via *PCAtools*, check the vignette:
[PCAtools: everything Principal Component Analysis](https://bioconductor.org/packages/release/bioc/vignettes/PCAtools/inst/doc/PCAtools.html)

## Perform UMAP

UMAP can be performed on the entire dataset, if your computer's memory will
permit. Currently it's default is to use the data contained in the 'scaled'
assay component of your *SingleCellExperiment* object.

```{r performUMAP}

  sce <- performUMAP(sce)

```

UMAP can also be stratified based on a column in your metadata, e.g.,
(treated versus untreated samples); however, to do this, I recommend
creating separate *SingleCellExperiment* objects from the very start,
i.e., from the the data input stage, and processing the data separately
for each group.

**Nota bene** - advanced users may want to change the default configuration for UMAP. *scDataviz* currently performs UMAP via the *umap* package. In order to modify the default configuration, one can pull in the default config separately from the *umap* package and then modify these config values held in the *umap.defaults* variable, as per the [umap vignette](https://cran.r-project.org/web/packages/umap/vignettes/umap.html) (see 'Tuning UMAP' section). For example:

```{r eval = FALSE, echo = TRUE}

  config <- umap::umap.defaults
  config$min_dist <- 0.5
  performUMAP(sce, config = config)

```

We can also perform UMAP on a select number of PC eigenvectors.
*PCAtools* [@PCAtools] can be used to infer ideal number of dimensions
to use via the elbow method and Horn's parallel analysis.

```{r elbowHorn}

  elbow <- findElbowPoint(p$variance)
  horn <- parallelPCA(assay(sce, 'scaled'))

  elbow
  horn$n

```

For now, let's just use 5 PCs.

```{r performUMAP_PCA}

  sce <- performUMAP(sce, reducedDim = 'PCA', dims = c(1:5))

```

## Create a contour plot of the UMAP layout

This and the remaining sections in this tutorial are about producing great
visualisations of the data and attempting to make sense of it, while not
fully overlapping with functionalioty provided by other programs that operate
in tis space.

With the contour plot, we are essentially looking at celluar density. It can
provide for a beautiful viusualisation in a manuscript while also serving as
a useful QC tool: if the density is 'scrunched up' into a single area in the
plot space, then there are likely issues with your input data distribution.
We want to see well-separated, high density 'islands', or, at least, gradual
gradients that blend into one another across high density 'peaks'.

```{r ex2, fig.height = 7.5, fig.width = 16, fig.cap = 'Create a contour plot of the UMAP layout'}

  ggout1 <- contourPlot(sce,
    reducedDim = 'UMAP',
    bins = 150,
    subtitle = 'UMAP performed on expression values',
    legendLabSize = 18,
    axisLabSize = 22,
    titleLabSize = 22,
    subtitleLabSize = 18,
    captionLabSize = 18)

  ggout2 <- contourPlot(sce,
    reducedDim = 'UMAP_PCA',
    bins = 150,
    subtitle = 'UMAP performed on PC eigenvectors',
    legendLabSize = 18,
    axisLabSize = 22,
    titleLabSize = 22,
    subtitleLabSize = 18,
    captionLabSize = 18)

  cowplot::plot_grid(ggout1, ggout2,
    labels = c('A','B'),
    ncol = 2, align = "l", label_size = 24)

```

## Show marker expression across the layout

Here, we randomly select some markers and then plot their expression profiles
across the UMAP layouts.

```{r ex3, fig.height = 12, fig.width = 20, fig.cap = 'Show marker expression across the layout'}

  markers <- sample(rownames(sce), 6)
  markers

  ggout1 <- markerExpression(sce,
    markers = markers,
    subtitle = 'UMAP performed on expression values',
    nrow = 1, ncol = 6,
    legendKeyHeight = 1.0,
    legendLabSize = 18,
    stripLabSize = 22,
    axisLabSize = 22,
    titleLabSize = 22,
    subtitleLabSize = 18,
    captionLabSize = 18)

  ggout2 <- markerExpression(sce,
    markers = markers,
    reducedDim = 'UMAP_PCA',
    subtitle = 'UMAP performed on PC eigenvectors',
    nrow = 1, ncol = 6,
    col = c('white', 'darkblue'),
    legendKeyHeight = 1.0,
    legendLabSize = 18,
    stripLabSize = 22,
    axisLabSize = 22,
    titleLabSize = 22,
    subtitleLabSize = 18,
    captionLabSize = 18)

  cowplot::plot_grid(ggout1, ggout2,
    labels = c('A','B'),
    nrow = 2, align = "l", label_size = 24)

```

## Shade cells by metadata

Shading cells by metadata can be useful for identifying any batch effects,
but also useful for visualising, e.g., differences across treatments.

First, let's take a look inside the metadata that we have.

```{r metadataPlot}

  head(metadata(sce))

  levels(metadata(sce)$group)

  levels(metadata(sce)$treatment)

```

```{r ex4, fig.height = 12, fig.width = 14, fig.cap = 'Shade cells by metadata', message = FALSE}

  ggout1 <- metadataPlot(sce,
    colby = 'group',
    colkey = c(Healthy = 'royalblue', Disease = 'red2'),
    title = 'Disease status',
    subtitle = 'UMAP performed on expression values',
    legendLabSize = 16,
    axisLabSize = 20,
    titleLabSize = 20,
    subtitleLabSize = 16,
    captionLabSize = 16)

  ggout2 <- metadataPlot(sce,
    reducedDim = 'UMAP_PCA',
    colby = 'group',
    colkey = c(Healthy = 'royalblue', Disease = 'red2'),
    title = 'Disease status',
    subtitle = 'UMAP performed on PC eigenvectors',
    legendLabSize = 16,
    axisLabSize = 20,
    titleLabSize = 20,
    subtitleLabSize = 16,
    captionLabSize = 16)

  ggout3 <- metadataPlot(sce,
    colby = 'treatment',
    title = 'Treatment type',
    subtitle = 'UMAP performed on expression values',
    legendLabSize = 16,
    axisLabSize = 20,
    titleLabSize = 20,
    subtitleLabSize = 16,
    captionLabSize = 16)

  ggout4 <- metadataPlot(sce,
    reducedDim = 'UMAP_PCA',
    colby = 'treatment',
    title = 'Treatment type',
    subtitle = 'UMAP performed on PC eigenvectors',
    legendLabSize = 16,
    axisLabSize = 20,
    titleLabSize = 20,
    subtitleLabSize = 16,
    captionLabSize = 16)

  cowplot::plot_grid(ggout1, ggout3, ggout2, ggout4,
    labels = c('A','B','C','D'),
    nrow = 2, ncol = 2, align = "l", label_size = 24)

```

## Find ideal clusters in the UMAP layout via k-nearest neighbours

This function utilises the k nearest neighbours (k-NN) approach from Seurat,
which works quite well on flow cytometry and CyTOF UMAP layouts, from my
experience.

```{r ex5, message = FALSE, fig.height = 8, fig.width = 14, fig.cap = 'Find ideal clusters in the UMAP layout via k-nearest neighbours'}

  sce <- clusKNN(sce,
    k.param = 20,
    prune.SNN = 1/15,
    resolution = 0.01,
    algorithm = 2,
    verbose = FALSE)

  sce <- clusKNN(sce,
    reducedDim = 'UMAP_PCA',
    clusterAssignName = 'Cluster_PCA',
    k.param = 20,
    prune.SNN = 1/15,
    resolution = 0.01,
    algorithm = 2,
    verbose = FALSE)

  ggout1 <- plotClusters(sce,
    clusterColname = 'Cluster',
    labSize = 7.0,
    subtitle = 'UMAP performed on expression values',
    caption = paste0('Note: clusters / communities identified via',
      '\nLouvain algorithm with multilevel refinement'),
    axisLabSize = 20,
    titleLabSize = 20,
    subtitleLabSize = 16,
    captionLabSize = 16)

  ggout2 <- plotClusters(sce,
    clusterColname = 'Cluster_PCA',
    reducedDim = 'UMAP_PCA',
    labSize = 7.0,
    subtitle = 'UMAP performed on PC eigenvectors',
    caption = paste0('Note: clusters / communities identified via',
      '\nLouvain algorithm with multilevel refinement'),
    axisLabSize = 20,
    titleLabSize = 20,
    subtitleLabSize = 16,
    captionLabSize = 16)

  cowplot::plot_grid(ggout1, ggout2,
    labels = c('A','B'),
    ncol = 2, align = "l", label_size = 24)

```

## Plot marker expression per identified cluster

```{r ex6a, eval = FALSE}

  markerExpressionPerCluster(sce,
    caption = 'Cluster assignments based on UMAP performed on expression values',
    stripLabSize = 22,
    axisLabSize = 22,
    titleLabSize = 22,
    subtitleLabSize = 18,
    captionLabSize = 18)

```

```{r ex6b, fig.height = 7, fig.width = 12, fig.cap = 'Plot marker expression per identified cluster2'}

  clusters <- unique(metadata(sce)[['Cluster_PCA']])
  clusters

  markers <- sample(rownames(sce), 5)
  markers

  markerExpressionPerCluster(sce,
    clusters = clusters,
    clusterAssign = metadata(sce)[['Cluster_PCA']],
    markers = markers,
    nrow = 2, ncol = 5,
    caption = 'Cluster assignments based on UMAP performed on PC eigenvectors',
    stripLabSize = 22,
    axisLabSize = 22,
    titleLabSize = 22,
    subtitleLabSize = 18,
    captionLabSize = 18)

```

Try all markers across a single cluster:

```{r ex6c, fig.height = 6, fig.width = 8, fig.cap = 'Plot marker expression per identified cluster3'}

  cluster <- sample(unique(metadata(sce)[['Cluster']]), 1)
  cluster

  markerExpressionPerCluster(sce,
    clusters = cluster,
    markers = rownames(sce),
    stripLabSize = 20,
    axisLabSize = 20,
    titleLabSize = 20,
    subtitleLabSize = 14,
    captionLabSize = 12)

```



## Determine enriched markers in each cluster and plot the expression signature

This method also calculates metacluster abundances across a chosen phenotype.
The function returns a data-frame, which can then be exported to do other
analyses.

### Disease vs Healthy metacluster abundances

```{r echo = TRUE, eval = FALSE}

  markerEnrichment(sce,
    method = 'quantile',
    studyvarID = 'group')

```

`r knitr::kable(markerEnrichment(sce, method = 'quantile', studyvarID = 'group'), auto = TRUE) %>% kable_styling(full_width = FALSE)`

.

### Treatment type metacluster abundances

```{r echo = TRUE, eval = FALSE}

  markerEnrichment(sce,
    sampleAbundances = FALSE,
    method = 'quantile',
    studyvarID = 'treatment')

```

`r knitr::kable(markerEnrichment(sce, sampleAbundances = FALSE, method = 'quantile', studyvarID = 'treatment'), auto = TRUE) %>% kable_styling(full_width = FALSE)`

.

### Expression signature

The expression signature is a quick way to visualise which markers are more or
less expressed in each identified cluster of cells.

```{r ex7, fig.height = 8, fig.width = 6, fig.cap = 'Determine enriched markers in each cluster and plot the expression signature'}

  plotSignatures(sce,
    labCex = 1.2,
    legendCex = 1.2,
    labDegree = 40)

```


# Tutorial 2: Import from Seurat

Due to the fact that *scDataviz* is based on *SingleCellExperiment*, it has
increased interoperability with other packages, including the popular
*Seurat* [@satijalab]. Taking the data produced from the
[Seurat Tutorial](https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html) on
Peripheral Blood Mononuclear Cells (PBMCs), we can convert this to a
*SingleCellExperiment* object recognisable by *scDataviz* via
`as.SingleCellExperiment()`.

When deriving from the Seurat route, be sure to manually assign the metadata slot,
which is required for some functions. Also be sure to modify the default values for
`assay`, `reducedDim`, and `dimColnames`, as these are assigned differently in Seurat.

```{r import_Seurat1, eval = FALSE}

  sce <- as.SingleCellExperiment(pbmc)

  metadata(sce) <- data.frame(colData(sce))

  markerExpression(sce,
    assay = 'logcounts',
    reducedDim = 'UMAP',
    dimColnames = c('UMAP_1','UMAP_2'),
    markers = c('CD79A', 'Cd79B', 'MS4A1'))

```

For `markerEnrichment()`, a typical command using an ex-Seurat object could be:

```{r import_Seurat2, eval = FALSE}

  markerEnrichment(sce,
    assay = 'logcounts',
    method = 'quantile',
    sampleAbundances = TRUE,
    sampleID = 'orig.ident',
    studyvarID = 'ident',
    clusterAssign = as.character(colData(sce)[['seurat_clusters']]))

```


# Tutorial 3: Import any numerical data

*scDataviz* will work with any numerical data, too. Here, we show a quick
example of how one can import a data-matrix of randomly-generated numbers
that follow a negative binomial distribution, comprising 2500 cells and 20
markers:

```{r importRandomData1}

  mat <- jitter(matrix(
    MASS::rnegbin(rexp(50000, rate=.1), theta = 4.5),
    ncol = 20))
  colnames(mat) <- paste0('CD', 1:ncol(mat))
  rownames(mat) <- paste0('cell', 1:nrow(mat))

  metadata <- data.frame(
    group = rep('A', nrow(mat)),
    row.names = rownames(mat),
    stringsAsFactors = FALSE)
  head(metadata)

  sce <- importData(mat,
    assayname = 'normcounts',
    metadata = metadata)
  sce

```

This will also work without any assigned metadata; however, having no metadata limits
the functionality of the package.

```{r importRandomData2}

  sce <- importData(mat,
    assayname = 'normcounts',
    metadata = NULL)
  sce

```



# Acknowledgments

* Jessica Timms
* James Opzoomer
* Shahram Kordasti
* Marcel Ramos (Bioconductor)
* Lori Shepherd (Bioconductor)
* Bioinformatics CRO
* Henrik Bengtsson


# Session info

```{r}

sessionInfo()

```

# References

@scDataviz

@PCAtools

@Lun

@satijalab

@Wickham