---
title: Computing RNA velocity in a Bioconductor framework
author:
- name: Kevin Rue-Albrecht
- name: Aaron Lun
  email: infinite.monkeys.with.keyboards@gmail.com
- name: Charlotte Soneson
  email: charlottesoneson@gmail.com
package: velociraptor
bibliography: velociraptor.bib
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{User's guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}    
---

```{r, echo=FALSE, results="hide", message=FALSE}
require(knitr)
library(BiocStyle)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
```

# Overview

This package provides a lightweight interface between the Bioconductor `SingleCellExperiment` data structure
and the `r basilisk::PyPiLink("scvelo")` Python package for RNA velocity calculations.
The interface is comparable to that of many other `SingleCellExperiment`-compatible functions,
allowing users to plug in RNA velocity calculations into the existing Bioconductor analysis framework.
To demonstrate, we will use a data set from @Hermann2018-spermatogenesis, provided via the `r Biocpkg("scRNAseq")` package.
This data set contains gene-wise estimates of spliced and unspliced UMI counts for 2,325 mouse spermatogenic cells.

```{r}
library(scRNAseq)
sce <- HermannSpermatogenesisData()
sce
```

# Downsampling for demonstration

The full data set requires up to 12 GB of memory for the example usage presented in this vignette.
For demonstration purposes, we downsample the data set to the first 500 cells.
Feel free to skip this downsampling step if you have access to sufficient memory.

```{r}
sce <- sce[, 1:500]
```

# Basic workflow

We assume that feature selection has already been performed by the user using any method 
(see [here](https://osca.bioconductor.org/feature-selection.html) for some suggestions).
We will use the log-expression method from the `r Biocpkg("scran")` method here.

```{r}
library(scuttle)
sce <- logNormCounts(sce, assay.type=1)

library(scran)
dec <- modelGeneVar(sce)
top.hvgs <- getTopHVGs(dec, n=2000)
```

We can plug these choices into the `scvelo()` function with our `SingleCellExperiment` object.
Here, we will use the `"spliced"` count matrix as a proxy for the typical exonic count matrix.
While this count matrix is not required for the velocity estimation in itself, it is used internally to retrieve the principal component space that is used by `r basilisk::PyPiLink("scvelo")` to find nearest neighbors.
There are some subtle differences between the spliced count matrix and the typical exonic count matrix - see `?scvelo` for some commentary about this -
but the spliced counts are generally a satisfactory replacement.
By default, `scvelo()` uses the steady-state approach to estimate velocities. 
The stochastic and dynamical models implemented in `r basilisk::PyPiLink("scvelo")` are accessible as well, by modifying the `mode` argument. 

```{r}
library(velociraptor)
velo.out <- scvelo(sce, subset.row=top.hvgs, assay.X="spliced")
velo.out
```

The `scvelo()` function produces a `SingleCellExperiment` containing all of the outputs of the calculation in Python.
Of particular interest is the `velocity_pseudotime` vector that captures the relative progression of each cell 
along the biological process driving the velocity vectors.
We can visualize this effect below in a $t$-SNE plot generated by `r Biocpkg('scater')` on the top HVGs.

```{r}
library(scater)

set.seed(100)
sce <- runPCA(sce, subset_row=top.hvgs)
sce <- runTSNE(sce, dimred="PCA")

sce$velocity_pseudotime <- velo.out$velocity_pseudotime
plotTSNE(sce, colour_by="velocity_pseudotime")
```

It is also straightforward to embed the velocity vectors into our desired low-dimensional space, 
as shown below for the $t$-SNE coordinates.
This uses a grid-based approach to summarize the per-cell vectors into local representatives for effective visualization.

```{r}
embedded <- embedVelocity(reducedDim(sce, "TSNE"), velo.out)
grid.df <- gridVectors(reducedDim(sce, "TSNE"), embedded)

library(ggplot2)
plotTSNE(sce, colour_by="velocity_pseudotime") +
    geom_segment(data=grid.df, mapping=aes(x=start.1, y=start.2, 
        xend=end.1, yend=end.2), arrow=arrow(length=unit(0.05, "inches")))
```

And that's it, really.

# Advanced options

`scvelo()` interally performs a PCA step that we can bypass by supplying our own PC coordinates.
Indeed, it is often the case that we have already performed PCA in the earlier analysis steps,
so we can just re-use those results to (i) save time and (ii) improve consistency with the other steps.
Here, we computed the PCA coordinates in `runPCA()` above, so let's just recycle that:

```{r}
# Only setting assay.X= for the initial AnnData creation,
# it is not actually used in any further steps.
velo.out2 <- scvelo(sce, assay.X=1, subset.row=top.hvgs, use.dimred="PCA") 
velo.out2
```

We also provide an option to use the `r basilisk::PyPiLink("scvelo")` pipeline without modification,
i.e., relying on their normalization and feature selection.
This sacrifices consistency with other Bioconductor workflows but enables perfect mimicry of a pure Python-based analysis. 
In this case, arguments like `subset.row=` are simply ignored. 

```{r}
velo.out3 <- scvelo(sce, assay.X=1, use.theirs=TRUE)
velo.out3
```

Advanced users can tinker with the settings of individual `r basilisk::PyPiLink("scvelo")` steps
by setting named lists of arguments in the `scvelo.params=` argument.
For example, to tinker with the behavior of the `recover_dynamics` step, we could do:

```{r}
velo.out4 <- scvelo(sce, assay.X=1, subset.row=top.hvgs,
    scvelo.params=list(recover_dynamics=list(max_iter=20)))
velo.out4
```

# Session information

```{r}
sessionInfo()
```

# References