--- 
title: "TF activity inference from scRNA-seq data with DoRothEA as regulon 
    resource."
author:
- name: Christian H. Holland 
  affiliation: Institute for Computational Biomedicine, Heidelberg University
  email: christian.holland@bioquant.uni-heidelberg.de
- name: Alberto Valdeolivas
  affiliation: Institute for Computational Biomedicine, Heidelberg University
  email: alberto.valdeolivas@bioquant.uni-heidelberg.de
- name: Julio Saez-Rodriguez 
  affiliation: Institute for Computational Biomedicine, Heidelberg University 
   
package: dorothea
output:
  BiocStyle::html_document
bibliography: references.bib
abstract: |
  This vignette describes how to infer transcription factor activity from 
  scRNA-seq data by running Viper on DoRothEA's curated regulons. 
license: GNU-GLPv3, please check http://www.gnu.org/licenses/
vignette: |
  %\VignetteIndexEntry{TF activity inference from scRNA-seq data with DoRothEA as regulon resource.}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

**DoRothEA** is a comprehensive resource containing a curated collection of 
transcription factors (TFs) and its transcriptional targets. The set of genes
regulated by a specific transcription factor is known as regulon. DoRothEA's 
regulons were gathered from different types of evidence. Each TF-target 
interaction is defined by a confidence level based on the number of 
supporting evidence. The confidence levels ranges from A (highest confidence) 
to E (lowest confidence) [@GarciaAlonso2019]. 

**DoRothEA** regulons are usually coupled with the statistical method **VIPER** 
[@Alvarez2016]. In this context, TF activities are computed based on the mRNA 
expression levels of its targets. We therefore can consider TF activity as a 
proxy of a given transcriptional state [@Dugourd2019]. 

@Holland2020 evaluated the performance of **DoRothEA** in 
combination with **VIPER** when applied to scRNA-seq data. We showed that, in 
spite of the current limitations of scRNA-seq technologies, their approach can 
provide meaningful results in this context. Indeed, this vignette shows an 
example on how to apply **DoRothEA** regulons coupled with **VIPER** in a well 
known single-cell dataset. 

# Installation

First of all, you need a current version of R (http://www.r-project.org). 
`r Biocpkg("DoRothEA")` is a freely available annotation package deposited on 
http://bioconductor.org/ and https://github.com/saezlab/dorothea. 

You can install it by running the following commands on an R console:

```{r "installation", eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("dorothea")
```

We also load here the packages required  to run this script. 

```{r "load packages", message = FALSE}
## We load the required packages
library(dorothea)
library(dplyr)
library(Seurat)
library(tibble)
library(pheatmap)
library(tidyr)
library(viper)
```

# Example of usage

In the following paragraphs, we provide examples describing how to run **VIPER** 
on **DoRothEA** regulons in a scRNA-seq dataset. In particular, we use the 
`r CRANpkg("Seurat")` toolkit for single cell genomics [@Stuart2019]. For the 
sake of simplicity, we follow the example provided in the following 
`r CRANpkg("Seurat")` vignette:  

https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html 

The dataset contains 2700 Peripheral Blood Mononuclear Cells (PBMC) that were 
sequenced on the Illumina NextSeq 500. This dataset is freely available in 
10X Genomics: 

https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

```{r "load data", eval=FALSE}
## Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

## Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", 
                           min.cells = 3, min.features = 200)
```

```{r "load data2", eval=TRUE , include=FALSE}
load(file="../inst/extdata/for_vignette/seurat_object.RData", 
     verbose = FALSE)
```

## Pre-processing, normalization and identification of highly variable features

We follow the standard pre-processing steps as described in the aforementioned 
`r CRANpkg("Seurat")` vignette before going deeper into the data analysis. 
These steps carry out the selection and filtration of cells based on quality
control metrics, the data normalization and scaling, and the detection of highly
variable features (see https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html). 

```{r "preprocessing", message=FALSE}
## Identification of mithocondrial genes
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

## Filtering cells following standard QC criteria.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & 
    percent.mt < 5)

## Normalizing the data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", 
    scale.factor = 10000)

pbmc <- NormalizeData(pbmc)

## Identify the 2000 most highly variable genes
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

## In addition we scale the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
```

## Clustering cells

One of the most relevant steps in scRNA-seq data analysis is clustering. Cells
are grouped based on the similarity of their transcriptomic profiles. We first 
apply the `r CRANpkg("Seurat")` v3 classical approach as described in their 
aforementioned vignette. We visualize the cell clusters using UMAP: 

```{r "clustering", message=FALSE, warning=FALSE}
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc), 
               verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:10, umap.method = "uwot", metric = "cosine")

pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, 
                               logfc.threshold = 0.25, verbose = FALSE)

## Assigning cell type identity to clusters
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T",
                     "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
```

```{r "pca", message = FALSE, warning = FALSE}
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
```

## Clustering cells with TF activity

@Holland2020 showed that clustering the cells based on their TF activity 
profiles can also be very interesting. Indeed, clustering cells using TF 
activity computed with **VIPER** and **DoRothEA** performs better than 
using the expression level of the same TFs. In addition, it brings 
complementary information to the clusters based on transcriptomics profiles.

Here, we first run **VIPER** on **DoRothEA's** regulons to obtain TFs activity,
by using the wrapper function `run_viper()`. This function can deal with 
different input types such as matrix, dataframe, ExpressionSet or even Seurat 
objects. In case of a seurat object the function returns the same seurat object 
with an additonal assay called `dorothea` containing the TF activities in the 
slot `data`.
```{r "viper", message=FALSE}
## We read Dorothea Regulons for Human:
dorothea_regulon_human <- get(data("dorothea_hs", package = "dorothea"))

## We obtain the regulons based on interactions with confidence level A, B and C
regulon <- dorothea_regulon_human %>%
    dplyr::filter(confidence %in% c("A","B","C"))

## We compute Viper Scores 
pbmc <- run_viper(pbmc, regulon,
                  options = list(method = "scale", minsize = 4, 
                                 eset.filter = FALSE, cores = 1, 
                                 verbose = FALSE))
```

We then apply `r CRANpkg("Seurat")` to cluster the cells following the same 
protocol than above but using TF activity scores.

```{r "tf clustering", message=FALSE}
## We compute the Nearest Neighbours to perform cluster
DefaultAssay(object = pbmc) <- "dorothea"
pbmc <- ScaleData(pbmc)
pbmc <- RunPCA(pbmc, features = rownames(pbmc), verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)

pbmc <- RunUMAP(pbmc, dims = 1:10, umap.method = "uwot", metric = "cosine")

pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, 
                               logfc.threshold = 0.25, verbose = FALSE)

## Assigning cell type identity to clusters
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", 
                     "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
```

```{r "tf_umap", message = FALSE, warning = FALSE}
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
```

## TF activity per cell population

Finally, we characterise the different cell populations based on their 
TF activities thanks to the previously computed **VIPER** scores on 
**DoRothEA's** regulons. 

```{r "tf activity", message = FALSE}
## We transform Viper scores, scaled by seurat, into a data frame to better 
## handling the results
viper_scores_df <- GetAssayData(pbmc, slot = "scale.data", 
                                    assay = "dorothea") %>%
  data.frame() %>%
  t()

## We create a data frame containing the cells and their clusters
CellsClusters <- data.frame(cell = names(Idents(pbmc)), 
                            cell_type = as.character(Idents(pbmc)),
                            stringsAsFactors = FALSE)

## We create a data frame with the Viper score per cell and its clusters
viper_scores_clusters <- viper_scores_df  %>%
  data.frame() %>% 
  rownames_to_column("cell") %>%
  gather(tf, activity, -cell) %>%
  inner_join(CellsClusters)

## We summarize the Viper scores by cellpopulation
summarized_viper_scores <- viper_scores_clusters %>% 
  group_by(tf, cell_type) %>%
  summarise(avg = mean(activity),
            std = sd(activity))

```

For visualization purposes, we select the 20 most variable TFs across clusters
according to our scores. 

```{r "highly variable tfs", message=FALSE}
## We select the 20 most variable TFs. (20*9 populations = 180)
highly_variable_tfs <- summarized_viper_scores %>%
  group_by(tf) %>%
  mutate(var = var(avg))  %>%
  ungroup() %>%
  top_n(180, var) %>%
  distinct(tf)

## We prepare the data for the plot
summarized_viper_scores_df <- summarized_viper_scores %>%
  semi_join(highly_variable_tfs, by = "tf") %>%
  dplyr::select(-std) %>%   
  spread(tf, avg) %>%
  data.frame(row.names = 1, check.names = FALSE, stringsAsFactors = FALSE) 
```

```{r "tf_heatmap"}
palette_length = 100
my_color = colorRampPalette(c("Darkblue", "white","red"))(palette_length)

my_breaks <- c(seq(min(summarized_viper_scores_df), 0, 
                   length.out=ceiling(palette_length/2) + 1),
               seq(max(summarized_viper_scores_df)/palette_length, 
                   max(summarized_viper_scores_df), 
                   length.out=floor(palette_length/2)))

viper_hmap <- pheatmap(t(summarized_viper_scores_df),fontsize=14, 
                       fontsize_row = 10, 
                       color=my_color, breaks = my_breaks, 
                       main = "DoRothEA (ABC)", angle_col = 45,
                       treeheight_col = 0,  border_color = NA) 
```

# Session info

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

# References