---
title: "Retrofit Simulation Vignette"
author: "Adam Keebum Park, Roopali Singh"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Retrofit Simulation Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = '##', results = 'markup', warning = FALSE)
```

# Introduction

RETROFIT is a statistical method for reference-free deconvolution of spatial transcriptomics data to estimate cell type mixtures. In this Vignette, we will estimate cell type composition of a sample synthetic dataset. We will annotate cell types using an annotated single cell reference.

# Package Installation

Install and load the package using the following steps:
```{r, eval=FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("retrofit")
```

The main functionalities of RETROFIT are covered in this tutorial.
```{r, load_library, echo = FALSE}
library(retrofit)
```

# Spatial Transcriptomics Data

First load the ST data, using the following command:
```{r, data}
data("vignetteSimulationData")
x = vignetteSimulationData$n10m3_x
```
The ST data matrix will consist of G = 500 genes and S = 1000 spots i.e., a matrix of order G x S.

# Deconvolution

Initialize the following parameters for deconvolution:
- iterations: Number of iterations (default = 4000)
- L: Number of components required
```{r, decompose initialization}
iterations  = 10
L           = 20
```

After initialization, run RETROFIT on the data (X) as follows:
```{r, decompose}
result  = retrofit::decompose(x, L=L, iterations=iterations, verbose=TRUE)
H   = result$h
W   = result$w
Th  = result$th
```
After deconvolution of ST data, we have our estimates of W (a matrix of order G x L), H (a matrix of order L x S) and Theta (a vector of L components). Here, we are using number of iterations as 10 for demonstration purposes. For reproducing results in the paper, we need to run RETROFIT for iterations = 4000. The whole computation is omitted here due to time complexity (> 10min). We will load the results from 4000 iterations for the rest of the analysis.

```{r, load_retrofit_results}
H   = vignetteSimulationData$results_4k_iterations$decompose$h
W   = vignetteSimulationData$results_4k_iterations$decompose$w
Th  = vignetteSimulationData$results_4k_iterations$decompose$th
```

The above results are obtained by running the code below.

```{r, retrofit reproducibility, eval = FALSE}
iterations = 4000
set.seed(12)
result = retrofit::decompose(x, L=L, iterations=iterations)
```
# Cell-type Annotation

Next, we need to annotate the components, to get the proportion of, say K, cell types. We can do this in two ways: (a) using an annotated single cell reference or (b) using the known marker genes. Here, we will annotate using single cell reference.

Get the single cell reference data:
```{r, reference}
sc_ref_w = vignetteSimulationData$sc_ref_w
```

This file contains average gene expression values for G = 500 genes in K = 10 cell types i.e., a matrix of order G x K. Run the following command to get K cell-type mixtures from the ST data X:
```{r, annotate}
K             = 10
result        = retrofit::annotateWithCorrelations(sc_ref=sc_ref_w, K=K, 
                                                   decomp_w=W, decomp_h=H)
H_annotated   = result$h
W_annotated   = result$w
ranked_cells  = result$ranked_cells
```
Above code assigns components to the cell type it has maximum correlation with.

# Deconvolution and annotation in one step (Optional)

We can also deconvolve the ST data matrix along with cell-type annotation all in one step, as follows:
```{r, retrofit}
iterations  = 10
L           = 20
K           = 10 
result      = retrofit::retrofit(x, 
                                 sc_ref=sc_ref_w,
                                 iterations=iterations, 
                                 L=L, 
                                 K=K)
```

# Results 

Visualize the correlation values between each cell type and the chosen component, as follows:
```{r, visualize_correlations}
# correlation between true and estimated cell-type proportions
correlations        = stats::cor(sc_ref_w[ranked_cells], W_annotated[,ranked_cells])
ranked_correlations = sort(diag(correlations), decreasing=TRUE)
df                  = data.frame(x=1:length(ranked_correlations), 
                                 y=ranked_correlations, 
                                 label_x1=1, 
                                 label_x2=2, 
                                 label_y=seq(from=0.5, by=-0.05, length.out=10),
                                 label_cell=ranked_cells,
                                 label_corr=format(round(ranked_correlations, digits=4)))

gg <- ggplot2::ggplot(df,ggplot2::aes(x=x, y=y, group=1)) + 
  ggplot2::geom_line(ggplot2::aes(x=x, y=y)) + 
  ggplot2::geom_point(ggplot2::aes(x=x, y=y)) + 
  ggplot2::theme_bw() + 
  ggplot2::theme(axis.text.x=ggplot2::element_blank()) +
  ggplot2::ylim(0, 1.05) +
  ggplot2::ylab(expression(paste("Correlation (",W^0,",",widetilde(W),")"))) +
  ggplot2::geom_text(data=df, ggplot2::aes(x=label_x1, y=label_y, label=label_corr), size=4, hjust=0) +
  ggplot2::geom_text(data=df, ggplot2::aes(x=label_x2, y=label_y, label=label_cell), size=4, hjust=0)
plot(gg)
```

All mapped cell types have greater than 0.75 correlation with the component they are matched with.

Since this is synthetic data, true cell type proportions are known. We can visualize the RMSE between the true and the estimated cell-type proportions as follows:

```{r, visualize_rmse}
H_true  = vignetteSimulationData$sc_ref_h
H_est   = H_annotated
corrH   = sort(diag(stats::cor(H_true,H_est)), decreasing=TRUE, na.last=TRUE)
df      = data.frame(x=seq(0,1,length.out = 1000), 
                     y=corrH)
df_text = data.frame(x=0.2,
                     y=0.6,
                     label = c(paste("RETROFIT:", round(DescTools::AUC(x=seq(0,1,length.out = 1000), y=corrH),digits=3))))

gg   <- ggplot2::ggplot(df, ggplot2::aes(x=x,y=y)) +
  ggplot2::geom_line() + 
  ggplot2::scale_color_manual('gray30') + 
  ggplot2::xlab("Normalized Rank") +
  ggplot2::ylab(expression(paste("Correlation (",H,",",widetilde(H),")"))) +
  ggplot2::theme_bw() +
  ggplot2::geom_text(data = df_text, ggplot2::aes(label = label)) 
plot(gg)
```

In case of real data, where ground truth is not available, we can do certain diagnostics on the quality of results. For example, compute the correlation between marker expression (if known) and cell type proportion in ST data and visualize the agreement of the spatial pattern between markers and estimated proportion (read the paper for more information).

# Session information

```{r}
sessionInfo()
```