---
title: "A quick start guide to the standR package"
author: "Ning Liu, Dharmesh Bhuva, Ahmed Mohamed, Chin Wee Tan, Melissa Davis"
date: "`r Sys.Date()`"
output: 
  html_document:
    toc: true
    number_sections: true
    theme: cosmo
    highlight: tango
    code_folding: show
vignette: >
  %\VignetteIndexEntry{standR_introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<style type="text/css">
  body{
  font-size: 14pt;
}
</style>

```{r message=FALSE, warning=FALSE, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```



# Installation

```{r eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("standR")
```


The development version of `standR` can be installed from GitHub:

```{r eval=FALSE}
devtools::install_github("DavisLaboratory/standR")
```


# Quick start


```{r message = FALSE, warning = FALSE}
library(standR)
library(SpatialExperiment)
library(limma)
library(ExperimentHub)
```


## Load data for this guide

This is the background for the data:

NanoString GeoMx DSP dataset of diabetic kidney disease (DKD) vs healthy kidney 
tissue.  **Seven slides** were analyzed, **4 DKD and 3 healthy**. Regions of 
Interest (ROI) were focused two different parts of a kidneyâ€™s structure: 
**tubules or glomeruli**. Individual glomeruli were identified by a pathologist 
as either **relatively healthy or diseased** regardless if the tissue was DKD 
or healthy. 
Tubule ROIs were segmented into **distal (PanCK) and proximal (neg) tubules**. 
While both distal and proximal tubules are called tubules, they perform very 
different functions in the kidney.


```{r message = FALSE, warning = FALSE}
eh <- ExperimentHub()

query(eh, "standR")

countFile <- eh[["EH7364"]]
sampleAnnoFile <- eh[["EH7365"]]
featureAnnoFile <- eh[["EH7366"]]

spe <- readGeoMx(countFile, sampleAnnoFile, featureAnnoFile = featureAnnoFile, rmNegProbe = TRUE)

```


## QC

### metadata visualization 

Based on the description of the data, we know that all glomerulus are classified 
as abnormal and healthy, and tubule are classified as neg and PanCK. 

We therefore merge the region-related annotations to avoid collinearity, which 
can affect the process of batch correction.


```{r}
colData(spe)$regions <- paste0(colData(spe)$region,"_",colData(spe)$SegmentLabel) |> 
  (\(.) gsub("_Geometric Segment","",.))() |>
  paste0("_",colData(spe)$pathology) |>
  (\(.) gsub("_NA","_ns",.))()

library(ggalluvial)

plotSampleInfo(spe, column2plot = c("SlideName","disease_status","regions"))
```


### Gene level QC



```{r}
spe <- addPerROIQC(spe, rm_genes = TRUE)
```



```{r}
plotGeneQC(spe, ordannots = "regions", col = regions, point_size = 2)
```

Using the `plotGeneQC` function, we can have a look at which were the genes 
removed and the overall distribution of percentage of non-expressed genes in 
all ROIs. By default, top 9 genes are plotted here (arranging by mean 
expression), user can increase the number of plotted genes by changing 
the `top_n` parameter.


In this case we don't see any specific biological pattern for the samples 
expressing this genes (figure above).


### ROI level QC

In the ROI level QC, we first aim to identify (if any) ROI(s) that have 
relatively low library size and low cell count because they are considered as 
low quality samples due to insufficient sequencing depth or lack of RNA in the 
chosen region. 

In this case, looking at the distribution plots of library size and nuclei 
count, we don't see any particular spike in the low ends, rather the 
distributions are relatively smooth. Looking at the dot plot, library sizes are 
mostly positively correlate with the nuclei count, with some data have 
relatively low library size while the nuclei count is reasonable. We therefore 
can try to draw an filtering threshold at the low end of the library size, in 
this case 50000. By coloring the dot with their slide names, we find that the 
ROIs below the threshold are all from slide disease1B, suggesting the reason 
for this might be some technical issues of slide disease1B.


```{r}
plotROIQC(spe, y_threshold = 50000, col = SlideName)
```

Since library size of 50000 seems to be a reasonable threshold, here we subset 
the spatial experiment object based on the library size in `colData`.

```{r}
spe <- spe[,rownames(colData(spe))[colData(spe)$lib_size > 50000]]
```


## Inspection of variations on ROI level

### RLE

Here we can see obvious variation from slides to slides, and small variations 
are also observed within each slide.

```{r}
plotRLExpr(spe, ordannots = "SlideName", assay = 2, col = SlideName)
```


### PCA


Here we color the PCA with slide information, and shape by regions (tissue). 
We can see that PC1 is mainly spread out by regions, especially glomerulus and 
tubule. And grouping based on slide within each tissue are observed. 
The subtypes in tubule are clearly separated, but different subtypes of 
glomerulus is still grouping together. Moreover, diseased tissues and control 
tissues are mixed as well (disease slides and normal slides).


```{r}
drawPCA(spe, assay = 2, col = SlideName, shape = regions)
```



# Data normalization

As we observed the technical variations in the data in both RLE and PCA plots. 
It is necessary to perform normalization in the data.

In the `standR` package, we offer normalization options including TMM, RPKM, 
TPM, CPM, upperquartile and sizefactor. Among them, RPKM and TPM required gene 
length information (add `genelength` column to the `rowData` of the object). 
For TMM, upperquartile and sizefactor, their normalized factor will be stored 
their `metadata`.

Here we used TMM to normalize the data.

```{r}
colData(spe)$biology <- paste0(colData(spe)$disease_status, "_", colData(spe)$regions)

spe_tmm <- geomxNorm(spe, method = "TMM")
```


# Batch correction

In the Nanostring's GeoMX DSP protocol, due to the fact that one slide is only 
big enough for a handful of tissue segments (ROIs), it is common that we see 
the DSP data being confounded by the batch effect introduced by different 
slides. In order to establish fair comparison between ROIs later on, it is 
necessary to remove this batch effect from the data.

To run RUV4 batch correction, we need to provide a list of "negative control 
genes (NCGs)".

The function `findNCGs` allows identifying the NCGs from the data. In this case, 
since the batch effect is mostly introduced by slide, we therefore want to 
identify NCGs across all slides, so here we set the `batch_name` to "SlideName", 
and select the top 500 least variable genes across different slides as NCGs. 


```{r}
spe <- findNCGs(spe, batch_name = "SlideName", top_n = 500)

metadata(spe) |> names()
```

Here we use k of 5 to perform RUV-4 normalization.

```{r}
spe_ruv <- geomxBatchCorrection(spe, factors = "biology", 
                   NCGs = metadata(spe)$NCGs, k = 5)
```


We can then inspect the PCA of the corrected data with annotations, to inspect 
the removal of batch effects, and the retaining of the biological factors.


```{r}
plotPairPCA(spe_ruv, assay = 2, color = disease_status, shape = regions, title = "RUV4")
```


Moreover, we can also have a look at the RLE plots of the normalized count.

```{r}
plotRLExpr(spe_ruv, assay = 2, color = SlideName) + ggtitle("RUV4")

```


**For more detailed analysis pipeline and usage of the standR package, please see https://github.com/DavisLaboratory/GeoMXAnalysisWorkflow**


# SessionInfo

```{r}
sessionInfo()
```