--- title: "A quick start guide to the standR package" author: "Ning Liu, Dharmesh Bhuva, Ahmed Mohamed, Chin Wee Tan, Melissa Davis" date: "`r Sys.Date()`" output: html_document: toc: true number_sections: true theme: cosmo highlight: tango code_folding: show vignette: > %\VignetteIndexEntry{standR_introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r message=FALSE, warning=FALSE, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) ``` # Installation ```{r eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("standR") ``` The development version of `standR` can be installed from GitHub: ```{r eval=FALSE} devtools::install_github("DavisLaboratory/standR") ``` # Quick start ```{r message = FALSE, warning = FALSE} library(standR) library(SpatialExperiment) library(limma) library(ExperimentHub) ``` ## Load data for this guide This is the background for the data: NanoString GeoMx DSP dataset of diabetic kidney disease (DKD) vs healthy kidney tissue. **Seven slides** were analyzed, **4 DKD and 3 healthy**. Regions of Interest (ROI) were focused two different parts of a kidney’s structure: **tubules or glomeruli**. Individual glomeruli were identified by a pathologist as either **relatively healthy or diseased** regardless if the tissue was DKD or healthy. Tubule ROIs were segmented into **distal (PanCK) and proximal (neg) tubules**. While both distal and proximal tubules are called tubules, they perform very different functions in the kidney. ```{r message = FALSE, warning = FALSE} eh <- ExperimentHub() query(eh, "standR") countFile <- eh[["EH7364"]] sampleAnnoFile <- eh[["EH7365"]] featureAnnoFile <- eh[["EH7366"]] spe <- readGeoMx(countFile, sampleAnnoFile, featureAnnoFile = featureAnnoFile, rmNegProbe = TRUE) ``` ## QC ### metadata visualization Based on the description of the data, we know that all glomerulus are classified as abnormal and healthy, and tubule are classified as neg and PanCK. We therefore merge the region-related annotations to avoid collinearity, which can affect the process of batch correction. ```{r} colData(spe)$regions <- paste0(colData(spe)$region,"_",colData(spe)$SegmentLabel) |> (\(.) gsub("_Geometric Segment","",.))() |> paste0("_",colData(spe)$pathology) |> (\(.) gsub("_NA","_ns",.))() library(ggalluvial) plotSampleInfo(spe, column2plot = c("SlideName","disease_status","regions")) ``` ### Gene level QC ```{r} spe <- addPerROIQC(spe, rm_genes = TRUE) ``` ```{r} plotGeneQC(spe, ordannots = "regions", col = regions, point_size = 2) ``` Using the `plotGeneQC` function, we can have a look at which were the genes removed and the overall distribution of percentage of non-expressed genes in all ROIs. By default, top 9 genes are plotted here (arranging by mean expression), user can increase the number of plotted genes by changing the `top_n` parameter. In this case we don't see any specific biological pattern for the samples expressing this genes (figure above). ### ROI level QC In the ROI level QC, we first aim to identify (if any) ROI(s) that have relatively low library size and low cell count because they are considered as low quality samples due to insufficient sequencing depth or lack of RNA in the chosen region. In this case, looking at the distribution plots of library size and nuclei count, we don't see any particular spike in the low ends, rather the distributions are relatively smooth. Looking at the dot plot, library sizes are mostly positively correlate with the nuclei count, with some data have relatively low library size while the nuclei count is reasonable. We therefore can try to draw an filtering threshold at the low end of the library size, in this case 50000. By coloring the dot with their slide names, we find that the ROIs below the threshold are all from slide disease1B, suggesting the reason for this might be some technical issues of slide disease1B. ```{r} plotROIQC(spe, y_threshold = 50000, col = SlideName) ``` Since library size of 50000 seems to be a reasonable threshold, here we subset the spatial experiment object based on the library size in `colData`. ```{r} spe <- spe[,rownames(colData(spe))[colData(spe)$lib_size > 50000]] ``` ## Inspection of variations on ROI level ### RLE Here we can see obvious variation from slides to slides, and small variations are also observed within each slide. ```{r} plotRLExpr(spe, ordannots = "SlideName", assay = 2, col = SlideName) ``` ### PCA Here we color the PCA with slide information, and shape by regions (tissue). We can see that PC1 is mainly spread out by regions, especially glomerulus and tubule. And grouping based on slide within each tissue are observed. The subtypes in tubule are clearly separated, but different subtypes of glomerulus is still grouping together. Moreover, diseased tissues and control tissues are mixed as well (disease slides and normal slides). ```{r} drawPCA(spe, assay = 2, col = SlideName, shape = regions) ``` # Data normalization As we observed the technical variations in the data in both RLE and PCA plots. It is necessary to perform normalization in the data. In the `standR` package, we offer normalization options including TMM, RPKM, TPM, CPM, upperquartile and sizefactor. Among them, RPKM and TPM required gene length information (add `genelength` column to the `rowData` of the object). For TMM, upperquartile and sizefactor, their normalized factor will be stored their `metadata`. Here we used TMM to normalize the data. ```{r} colData(spe)$biology <- paste0(colData(spe)$disease_status, "_", colData(spe)$regions) spe_tmm <- geomxNorm(spe, method = "TMM") ``` # Batch correction In the Nanostring's GeoMX DSP protocol, due to the fact that one slide is only big enough for a handful of tissue segments (ROIs), it is common that we see the DSP data being confounded by the batch effect introduced by different slides. In order to establish fair comparison between ROIs later on, it is necessary to remove this batch effect from the data. To run RUV4 batch correction, we need to provide a list of "negative control genes (NCGs)". The function `findNCGs` allows identifying the NCGs from the data. In this case, since the batch effect is mostly introduced by slide, we therefore want to identify NCGs across all slides, so here we set the `batch_name` to "SlideName", and select the top 500 least variable genes across different slides as NCGs. ```{r} spe <- findNCGs(spe, batch_name = "SlideName", top_n = 500) metadata(spe) |> names() ``` Here we use k of 5 to perform RUV-4 normalization. ```{r} spe_ruv <- geomxBatchCorrection(spe, factors = "biology", NCGs = metadata(spe)$NCGs, k = 5) ``` We can then inspect the PCA of the corrected data with annotations, to inspect the removal of batch effects, and the retaining of the biological factors. ```{r} plotPairPCA(spe_ruv, assay = 2, color = disease_status, shape = regions, title = "RUV4") ``` Moreover, we can also have a look at the RLE plots of the normalized count. ```{r} plotRLExpr(spe_ruv, assay = 2, color = SlideName) + ggtitle("RUV4") ``` **For more detailed analysis pipeline and usage of the standR package, please see https://github.com/DavisLaboratory/GeoMXAnalysisWorkflow** # SessionInfo ```{r} sessionInfo() ```