---
title: Starting work with scRepertoire.
author: 
  name: Nick Borcherding
  email: ncborch@gmail.com
  affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA
date: "October 16, 2020"
output:
  BiocStyle::html_document:
    toc_float: true
package: scRepertoire
vignette: >
  %\VignetteIndexEntry{Using scRepertoire}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8} 
---

```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
library(BiocStyle)
```

# Introduction

scRepertoire is designed to take filter contig outputs from the 10x Genomics Cell Ranger pipeline, processes that data to assign clonotype based on two TCR or Ig chains, and analyze the clonotype dynamics. The latter can be separated into 1) clonotype-only analysis functions, such as unique clonotypes or clonal space quantification and 2) interaction with mRNA expression data using Seurat, SingleCellExperiment or Monocle 3 packages. 

## Loading Libraries

```{r}
suppressMessages(library(scRepertoire))
```

## Loading and Processing Contig Data

scRepertoire comes with a data set derived from T cells derived from three patients with renal clear cell carcinoma in order to demonstrate the functionality of the R package. More information on the data set can be found at [preprint 1](https://www.biorxiv.org/content/10.1101/478628v1.abstract) and [preprint 2](https://www.biorxiv.org/content/10.1101/824482v1.abstract). The samples consist of paired peripheral-blood and tumor-infiltrating runs, effectively creating 6 distinct runs for T cell receptor (TCR) enrichment. We can preview the elements in the list by using the head function and looking at the first contig annotation. Here notice the barcode is labeled as PX_P_############# - this refers to Patient X (PX) and Peripheral Blood (P).

The data, contig_list, is 6 filtered_contig.csv file outputs from Cell Ranger that made into a list.

```{r}
data("contig_list") #the data built into scRepertoire

head(contig_list[[1]])
```

Some workflows will have the additional labeling of the standard barcode. Before we proceed, we will use the function *stripBarcode()* in order to avoid any labeling issues down the line. Importantly, *stripBarcode()* is for *removing prefixes* on barcodes that have resulted from other pipelines. 

**No need for stripBarcode function**, if the barcodes look like:
+ AAACGGGAGATGGCGT-1
+ AAACGGGAGATGGCGT

In terms of using *stripBarcode()*, please think about the following parameters. 

*column*
+ The column in which the barcodes are present

*connector*
+ The character that is connecting the barcode with the prefix

*num_connects*
+ the levels of barcode prefix, where X_X_AAACGGGAGATGGCGT-1 == 3, X_AAACGGGAGATGGCGT-1 = 2.

```{r eval=FALSE}
for (i in seq_along(contig_list)) {
    contig_list[[i]] <- stripBarcode(contig_list[[i]], column = 1, connector = "_", num_connects = 3)
}
```

You can see now the barcode in column 1, we have removed the P#_#_ prefixes. 

# Combining the Contigs

As the output of CellRanger are quantifications of both the TCRA and TCRB chains, the next step is to create a single list object with the TCR gene and CDR3 sequences by cell barcode. This is performed using the *combineTCR()*, where the input is the stripped contig_list. There is also the relabeling of the barcodes by sample and ID information to prevent duplicates.

*cells*
+  T-AB - T cells, alpha-beta TCR
+  T-GD - T cells, gamma-delta TCR

*removeNA*
+  TRUE - this is a stringent filter to remove any cell barcode with an NA value in at least one of the chains
+  FALSE - the default setting to include and incorporate cells with 1 NA value

*removeMulti*
+  TRUE - this is a stringent filter to remove any cell barcode with more than 2 immune receptor chains
+  FALSE - the default setting to include and incorporate cells with > 2 chains

*filterMulti*
+  TRUE - Isolated the top 2 expressed chains in cell barcodes with multiple chains
+  FALSE - the default setting to include and incorporate cells with > 2 chains

```{r}
combined <- combineTCR(contig_list, 
                        samples = c("PY", "PY", "PX", "PX", "PZ","PZ"), 
                        ID = c("P", "T", "P", "T", "P", "T"), cells ="T-AB")
```

The output of *combineTCR()* will be a list of contig data frames that will be reduced to the reads associated with a single cell barcode. It will also combine the multiple reads into clonotype calls by either the nucleotide sequence (CTnt), amino acid sequence (CTaa), the gene sequence (CTgene) or the combination of the nucleotide and gene sequence (CTstrict). The analogous function for B cells, *combineBCR()* functions similarly with 2 major caveats: 1) Each barcode can only have a maximum of 2 sequences, if greater exists, the 2 with the highest reads are selected. 2) The strict definition of clonotype (CTstrict) is based on the v gene and >85% normalized hamming distance of the nucleotide sequence. The hamming distance is calculated across all BCR sequences recovered, regardless of the run.

# Other Processing Functions

## Adding Additional Variables

What if there are more variables to add than just sample and ID? We can add them by using the *addVariable()* function. All we need is the **name** of the variable you'd like to add and the specific character or numeric values (**variables**). As an example, here we add the batches in which the samples were processed and sequenced. 

```{r}
example <- addVariable(combined, name = "batch", 
                        variables = c("b1", "b1", "b2", "b2", "b2", "b2"))

example[[1]][1:5,ncol(example[[1]])] # This is showing the first 5 values of the new column added
```

## Subsetting Contigs

Likewise we can remove specific list elements after *combineTCR()* using the *subsetContig()* function. In order to subset, we need to identify the vector we would like to use for subsetting (**name**) and also the variable values to subset (**variables**). Below you can see us isolate just the 4 sequencing results from PX and PY. 

```{r}
subset <- subsetContig(combined, name = "sample", 
                        variables = c("PX", "PY"))
```

***

# Visualizing Contigs

*cloneCall*
+  "gene" - use the genes comprising the TCR/Ig
+  "nt" - use the nucleotide sequence of the CDR3 region
+  "aa" - use the amino acid sequence of the CDR3 region
+  "gene+nt" - use the genes comprising the TCR/Ig + the nucleotide sequence of the CDR3 region. This is the [proper definition of clonotype](https://www.ncbi.nlm.nih.gov/pubmed/19568742).

Important to note, that the clonotype is called using essentially the combination of genes or nt/aa CDR3 sequences for both loci. As of this implementation of scRepertoire, clonotype calling is not incorporating small variations within the CDR3 sequences. As such the *gene* approach will be the most sensitive, while the use of *nt* or *aa* moderately so, and the most specific for clonotypes being *gene+nt*. Additionally, the clonotype call is trying to incorporate both loci, *i.e*, both *TCRA* and *TCRB* chains and if a single cell barcode has multiple sequences identified (*i.e.*, 2 TCRA chains expressed in one cell). Using the 10x approach, there is a subset of barcodes that only return one of the immune receptor chains, the unreturned chain is assigned an *NA* value.

The first function to explore the clonotypes is *quantContig()* to return the total or relative numbers of unique clonotypes. 
*scale*
+  TRUE - relative percent of unique clonotypes scaled by total size of the size of the clonotype repertoire
+  FALSE - Report the total number of unique clonotypes

```{r Figure2A_1}
quantContig(combined, cloneCall="gene+nt", scale = TRUE)
```

Within each of the general analysis functions, there is the ability to export the data frame used to create the visualization. To get the exported values, use **exportTable** = TRUE. It will return the data frame used to make the graph, instead of the visual output.  

```{r}
quantContig_output <- quantContig(combined, cloneCall="gene+nt", 
                                    scale = TRUE, exportTable = TRUE)
quantContig_output
```

The other option here is to be able to define the visualization by data classes. Here we used the *combineTCR()* to define the **ID** variable as part of the naming structure. We can the **group** to specifically use a column in the data set to organize the visualization.

```{r Figure2A_2}
quantContig(combined, cloneCall="gene", group = "ID", scale = TRUE)
```

We can also examine the relative distribution of clonotypes by abundance. Here *abundanceContig()* will produce a line graph with a total number of clonotypes by the number of instances within the sample or run. Like above, we can also group this by vectors within the contig object using the **group** variable in the function

```{r Figure2B}
abundanceContig(combined, cloneCall = "gene", scale = FALSE)
abundanceContig(combined, cloneCall = "gene", group = "ID", scale = FALSE)
```

As you can see the peripheral blood sample derived from patient 1 is a relative extreme outlier. Another method to examine the relative abundance is to look at the density by using the **scale** call in the function.

```{r Figure2C}
abundanceContig(combined, group = "ID", scale = TRUE)
```

Lastly on the basic visualization side, we can look at the length distribution of the CDR3 sequences by calling the *lengtheContig()* function. Importantly, unlike the other basic visualizations, the **cloneCall** can only be "nt" or "aa". Due to the method of calling clonotypes as outlined above, the length should reveal a multimodal curve, this is a product of using the **NA** for the unreturned chain sequence and multiple chains within a single barcode. 

```{r Figure2D}
lengthContig(combined, cloneCall="aa", chains = "combined") 
```

Or we can visualize the individual chains of the immune receptors by selecting **chains** = "single". Notably this will remove the NA component of combined clonotypes, so visualize is only the sequences recovered in the filtered contig annotation file from Cell Ranger. 

```{r}
lengthContig(combined, cloneCall="nt", chains = "single") 
```

We can also look at clonotypes between samples and changes in dynamics by using the *compareClonotypes()* function.

*samples*
+  Can be used to isolate specific samples based on the name of the list element

*graph*
+  "alluvial" - graph imaged below
+  "area" - graph by area of the respective clonotype

*number*
+  The top number of clonotypes to graph, this will be calculated based on the frequency of the individual sample. This can also be left blank.

*clonotypes*
+  Can be used to isolate specific clonotype sequences, ensure the call matches the sequences you would like to visualize.

```{r}
compareClonotypes(combined, numbers = 10, samples = c("PX_P", "PX_T"), 
                    cloneCall="aa", graph = "alluvial")
```

Last of the basic analysis visualizations is the relative usage of vgenes of the TCR, using *vizVgenes()*.

*TCR*
+  For T-AB: TCR1 == TCRA and TCR2 == TCRB
+  For T-GD: TCR1 == TCRG and TCR2 == TCRD

*facet.x and facet.y*
+  Any variable within the combined() object

*fill*
+  Color the bars by variable

```{r}
vizVgenes(combined, TCR="TCR1", facet.x = "sample", facet.y = "ID")
```

***

# More Advanced Clonal Analysis

After we have completed the basic processing and summary functions in scRepertoire, we can begin to explore the clonotypes of the single-cell data in more detail. 

## Clonal Space Homeostasis

By examining the clonal space, we are effectively looking at the relative space occupied by clones at specific proportions. Another way to think about this would be thinking of the total immune receptor sequencing run as a measuring cup. In this cup, we will fill liquids of different viscosity - or different number of clonal proportions. Clonal space homeostasis is asking what percentage of the cup is filled by clones in distinct proportions (or liquids of different viscosity, to extend the analogy). The proportional cut points are set under the **cloneType** variable in the function and can be adjusted, at baseline the bins are as follows:

*cloneTypes*
+  Rare = .0001
+  Small = .001
+  Medium = .01
+  Large = .1
+  Hyperexpanded = 1

```{r Figure3A}
clonalHomeostasis(combined, cloneCall = "gene")
clonalHomeostasis(combined, cloneCall = "aa")
```


## Clonal Proportion

Like clonal space homeostasis above, clonal proportion acts to place clones into separate bins. The key difference is instead of looking at the relative proportion of the clone to the total, the *clonalProportion()* function will rank the clones by total number and place them into bins. 

The **split** represents ranking of clonotypes by copy or frequency of occurrence, meaning 1:10 are the top 10 clonotypes in each sample. The default bins are under the **split** variable in the function and can be adjusted, but at baseline they are as follows. 

*split*
+  10
+  100
+  1000 
+  10000
+  30000
+  100000

```{r Figure3B}
clonalProportion(combined, cloneCall = "gene") 
clonalProportion(combined, cloneCall = "nt") 
```


## Overlap Analysis

If you are interested in measures of similarity between the samples loaded into scRepertoire, using *clonalOverlap()* can assist in the visualization. Two methods currently can be performed in *clonalOverlap()* 1) overlap coefficient and 2) Morisita index. The former is looking at the overlap of clonotypes scaled to the length of unique clonotypes in the smaller sample. The Morisita index is more complex, it is an ecological measure of the dispersion of individuals within a population, incorporating the size of the population.

```{r Figure3C}
clonalOverlap(combined, cloneCall = "gene+nt", method = "morisita")
```

Another recent addition to scRepertoire is the ability to cluster the samples by the clone size distribution using *clonesizeDistribution()* adapted from the [powerTCR](https://bioconductor.org/packages/release/bioc/html/powerTCR.html) R package. Please read and cite the respective [citation](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006571) if using this function to analyze the similarities of sample clone size distributions. In this function, method refers to the method for which the hierarchical clustering will be based on. 

```{r}
clonesizeDistribution(combined, cloneCall = "gene+nt", 
                        method="ward.D2")
```

## Diversity Analysis

Diversity can also be measured for samples or by other variables. Diversity is calculated using four metrics: 1) Shannon, 2) inverse Simpson, 3) Chao1, and 4) Abundance-based Coverage Estimator (ACE). With the former two generally used to estimate baseline diversity and Chao/ACE indices used to estimate the richness of the samples.

```{r Figure3D}
clonalDiversity(combined, cloneCall = "gene", group = "samples")
clonalDiversity(combined, cloneCall = "gene", group = "ID")
```

***

# Interacting with mRNA Expression

As mentioned previously, this data set is derived from work performed in the laboratory of [Weizhou Zhang]("https://pathology.ufl.edu/faculty/experimental-pathology/weizhou-zhang-ph-d/"). For the purposes of the vignette, we have randomly sampled 100 cells from the the fully integrated Seurat object to minimize the size of the package. A full version of the Seurat object is available via [GitHub](https://github.com/ncborcherding/scRepertoire) under the *Getting Data* header. We will use both Seurat and SingleCellExperiment (SCE) with scater to perform the further visualizations in tandem. 

```{r Figure4A}
library(SingleCellExperiment)
library(Seurat)
library(scater)
screp_example <- get(data("screp_example"))
sce <- suppressMessages(UpdateSeuratObject(screp_example))
sce <- as.SingleCellExperiment(screp_example)

#Seurat Format
DimPlot(screp_example)

##Single Cell Experiment Format
plotUMAP(sce, colour_by = "seurat_clusters")
```

Here you can see we have 12 total clusters (C1-12), which we have labeled as such for simplicity. We can also get a little more granular information on the number of cells by using the *table()* function. 

```{r}
table(screp_example$seurat_clusters)
```


Next we can take the clonotypic information and attach it to our Seurat object using the *combineExpression()* function. **Importantly**, the major requirement for the attachment is matching contig cell barcodes and barcodes in the row names of the meta data of the Seurat or SCE object. If these do not match, the attachment will fail. Based on ease, we suggest you make the changes to the Seurat object row names.

We can call (cloneCall) the 4 variations of clonotypes: 1) genes, 2) CDR3 amino acid sequence, 3) CDR3 nucleotide sequence, or 4) genes and CDR3 nucleotide sequence. The attaching function will also calculate the frequency of the clonotype based on the **groupBy** variable. If blank, **groupBy** will calculate frequencies of clonotypes by individual run, but because we have 6 samples of paired peripheral and tumor T cells, we are actually going to use the groupBy variable to call "sample" in order to calculate frequencies across both the peripheral blood and tumor T cells of the same patient.

Lastly, in order to categorize the frequency, we have the variable **cloneTypes** which acts as a bin to place labels. As a default, **cloneTypes** is set to equal c(Single = 1, Small = 5, Medium = 20, Large = 100, Hyperexpanded = 500). This is because the highest repeated clonotype is in Patient 3 with just under 500 clones. If your data has a clone with greater expansion, you should readjust the cut points. 

```{r}
screp_example <- combineExpression(combined, screp_example, cloneCall="gene", groupBy = "sample")

sce <- combineExpression(combined, sce, cloneCall = "gene", groupBy = "sample")
```

We first want to look at the distribution of peripheral versus tumor T cells. We can use the same color scheme as the rest of the scRepertoire package by calling the object **colorblind_vector** using the following hex codes. 

```{r Figure4B_1}
colorblind_vector <- colorRampPalette(c("#FF4B20", "#FFB433", "#C6FDEC", "#7AC5FF", "#0348A6"))

DimPlot(screp_example, group.by = "Type") + NoLegend() +
    scale_color_manual(values=colorblind_vector(2))
```

We can also look at the composition of each cluster by comparing the proportion of the cluster comprising peripheral blood versus tumor T cells. We can do this by first forming a table of the cluster and type of cells, then scaling the rows of the table by the total number of cells sequenced. 

```{r Figure4B_2}
table <- table(screp_example$Type, Idents(screp_example))
table[1,] <- table[1,]/sum(table[1,]) #Scaling by the total number of peripheral T cells
table[2,] <- table[2,]/sum(table[2,]) #Scaling by the total number of tumor T cells
table <- as.data.frame(table)
table$Var2 <- factor(table$Var2, 
                    levels = c("C1", "C2", "C3", "C4", "C5", "C6", 
                               "C7", "C8", "C9", "C10", "C11", "C12"))

ggplot(table, aes(x=Var2, y=Freq, fill=Var1)) + 
  geom_bar(stat="identity", position="fill", color="black", lwd=0.25) + 
  theme(axis.title.x = element_blank()) + 
scale_fill_manual(values = c("#FF4B20","#0348A6")) + 
  theme_classic() + 
    theme(axis.title = element_blank()) + 
    guides(fill=FALSE)
```

Now we can look at the distribution of the clonotype bins by first ordering the **clonoType** as a factor, this prevents the coloring from being in alphabetical order. Next we use the *DimPlot()* function call in Seurat with our scale_color_manual additional layer. 

```{r Figure4C}
slot(screp_example, "meta.data")$cloneType <- factor(slot(screp_example, "meta.data")$cloneType, 
                levels = c("Hyperexpanded (100 < X <= 500)", "Large (20 < X <= 100)", 
                            "Medium (5 < X <= 20)", "Small (1 < X <= 5)", 
                            "Single (0 < X <= 1)", NA))

DimPlot(screp_example, group.by = "cloneType") +
    scale_color_manual(values = c(rev(colorblind_vector(5))), na.value="grey")

plotUMAP(sce, colour_by = "cloneType")
```


We can also look at the clonotypes by calling specific sequences in the *highlightClonotypes()* below. In order to highlight the clonotypes, we first need to use the **cloneCall** the type of sequence we will be using and then the specific sequences themselves using **sequence**. Below you can see the steps to highlight the two most prominent sequences *"CAVNGGSQGNLIF_CSAEREDTDTQYF"* with a frequency = 482 (clonotype 1) and *"NA_CATSATLRVVAEKLFF"* with a frequency = 287 (Clonotype2). 

```{r Figure4E}
seurat <- highlightClonotypes(screp_example, cloneCall= "aa", 
                sequence = c("CAVNGGSQGNLIF_CSAEREDTDTQYF", "NA_CATSATLRVVAEKLFF"))
DimPlot(seurat, group.by = "highlight")


```

We can also look at count of cells by cluster assigned into specific frequency ranges, by using the *occupiedscRepertoire()* function and selecting the **x.axis** to display cluster or other variables in the meta data of the single cell object. 

```{r}
occupiedscRepertoire(sce, x.axis = "cluster")
```


After all the metadata has been modified, we can look at clonotypes across multiple categories using the *alluvialClonotypes()* function. To understand the basic concepts of this graphing method, I'd highly recommend reading [this post](https://cran.r-project.org/web/packages/ggalluvial/vignettes/ggalluvial.html), essentially we are able to use the plots to examine the interchange of categorical variables. Because this function will produce a graph with each clonotype arranged by called stratifications, this will take some time depending on the size of your total cells. To expedite, we will actually subset the Seurat object before using *alluvialClonotypes()*.

```{r Figure4F}

alluvialClonotypes(screp_example, cloneCall = "gene", 
                   y.axes = c("Patient", "cluster", "Type"), 
                   color = "TRAV12-2.TRAJ42.TRAC_TRBV20-1.TRBJ2-3.TRBD2.TRBC2") + 
    scale_fill_manual(values = c("grey", colorblind_vector(1)))

alluvialClonotypes(screp_example, cloneCall = "gene", 
                   y.axes = c("Patient", "cluster", "Type"), 
                   color = "cluster") 

alluvialClonotypes(sce, cloneCall = "gene", 
                   y.axes = c("Patient", "seurat_clusters", "Type"), 
                   color = "TRAV12-2.TRAJ42.TRAC_TRBV20-1.TRBJ2-3.TRBD2.TRBC2") + 
    scale_fill_manual(values = c("grey", colorblind_vector(1)))

alluvialClonotypes(sce, cloneCall = "gene", 
                   y.axes = c("Patient", "seurat_clusters", "Type"), 
                   color = "seurat_clusters") 
```

Like alluvial graphs, we can also visualize the interconnection of clusters using the chord diagrams from the circlize R package. The first step is getting the data frame output to feed into the *chordDiagram()* function in circlize, which can be done using *getCirclize()*. This will calculate the relative number of clonotypes shared based on the **groupBy** variable using the product of *combineExpression()*. 

```{r}
library(circlize)
library(scales)

circles <- getCirclize(screp_example, groupBy = "cluster")

#Just assigning the normal colors to each cluster
grid.cols <- hue_pal()(length(unique(seurat@active.ident)))
names(grid.cols) <- levels(seurat@active.ident)

#Graphing the chord diagram
chordDiagram(circles, self.link = 1, grid.col = grid.cols)
```


For users that would like greater ability to use the meta data in the Seurat objects to perform the analysis that scRepertoire provides, there is also the option of using the *expression2List()* function that will take the meta data and output the data as a list by cluster.

```{r}
combined2 <- expression2List(screp_example, group = "cluster")
combined3 <- expression2List(sce, group = "cluster")
```

## 1. Clonal Diversity
```{r}
clonalDiversity(combined2, cloneCall = "nt")
clonalDiversity(combined3, cloneCall = "nt")
```

## 2. Clonal Homeostasis
```{r}
clonalHomeostasis(combined2, cloneCall = "nt")
clonalHomeostasis(combined3, cloneCall = "nt")
```

## 3. Clonal Proportion
```{r}
clonalProportion(combined2, cloneCall = "nt")
clonalProportion(combined3, cloneCall = "nt")
```

## 4. Clonal Overlap
```{r}
clonalOverlap(combined2, cloneCall="aa", method="overlap")
clonalOverlap(combined3, cloneCall="aa", method="overlap")
```

***

# Conclusion

If you have any questions, comments or suggestions, feel free to visit the github repository or [email me](mailto:ncborch@gmail.com).

```{r}
sessionInfo()
```