---
title: Starting work with scRepertoire.
author: 
  name: Nick Borcherding
  email: ncborch@gmail.com
  affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA
date: "November 17th, 2023"
output:
  BiocStyle::html_document:
    toc_float: true
package: scRepertoire
vignette: >
  %\VignetteIndexEntry{Using scRepertoire}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8} 
---

```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
library(BiocStyle)
library(scater)
```

# Introduction

scRepertoire is designed to take filter contig outputs from the 10x Genomics Cell Ranger pipeline, process that data to assign clonotype based on two TCR or Ig chains and analyze the clonotype dynamics. The latter can be separated into 1) clonotype-only analysis functions, such as unique clonotypes or clonal space quantification, and 2) interaction with mRNA expression data using Seurat or SingleCellExperiment packages.

## Loading Libraries

```{r}
suppressMessages(library(scRepertoire))
```

## Loading Data

### What data to load into scRepertoire?

scRepertoire functions using the **filtered_contig_annotations.csv** output from the 10x Genomics Cell Ranger. This file is located in the ./outs/ directory of the VDJ alignment folder. To generate a list of contigs to use for scRepertoire:  

* load the **filtered_contig_annotations.csv** for each of the samples.  
* make a list in the R environment.  

```{r, eval=FALSE}
S1 <- read.csv(".../Sample1/outs/filtered_contig_annotations.csv")
S2 <- read.csv(".../Sample2/outs/filtered_contig_annotations.csv")
S3 <- read.csv(".../Sample3/outs/filtered_contig_annotations.csv")
S4 <- read.csv(".../Sample4/outs/filtered_contig_annotations.csv")

contig_list <- list(S1, S2, S3, S4)
```

### Other alignment workflows

Beyond the default 10x Genomic Cell Ranger pipeline outputs, scRepertoire supports the following **single-cell** formats: 

* AIRR  
* BD Rhapsody Multiomic Immune Profiling  
* [Immcantation](https://immcantation.readthedocs.io/en/stable/index.html)  
* JSON-formatted contig data  
* [MiXCR](https://pubmed.ncbi.nlm.nih.gov/25924071/)  
* [Omniscope](https://www.omniscope.ai/)  
* [TRUST4](https://pubmed.ncbi.nlm.nih.gov/33986545/)  
* [WAT3R](https://pubmed.ncbi.nlm.nih.gov/35674381/)  

```loadContigs()``` can be given a directory where the sequencing experiments are located and it will recursively load and process the contig data based on the file names. Alternatively, ```loadContigs()``` can be given a list of data frames and process the contig data


```{r, eval=FALSE, tidy = FALSE}
#Directory example
contig.output <- c("~/Documents/MyExperiment")
contig.list <- loadContigs(input = contig.output, 
                           format = "TRUST4")

#List of data frames example
S1 <- read.csv("~/Documents/MyExperiment/Sample1/outs/barcode_results.csv")
S2 <- read.csv("~/Documents/MyExperiment/Sample2/outs/barcode_results.csv")
S3 <- read.csv("~/Documents/MyExperiment/Sample3/outs/barcode_results.csv")
S4 <- read.csv("~/Documents/MyExperiment/Sample4/outs/barcode_results.csv")

contig_list <- list(S1, S2, S3, S4)
contig.list <- loadContigs(input = contig.output, 
                           format = "WAT3R")
```

### Multiplexed Experiment

It is now easy to create the contig list from a multiplexed experiment by first generating a single-cell RNA object (either Seurat or Single Cell Experiment), loading the filtered contig file and then using `createHTOContigList()`. This function will return a list separated by the **group.by** variable(s). 

This function depends on the match of barcodes between the single-cell object and contigs. If there is a prefix or different suffix added to the barcode, this will result in no contigs recovered. Currently, it is recommended you do this step before the integration, as integration workflows commonly alter the barcodes. There is a **multi.run** variable that can be used on the integrated object. However, it assumes you have modified the barcodes with the Seurat pipeline (automatic addition of _# to end), and your contig list is in the same order.

```{r, eval = F, tidy = FALSE}
contigs <- read.csv(".../outs/filtered_contig_annotations.csv")

contig.list <- createHTOContigList(contigs, 
                                   Seurat.Obj, 
                                   group.by = "HTO_maxID")
```

## Example Data in scRepertoire

scRepertoire comes with a data set from T cells derived from four patients with acute respiratory distress to demonstrate the functionality of the R package. More information on the data set can be found in the corresponding [manuscript](https://pubmed.ncbi.nlm.nih.gov/33622974/). The samples consist of paired peripheral-blood (B) and bronchoalveolar lavage (L), effectively creating 8 distinct runs for T cell receptor (TCR) enrichment. We can preview the elements in the list by using the head function and looking at the first contig annotation. 

The built-in example data is derived from the 10x Cell Ranger pipeline, so it is ready to go for downstream processing and analysis. 

```{r tidy = FALSE}
data("contig_list") #the data built into scRepertoire

head(contig_list[[1]])
```

***

# Combining Contigs into Clones

## combineTCR

**input.data**  

* List of *filtered_contig_annotations.csv* data frames from the 10x Cell Ranger.  
* List of data processed using `loadContigs()`.  

**samples** and **ID**  

* Grouping variables for downstream analysis and will be added as prefixes to 
prevent issues with duplicate barcodes **(optional)**.  

**removeNA**  

* TRUE - Filter to remove any cell barcode with an NA value in at least one of the chains.  
* FALSE - Include and incorporate cells with 1 NA value **(default)**.   

**removeMulti**  

* TRUE - Filter to remove any cell barcode with more than 2 immune receptor chains.  
* FALSE - Include and incorporate cells with > 2 chains **(default)**.  

**filterMulti**  

* TRUE - Isolate the top 2 expressed chains in cell barcodes with multiple chains.  
* FALSE - Include and incorporate cells with > 2 chains **(default)**.  

The output of `combineTCR()` will be a list of contig data frames that will be reduced to the reads associated with a single cell barcode. It will also combine the multiple reads into clone calls by either the nucleotide sequence (**CTnt**), amino acid sequence (**CTaa**), the VDJC gene sequence (**CTgene**), or the combination of the nucleotide and gene sequence (**CTstrict**).

```{r tidy = FALSE}
combined.TCR <- combineTCR(contig_list, 
                           samples = c("P17B", "P17L", "P18B", "P18L", 
                                            "P19B","P19L", "P20B", "P20L"),
                           removeNA = FALSE, 
                           removeMulti = FALSE, 
                           filterMulti = FALSE)

head(combined.TCR[[1]])
```

## combineBCR

`combineBCR()` is analogous to `combineTCR()` with 2 major changes: **1)** Each barcode can only have a maximum of 2 sequences, if greater exists, the 2 with the highest reads are selected; **2)** The **strict** definition of a clone is based on the normalized Levenshtein edit distance of CDR3 nucleotide sequences and V-gene usage. For more information on this approach, please see the respective [citation](https://pubmed.ncbi.nlm.nih.gov/34161770/). This definition allows for the grouping of BCRs derived from the same progenitor that have undergone mutation as part of somatic hypermutation and affinity maturation.

**threshold**  
The level of similarity in sequences to group together. **Default** is 0.85.

\[
\text{threshold}(s, t) = 1-\frac{\text{Levenshtein}(s, t)}{\frac{\text{length}(s) + \text{length}(t)}{2}}
\]

**call.related.clones**  
Calculate the normalized edit distance (**TRUE**) or skip the calculation (**FALSE**). Skipping the edit distance calculation may save time, especially in the context of large data sets, but is not recommended. 


```{r tidy = FALSE}
BCR.contigs <- read.csv("https://www.borch.dev/uploads/contigs/b_contigs.csv")
combined.BCR <- combineBCR(BCR.contigs, 
                           samples = "P1", 
                           threshold = 0.85)

head(combined.BCR[[1]])
```

***

# Additional Processing

## addVariable

What if there are more variables to add than just sample and ID? We can add them by using the `addVariable()` function. All we need is the **variable.name** of the variable you'd like to add and the specific character or numeric values (**variables**). As an example, here we add the **Type** in which the samples were processed and sequenced. 

```{r tidy = FALSE}
combined.TCR <- addVariable(combined.TCR, 
                            variable.name = "Type", 
                            variables = rep(c("B", "L"), 4))

head(combined.TCR[[1]])
```

## subsetClones

Likewise, we can remove specific list elements after `combineTCR()` using the `subsetClones()` function. In order to subset, we need to identify the vector we would like to use for subsetting (**name**) and the variable values to subset (**variables**). Below, we isolate just the 2 sequencing results from P18L and P18B. 

```{r, tidy = FALSE}
subset1 <- subsetClones(combined.TCR, 
                        name = "sample", 
                        variables = c("P18L", "P18B"))

head(subset1[[1]])
```

Alternatively, we can also just select the list elements after `combineTCR()` or `combineBCR()`.

```{r}
subset2 <- combined.TCR[c(3,4)]
head(subset2[[1]])
```

## exportClones  

After assigning the clone by barcode, we can export the paired clonotypes using `exportClones()` to save for later use or to use in other pipelines. 

**format**  

* "paired" - Export the paired sequences **(default)**.  
* "airr" - Export data in an AIRR-compliant format.  
* "TCRMatch" - Export TCRB chain information. 
    
**write.file**    

* TRUE, save the file.  
* FALSE, return a data.frame.  

**dir**  
directory location to save the csv  

**file.name**  
the csv file name  

```{r, eval = FALSE, tidy = FALSE}
exportClones(combined, 
             write.file = TRUE,
             dir = "~/Documents/MyExperiment/Sample1/"
             file.name = "clones.csv"
```
 
***

# Basic Clonal Visualizations

**cloneCall**  
* "gene" - use the VDJC genes comprising the TCR/Ig  
* "nt" - use the nucleotide sequence of the CDR3 region  
* "aa" - use the amino acid sequence of the CDR3 region  
* "strict" - use the VDJC genes comprising the TCR + the nucleotide sequence of the CDR3 region. This is the [proper definition of clonotype](https://www.ncbi.nlm.nih.gov/pubmed/19568742). For ```combineBCR()``` strict refers to the edit distance clusters + Vgene of the Ig. 

It is important to note that the clonotype is called using essentially the combination of genes or nt/aa CDR3 sequences for both loci. As of this implementation of scRepertoire, clonotype calling is not incorporating small variations within the CDR3 sequences. As such the **gene** approach will be the most sensitive, while the use of **nt** or **aa** is moderately so, and the most specific for clonotypes being **strict**. Additionally, the clonotype call is trying to incorporate both loci, *i.e.*, both **TCRA** and **TCRB** chains and if a single cell barcode has multiple sequences identified (*i.e.*, 2 TCRA chains expressed in one cell). Using the 10x approach, there is a subset of barcodes that only return one of the immune receptor chains. The unreturned chain is assigned an **NA** value.

## clonalQuant

The first function to explore the clones is `clonalQuant()` to return the total or relative numbers of unique clones. 

**scale**  

* TRUE - relative percent of unique clones scaled by the total size of the clonal repertoire.
* FALSE - Report the total number of unique clones **(default)**.

**chain**  

+ "both" for combined chain visualization  
+ "TRA", "TRB", "TRD", "TRG", "IGH" or "IGL" to select single chain

```{r tidy = FALSE}
clonalQuant(combined.TCR, 
            cloneCall="strict", 
            chain = "both", 
            scale = TRUE)
```

Another option here is to be able to define the visualization by data classes. Here, we used the `combineTCR()` list to define the **Type** variable as part of the naming structure. We can use the **group.by** to specifically use a column in the data set to organize the visualization.

```{r}
clonalQuant(combined.TCR, 
            cloneCall="gene", 
            group.by = "Type", 
            scale = TRUE)
```

## clonalAbundance

We can also examine the relative distribution of clones by abundance. Here `clonalAbundance()` will produce a line graph with a total number of clones by the number of instances within the sample or run. Like above, we can also group.by this by vectors within the contig object using the **group.by** variable in the function.

```{r tidy = FALSE}
clonalAbundance(combined.TCR, 
                cloneCall = "gene", 
                scale = FALSE)
```

`clonalAbundance()` output can also be converted into a density plot, which may allow for better comparisons between different repertoire sizes, by setting **scale** = TRUE. 
```{r}
clonalAbundance(combined.TCR, 
                cloneCall = "gene", 
                scale = TRUE)
```


## clonalLength

We can look at the length distribution of the CDR3 sequences by calling the `lengtheContig()` function. Importantly, unlike the other basic visualizations, the **cloneCall** can only be **"nt"** or **"aa"**. Due to the method of calling clones as outlined above, the length should reveal a multimodal curve, this is a product of using the **NA** for the unreturned chain sequence and multiple chains within a single barcode. 

**chain**  

+ "both" for combined chain visualization  
+ "TRA", "TRB", "TRD", "TRG", "IGH" or "IGL" to select single chain  

```{r tidy = FALSE}
clonalLength(combined.TCR, 
             cloneCall="aa", 
             chain = "both") 

clonalLength(combined.TCR, 
             cloneCall="aa", 
             chain = "TRA", 
             scale = TRUE) 
```

## clonalCompare

We can also look at clones between samples and changes in dynamics by using the `clonalCompare()` function.

**samples**  

* Can be used to isolate specific samples based on the name of the list element

**graph**  

* "alluvial" - graph imaged below  
* "area" - graph by area of the respective clone  

**top.clones**  

* The top number of clones to graph, this will be calculated based on the frequency of the individual sample. This can also be left blank.  

**clones**  

* Can be used to isolate specific clone sequences, ensure the call matches the sequences you would like to visualize.  

**highlight.clones**  

* Specifically color certain clones, other clones will be greyed out.

**relabel.clones**  

* Simplify the isolated clones to numerical designations to allow for a tidier visualization

```{r tidy = FALSE}
clonalCompare(combined.TCR, 
                  top.clones = 10, 
                  samples = c("P17B", "P17L"), 
                  cloneCall="aa", 
                  graph = "alluvial")
```

We can also choose to highlight specific clones, such as in the case of *"CVVSDNTGGFKTIF_CASSVRRERANTGELFF"* and *"NA_CASSVRRERANTGELFF"* using the **highlight.clones** parameter. In addition, we can simplify the plot to label the clones as clones 1:19.

```{r tidy = FALSE}
clonalCompare(combined.TCR, 
              top.clones = 10,
              highlight.clones = c("CVVSDNTGGFKTIF_CASSVRRERANTGELFF", "NA_CASSVRRERANTGELFF"),
              relabel.clones = TRUE,
              samples = c("P17B", "P17L"), 
              cloneCall="aa", 
              graph = "alluvial")
```

Alternatively, if we only want to show specific clones, we can use the **clones** parameter.

```{r}
clonalCompare(combined.TCR, 
              clones = c("CVVSDNTGGFKTIF_CASSVRRERANTGELFF", "NA_CASSVRRERANTGELFF"),
              relabel.clones = TRUE,
              samples = c("P17B", "P17L"), 
              cloneCall="aa", 
              graph = "alluvial")
```


## clonalScatter

```clonalScatter()``` will organize two repertoires, quantify the relative clone sizes, and produce a scatter plot comparing the two samples. 

**x.axis** and **y.axis** 

* Names of the list element to place on the x-axis and y-axis - so for example *"P18B"* and *"P18L"* 

**dot.size**  

* "total" to display the total number of clones between the x- and y-axis  
* Name of the list element to use for size calculation  
 
**graph**  

* "proportion" for the relative proportion the clone represents across all clones  
* "count" for the total count of clones by sample  

```{r tidy = FALSE}
clonalScatter(combined.TCR, 
              cloneCall ="gene", 
              x.axis = "P18B", 
              y.axis = "P18L",
              dot.size = "total",
              graph = "proportion")
```

***

# Visualizing Clonal Dynamics

## clonalHomeostasis

By examining the clonal space, we effectively look at the relative space occupied by clones at specific proportions. Another way to think about this would be to think of the total immune receptor sequencing run as a measuring cup. In this cup, we will fill liquids of different viscosity - or different numbers of clonal proportions. Clonal space homeostasis asks what percentage of the cup is filled by clones in distinct proportions (or liquids of different viscosity, to extend the analogy). The proportional cut points are set under the **cloneSize** variable in the function and can be adjusted.

**cloneSize**  

* Rare = 0.0001  
* Small = 0.001  
* Medium = 0.01  
* Large = 0.1  
* Hyperexpanded = 1  

```{r tidy = FALSE}
clonalHomeostasis(combined.TCR, 
                  cloneCall = "gene")

```

We can reassign the proportional cut points for cloneSize, which can drastically alter the visualization and analysis. 

```{r tidy = FALSE}
clonalHomeostasis(combined.TCR, 
                  cloneCall = "gene",
                  cloneSize = c(Rare = 0.001, Small = 0.01, Medium = 0.1, Large = 0.3, Hyperexpanded =
    1))
```

In addition, we can use the **group.by** parameter to look at the relative proportion of clones between groups - such as tissue.

```{r tidy = FALSE}

combined.TCR <- addVariable(combined.TCR, 
                            variable.name = "Type", 
                            variables = rep(c("B", "L"), 4))

clonalHomeostasis(combined.TCR, 
                  group.by = "Type",
                  cloneCall = "gene")
```

## clonalProportion

Like clonal space homeostasis above, clonal proportion places clones into separate bins. The key difference is that instead of looking at the relative proportion of the clone to the total, the `clonalProportion()` function will rank the clones by total number and place them into bins. 

The **clonalSplit** represents the ranking of clonotypes by copy or frequency of occurrence, meaning 1:10 are the top 10 clonotypes in each sample. The default bins are under the **clonalSplit** variable in the function and can be adjusted, but they are as follows at baseline. 
 
**clonalSplit**  

* 10  
* 100  
* 1000  
* 10000 
* 30000 
* 100000 

```{r tidy = FALSE}
clonalProportion(combined.TCR, 
                 cloneCall = "gene") 

clonalProportion(combined.TCR, 
                 cloneCall = "nt",
                 clonalSplit = c(1, 5, 10, 100, 1000, 10000)) 
```

***

# Summarizing Repertoires

## vizGenes

A visualization of the relative usage of genes of the TCR or BCR, using `vizGenes()`. There is some functional crossover between `vizGenes()` and two functions below called  `percentGenes()` and `percentVJ()`. But `vizGenes()` is more adaptable to allow for comparisons across chains, scaling, etc. 

**x.axis**  

* Gene segment to separate the x-axis, such as "TRAV", "TRBD", "IGKJ"

**y.axis** 

* Gene segment or categorical variable to visualize along the y-axis

**plot**  

* "barplot" for a bar chart  
* "heatmap" for a heatmap  

**scale**  

* TRUE to scale the graph by number of genes per sample  
* FALSE to report raw numbers  

**order**  

* "gene" to order by gene name  
* "variance" to order by variance between the separate variable categories  

```{r tidy = FALSE}
vizGenes(combined.TCR, 
         x.axis = "TRBV",
         y.axis = NULL,
         plot = "barplot",  
         scale = TRUE)
```

`vizGenes()` can also be used to look at the usage of genes in a single chain. For example, say we are interested in the difference in TRB V and J usage between lung and peripheral blood samples - we can easily take a look at this using the following code:

```{r tidy = FALSE}
#Peripheral Blood
vizGenes(combined.TCR[c(1,3,5,7)], 
         x.axis = "TRBV",
         y.axis = "TRBJ",
         plot = "heatmap",  
         scale = TRUE)

#Lung
vizGenes(combined.TCR[c(2,4,6,8)], 
         x.axis = "TRBV",
         y.axis = "TRBJ",
         plot = "heatmap",  
         scale = TRUE)
```

For the P17 patient samples, what if we are interested in chain pairings, we can look at TRBV and TRAV at the same time using them as inputs to **x.axis** and **y.axis**. 

```{r tidy = FALSE}
vizGenes(combined.TCR[c(1,2)], 
         x.axis = "TRBV",
         y.axis = "TRAV",
         plot = "heatmap", 
         scale = FALSE)
```

## percentAA

Quantify the proportion of amino acids along the cdr3 sequence with `percentAA()`. By default, the function will pad the sequences with NAs up to the maximum of **aa.length**. Sequences longer than **aa.length** will be removed before visualization **(default aa.length = 20)**. 

```{r, message = FALSE, tidy = FALSE}
percentAA(combined.TCR, 
          chain = "TRB", 
          aa.length = 20)
```

## positionalEntropy

We can also quantify the level of entropy/diversity across amino acid residues along the cdr3 sequence. `positionalEntropy()` combines the quantification by residue of ``percentAA()`` with the diversity calls in ```clonalDiversity()```. Importantly, because diversity can be affected by size of sample, similar to ```clonalDiversity()```, ```positionalEntropy()``` will also downsample/bootstrap the calculation. Positions without variance will have a value reported as 0 for the purposes of comparison.

**method**  

* "shannon" - Shannon Diversity  
* "inv.simpson" - Inverse Simpson Diversity 
* "norm.entropy" - Normalized Entropy  

```{r}
positionalEntropy(combined.TCR, 
                  chain = "TRB", 
                  aa.length = 20)
```

## percentGenes

Quantify the proportion of V or J gene usage with `percentGenes()`. Like `percentAA()`, we select the chain of interest and then indicate the gene of interest with the **gene** parameter. Two major limitations of `percentGenes()` are, 1) the function quantifies only V or J genes, and 2) the quantification of the genes are limited to all the V or J genes seen across the samples, not all possible V or J genes. 

```{r tidy = FALSE}
percentGenes(combined.TCR, 
             chain = "TRB", 
             gene = "Vgene")
```

We can also use the output `percentGenes()` for dimensional reduction to summarise the gene usage by sample. This can be done with a simple principal component analysis (below) or even more complex reductions. 

```{r tidy = FALSE}
df.genes <- percentGenes(combined.TCR, 
                         chain = "TRB", 
                         gene = "Vgene", 
                         exportTable = TRUE)

#Performing PCA
pc <- prcomp(df.genes)

#Getting data frame to plot from
df <- as.data.frame(cbind(pc$x[,1:2], rownames(df.genes)))
df$PC1 <- as.numeric(df$PC1)
df$PC2 <- as.numeric(df$PC2)

#Plotting
ggplot(df, aes(x = PC1, y = PC2)) + 
  geom_point(aes(fill =df[,3]), shape = 21, size = 5) + 
  guides(fill=guide_legend(title="Samples")) + 
  scale_fill_manual(values = hcl.colors(nrow(df), "inferno")) + 
  theme_classic() 
```

## percentVJ

Quantify the proportion of V and J gene usage with `percentVJ()`. Like `percentGenes()`, this function will quantify the percentage of V and J paired together across individual repertoires. The output can be visualized using a heatmap or as input for further dimensional reduction.

```{r tidy = FALSE}
percentVJ(combined.TCR, 
          chain = "TRB")

df.genes <- percentVJ(combined.TCR, 
                      chain = "TRB", 
                      exportTable = TRUE)

#Performing PCA
pc <- prcomp(df.genes)

#Getting data frame to plot from
df <- as.data.frame(cbind(pc$x[,1:2], rownames(df.genes))) 
df$PC1 <- as.numeric(df$PC1)
df$PC2 <- as.numeric(df$PC2)

#Plotting
ggplot(df, aes(x = PC1, y = PC2)) + 
  geom_point(aes(fill =df[,3]), shape = 21, size = 5) + 
  guides(fill=guide_legend(title="Samples")) + 
  scale_fill_manual(values = hcl.colors(nrow(df), "inferno")) + 
  theme_classic() 
```

## percentKmer

Another quantification of the composition of the CDR3 sequence is to define motifs by sliding across the amino acid or nucleotide sequences at set intervals resulting in substrings or kmers. 

**motif.length**  

* Numerical value for the length of the kmer.

**top.motifs**  

* Display the most variable genes determined by mean absolute deviation (MAD).

```{r tidy = FALSE}
percentKmer(combined.TCR, 
            cloneCall = "aa",
            chain = "TRB", 
            motif.length = 3, 
            top.motifs = 25)

percentKmer(combined.TCR, 
            cloneCall = "nt",
            chain = "TRB", 
            motif.length = 3, 
            top.motifs = 25)
```

***

# Comparing Clonal Diversity and Overlap

## clonalDiversity

Diversity can also be measured for samples or by other variables. Diversity **metrics** calculated, include: **"shannon"**, **"inv.simpson"**, **"norm.entropy"**, **"gini.simpson"**, **"chao1"**, and **"ACE"**. Please see the manual for more information on each metric and the underlying calculations. 

Inherent in diversity calculations is a bias for increasing diversity with increasing repertoire size. ```clonalDiversity()``` will automatically downsample to the smallest repertoire size and perform bootstrapping to return the mean diversity estimates. If the output of diversity values are strange or minimally variable, it is likely due to a sample with small repertoire size.

**n.boots**  
The number of calculations to perform **(default = 100)**.  

**return.boots**  

* TRUE: Return all the calculations.  
* FALSE: Return only the mean values **(default)**.  

**skip.boots**  
Skip the bootstrapping calculations.


```{r tidy = FALSE}
clonalDiversity(combined.TCR, 
                cloneCall = "gene", 
                n.boots = 20)

```

As a default, ```clonalDiversity()``` will return all the metrics calculated - **"shannon"**, **"inv.simpson"**, **"norm.entropy"**, **"gini.simpson"**, **"chao1"**, and **"ACE"**. Selecting a single or a subset of these methods using the **metrics** parameter.

```{r}
#Return only a subset of metrics
clonalDiversity(combined.TCR, 
                metrics = c("shannon", "ACE"),
                cloneCall = "gene", 
                n.boots = 20)
```

## clonaRarefaction

We can also use Hill numbers to estimate the rarefaction, or estimating species richness, using the the abundance of clones across groupings. Underlying the rarefaction calculation is the use of observed receptor of abundance to compute diversity. 

**hill.numbers** 

+ 0 - species-richness  
+ 1 - Shannon  
+ 2 - Simpson  

**plot.type**  

+ 1 - sample-size-based rarefaction/extrapolation  
+ 2 - sample completeness curve  
+ 3 - coverage-based rarefaction/extrapolation curve  

This function relies on the [iNEXT](https://cran.r-project.org/web/packages/iNEXT/index.html) with the accompanying [manuscript](http://chao.stat.nthu.edu.tw/wordpress/paper/120_pdf_appendix.pdf). Like the other wrapping functions in scRepertoire, please cite the original work. The sample completeness curve (**plot.type** = 2), may not show full sample coverage due to the size/diversity of the input data.

### Rarefaction using Species Richness (q = 0)

```{r, message=FALSE, tidy = FALSE}
clonalRarefaction(combined.TCR,
                  plot.type = 1,
                  hill.numbers = 0,
                  n.boots = 2)

clonalRarefaction(combined.TCR,
                  plot.type = 2,
                  hill.numbers = 0,
                  n.boots = 2)

clonalRarefaction(combined.TCR,
                  plot.type = 3,
                  hill.numbers = 0,
                  n.boots = 2)
```

### Rarefaction using Shannon Diversity (q = 1)

```{r tidy = FALSE}
clonalRarefaction(combined.TCR,
                  plot.type = 1,
                  hill.numbers = 1,
                  n.boots = 2)

clonalRarefaction(combined.TCR,
                  plot.type = 2,
                  hill.numbers = 1,
                  n.boots = 2)

clonalRarefaction(combined.TCR,
                  plot.type = 3,
                  hill.numbers = 1,
                  n.boots = 2)
```

## clonalSizeDistribution

Another method for modeling the repertoire distribution is a discrete gamma-GPD spliced threshold model, proposed by [Koch et al.](https://pubmed.ncbi.nlm.nih.gov/30485278/) The spliced model models the repertoire and allows for the application of a power law distribution for larger clonal-expanded sequences and a Poisson distribution for smaller clones. After fitting the models, repertoires can be compared using Euclidean distance. 

If using this function, please read/cite [Koch et al.](https://pubmed.ncbi.nlm.nih.gov/30485278/) and check out the [powerTCR](https://bioconductor.org/packages/release/bioc/html/powerTCR.html) R package.

```{r tidy = FALSE}
clonalSizeDistribution(combined.TCR, 
                       cloneCall = "aa", 
                       method= "ward.D2")
```

## clonalOverlap

If you are interested in measures of similarity between the samples loaded into scRepertoire, using `clonalOverlap()` can assist in the visualization. 

The underlying `clonalOverlap()` calculation varies by the **method** parameter, more information on the exact calculations are available in the manual. 

**method**

* "overlap" - overlap coefficient 
* "morisita" - Morisita's overlap index
* "jaccard" - Jaccard index
* "cosine" - cosine similarity
* "raw" - exact number of overlapping clones

```{r tidy = FALSE}
clonalOverlap(combined.TCR, 
              cloneCall = "strict", 
              method = "morisita")

clonalOverlap(combined.TCR, 
              cloneCall = "strict", 
              method = "raw")
```

***
# Combining Clones and Single-Cell Objects

The data in the scRepertoire package is derived from a [study](https://pubmed.ncbi.nlm.nih.gov/33622974/) of acute respiratory stress disorder in the context of bacterial and COVID-19 infections. The internal single cell data (`scRep_example()`) built in to scRepertoire is randomly sampled 500 cells from the fully integrated Seurat object to minimize the package size. We will use both Seurat and Single-Cell Experiment (SCE) with scater to perform further visualizations in tandem. 

## Preprocessed Single-Cell Object

```{r tidy = FALSE}
scRep_example <- get(data("scRep_example"))

#Making a Single-Cell Experiment object
sce <- Seurat::as.SingleCellExperiment(scRep_example)
```


## combineExpression

After processing the contig data into clones via `combineBCR()` or `combineTCR()`, we can add the clonal information to the single-cell object using `combineExpression()`. 

**Importantly**, the major requirement for the attachment is matching contig cell barcodes and barcodes in the row names of the meta data of the Seurat or Single-Cell Experiment object. If these do not match, the attachment will fail. Based on ease, we suggest making changes to the single-cell object barcodes.

### Calculating cloneSize

Part of `combineExpression()` is calculating the clonal frequency and proportion, placing each clone into groups called **cloneSize**. The default **cloneSize** argument uses the following bins: c(Rare = 1e-4, Small = 0.001, Medium = 0.01, Large = 0.1, Hyperexpanded = 1), which can be modified to include more/less bins or different names. 

Clonal frequency and proportion is dependent on the repertoires being compared, which we can modify the calculation using the **group.by** parameter, such as grouping by the *Patient* variable from above. If **group.by** is not set, `combineExpression()` will calculate clonal frequency, proportion, and **cloneSize** as a function of individual sequencing runs. In addition, **cloneSize** can use the frequency of clones when **proportion** = FALSE. 

We can look at the default cloneSize groupings using the Single-Cell Experiment object we just created above with using **group.by** set to the *sample* variable used in ```combineTCR()``` :

```{r tidy = FALSE}
sce <- combineExpression(combined.TCR, 
                         sce, 
                         cloneCall="gene", 
                         group.by = "sample", 
                         proportion = TRUE)

#Define color palette 
colorblind_vector <- hcl.colors(n=7, palette = "inferno", fixup = TRUE)

plotUMAP(sce, colour_by = "cloneSize") +
    scale_color_manual(values=rev(colorblind_vector[c(1,3,5,7)]))
```

Alternatively, if we want **cloneSize** to be based on the frequency of the clone, we can set **proportion** = FALSE and we will need to change the **cloneSize** bins to integers. If we have not inspected our clone data, setting the upper limit of the clonal frequency might be difficult - ```combineExpression()``` will automatically adjust the upper limit to fit the distribution of the frequencies. To demonstrate this, check out the Seurat object below: 

```{r tidy = FALSE}
scRep_example <- combineExpression(combined.TCR, 
                                   scRep_example, 
                                   cloneCall="gene", 
                                   group.by = "sample", 
                                   proportion = FALSE, 
                                   cloneSize=c(Single=1, Small=5, Medium=20, Large=100, Hyperexpanded=500))

Seurat::DimPlot(scRep_example, group.by = "cloneSize") +
    scale_color_manual(values=rev(colorblind_vector[c(1,3,4,5,7)]))
```

### Combining both TCR and BCR 

If we have TCR/BCR enrichment or want to add info for gamma-delta and alpha-beta T cells, we can make a single list and use ```combineExpression()```. 

**Major note** if there are duplicate barcodes (if a cell has both Ig and TCR), the immune receptor information will not be added. It might be worth checking cluster identities and removing incongruent barcodes in the products of ```combineTCR()``` and ```combineBCR()```. 

As an anecdote, the [testing data](https://support.10xgenomics.com/single-cell-vdj/datasets/6.0.1/SC5v2_Melanoma_5Kcells_Connect_single_channel_SC5v2_Melanoma_5Kcells_Connect_single_channel) we used to improve this function had 5-6% of barcode overlap.

```{r, eval=FALSE, tidy = FALSE}
#This is an example of the process, which will not be evaluated during knit
TCR <- combineTCR(...)
BCR <- combineBCR(...)
list.receptors <- c(TCR, BCR)


seurat <- combineExpression(list.receptors, 
                            seurat, 
                            cloneCall="gene", 
                            proportion = TRUE)
```

*** 

# Visualizations for Single-Cell Objects

## clonalOverlay 

Using the dimensional reduction graphs as a reference, we can also generate an overlay of the position of clonally-expanded cells using `clonalOverlay()`. 

**reduction** 

* The dimensional reduction for the visualization, **(default = "pca")**

**freq.cutpoint**  

* lowest clonal frequency or proportion to generate the contour plot


**bins** 

* the number of contours to draw

`clonalOverlay()` can be used to look across all cells or faceted by a meta data variable using **facet.by**. The overall dimensional reduction will be maintained as we facet, while the contour plots will adjust based on the **facet.by** variable. The coloring of the dot plot is based on the active identity of the single-cell object. 

This visualization was authored by Dr. Francesco Mazziotta and inspired by Drs. Carmona and Andreatta and their work with [ProjectTIL](https://github.com/carmonalab/ProjecTILs), which is a great pipeline to annotated T cell subtypes.

```{r tidy = FALSE}
#Adding patient information
scRep_example$Patient <- substr(scRep_example$orig.ident, 1,3)

clonalOverlay(scRep_example, 
              reduction = "umap", 
              freq.cutpoint = 1, 
              bins = 10, 
              facet.by = "Patient") + 
              guides(color = "none")
```

## clonalNetwork 

Similar to `clonalOverlay()`, we can look at the network interaction of clones shared between clusters along the single-cell dimensional reduction using `clonalNetwork()`. This function shows the relative proportion of clones from the starting node, with the ending node indicated by the arrow. 

Filtering for clones can be accomplished using 3 methods:  

**filter.clones**  

* Select a number to isolate the clones comprising the top n number of cells (e.g., **filter.clones** = 2000)  
* Select "min" to make sure all groups are scaled to the size of the minimum group

**filter.identity**  

* For the identity chosen to visualize, show the to and from network connections for a single group  

**filter.proportion**

* Remove clones that comprise less than a certain proportion of clones in groups.

**filter.graph**

* Remove the reciprocal edges from the half of the graph, allowing for cleaner visualization.
                          

```{r tidy = FALSE}
#ggraph needs to be loaded due to issues with ggplot
library(ggraph)

#No Identity filter
clonalNetwork(scRep_example, 
              reduction = "umap", 
              group.by = "seurat_clusters",
              filter.clones = NULL,
              filter.identity = NULL,
              cloneCall = "aa")

```

We can look at the clonal relationships relative to a single cluster using the **filter.identity** parameter. 

```{r, tidy = FALSE}
#Examining Cluster 3 only
clonalNetwork(scRep_example, 
              reduction = "umap", 
              group.by = "seurat_clusters",
              filter.identity = 3,
              cloneCall = "aa")
```

We can also use **exportClones** parameter to quickly get clones that are shared across multiple identity groups, along with the total number of clones in the data set.

```{r tidy = FALSE}
shared.clones <- clonalNetwork(scRep_example, 
                               reduction = "umap", 
                               group.by = "seurat_clusters",
                               cloneCall = "aa", 
                               exportClones = TRUE)
head(shared.clones)
```

## highlightClones

We can also look at the clones by calling specific sequences in the `highlightclones()` below. In order to highlight the clones, we first need to use the **cloneCall**, the type of sequence we will be using, and then the specific sequences themselves using **sequence**. Below, we can see the steps to highlight the most prominent sequences *CAERGSGGSYIPTF_CASSDPSGRQGPRWDTQYF*, and *CARKVRDSSYKLIF_CASSDSGYNEQFF*.

```{r tidy = FALSE}
scRep_example <- highlightClones(scRep_example, 
                    cloneCall= "aa", 
                    sequence = c("CAERGSGGSYIPTF_CASSDPSGRQGPRWDTQYF", 
                                 "CARKVRDSSYKLIF_CASSDSGYNEQFF"))

Seurat::DimPlot(scRep_example, group.by = "highlight") + 
  ggplot2::theme(plot.title = element_blank())
```

## clonalOccupy

We can also look at the count of cells by cluster assigned into specific frequency ranges by using the `clonalOccupy()` function and selecting the **x.axis** to display cluster or other variables in the meta data of the single cell object.

**proportion**  

* can be used to look at relative level groupings  

**label**  

* will still return the absolute number of clones  

```{r tidy = FALSE}
clonalOccupy(scRep_example, 
              x.axis = "seurat_clusters")

clonalOccupy(scRep_example, 
                     x.axis = "ident", 
                     proportion = TRUE, 
                     label = FALSE)
```

## alluvialClones

After the metadata has been modified, we can look at clones across multiple categories using the `alluvialClones()` function. We are able to use the plots to examine the interchange of categorical variables. Because this function will produce a graph with each clone arranged by called stratification, this will take some time depending on the size of the repertoire.

To understand the basic concepts of this graphing method and the ggalluvial R package, we recommend reading [this post](https://cran.r-project.org/web/packages/ggalluvial/vignettes/ggalluvial.html).  

```{r tidy = FALSE}
#Adding type information
scRep_example$Type <- substr(scRep_example$orig.ident, 4,4)

alluvialClones(scRep_example, 
               cloneCall = "aa", 
               y.axes = c("Patient", "ident", "Type"), 
               color = c("CVVSDNTGGFKTIF_CASSVRRERANTGELFF", "NA_CASSVRRERANTGELFF")) + 
    scale_fill_manual(values = c("grey", colorblind_vector[3]))


alluvialClones(scRep_example, 
                   cloneCall = "gene", 
                   y.axes = c("Patient", "ident", "Type"), 
                   color = "ident") 
```

## getCirclize

Like alluvial graphs, we can also visualize the interconnection of clusters using the chord diagrams from the [circlize R package](https://jokergoo.github.io/circlize_book/book/). 

The first step is getting the data frame output to feed into the `chordDiagram()` function in circlize, which can be done using `getCirclize()`. 

This will calculate the relative number of clones shared based on the **group.by** variable using the product of `combineExpression()`. It is important to note `getCirclize()` will create a matrix the size of the **group.by** variable and then simplify into instructions to be read by the circlize R package. The output is the total number of unique and shared clones by the **group.by** variable. 

```{r tidy = FALSE}
library(circlize)
library(scales)

circles <- getCirclize(scRep_example, 
                       group.by = "seurat_clusters")

#Just assigning the normal colors to each cluster
grid.cols <- hue_pal()(length(unique(scRep_example$seurat_clusters)))
names(grid.cols) <- unique(scRep_example$seurat_clusters)

#Graphing the chord diagram
chordDiagram(circles, self.link = 1, grid.col = grid.cols)
```

This can also be used if we want to explore just the lung-specific T cells by just subsetting the single-cell object. For the sake of this vignette, we can also look at setting **proportion** = TRUE to get a scaled output.

```{r}
subset <- subset(scRep_example, Type == "L")

circles <- getCirclize(subset, group.by = "ident", proportion = TRUE)

grid.cols <- scales::hue_pal()(length(unique(subset@active.ident)))
names(grid.cols) <- levels(subset@active.ident)

chordDiagram(circles, 
             self.link = 1, 
             grid.col = grid.cols, 
             directional = 1, 
             direction.type =  "arrows",
             link.arr.type = "big.arrow")
```

***

# Quantifying Clonal Bias

## StartracDiversity

From the excellent work by [Lei Zhang, et al.](https://www.nature.com/articles/s41586-018-0694-x), the authors introduce new methods for looking at clones by cellular origins and cluster identification. Their [STARTRAC](https://github.com/Japrin/STARTRAC) software has been adapted to work with scRepertoire and please read and cite their excellent work. 

In order to use the ```StartracDiversity()``` function, you will need to include the product of the ```combinedExpression()``` function. The second requirement is a column header in the meta data of the Seurat object that has tissue of origin. In the example data, **type** corresponds to the column "Type", which includes the "P" and "T" classifiers. The indices can be subsetted for a specific patient or examined overall using the **by** variable. Importantly, the function uses only the strict definition of a clone of the VDJC genes and the CDR3 nucleotide sequence. 

**The indices output includes:**  

*  expa - Clonal Expansion  
*  migr - Cross-tissue Migration  
*  tran - State Transition  

```{r tidy = FALSE}
StartracDiversity(scRep_example, 
                  type = "Type", 
                  group.by = "Patient")
```

## clonalBias

A new metric proposed by [Massimo et al](https://pubmed.ncbi.nlm.nih.gov/35829695/), ```clonalBias()```, like STARTRAC is a clonal metric that seeks to quantify how individual clones are skewed towards a specific cellular compartment or cluster.

**split.by**   

* Variable used to calculate the baseline frequencies  

**group.by**  

* The compartment/variable used for the purpose of the comparison  

**min.expand**  

* Cut point for frequency **(default = 10)**

```{r, message = FALSE, tidy = FALSE}
clonalBias(scRep_example, 
           cloneCall = "aa", 
           split.by = "Patient", 
           group.by = "seurat_clusters",
           n.boots = 10, 
           min.expand =0)
```

***

# Clustering by Edit Distance

## clonalCluster

The nucleotide or amino acid sequences of the chains can be used to cluster clonotypes by examining the edit distance of the sequences. This approach is underlying the `combineBCR()` function but now can be applied to the B or T cell receptors at the level of nucleotides (**sequence = "nt"**) or amino acids (**sequence = "aa"**). It will add a cluster to the end of each list element by generating a network connected by the similarity in sequence. This network is directed by the **threshold** variable, where 0.85 is the normalized mean edit distance.

Edit-distance based clusters will have the following format: 

* [chain:] + :Cluster + [number]  e.g., TRA:Cluster.1

**Cluster** denotes if the cluster was called using the normalized Levenshtein distance, which takes the edit distance calculated between 2 sequences and divides that by the mean of the sequence lengths. Unconnected sequences will have *NA* values.

### Basic use

```{r tidy = FALSE}
sub_combined <- clonalCluster(combined.TCR[[2]], 
                              chain = "TRA", 
                              sequence = "aa", 
                              threshold = 0.85, 
                              group.by = NULL)

head(sub_combined[,c(1,2,13)])
```

### Clustering with a single-cell object

If performed over the number of samples, such as the list elements, **group.by** can used to calculate only the clusters on the setting of patient sample (**group.by** = "Patient") or tissue type (**group.by** = "Type"). This will add the selected group to the beginning of the cluster designation. We can also call ```clonalCluster()``` directly on a Single-Cell Object.

```{r tidy = FALSE}
#Define color palette 
colorblind_vector <- hcl.colors(n=7, palette = "inferno", fixup = TRUE)

scRep_example <- clonalCluster(scRep_example, 
                               chain = "TRA", 
                               sequence = "aa", 
                               threshold = 0.85, 
                               group.by = "Patient")

Seurat::DimPlot(scRep_example, group.by = "TRA_cluster") +
    scale_color_manual(values =  hcl.colors(n=length(unique(scRep_example@meta.data[,"TRA_cluster"])), "inferno")) + 
  Seurat::NoLegend() + 
  theme(plot.title = element_blank())
```

Using ```clonalCluster()```, we can also return an igraph object of all the related sequences using **exportGraph** = TRUE. The returned igraph object contains only the sequences that have at least one connection with another sequence. The igraph can then be directly visualized (below) or used for downstream analysis (see the igraph [website](https://r.igraph.org/)). 

### Returning an igraph object:

```{r tidy = FALSE}
#Clustering Patient 19 samples
igraph.object <- clonalCluster(combined.TCR[c(5,6)],
                               chain = "TRB",
                               sequence = "aa",
                               group.by = "sample",
                               threshold = 0.85, 
                               exportGraph = TRUE)

#Setting color scheme
col_legend <- factor(igraph::V(igraph.object)$group)
col_samples <- hcl.colors(3,"inferno")[as.numeric(col_legend)]
color.legend <- factor(unique(igraph::V(igraph.object)$group))

#Plotting
plot(
  igraph.object,
  vertex.size     = sqrt(igraph::V(igraph.object)$size),
  vertex.label    = NA,
  edge.arrow.size = .25,
  vertex.color    = col_samples
)
legend("topleft", legend = levels(color.legend), pch = 16, col = unique(col_samples), bty = "n")
```

***

# Conclusion  

This has been a general overview of the capabilities of scRepertoire from the initial processing and visualization to attach to the mRNA expression values in a single-cell object. If you have any questions, comments, or suggestions, please visit the GitHub repository or [email me](mailto:ncborch@gmail.com).

### Session Info

```{r}
sessionInfo()
```