---
title: "Introduction to ClustIRR"
author: "Kai Wollek and Simo Kitanovski"
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{Introduction to ClustIRR}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.height = 5,
    fig.width = 5,
    fig.align = "center"
)
```

```{r}
library(ClustIRR)
library(knitr)
library(visNetwork)
```


# Introduction
Adaptive immune repertoires of B- and T-cell receptors (BCRs/TCRs) play an 
important role in protecting the host against genetically diverse and rapidly 
evolving pathogens, such as viruses, bacteria, or cancers. The BCR and TCR
sequence diversity originates in part due to V(D)J recombination, in which 
different germline-encoded genes are joined to form immune receptors (IRs). 
As a result of this process, practically every newly formed naive mature T cell 
and B cell is equipped with a distinct IR, and this allows them to recognize 
distinct sets of antigens. 

B-cells bind antigens directly via the complementarity determining regions 
(CDR) of their BCRs, and T-cells recognize antigenic peptides presented 
by major histocompatibility (MHC) molecules via the CDRs of their TCRs. Antigen 
recognition may lead to B/T  cell activation, and in such a case, the cells 
start to proliferate rapidly, forming antigen-specific clones that are capable 
of mounting effective immune response. 

Recent studies have shown that similarity in TCR sequences implies shared 
antigen specificity between receptors. Hence, by clustering of TCR sequences of
a repertoire derived by high-throughput sequencing (HT-seq), we can identify 
groups of TCRs with shared antigen specificity, which is essential for the 
development of cancer immunotherapies, vaccines, antiviral drugs, etc. 

This vignette introduces `r Biocpkg("ClustIRR")`, a computational method 
for clustering of immune receptor repertoires.

# Installation
`r Biocpkg("ClustIRR")` is freely available as part of Bioconductor, filling
the gap that currently exists in terms of software for quantitative analysis 
of adaptive immune repertoires. 

To install `r Biocpkg("ClustIRR")` please start R and enter:

```{r, eval=FALSE}
if(!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("ClustIRR")
```


# ClustIRR algorithm
The algorithm of `r Biocpkg("ClustIRR")` performs clustering of IR sequences 
to find groups of IRs with similar specificity. 

```{r graphic, echo = FALSE, fig.align="left", out.width='90%'}
knitr::include_graphics("../inst/extdata/logo.png")
```


## Input

The main input of `r Biocpkg("ClustIRR")` are two repertoires, i.e., sets of 
amino acid sequences of the complementarity determining region 3 (CDR3). The 
CDR3s may come from one T cell receptor chain (e.g., \ only CDR3$\alpha$s or 
only CDR3$\beta$s) or from both chains (CDR3$\alpha\beta$). The two sets 
represent two repertoires, and each element in the sets represents a T-cell:

* `s`: IR repertoire under investigation (case sample)
* `r`: reference IR repertoire (control/reference sample)

**Hint:** `s` and `r` may also contain CDR3s from $\gamma\delta$ T-cells 
(CDR3$\gamma$ and CDR3$\delta$) or B-cells (CDR3H and CDR3L). This is 
explained in case study 4.

Let's have a look at an example data set which we will use as input:

```{r}
data("CDR3ab")
```

```{r}
# take the first 500 CDR3b sequences from the data -> s
# s <- base::data.frame(CDR3b = CDR3ab$CDR3b[1:500])
s <- base::data.frame(CDR3b = c(CDR3ab$CDR3b[1:500], "CASSSSPGDQEQFF"))

# take 5000 CDR3b sequences 501:5500 from the data -> r
r <- base::data.frame(CDR3b = CDR3ab$CDR3b[501:5500])
```


```{r}
str(s)
```


```{r}
str(r)
```


## Clustering

`r Biocpkg("ClustIRR")` employs two clustering strategies:

* **local** clustering: detects enrichment of motifs in `s` compared to `r`
* **global** clustering: identifies pairs of CDR3s in `s` that have similar 
sequences

The rationale behind these two clustering strategies is the following: two 
identical CDR3 sequences have the same specificity. Similar CDR3 sequences 
(e.g., CDR3 sequences that differ by only one amino acid) may have similar 
specificity. Global clustering is designed to find pairs of CDR3s that are
globally (based on the complete CDR3 sequences) similar. 

We also know that two CDR3s with significantly different sequences may still 
recognize the same peptide[^1] if they share a motif in their core regions 
(e.g., identical 4-mer). Such "useful" motifs may be enriched in `s`  but not 
in `r`, and local clustering aims to identify them.


### Local clustering

CDR3 sequences are segmented into overlapping **motifs** ($k$-mers), where $k$ 
is specified by setting the input $ks = 4$. 

Example of segmenting CDR3 sequence into 4-mers:

```{r}
cdr3 <- "CASSTTTGTGELFF"
k <- 4
base::colnames(stringdist::qgrams(cdr3, q = k))
```


$k$-mers found in the **core region** of the CDR3 loop are more likely to 
establish contact with an antigenic peptide than the $k$-mers found at the 
flanks of the CDR3 sequence. Hence, the user is encouraged to remove a few 
amino acids from the flanks of each CDR3 sequence. This can be done by changing
the control input from `control$trim_flank_aa,` e.g.:
  
  * `control$trim_flank_aa = 0`: no trimming
  * `control$trim_flank_aa = 3`: trim three amino acids from both flanks of the 
  CDR3 sequence

An example of trimming CDR3 flanks and segmenting the core of the CDR3 sequence 
into 4-mers is shown below. We now focus on five overlapping motifs found in 
the core region of the `CASSTTTGTGELFF`.

```{r}
t <- 3
cdr3_trimmed <- base::substr(x = cdr3, start = t + 1, stop = nchar(cdr3) - t)
base::colnames(stringdist::qgrams(cdr3_trimmed, q = k))
```
 

A motif is considered enriched if the following (user-defined) criteria are 
satisfied:

1. `control$local_min_o`: minimum motif frequency in `s`
2. `control$local_min_ove`: minimum ratio of observed vs. expected (OvE) 
    relative motif frequency, with $OvE=\dfrac{f_s}{n_s}/\dfrac{f_r}{n_r}$
    * $f_{s}$ and $f_{r}$: motif frequencies in repertoires `s` and `r`
    * $n_{s}$ and $n_{r}$: total number of motifs in repertoires `s` and `r`
      
3. `control$local_max_fdr`: maximum false discovery rate (FDR). Corrected 
    p-value, computed by Fisher's exact test (actually hypergeometic test)


### Global clustering

For global clustering, `r Biocpkg("ClustIRR")` quantifies the dissimilarity 
between pairs of CDR3 sequences using Hamming distances. Two CDR3 sequences 
with Hamming distance $\leq x$ are considered globally similar, where $x$ is 
the user-defined threshold `control$global_max_dist` (default = 1). 

With this, `r Biocpkg("ClustIRR")` provides a very simple and intuitive 
heuristic for identifying globally similar CDR3s. But this approach also 
has drawbacks: 

1. CDR3 sequences with different lengths are not compared
2. Hamming distance, by definition, does not take into account the properties 
of the replaced amino acids

To overcome these challenges, `r Biocpkg("ClustIRR")` provides a second input
option (see red input in `r Biocpkg("ClustIRR")` workflow), which allows the 
user to provide a matrix of globally similar CDR3 sequences computed by 
complementary approaches (e.g., *tcrdist*). This input can be provided via 
the input parameter `control$global_pairs`.


## Output

The main function in `r Biocpkg("ClustIRR")` is `cluster_irr`. This function 
returns an S4 object of class `clust_irr`. The object contains two sublists 
(slots):

1. clustering results: tables and lists
2. processed inputs: processed version of the input data (`s`) and parameters

We will describe the inputs, outputs, and the algorithm of 
`r Biocpkg("ClustIRR")` with the help of the following case studies. 


# Case study 1: simple TCR repertoire analysis with `r Biocpkg("ClustIRR")`

In this example, we will insert the motif *LEAR* in the core regions of the 
first 20 CDR3b sequences of repertoire `s`. With this, we simulate enrichment 
of this motif. 

This motif is not enriched in repertoire `r`, and `r Biocpkg("ClustIRR")` 
should be able to detect *LEAR* as enriched:

```{r}
# insert motif LEAR
base::substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "LEAR"
```


... and then we perform clustering with `r Biocpkg("ClustIRR")`:

```{r}
o <- cluster_irr(s = s,
                 r = r,
                 version = 2,
                 ks = 4,
                 cores = 1,
                 control = list(trim_flank_aa = 3))
```

## Local clustering results
In the following table we see that `r Biocpkg("ClustIRR")` has detected 
enrichment of *LEAR* in `s` compared to `r`: 

```{r}
# extract local motifs
local_motifs <- get_clustirr_clust(o)$CDR3b$local$m

# display only passed motifs
knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE)
```

Interestingly, the motif *LLEA* is also enriched. This is probably because 
*LLEA* and *LEAR* share the substring *LEA*, hence, the enrichment of *LLEA* 
can be seen as a byproduct of the inserted motif *LEAR*.

This can also be seen from the lower frequency of *LLEA* ($f_s$=5) in `s` 
compared to *LEAR* ($f_s$=20). For *LEAR* we see FDR$\approx 10^{-15}$, 
which is significantly smaller than the FDR$\approx 10^{-2}$ observed for 
*LLEA*. Finally, for *LEAR* we see OvE$\approx 100$, whereas for *LLEA*'s  
OvE$=\infty$ (*LLEA* has `f_r=0`; division by 0 when calculating OvE). 


## Global clustering results

In our data we had only one pair of globally similar CDR3 sequences, i.e.
the CDR3 sequences *CAS**S**PLEARGYTF* and *CAS**R**PLEARGYTF* which differ 
by one amino acid at position 4. `r Biocpkg("ClustIRR")` has identified this:

```{r}
# display globally similar pairs
knitr::kable(get_clustirr_clust(o)$CDR3b$global, row.names = FALSE)
```

## Putting it all together $\rightarrow$ graph

To interpret the `r Biocpkg("ClustIRR")` output we can inspect the tables
of locally/globally similar CDR3s. However, sometimes these tables can be
massive and difficult to interpret. Hence, we provide the functions `get_graph`, 
which allows us to generate an undirected graph (`r CRANpkg("igraph")` object) 
from the main outputs of `r Biocpkg("ClustIRR")`, and `plot_graph` for 
visualization of the graph.

Each vertex in the graph is a T-cell clone, and we draw an edge between two 
vertices if a) they are globally similar; or b) if they share an enriched 
motif (locally similar). 

Let's visualize the graph output for this case study:

```{r, fig.width=5, fig.height=4, fig.align='center'}
par(mai = c(0,0,0,0))
plot_graph(clust_irr = o)
```

&nbsp;

### Vertices

The graph shows a cluster of 20 T-cell clones (vertices). These are the 20 
clones with CDR3 sequences in which we inserted the motif *LEAR*. The remaining 
clones (about 480) are shown as singleton vertices.

We also wee two vertices connected by an edge. These are the two clones that
are globally similar.

We can also plot an **interactive** graph with `r CRANpkg("visNetwork")`. 

```{r, fig.width=5, fig.height=4, fig.align='center'}
plot_graph(clust_irr = o, as_visnet = TRUE)
```


### Edges
Between a pair of vertices, we draw an edge if: a) they are globally similar; 
or b) they share an enriched motif. Furthermore, each chain type may have an
edge if the repertoire contains paired information, e.g., on $\beta$ and 
$\alpha$ chain CDR3 sequences.

The **color**, **linetype** and **thickness** of the edges are determined as
follows. 

  * edge colors
      
      * purple: local CDR3 similarity
      * green: global CDR3 similarity
      * black: local and global CDR3 similarity
      
  * edge linetypes
    
      * dashed: similarity between CDR3$\beta$, CDR3$\delta$ and CDR3H
      * dotted: similarity between CDR3$\alpha$, CDR3$\gamma$ and CDR3L
      * solid: similarity between CDR3s from both chains (e.g. CDR3$\alpha$ 
        and CDR3$\beta$)

  * edge thickness: number of edges between two clones


# Case study 2: analysis of TCR repertoire with large expanded clone

In this case study, we assume that the data contains a clone with 20 
T-cells ($\approx$4% of the size of the initial repertoire). The T-cells 
in the expanded clone have the same CDR3b sequence *CATSRPDGLAQYF*. 
`r Biocpkg("ClustIRR")` should detect most motifs at the core of 
*CATSRPDGLAQYF* as enriched while also reporting that all cells within 
have globally similar (in fact identical) CDR3b sequences.

Let's insert a clone in dataset `s`:

```{r}
# create a clone of 10 T-cells
clone <- base::data.frame(CDR3b = rep("CATSRPDGLAQYF", times = 20))

# append the clone to the original repertoire 's'
s <- base::rbind(s, clone)
```

... and once again perform clustering with `r Biocpkg("ClustIRR")`:

```{r}
o <- cluster_irr(s = s,
                 r = r,
                 version = 2,
                 ks = 4,
                 cores = 1,
                 control = list(trim_flank_aa = 3))
```

## Local clustering results
`r Biocpkg("ClustIRR")` once again reports enrichment of *LEAR*, but also of 
many additional motifs that are part of the core of *CATSRPDGLAQYF*, such as 
*DGLA*, *PDGL*, *RPDG*, *SRPG*, etc.

```{r, fig.align='center'}
# extract local motifs
local_motifs <- get_clustirr_clust(o)$CDR3b$local$m

# display only passed motifs
knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE)
```

## Global clustering results

Once again, `r Biocpkg("ClustIRR")` finds the same pair of globally similar 
CDR3 sequences. However, remember that this time our repertoire `s` contains 
20 identical CDR3s belonging to the expanded clone's T-cells. Each pair 
of CDR3 sequences in the clone are globally similar. Meanwhile, the CDR3 
sequences *CAS**S**PLEARGYTF* and *CAS**R**PLEARGYTF* differ by one amino 
acid at position 4.

Let's check how the global and local similarities are represented in the
graph (see next section).


## Graph output

Let's plot the clustering results:

```{r}
# plot the clust_irr object
plot_graph(clust_irr = o, as_visnet = TRUE)
```

&nbsp;

We see one connected component, which is identical to the one we saw in case 
study 1. Furthermore, we see a large vertex. This represents the clonal 
expansion, where the size of the vertex scales as the logarithm of the number 
of T-cells in the clone.

Clonally expanded cells are globally similar to each other. If the specific 
clonal expansion is only found in `s` but not in `r`, then it is likely that
we will also see an enrichment of certain motifs from the core of the 
corresponding CDR3 sequences. In summary, CDR3s of expanded clones are similar 
in terms of the global sequences but may also be locally similar.


# Case study 3: analysis of TCR repertoire with paired $\alpha\beta$ TCR chains

Single-cell technology allows us to sequence entire TCR repertoires, and
to extract the sequences of both TCR chains: $\beta$ and $\alpha$.

`r Biocpkg("ClustIRR")` can analyze such data. Clustering is performed 
separately using the CDR3 sequences of each chain.

Let's create the input data. We create two repetoires: 

  * repertoire `S0`: 1,000 CDR3$\beta$ and CDR3$\alpha$ sequence pairs. 
  * repertoire `S1`: 1,000 CDR3$\beta$ and CDR3$\alpha$ sequence pairs. 
  
Imagine that `S0` and `S1` are two TCR repertoires of a cancer patient, 
sequenced before and after cancer therapy, respectively.

```{r}
data("CDR3ab")

S0 <- base::data.frame(CDR3a = CDR3ab$CDR3a[3001:4000],
                      CDR3b = CDR3ab$CDR3b[3001:4000])

S1 <- base::data.frame(CDR3a = CDR3ab$CDR3a[5001:6000],
                      CDR3b = CDR3ab$CDR3b[5001:6000])
```

TCR repertoire `S0` has 

  * two enriched motifs: *LEAR* and *REAL*
  * three expanded clones
      * clone 1: 100 T-cells (CDR3a: CASSEGEQFF and CDR3b CASSLLARAEQFF)
      * clone 2: 50 T-cells (CDR3a: CASSLESPLHF and CDR3b: CASSLEEEEEEPLHF)
      * clone 3: 10 T-cells (CDR3a: CASSLESPLHF and CDR3b: CASSLAAAAASPLHF)

```{r}
# insert motif LEAR in CDR3b
base::substr(x = S0$CDR3b[1:10], start = 6, stop = 9) <- "LEAR"
# insert motif REAL in CDR3a
base::substr(x = S0$CDR3a[50:59], start = 6, stop = 9) <- "REAL"
```


```{r}
# create 3 clones
clone_1 <- data.frame(CDR3a = rep(x = "CASSEGEQFF", times = 100),
                      CDR3b = rep(x = "CASSLLARAEQFF", times = 100))

clone_2 <- data.frame(CDR3a = rep(x = "CASSLESPLHF", times = 50),
                      CDR3b = rep(x = "CASSLEEEEEEPLHF", times = 50))

clone_3 <- data.frame(CDR3a = rep(x = "CASSLESPLHF", times = 10),
                      CDR3b = rep(x = "CASSLAAAAASPLHF", times = 10))

# append clones to s
S0 <- rbind(S0, clone_1, clone_2, clone_3)
```


TCR repertoire `S1` has 

  * one enriched motif: *WWWW*
  * one expanded clone 1: 100 T-cells (CDR3a: CASSEGEQFF and CDR3b 
  CASSLLARAEQFF) [same as clone 1 in repertoire `S0`]

```{r}
base::substr(x = S1$CDR3b[1:20], start = 6, stop = 9) <- "WWWW"
S1 <- rbind(S1, clone_1)
```


We will perform two sets of analysis with `r Biocpkg("ClustIRR")`.

First, we will inspect how the specificity structure of repertoire `S1` 
(`s`=`S1`) is modulated compare to repertoire `S0` (`r`=`S0`). Second, we 
will do the reverse, i.e. we will inspect the specificity structure of 
repertoire `S0` (`s`=`S0`) compared to `S1` (`r`=`S1`).

```{r}
o_S1_vs_S0 <- cluster_irr(s = S1,
                          r = S0,
                          version = 2,
                          ks = 4,
                          cores = 1,
                          control = list(trim_flank_aa = 3))


o_S0_vs_S1 <- cluster_irr(s = S0,
                          r = S1,
                          version = 2,
                          ks = 4,
                          cores = 1,
                          control = list(trim_flank_aa = 3))
```

Let's plot the joint graph:

```{r}
# beta & alpha chain
plot_joint_graph(clust_irr_1 = o_S1_vs_S0, 
                 clust_irr_2 = o_S0_vs_S1,
                 as_visnet = TRUE)
```

&nbsp;

## How to interpret the joint graph?

The clones of repertoire `S0` and `S1` are shown are orange and blue 
vertices, respectively. 

Repertoire `S0` contains three expanded clones. Hence, we see three large 
orange vertices. Repertoire `S1` contains only one expanded clone shown as 
a large blue vertex. The remaining vertices are small and likely contain a 
single T-cell.

Within each repertoire, the edges are drawn as explained earlier. 

Meanwhile, in the joint graph we also have edges between the vertices from 
the two repertoires. These will be drawn if a pair of clones in `S0` and `S1` 
have globally similar CDR3 sequences. In this particular example, the vertex 
that represents the expanded clone 1 (large orange and blue vertices) are 
connected by such an edge.


# Case study 4: analysis of TCR repertoires with paired $\gamma\delta$ chains

`r Biocpkg("ClustIRR")` can be employed to study the specificity structure of:

  * $\alpha\beta$ chain TCR repertoires
  * $\gamma\delta$ chain TCR repertoires
  * heavy(H)/light(L) chain BCR repertoires

To carry out this analysis, the user must appropriately configure the 
inputs. Here we demonstrate this point. The rest of the clustering analysis 
proceeds similarly as outlined in case studies 1-3.

## Configuring the input
To analyze $\alpha\beta$ chain TCR repertoires, the input data sets `s` and `r`
are required to have one or both (e.g., if we have paired data as explained in
case study 3) of the columns: `CDR3a` and `CDR3b`.

For the analysis of $\gamma\delta$ chain TCR repertoires the columns `CDR3g` 
and `CDR3d` have to be specified, whereas for the analysis of heavy(H)/light(L) 
chain BCR repertoires `r Biocpkg("ClustIRR")` uses the two columns CDR3h and 
CDR3l.


Dummy example:

```{r}
data("CDR3ab")

# gamma/delta chain TCR data -> notice CDR3g and CDR3d columns of 's' and 'r'
s <- base::data.frame(CDR3g = CDR3ab$CDR3a[4501:5000],
                      CDR3d = CDR3ab$CDR3b[4501:5000])

r <- base::data.frame(CDR3g = CDR3ab$CDR3a[5001:10000],
                      CDR3d = CDR3ab$CDR3b[5001:10000])

```


```{r}
data("CDR3ab")

# heavy/light chain BCR data -> notice CDR3h and CDR3l columns of 's' and 'r'
s <- base::data.frame(CDR3h = CDR3ab$CDR3a[4501:5000],
                      CDR3l = CDR3ab$CDR3b[4501:5000])

r <- base::data.frame(CDR3h = CDR3ab$CDR3a[5001:10000],
                      CDR3l = CDR3ab$CDR3b[5001:10000])

```

The rest of the analysis proceeds as usual, i.e., by calling the function
`cluster_irr`.

```{r session_info}
utils::sessionInfo()
```


[^1]: Glanville, Jacob, et al. "Identifying specificity groups in the T cell 
receptor repertoire." Nature 547.7661 (2017): 94-98.