--- title: "Introduction to ClustIRR" author: "Kai Wollek and Simo Kitanovski" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Introduction to ClustIRR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 5, fig.width = 5, fig.align = "center" ) ``` ```{r} library(ClustIRR) library(knitr) library(visNetwork) ``` # Introduction Adaptive immune repertoires of B- and T-cell receptors (BCRs/TCRs) play an important role in protecting the host against genetically diverse and rapidly evolving pathogens, such as viruses, bacteria, or cancers. The BCR and TCR sequence diversity originates in part due to V(D)J recombination, in which different germline-encoded genes are joined to form immune receptors (IRs). As a result of this process, practically every newly formed naive mature T cell and B cell is equipped with a distinct IR, and this allows them to recognize distinct sets of antigens. B-cells bind antigens directly via the complementarity determining regions (CDR) of their BCRs, and T-cells recognize antigenic peptides presented by major histocompatibility (MHC) molecules via the CDRs of their TCRs. Antigen recognition may lead to B/T cell activation, and in such a case, the cells start to proliferate rapidly, forming antigen-specific clones that are capable of mounting effective immune response. Recent studies have shown that similarity in TCR sequences implies shared antigen specificity between receptors. Hence, by clustering of TCR sequences of a repertoire derived by high-throughput sequencing (HT-seq), we can identify groups of TCRs with shared antigen specificity, which is essential for the development of cancer immunotherapies, vaccines, antiviral drugs, etc. This vignette introduces `r Biocpkg("ClustIRR")`, a computational method for clustering of immune receptor repertoires. # Installation `r Biocpkg("ClustIRR")` is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of adaptive immune repertoires. To install `r Biocpkg("ClustIRR")` please start R and enter: ```{r, eval=FALSE} if(!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("ClustIRR") ``` # ClustIRR algorithm The algorithm of `r Biocpkg("ClustIRR")` performs clustering of IR sequences to find groups of IRs with similar specificity. ```{r graphic, echo = FALSE, fig.align="left", out.width='90%'} knitr::include_graphics("../inst/extdata/logo.png") ``` ## Input The main input of `r Biocpkg("ClustIRR")` are two repertoires, i.e., sets of amino acid sequences of the complementarity determining region 3 (CDR3). The CDR3s may come from one T cell receptor chain (e.g., \ only CDR3$\alpha$s or only CDR3$\beta$s) or from both chains (CDR3$\alpha\beta$). The two sets represent two repertoires, and each element in the sets represents a T-cell: * `s`: IR repertoire under investigation (case sample) * `r`: reference IR repertoire (control/reference sample) **Hint:** `s` and `r` may also contain CDR3s from $\gamma\delta$ T-cells (CDR3$\gamma$ and CDR3$\delta$) or B-cells (CDR3H and CDR3L). This is explained in case study 4. Let's have a look at an example data set which we will use as input: ```{r} data("CDR3ab") ``` ```{r} # take the first 500 CDR3b sequences from the data -> s # s <- base::data.frame(CDR3b = CDR3ab$CDR3b[1:500]) s <- base::data.frame(CDR3b = c(CDR3ab$CDR3b[1:500], "CASSSSPGDQEQFF")) # take 5000 CDR3b sequences 501:5500 from the data -> r r <- base::data.frame(CDR3b = CDR3ab$CDR3b[501:5500]) ``` ```{r} str(s) ``` ```{r} str(r) ``` ## Clustering `r Biocpkg("ClustIRR")` employs two clustering strategies: * **local** clustering: detects enrichment of motifs in `s` compared to `r` * **global** clustering: identifies pairs of CDR3s in `s` that have similar sequences The rationale behind these two clustering strategies is the following: two identical CDR3 sequences have the same specificity. Similar CDR3 sequences (e.g., CDR3 sequences that differ by only one amino acid) may have similar specificity. Global clustering is designed to find pairs of CDR3s that are globally (based on the complete CDR3 sequences) similar. We also know that two CDR3s with significantly different sequences may still recognize the same peptide[^1] if they share a motif in their core regions (e.g., identical 4-mer). Such "useful" motifs may be enriched in `s` but not in `r`, and local clustering aims to identify them. ### Local clustering CDR3 sequences are segmented into overlapping **motifs** ($k$-mers), where $k$ is specified by setting the input $ks = 4$. Example of segmenting CDR3 sequence into 4-mers: ```{r} cdr3 <- "CASSTTTGTGELFF" k <- 4 base::colnames(stringdist::qgrams(cdr3, q = k)) ``` $k$-mers found in the **core region** of the CDR3 loop are more likely to establish contact with an antigenic peptide than the $k$-mers found at the flanks of the CDR3 sequence. Hence, the user is encouraged to remove a few amino acids from the flanks of each CDR3 sequence. This can be done by changing the control input from `control$trim_flank_aa,` e.g.: * `control$trim_flank_aa = 0`: no trimming * `control$trim_flank_aa = 3`: trim three amino acids from both flanks of the CDR3 sequence An example of trimming CDR3 flanks and segmenting the core of the CDR3 sequence into 4-mers is shown below. We now focus on five overlapping motifs found in the core region of the `CASSTTTGTGELFF`. ```{r} t <- 3 cdr3_trimmed <- base::substr(x = cdr3, start = t + 1, stop = nchar(cdr3) - t) base::colnames(stringdist::qgrams(cdr3_trimmed, q = k)) ``` A motif is considered enriched if the following (user-defined) criteria are satisfied: 1. `control$local_min_o`: minimum motif frequency in `s` 2. `control$local_min_ove`: minimum ratio of observed vs. expected (OvE) relative motif frequency, with $OvE=\dfrac{f_s}{n_s}/\dfrac{f_r}{n_r}$ * $f_{s}$ and $f_{r}$: motif frequencies in repertoires `s` and `r` * $n_{s}$ and $n_{r}$: total number of motifs in repertoires `s` and `r` 3. `control$local_max_fdr`: maximum false discovery rate (FDR). Corrected p-value, computed by Fisher's exact test (actually hypergeometic test) ### Global clustering For global clustering, `r Biocpkg("ClustIRR")` quantifies the dissimilarity between pairs of CDR3 sequences using Hamming distances. Two CDR3 sequences with Hamming distance $\leq x$ are considered globally similar, where $x$ is the user-defined threshold `control$global_max_dist` (default = 1). With this, `r Biocpkg("ClustIRR")` provides a very simple and intuitive heuristic for identifying globally similar CDR3s. But this approach also has drawbacks: 1. CDR3 sequences with different lengths are not compared 2. Hamming distance, by definition, does not take into account the properties of the replaced amino acids To overcome these challenges, `r Biocpkg("ClustIRR")` provides a second input option (see red input in `r Biocpkg("ClustIRR")` workflow), which allows the user to provide a matrix of globally similar CDR3 sequences computed by complementary approaches (e.g., *tcrdist*). This input can be provided via the input parameter `control$global_pairs`. ## Output The main function in `r Biocpkg("ClustIRR")` is `cluster_irr`. This function returns an S4 object of class `clust_irr`. The object contains two sublists (slots): 1. clustering results: tables and lists 2. processed inputs: processed version of the input data (`s`) and parameters We will describe the inputs, outputs, and the algorithm of `r Biocpkg("ClustIRR")` with the help of the following case studies. # Case study 1: simple TCR repertoire analysis with `r Biocpkg("ClustIRR")` In this example, we will insert the motif *LEAR* in the core regions of the first 20 CDR3b sequences of repertoire `s`. With this, we simulate enrichment of this motif. This motif is not enriched in repertoire `r`, and `r Biocpkg("ClustIRR")` should be able to detect *LEAR* as enriched: ```{r} # insert motif LEAR base::substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "LEAR" ``` ... and then we perform clustering with `r Biocpkg("ClustIRR")`: ```{r} o <- cluster_irr(s = s, r = r, version = 2, ks = 4, cores = 1, control = list(trim_flank_aa = 3)) ``` ## Local clustering results In the following table we see that `r Biocpkg("ClustIRR")` has detected enrichment of *LEAR* in `s` compared to `r`: ```{r} # extract local motifs local_motifs <- get_clustirr_clust(o)$CDR3b$local$m # display only passed motifs knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE) ``` Interestingly, the motif *LLEA* is also enriched. This is probably because *LLEA* and *LEAR* share the substring *LEA*, hence, the enrichment of *LLEA* can be seen as a byproduct of the inserted motif *LEAR*. This can also be seen from the lower frequency of *LLEA* ($f_s$=5) in `s` compared to *LEAR* ($f_s$=20). For *LEAR* we see FDR$\approx 10^{-15}$, which is significantly smaller than the FDR$\approx 10^{-2}$ observed for *LLEA*. Finally, for *LEAR* we see OvE$\approx 100$, whereas for *LLEA*'s OvE$=\infty$ (*LLEA* has `f_r=0`; division by 0 when calculating OvE). ## Global clustering results In our data we had only one pair of globally similar CDR3 sequences, i.e. the CDR3 sequences *CAS**S**PLEARGYTF* and *CAS**R**PLEARGYTF* which differ by one amino acid at position 4. `r Biocpkg("ClustIRR")` has identified this: ```{r} # display globally similar pairs knitr::kable(get_clustirr_clust(o)$CDR3b$global, row.names = FALSE) ``` ## Putting it all together $\rightarrow$ graph To interpret the `r Biocpkg("ClustIRR")` output we can inspect the tables of locally/globally similar CDR3s. However, sometimes these tables can be massive and difficult to interpret. Hence, we provide the functions `get_graph`, which allows us to generate an undirected graph (`r CRANpkg("igraph")` object) from the main outputs of `r Biocpkg("ClustIRR")`, and `plot_graph` for visualization of the graph. Each vertex in the graph is a T-cell clone, and we draw an edge between two vertices if a) they are globally similar; or b) if they share an enriched motif (locally similar). Let's visualize the graph output for this case study: ```{r, fig.width=5, fig.height=4, fig.align='center'} par(mai = c(0,0,0,0)) plot_graph(clust_irr = o) ```   ### Vertices The graph shows a cluster of 20 T-cell clones (vertices). These are the 20 clones with CDR3 sequences in which we inserted the motif *LEAR*. The remaining clones (about 480) are shown as singleton vertices. We also wee two vertices connected by an edge. These are the two clones that are globally similar. We can also plot an **interactive** graph with `r CRANpkg("visNetwork")`. ```{r, fig.width=5, fig.height=4, fig.align='center'} plot_graph(clust_irr = o, as_visnet = TRUE) ``` ### Edges Between a pair of vertices, we draw an edge if: a) they are globally similar; or b) they share an enriched motif. Furthermore, each chain type may have an edge if the repertoire contains paired information, e.g., on $\beta$ and $\alpha$ chain CDR3 sequences. The **color**, **linetype** and **thickness** of the edges are determined as follows. * edge colors * purple: local CDR3 similarity * green: global CDR3 similarity * black: local and global CDR3 similarity * edge linetypes * dashed: similarity between CDR3$\beta$, CDR3$\delta$ and CDR3H * dotted: similarity between CDR3$\alpha$, CDR3$\gamma$ and CDR3L * solid: similarity between CDR3s from both chains (e.g. CDR3$\alpha$ and CDR3$\beta$) * edge thickness: number of edges between two clones # Case study 2: analysis of TCR repertoire with large expanded clone In this case study, we assume that the data contains a clone with 20 T-cells ($\approx$4% of the size of the initial repertoire). The T-cells in the expanded clone have the same CDR3b sequence *CATSRPDGLAQYF*. `r Biocpkg("ClustIRR")` should detect most motifs at the core of *CATSRPDGLAQYF* as enriched while also reporting that all cells within have globally similar (in fact identical) CDR3b sequences. Let's insert a clone in dataset `s`: ```{r} # create a clone of 10 T-cells clone <- base::data.frame(CDR3b = rep("CATSRPDGLAQYF", times = 20)) # append the clone to the original repertoire 's' s <- base::rbind(s, clone) ``` ... and once again perform clustering with `r Biocpkg("ClustIRR")`: ```{r} o <- cluster_irr(s = s, r = r, version = 2, ks = 4, cores = 1, control = list(trim_flank_aa = 3)) ``` ## Local clustering results `r Biocpkg("ClustIRR")` once again reports enrichment of *LEAR*, but also of many additional motifs that are part of the core of *CATSRPDGLAQYF*, such as *DGLA*, *PDGL*, *RPDG*, *SRPG*, etc. ```{r, fig.align='center'} # extract local motifs local_motifs <- get_clustirr_clust(o)$CDR3b$local$m # display only passed motifs knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE) ``` ## Global clustering results Once again, `r Biocpkg("ClustIRR")` finds the same pair of globally similar CDR3 sequences. However, remember that this time our repertoire `s` contains 20 identical CDR3s belonging to the expanded clone's T-cells. Each pair of CDR3 sequences in the clone are globally similar. Meanwhile, the CDR3 sequences *CAS**S**PLEARGYTF* and *CAS**R**PLEARGYTF* differ by one amino acid at position 4. Let's check how the global and local similarities are represented in the graph (see next section). ## Graph output Let's plot the clustering results: ```{r} # plot the clust_irr object plot_graph(clust_irr = o, as_visnet = TRUE) ```   We see one connected component, which is identical to the one we saw in case study 1. Furthermore, we see a large vertex. This represents the clonal expansion, where the size of the vertex scales as the logarithm of the number of T-cells in the clone. Clonally expanded cells are globally similar to each other. If the specific clonal expansion is only found in `s` but not in `r`, then it is likely that we will also see an enrichment of certain motifs from the core of the corresponding CDR3 sequences. In summary, CDR3s of expanded clones are similar in terms of the global sequences but may also be locally similar. # Case study 3: analysis of TCR repertoire with paired $\alpha\beta$ TCR chains Single-cell technology allows us to sequence entire TCR repertoires, and to extract the sequences of both TCR chains: $\beta$ and $\alpha$. `r Biocpkg("ClustIRR")` can analyze such data. Clustering is performed separately using the CDR3 sequences of each chain. Let's create the input data. We create two repetoires: * repertoire `S0`: 1,000 CDR3$\beta$ and CDR3$\alpha$ sequence pairs. * repertoire `S1`: 1,000 CDR3$\beta$ and CDR3$\alpha$ sequence pairs. Imagine that `S0` and `S1` are two TCR repertoires of a cancer patient, sequenced before and after cancer therapy, respectively. ```{r} data("CDR3ab") S0 <- base::data.frame(CDR3a = CDR3ab$CDR3a[3001:4000], CDR3b = CDR3ab$CDR3b[3001:4000]) S1 <- base::data.frame(CDR3a = CDR3ab$CDR3a[5001:6000], CDR3b = CDR3ab$CDR3b[5001:6000]) ``` TCR repertoire `S0` has * two enriched motifs: *LEAR* and *REAL* * three expanded clones * clone 1: 100 T-cells (CDR3a: CASSEGEQFF and CDR3b CASSLLARAEQFF) * clone 2: 50 T-cells (CDR3a: CASSLESPLHF and CDR3b: CASSLEEEEEEPLHF) * clone 3: 10 T-cells (CDR3a: CASSLESPLHF and CDR3b: CASSLAAAAASPLHF) ```{r} # insert motif LEAR in CDR3b base::substr(x = S0$CDR3b[1:10], start = 6, stop = 9) <- "LEAR" # insert motif REAL in CDR3a base::substr(x = S0$CDR3a[50:59], start = 6, stop = 9) <- "REAL" ``` ```{r} # create 3 clones clone_1 <- data.frame(CDR3a = rep(x = "CASSEGEQFF", times = 100), CDR3b = rep(x = "CASSLLARAEQFF", times = 100)) clone_2 <- data.frame(CDR3a = rep(x = "CASSLESPLHF", times = 50), CDR3b = rep(x = "CASSLEEEEEEPLHF", times = 50)) clone_3 <- data.frame(CDR3a = rep(x = "CASSLESPLHF", times = 10), CDR3b = rep(x = "CASSLAAAAASPLHF", times = 10)) # append clones to s S0 <- rbind(S0, clone_1, clone_2, clone_3) ``` TCR repertoire `S1` has * one enriched motif: *WWWW* * one expanded clone 1: 100 T-cells (CDR3a: CASSEGEQFF and CDR3b CASSLLARAEQFF) [same as clone 1 in repertoire `S0`] ```{r} base::substr(x = S1$CDR3b[1:20], start = 6, stop = 9) <- "WWWW" S1 <- rbind(S1, clone_1) ``` We will perform two sets of analysis with `r Biocpkg("ClustIRR")`. First, we will inspect how the specificity structure of repertoire `S1` (`s`=`S1`) is modulated compare to repertoire `S0` (`r`=`S0`). Second, we will do the reverse, i.e. we will inspect the specificity structure of repertoire `S0` (`s`=`S0`) compared to `S1` (`r`=`S1`). ```{r} o_S1_vs_S0 <- cluster_irr(s = S1, r = S0, version = 2, ks = 4, cores = 1, control = list(trim_flank_aa = 3)) o_S0_vs_S1 <- cluster_irr(s = S0, r = S1, version = 2, ks = 4, cores = 1, control = list(trim_flank_aa = 3)) ``` Let's plot the joint graph: ```{r} # beta & alpha chain plot_joint_graph(clust_irr_1 = o_S1_vs_S0, clust_irr_2 = o_S0_vs_S1, as_visnet = TRUE) ```   ## How to interpret the joint graph? The clones of repertoire `S0` and `S1` are shown are orange and blue vertices, respectively. Repertoire `S0` contains three expanded clones. Hence, we see three large orange vertices. Repertoire `S1` contains only one expanded clone shown as a large blue vertex. The remaining vertices are small and likely contain a single T-cell. Within each repertoire, the edges are drawn as explained earlier. Meanwhile, in the joint graph we also have edges between the vertices from the two repertoires. These will be drawn if a pair of clones in `S0` and `S1` have globally similar CDR3 sequences. In this particular example, the vertex that represents the expanded clone 1 (large orange and blue vertices) are connected by such an edge. # Case study 4: analysis of TCR repertoires with paired $\gamma\delta$ chains `r Biocpkg("ClustIRR")` can be employed to study the specificity structure of: * $\alpha\beta$ chain TCR repertoires * $\gamma\delta$ chain TCR repertoires * heavy(H)/light(L) chain BCR repertoires To carry out this analysis, the user must appropriately configure the inputs. Here we demonstrate this point. The rest of the clustering analysis proceeds similarly as outlined in case studies 1-3. ## Configuring the input To analyze $\alpha\beta$ chain TCR repertoires, the input data sets `s` and `r` are required to have one or both (e.g., if we have paired data as explained in case study 3) of the columns: `CDR3a` and `CDR3b`. For the analysis of $\gamma\delta$ chain TCR repertoires the columns `CDR3g` and `CDR3d` have to be specified, whereas for the analysis of heavy(H)/light(L) chain BCR repertoires `r Biocpkg("ClustIRR")` uses the two columns CDR3h and CDR3l. Dummy example: ```{r} data("CDR3ab") # gamma/delta chain TCR data -> notice CDR3g and CDR3d columns of 's' and 'r' s <- base::data.frame(CDR3g = CDR3ab$CDR3a[4501:5000], CDR3d = CDR3ab$CDR3b[4501:5000]) r <- base::data.frame(CDR3g = CDR3ab$CDR3a[5001:10000], CDR3d = CDR3ab$CDR3b[5001:10000]) ``` ```{r} data("CDR3ab") # heavy/light chain BCR data -> notice CDR3h and CDR3l columns of 's' and 'r' s <- base::data.frame(CDR3h = CDR3ab$CDR3a[4501:5000], CDR3l = CDR3ab$CDR3b[4501:5000]) r <- base::data.frame(CDR3h = CDR3ab$CDR3a[5001:10000], CDR3l = CDR3ab$CDR3b[5001:10000]) ``` The rest of the analysis proceeds as usual, i.e., by calling the function `cluster_irr`. ```{r session_info} utils::sessionInfo() ``` [^1]: Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94-98.