---
title: "abseqR: reporting and data analysis functionalities for Rep-Seq datasets of antibody libraries"
author:
- "Jia Hong Fong"
- "Monther Alhamdoosh"

package: "abseqR"
output:
    BiocStyle::pdf_document: default
    BiocStyle::html_document:
vignette: >
    %\VignetteIndexEntry{Introduction to abseqR}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
library(abseqR)
library(png)
library(plotly)
library(grid)
library(gridExtra)
library(BiocStyle)
# knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(message = FALSE)
```

```{r abseqR_vignette_helper_functions, include=FALSE}
sabseqR <- function() { return(Biocpkg("abseqR")); }
#sabseqR <- function() { return("`abseqR`"); }
sabseqPy <- function() { return("[abseqPy](https://github.com/malhamdoosh/abseqPy)"); }
sabseq <- function() { return("`AbSeq`"); }
read_png <- function(path) {
    if (file.exists(path)) {
        #rasterGrob(as.raster(readPNG(path)), interpolate = TRUE)
        # in windows, the path is \ but LaTEX does not appreciate these slashes
        knitr::include_graphics(gsub("\\", "/",
                                     normalizePath(path),
                                     fixed = TRUE))
    } else {
        stop("File at ", path, " not found, fatal.")
    }
}

```

# Introduction

A plethora of high throughput sequencing (HTS) analysis pipelines are
available as open source tools to analyze and validate the quality of Rep-seq [^Repseq] datasets.
[OmicTools](https://omictools.com/search?q=repertoire)
provides a summary of repertoire sequencing tools that implements different
techniques and algorithms in analyzing and visualizing datasets from B-cell
receptors (BCR) and T-cell receptors (TCR). However, high throughput analysis pipelines of antibody library sequencing datasets are scarce.

`r sabseq()` is a comprehensive bioinformatic pipeline for the analysis of sequencing datasets generated from antibody libraries and `r sabseqR()` is one of its packages. The `r sabseq()` suite is implemented as a set of functions
in `R` and `Python` that can be used together to provide insights into the quality of antibody libraries. `r sabseqPy()` processes paired-end
or single-end _FASTA/FASTQ_ files generated from NGS sequencers and converts them into
CSV and [HDF](https://portal.hdfgroup.org/display/support) files.
`r sabseqR()` visualizes the output of `r sabseqPy()` and generates a self-contained
HTML report. Furthermore, `r sabseqR()` provides additional functionalities to
explicitly compare multiple samples and perform further downstream analyses.


`r sabseqR()` provides the following functionalities:

* Visualizations: the output from `r sabseqPy()` is summarized succintly into
  static and interactive plots. The plots are also stored in `Rdata` object files that provide flexibility for
  users to easily customize the aesthetics of any plot generated by `r sabseqR()`.
 
* Interactive reports: the plots generated by `r sabseqR()` can be collated and presented
  in a self-contained HTML document for convenience and ease of sharing.
  
* Sample comparison: `r sabseqR()` overloads the `+` operator
  via the S4 classes `AbSeqCRep` and `AbSeqRep` to compare
  multiple samples with each other. The `comparative reports` include additional
  plots, for example, sample similarity clustering, overlapping clonotypes, etc. The usual plots are also generated for all the compared samples by adding an extra layer of `aes`thetic.
  
## AbSeq core analyses

`r sabseq()` includes, but is not limited to, merging paired-end reads,
annotating V-(D)-J germlines, calculating unique clonotypes,
analyzing primer specificity, facilitating the selection of best restriction enzymes,
predicting frameshifts, identifying functional clones, and
calculating diversity indices and estimations. These analyses are seamlessly
extrapolated to analyze multiple library repertoires simultaneously when
multiple samples are present. Figure \@ref(fig:abseq-wf) depicts the complete
`r sabseq()` workflow. Sequencing files are taken as input to be annotated and
analyzed by `r sabseqPy()` before they are further analyzed and visualized
by `r sabseqR()`.


```{r abseq-wf, fig.cap="AbSeq workflow. Comprehensive analyses and visualizations are generated from input sequencing files using abseqPy and abseqR.", echo=FALSE, out.width="100%"}
read_png("abseq_workflow_MA.jpg")
```


# Installation

`r sabseqR()` can be installed from [bioconductor.org](http://bioconductor.org/) or
its GitHub repository at https://github.com/malhamdoosh/abseqR.

## Bioconductor

To install `r sabseqR()` via the `BiocManager`, type in R console:
```r
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("abseqR")
```

## GitHub

To install the development version of `r sabseqR()` from GitHub,
type in R console:
```r
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("malhamdoosh/abseqR")
```

## System prerequisites

`r sabseqR()` requires [pandoc](http://pandoc.org/) version _1.19.2.1_ or higher
to render the HTML reports. If `pandoc` cannot be detected while executing `r sabseqR()`,
the HTML report will __not__ be generated. `r sabseqR()` is a cross-platform
library and will work on any major operating system [^windows_biocparallel].

## R package dependencies

`r sabseqR()` depends on several packages from the [CRAN](https://cran.r-project.org/)
and [Bioconductor](https://bioconductor.org/) repositories:

* `r CRANpkg("RColorBrewer")` provides colour
schemes for maps and graphics. To install it, type in R console
`install.packages("RColorBrewer")`

* `r CRANpkg("VennDiagram")` provides a set of
functions to generate Venn diagrams. To install it, type in R console `install.packages("VennDiagram")`

* `r CRANpkg("circlize")` is a visualization tool
used to summarize the distributions of associations between V-J gene segments. To install it, type in R console `install.packages("circlize")`

* `r CRANpkg("flexdashboard")` is a package
that provides a template for RMarkdown that resembles a grid oriented dashboard and is used to generate the HTML reports. To
install it, type in R console `install.packages("flexdashboard")`

* `r CRANpkg("ggplot2")` is an implementation of
the "Grammar of Graphics" in R. It is used extensively to
generate plots. To install it, type in R console `install.packages("ggplot2")`

* `r CRANpkg("ggcorrplot")` is used to visualize
a correlation matrix using `r CRANpkg("ggplot2")`.
To install it, type in R console `install.packages("ggcorrplot")`

* `r CRANpkg("ggdendro")` provides a set of tools
for drawing dendrograms and tree plots using `r CRANpkg("ggplot2")`. To install it, type
`install.packages("ggdendro")`

* `r CRANpkg("grid")` is used to arrange plots.
It has been integrated into the base R package.

* `r CRANpkg("gridExtra")` provides functions
to work with "grid" graphics and used to arrange grid-based plots
in the HTML reports. To install it, type in R console `install.packages("gridExtra")`

* `r CRANpkg("knitr")` provides the capability to
dynamically generate reports in R. To install it, type in R console `install.packages("knitr")`

* `r CRANpkg("plotly")` is used to translate
`r CRANpkg("ggplot2")` graphs to interactive web-based plots. To install it, type in R console
`install.packages("plotly")`

* `r CRANpkg("plyr")` offers a set of tools used in
this package to apply operations on subsets of data in manageable pieces. To
install it, type in R console `install.packages("plyr")`

* `r CRANpkg("png")` is used to read and display PNG images.
To install it, type in R console `install.packages("png")`

* `r CRANpkg("reshape2")` allows this package to
restructure and aggregate dataframes. To install it, type in R console `install.packages("reshape2")`

* `r CRANpkg("rmarkdown")` converts R Markdown
documents into a variety of formats. To install it, type in R console `install.packages("rmarkdown")`

* `r CRANpkg("vegan")` provides a suite of functions
to calculate diversity and distance statistics between repertoires. To install it,
type in R console `install.packages("vegan")`

* `r Biocpkg("BiocParallel")` is
a package from [Bioconductor](https://bioconductor.org) used to enable parallel computing. To install it, type in R console
```r
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("BiocParallel")
```

# Quick start


To leverage all the functionalities provided by `r sabseqR()`, the main
functions to note are `abseqR::abseqReport`, `abseqR::report`, and `+`.
This section uses a small simulated dataset to walk through the use cases of each function.

## Datasets

<!-- MA: add a figure that depcits how the datasets are generated  (probably folder structure) -->

The example dataset is packaged with `r sabseqR()`.
For the sake of brevity, the dataset generation is described under the
[Appendices](#example-data) section.

Briefly, the dataset includes three samples, namely PCR1, PCR2, and PCR3, that was generated using _in silico_ simulations.
`r sabseqPy()` was then used to analyze the datasets and
the output directory argument `--outdir` specified in `r sabseqPy()` was initiated with the value `"./ex/"`.


```{r abseq-sim-data, fig.cap="Generation of the three simulated datasets. The flowchart on the left shows the generation of the datasets in `\"./ex/\"` while the folder structure on the right shows the output generated by `abseqPy` on the three datasets.", echo=FALSE, out.width="100%"}
read_png("simulation_flowchart_full.jpg")
```

### Fetch data files {#obtain-dataset}

The output of `r sabseqPy()` on the simulated datasets is first fetched into a local directory as follows:

```{r abseqR_copy_data_files, results=FALSE}
# substitute with any directory that you have read/write access to
sandboxDirectory <- tempdir()

# path to provided data (comes installed with abseqR)
exdata <- system.file("extdata", "ex", package = "abseqR")

# copy the provided data to sandboxDirectory
file.copy(exdata, sandboxDirectory, recursive = TRUE)
```

Then, the following commands can be executed in R console to verify that the three `PCR` datasets are fetched successfully:

```{r abseqR_data_contents}
# dataPath now holds the path to a local copy of the data
dataPath <- file.path(sandboxDirectory, "ex")
# the sample names can be found inside the auxiliary directory
list.files(path = file.path(dataPath, "auxiliary"))
```


## Basic analysis {#basic-analysis}

After [obtaining the datasets](#obtain-dataset), the `abseqReport` function from `r sabseqR()` is invoked to visualize the different analysis results as follows: 

```{r abseqR_quick_run, results=FALSE, warning=FALSE}
# This section will visualize all the datasets individually
# and compare PCR1 with PCR2 with PCR3

# Interim solution
if (Sys.info()["sysname"] == "Darwin") {
    BPPARAM <- BiocParallel::SerialParam()
} else {
    BPPARAM <- BiocParallel::bpparam()
}

# you should use report = 3 to generate a HTML report
samples <- abseqReport(dataPath,
                       compare = c("PCR1, PCR2, PCR3"),
                       report = 1,
                       BPPARAM = BPPARAM)

# ignore the message:
# "Sample output directory <path> is different from provided path
#   <path> assuming directory was moved"
# This warning message tells us that the directory has
# been moved (we copied the provided examples to "dataPath")
```

This creates plots for all samples included in `dataPath`.
In addition, The `compare = c("PCR1, PCR2, PCR3")` argument specifies
that samples `PCR1`, `PCR2`, and `PCR3` are explicitly
compared against each other. Other possible values for
`compare`, `report` , and `BPPARAM` will be discussed in detail
in later sections ([here](#abseqReport-compare), [here](#fine-tune),
and [here](#parallelization)).


## HTML reports' directory structure

Figure \@ref(fig:abseq-final-folder-structure) shows the folder structure of `./ex/`
after `r sabseqR()` completes.

```{r abseq-final-folder-structure, fig.cap="Folder structure after `abseqR` completes. The landing page `index.html` provides a convenient way to navigate around the standalone HTML reports in `html\\_files/`.", echo=FALSE}
grid.raster(readPNG("post_report_fs.png"))
```


Invoking `abseqReport` generates plots in the same folder as the
corresponding data files within the `auxiliary` directory. They are then
collated together in an HTML document found in the `report` directory.

The `report` directory is structured as follows:

* `index.html` file is the entry point to browse AbSeq's HTML reports. It summarizes the `r sabseq()` analysis and provides a convenient way
for navigating individual and comparative analysis results. For example, within this file, there are links to
the reports generated for `PCR1`, `PCR2`, `PCR3` and `PCR1 vs PCR2 vs PCR3`.

* `html_files` directory contains HTML files that are used build the individual and comparative reports. They can be accessed directly or via the main page `index.html`.

In conclusion, the individual sample reports in `html_files` can
be shared as-is, but `index.html` __must__ be accompanied by the `html_files` directory and thus it is recommended to share the entire `report` folder.

## Comparative analysis

### Comparative analysis using a single dataset {#abseqReport-compare}

This section describes the possible values for `abseqReport`'s
`compare` parameter. In the previous section, `abseqReport` was called with
`compare = c("PCR1, PCR2, PCR3")`. This 
compares the three samples all together. However,
it is also possible to compare only a subset of samples in the `dataPath` folder, multiple subsets of samples, or none at all.

The `compare` parameter accepts a vector of one or more strings. Each string
denotes a comparison between samples separated by commas, for example,
`compare = c("PCR1, PCR2, PCR3")`[^compare-ws].

If sample comparison is not required, then the following can be simply invoked `samples <- abseqReport(dataPath)`.

Example of other combinations:

```{r abseqR_abseqReport_compare_argument, results=FALSE, warning=FALSE}
# Example of 1 comparison
# Analyze all samples, but only compare PCR1 with PCR2
pcr1.pcr2 <- abseqReport(dataPath,
                         compare = c("PCR1, PCR2"),
                         report = 0)

# Example of 2 comparisons
# Analyze all samples. In addition, compare:
#  * PCR1 with PCR2
#  * PCR2 with PCR3
multiComparison <- abseqReport(dataPath,
                               compare = c("PCR1, PCR2", "PCR2, PCR3"),
                               report = 0)
```

<!-- using hidden chunks can cause problem for users who attempt to reproduce your results. Please remove all hidden chunks or make them visible -->

__Note__, `abseqReport` always returns S4 objects of the class __AbSeqRep__ for each sample in the `dataPath` directory regardless of the value of the `compare` argument as illustrated next:

```{r abseqR_show_return_of_abseqReport, results='hold'}
# compare = c("PCR1,PCR2")
names(pcr1.pcr2)

# compare = c("PCR1, PCR2", "PCR2 ,PCR3")
names(multiComparison)
```

<!-- MA: I guess it would be nice if you overwrite the `show` method for AbSeqRep and AbSeqCRep so that the output looks nice when an AbSeqRep object is thrown in the console, i.e.g, if you type pcr1.pcr2$PCR1 -->

The next subsection explains the motivation behind this behaviour.

### Comparative analysis using multiple datasets {#customize-cmp}

When constructing antibody libraries, users might be interested in comparing Ab repertoires from different stages of the construction process. Usually, each stage has its own sequencing runs and thus would be analyzed indepedent of others. The `report` function in `r sabseqR()` was written to enable this behaviour as illustrated next.  

[Previously](#basic-analysis), the S4 objects of three samples of our toy example loaded into a variable named `samples`. As a hypothetical example,
if the reports show an interesting observation between `PCR1` and `PCR3`,
it might be of interest to isolate the two samples from the rest. 

This code shows how the `+` operator can be used to create customized comparisons as follows:

```{r abseqR_refine_comparison_simple, results=FALSE, warning=FALSE}
# recall that samples is a named list where each element's name
# is the sample's own name (see names(samples))

# use report = 3 if the results should be collated in a HTML document
pcr1.pcr3 <- samples[["PCR1"]] + samples[["PCR3"]]
refinedComparison <- report(pcr1.pcr3,
                            outputDir = file.path(sandboxDirectory,
                                                  "refined_comparison"),
                            report = 1)
```

Here, the `+` operator creates a new comparison between `PCR1` and `PCR3` of class *AbSeqCRep*. S4 objects of this class can be passed to the `report` function to generate a standalone HTML report of this particular comparison.
Similar to `abseqReport`, this function returns S4 objects of the individual samples - `PCR1` and `PCR3`.

```{r abseqR_refine_comparison_sample_return_value, eval=TRUE}
names(refinedComparison)
```

Similarly, samples can be loaded from two different `r sabseqPy()`'s directories as illustrated in the following example: 


```{bash abseqR_different_dirs, eval=FALSE}
# first abseqPy run on SRR dataset from control group
abseq --file1 fasta/SRR_ACGT_CTRL.fasta --outdir SRR_CTRL

# second abseqPy run on SRR dataset from experiment group
abseq --file1 fasta/SRR_ACGT_EXP.fasta --outdir SRR_EXP
```

analyzing these samples in `r sabseqR()`:

```{r abseqR_adv_load_samples, eval=FALSE}
SRRControl <- abseqReport("SRR_CTRL")
SRRExp <- abseqReport("SRR_EXP")
```

then comparing _all_ samples in `SRRControl` with _all_ samples in `SRRExp`
can be done using [`+` and `report`](#customize-cmp).

```{r abseqR_adv_compare_old_new, eval=FALSE}
# short for SRRControl[[1]] + SRRControl[[2]] + ... + SRRExp[[1]] + ...
all.samples <- Reduce("+", c(SRRControl, SRRExp))
report(all.samples, outputDir = "SRRControl_vs_SRRExp")
```

> **Important**: The sample names in `SRR_CTRL` and `SRR_EXP` _must_ be unique.


In conclusion, the `+` operator along with the `report` function enables users to carry out a wide range of customized downstream analyses on the output of `r sabseqPy()`.

# Advanced Examples

## Lazy loading

In the previous section, the `SRRControl` dataset was loaded using
`SRRControl <- abseqReport("SRR_CTRL")`, which will **generate** all plots and reports by default. However, this dataset might have already been analyzed and users are interested in only loading the S4 objects of the samples. This can be efficiently carried out by using the `report=0` argument as follows: 

```{r abseqR_adv_lazy_load, eval=FALSE}
# in essence, this is identical to SRRControl <- abseqReport("SRR_CTRL"),
# but will not regenerate plots and reports!
SRRControl.lazy <- abseqReport("SRR_CTRL", report = 0)

# exactly the same return value
stopifnot(names(SRRControl.lazy) == names(SRRControl))
```

## Alternative reporting options  {#fine-tune}

In the previous section, the `report` parameter of `abseqReport` was used to
load the samples in `SRRControl` without actually plotting any data. 

The `report` argument can accept one of four possible values as follows:

1. `abseqReport("SRR_CTRL", report = 0)` does not generate plots and HTLM reports and only returns a named list of S4 objects. 

2. `abseqReport("SRR_CTRL", report = 1)` generates static plots in PNG format but **does not** generate HTML reports. 

3. `abseqReport("SRR_CTRL", report = 2)` generates static plots in PNG format in addition to HTML reports.

4. `abseqReport("SRR_CTRL", report = 3)` generates static plots in PNG format and interactive plots in the HTML reports using [plotly](https://cran.r-project.org/package=plotly). This is the default behaviour when the `report` argument is not specified.

## Parallelization {#parallelization}

One of `abseqReport`'s parameters is `BPPARAM`, which is used to pass arguments into the `BiocParallel::bplapply` function for customizing parallelization. More
information regarding the accepted values to `BPPARAM` can be found at
BiocParallel's
[page](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html).

Below is a simplified example of using 4 cores and serializing the loop.

```{r abseqR_adv_bpparam, eval=FALSE}

# use 4 CPU cores
nproc <- 4
samples <- abseqReport(dataPath,
                       BPPARAM = BiocParallel::MulticoreParam(nproc))


# run sequentially - no multiprocessing
samples <- abseqReport(dataPath,
                       BPPARAM = BiocParallel::SerialParam())
```

# Interpretation of report's figures {#abseqR-interpret}

```{r abseqR_plot_path_setup, include=FALSE}
# [s]ingle sample [dir]ectory
sdir <- file.path(dataPath, "auxiliary", "PCR1")
# [m]ulti sample [dir]ectory
mdir <- file.path(dataPath, "auxiliary", "PCR1_vs_PCR2_vs_PCR3")
```

This section presents the plots generated by `r sabseqR()` on the
dataset discussed above and provides some insights on how to interpret them.

The visualizations and analyses can be broken down into 5 categories:

1. Sequence quality analysis
2. Abundance analysis
3. Productivity analysis
4. Diversity analysis
5. Comparative analysis

## Sequence quality analysis

The plots described in this section can be found in the `Summary` and `Quality` tabs of the HTML report.

### Sequence length

The sequence length distribution is a simple way of validating the quality
of a sequencing run. The histogram is expected to show a large proportion of
sequences falling within the expected lengths.

```{r seq-len, fig.cap="Sequence length distribution of PCR1. Histogram of (merged) sequence lengths.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "annot", "PCR1_all_clones_len_dist.png"))
```

Figure \@ref(fig:seq-len) plots sequence length (x-axis) against proportion (y-axis). A similar plot with outliers removed can be found in the `Summary` tab.

### Alignment quality

`r sabseqPy()` filters low quality sequences based on the quality of the sequence alignment against germline sequence databases and thus the following parameters can be used:

1. Alignment bitscore
2. Subject start index
3. Query start index
4. Alignment length

However, setting the
optimal cut-off thresholds for these parameters is challenging. Stringent values could filter too many sequences while lenient values could retain
low quality sequences.

The alignment quality heatmaps generated in the `Quality` tab of the HTML report shows the relationship
between alignment lengths and these alignment parameters to help determine the percentage of sequences falling within a given range and thus inform the selection of cut-off thresholds.

For example, Figure \@ref(fig:align-qual) shows one of the 5 heatmaps: bitscore
against V germline alignment length. `r sabseqPy()` has some recommendations on the values of these parameters to retain good quality sequences. 

```{r align-qual, fig.cap="Bitscore vs V germline alignment length in PCR1. Heatmap of bitscore vs alignment length shows percentage of sequences in a given value range.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "abundance", "PCR1_igv_align_quality_bitscore_hm_static.png"))
```

### Insertions, deletions, and mismatches

Indels (Insertions or deletions) and mismatches
can be used as an indicator of sequence quality. 

Figure \@ref(fig:indel) shows the proportions of indels in `PCR1`. This graph plots the 
rate of indels (y-axis) in the V germlines of `PCR1` against the number of indels (x-axis).
A similar plot for rate of mismatches in V germlines can be found in the same directory. A high rate of indels in the germline sequences might indicate a quality problem with the library because this would affect the functionality of the sequenced clones. However, this could be due to the sequencing quality depending on which sequencing technology is used. For example, long read sequencing technologies tend to produce more indel errors than short read sequencing technologies. 

```{r indel, fig.cap="Rate of insertions and deletions in PCR1. The percentage of indels found within the V germlines of PCR1.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "abundance", "PCR1_igv_gaps_dist.png"))
```

## Abundance and association analysis

The plots generated in this section can be found in the `Abundance` tab of the HTML report.

### V-(D)-J germline abundance

The proportions of V-(D)-J germlines is essential in some experiment designs.
For example, it can be used to validate that the germline abundance of an
in-house antibody library is in-line with the donor antibody repertoire.
Experiments that artificially amplify certain germline families can also
be validated similarly using this analysis.

```{r vgermline, fig.cap="V family germline distribution in PCR1. The percentage of IGHV families after germline annotation using IgBLAST.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "abundance", "PCR1_igv_dist_family_level.png"))
```

Figure \@ref(fig:vgermline) shows the distribution of IGHV families in `PCR1`. Similar plots can be generated for individual V germlines genes and for the D and J germlines.


### V-J germline associations

This plot visualizes the recombination process of V and J germlines.
Figure \@ref(fig:vjassoc) summarizes the associations between V
and J family germlines in a plot generated using `r CRANpkg("circlize")`.

```{r vjassoc, fig.cap="V-J family germline association in PCR1. Segment size represents germline proportion while the thickness of the arcs shows the proportion of associations between V and J germline families. This plot was heavily inspired by [VDJTools](https://github.com/mikessh/vdjtools/))", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "abundance", "PCR1_vjassoc.png"))
```

This plot can be used to check whether the Ab library is biased towards a particular combination of germline genes due to cloning errors. 

## Productivity analysis

The plots described in this section can be found in the `Productivity` tab in `PCR1`'s HTML report. The main factors affecting the productiveness of a clone by
`r sabseq()`'s interpretation are:

1. Concordance of the coding frames of V and J germline genes
3. Presence of stop codons
2. Presence of indels within a framework or complementarity-determining region

Any sequence that satisfies at least one of the above condition will be classified
as unproductive and thus it is unlikely that it will express a functional antibody.

Figure \@ref(fig:prod-summ) summarizes the productivity analysis results of `PCR1`. Factors that cause sequences to be non-functional are colour coded as follows:

1. Green -- sequences _without_ stop codons but have a frameshift
2. Cyan -- sequences that are _in-frame_ but contain at least one stop codon
3. Purple -- sequences that contain at least one stop codon _and_ have a frameshift

```{r prod-summ, fig.cap="Productivity rate of sequences in PCR1. This plot shows the percentage of productive and unproductive sequences, the reason for unproductive sequences are colour coded.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "productivity", "PCR1_productivity.png"))
```

A good antibody library should have as low unproductive clones as possible. Cloning strategies that are used to clone sequences from the donor libraries or used to construct the Ab library play a key role in this aspect of the library quality. 

### V-J frameshifts

Figure \@ref(fig:frameshift) shows the percentage of clones that are
out-of-frame due to either misaligned V-J coding frames or to the presence of non-multiple of three-indels in one of the framework or complementarity-determining regions.

```{r frameshift, fig.cap="Rate of in-frame and out-of-frame sequences in PCR1. Misaligned V-J frame or non-multiple of 3 indels in either one of the framework or complementarity-determining regions causes a frameshift and therefore renders the sequence unproductive.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "productivity", "PCR1_vjframe_dist.png"))
```

### Stop codons

Figure \@ref(fig:stopcod) shows the hot spots for stop codons segregated by
framework and complementarity-determining regions.

```{r stopcod, fig.cap="Proportion of stop codons in any given CDR or FR of out-of-frame sequences in PCR1. The percentages show the frequency of stop codons in a given region relative to the total number of stop codons present.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "productivity", "PCR1_stopcodon_region_outframe.png"))
```

The figure above shows the percentage of stop codons in the FR and CDR regions of _out-of-frame_ sequences.
As discussed earlier, these stop codons may be introduced due to cloning or sequencing errors, hence
a similar plot for _in-frame_ sequences can also be found within the same tab.

### Indels and mismatches

Some sequences are productive despite having indels and mismatches. This occurs
when indels are multiple of three and mismatches do not introduce stop codons.
The following figures show the proportion of indels and mismatches for each germline,
framework region, and complementarity-determining region on _productive sequences_ (unless
specified otherwise).

Figure \@ref(fig:mismatches), Figure \@ref(fig:gaps), and Figure \@ref(fig:gaps-out)
plots the proportion of mismatches in productive sequences,
indels in productive sequences, and indels in out-of-frame (unproductive) sequences
for framework region 3 (FR3).
The motivation behind these plots is to quickly identify the quality of productive
sequences.

```{r mismatches, fig.cap="Proportion of mismatches in productive sequences of PCR1.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "productivity", "PCR1_fr3_mismatches_dist.png"))
```

```{r gaps, fig.cap="Proportion of indels in productive sequences of PCR1.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "productivity", "PCR1_fr3_gaps_dist.png"))
```

```{r gaps-out, fig.cap="Proportion of indels in out-of-frame sequences of PCR1.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "productivity", "PCR1_fr3_gaps_dist_out_of_frame.png"))
```

Ideally, the number of mismatches in framework regions and IGJ
are expected to be low because they are relatively conserved regions. Similarly,
the number of indels in productive sequences are expected to be low or some
multiple of 3.

Similar plots are generated for other FR and CDR regions, IGV, IGD, and IGJ.

<!-- TODO: missing plot 1: skipped IGV productive dist,
IGV inframe unproductive dist, IGV out-of-frame dist-->


## Diversity analysis

<!-- MA: define what is a clonotype first -->

The plots described in this section can be found in the `Diversity` tab of the HTML report. `r sabseq()` only conducts diversity analysis on clones that are productive.

### Clonotype-based analysis

To estimate repertoire diversity, `r sabseqR()` employs three commonly used techniques:

**Duplication-level analysis** in which the number of times a clone appears in the sequenced sample is calculated. Figure \@ref(fig:duplication) plots the proportion of sequences (y-axis) that appear once (singletons), twice (doubletons), 
and at higher-orders (x-axis). The higher the percentage of singletones and doubletones, the more diverse the library would likely be. 

**Rarefaction analysis** in which bootstrapping is used to estimate the *richness* of a library by calculating the proportion of unique sequences at different sample sizes. Figure \@ref(fig:rarefaction) plots the number of deduplicated clonotypes (y-axis)
againse sample sizes (x-axis). For each sample size, five samples are drawn and the mean with confidence intervals are calculated. In a highly diverse library, the percentage of unique clones should significantly increase as the sample size increases. 

**Capture-recapture analysis** in Figure \@ref(fig:recapture) plots the percentage of clonotypes recaptured (y-axis) in
a capture-recapture experiment at different sample sizes (x-axis).
For each sample size, the percentage of recaptured clonotypes is calculated for five repeated capture-recapture experiments and the mean and confidence intervals are reported.

In below figures, the complementarity-determining regions (CDRs)
are used to define a "clonotype". Similar plots can be generated for the framework regions (FRs) and the entire variable domain sequences. 

```{r duplication, fig.cap="Proportion of duplicated clonotypes in PCR1. Higher-order duplication levels starting from 10 are binned.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "diversity", "PCR1_cdr_duplication.png"))
```

```{r rarefaction, fig.cap="Rarefaction curve of clonotypes in PCR1. The number of deduplicated sequences are taken over the mean of 5 resampling rounds, where the shaded areas indicate 95\\% confidence interval.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "diversity", "PCR1_cdr_rarefaction.png"))
```

```{r recapture, fig.cap="Capture-recapture of clonotypes in PCR1. The number of recaptured sequences are taken over the mean of 5 resampling rounds, where the shaded areas indicate 95\\% confidence interval.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "diversity", "PCR1_cdr_recapture.png"))
```

### Spectratype analysis

Spectratypes are histograms of the clonotype lengths calculated based on the amino acid sequences. Figure \@ref(fig:spectra) shows a CDR3 spectratype.
Spectratypes for other FRs and CDRs are available in the same tab. In a good quality library, the framework regions would have quite conserved lengths while CDRs show high length diversity. CDR3 spectratype tends to follow a normal distribution in libraries cloned from naive repertoires. 

```{r spectra, fig.cap="CDR3 amino acid spectratype for PCR1 with outliers removed.", echo=FALSE, out.width="100%"}
read_png(file.path(sdir, "diversity", "spectratypes", "PCR1_cdr3_spectratype_no_outliers.png"))
```

### Position-specific analysis

This analysis examines the diversity of Ab library at each amino acid position of the variable domain by estimating the utilization of each of the 20 amino acids at each position. Position-specific frequency matrices (PSFMs) are calculated by aligning all the sequences of a region of interest to anchor at the first position and then the frequency of each amino acid is calculated accordingly. Two PSFMs are calculated: (1) the unscaled PSFM, in which the frequencies are calculated based on the total number of observed sequences per sample *at each position* and (2) the sacled PSFM, in which the frequencies are calculated based on the total number of observed sequences per sample. 

Figure \@ref(fig:comp) shows the PSFM of CDR3 in `PCR1`. Amino acids are coloured based on their physiochemical properties. The left plot shows the unscaled composition logo and the right plot
shows the scaled composition logo. Similar plots for other FRs and CDRs are available in the same tab.

```{r comp, fig.cap="Amino acid composition logo for PCR1's CDR3. The unscaled (left) and the scaled (right) composition logos are colour coded and grouped by their physicochemical properties.", echo=FALSE, fig.wide=TRUE}
unscaled <- file.path(sdir, "diversity", "composition_logos", "CDR3", "PCR1_cumulative_logo.png")
scaled <- file.path(sdir, "diversity", "composition_logos", "CDR3", "PCR1_cumulative_logo_scaled.png")
img1 <-  rasterGrob(as.raster(readPNG(unscaled)), interpolate = TRUE)
img2 <- rasterGrob(as.raster(readPNG(scaled)), interpolate = TRUE)
grid.arrange(img1, img2, ncol = 2)
```


## Comparative analysis

The plots described in this section can be found in the `Clonotypes` tab of
`PCR1 vs PCR2 vs PCR3`'s HTML report.

Since comparative analysis deals with _overlapping_ clonotypes,
this analysis only applies when `compare` was supplied with at
least one sample comparison. Earlier, the call to `abseqReport`
had `compare = c("PCR1, PCR2, PCR3")`, therefore
`PCR1`, `PCR2,` and `PCR3` are compared against each other.

Throughout this analysis, a clonotype is synonymous to its CDR3 amino acid sequence.

### Over-representation analysis

Figure \@ref(fig:topclones) offers a simple overview of the top 10 over-represented clones found
in each sample. Since the clonotypes are colour coded, overlapping clonotypes
can easily be identified within the top 10 of each samples. Note that the
proportions are scaled relative to the top 10 clones in the respective samples.

This plot complements the scatter plot mentioned above. It displays the most abundant clonotypes
in each sample with the amino acid sequence in the legend. 

```{r topclones, fig.cap="Top 10 clonotypes from PCR1, PCR2, and PCR3. The clonotype proportions displayed are scaled relative to the top 10 clones in each sample.", echo=FALSE, out.width="100%"}
read_png(file.path(mdir, "clonotype_analysis", "PCR1_PCR2_PCR3_top10Clonotypes.png"))
```


### Overlapping analysis

#### Multi-sample analysis

While Figure \@ref(fig:topclones) is capable of showing overlapping clones,
it is restricted to the top 10 over-represented clones from each sample. Figure \@ref(fig:overlap)
aims to overcome the restriction by using a venn diagram to visualize the number
of overlapping (and non-overlapping) clones from each sample. Each number within
the venn diagram shows the number of unique clonotypes that are overlapping (in
an intersection) or are non-overlapping (not in any intersection).
That is, by taking the sum of all the numbers in a sample segment,
it becomes the number of unique clonotypes found in that sample.


```{r overlap, fig.cap="Number of unique overlapping clones found in PCR1, PCR2, and PCR3. The numbers in an intersection denotes the number of unique clones that are shared between the samples involved in the said intersection.", echo=FALSE, out.width="100%"}
read_png(file.path(mdir, "clonotype_analysis", "PCR1_PCR2_PCR3_cdr3_clonotypeIntersection.png"))
```

Note that this venn diagram will __not__ be plotted if there are more than 5
samples.

#### Two-sample analysis

In order to visualize the correlation between any pair of samples, `r sabseqR()` plots
a scatter plot of every possible combination. Figure \@ref(fig:scatter) shows
one of them, plotting the clonotype frequencies in `PCR2` against `PCR1`.

The scatter plot:

* has $log_{10}$-scaled clonotype frequencies
* has a point for each clonotype
    * the coordinate of each point denotes the $log_{10}(frequency)$ in the sample
    on the x and y axis respectively
    * point sizes are mapped to the mean frequency of a clonotype
* has marginal density plots colour coded as such:
    * blue: density of _overlapping clonotypes_
    * purple: density of _non-overlapping clonotypes_
    * grey: density of _all clonotypes_
    

```{r scatter, fig.cap="Log-scaled scatter plot of clonotype frequencies in PCR1 and PCR2. Point size denotes mean frequency of the 2 samples and marginal density plots are colour coded by overlapping (blue), non-overlapping (purple), and all (grey) clonotypes.", echo=FALSE, out.width="100%"}
read_png(file.path(mdir, "clonotype_analysis", "PCR1_vs_PCR2_clone_scatter.png"))
```

This plot is heavily inspired by [VDJTools](https://github.com/mikessh/vdjtools/).

### Correlation analysis

In addition to Figure \@ref(fig:scatter), the linear correlation of clonotype frequencies
between samples can be directly quantified using Pearson's correlation coefficient.
Figure \@ref(fig:corr) shows the plot generated by `r CRANpkg("ggcorrplot")` used to visualize
pearson coefficients. A similar plot using Spearman's correlation coefficient (rank-based)
is also available in the same directory.

```{r corr, fig.cap="Pearson correlation between PCR1, PCR2, and PCR3. Values denote the pearson correlation coefficient of the clonotypes frequencies between the samples. These values will appear with a cross 'x' to signify an insignificant coefficient if the p-value of the coefficient is larger than 0.05.", echo=FALSE, out.width="100%"}
read_png(file.path(mdir, "clonotype_analysis", "pearson.png"))
```


### Clustering analysis

The `r CRANpkg("vegan")` package was used to calculate distances between samples.
The distances between samples are calculated using its clonotype frequencies by
applying methods from Morisita-Horn's overlap index, Jaccard index, and Dice's coefficient.

Figure \@ref(fig:cluster) shows a dendrogram plotted using
Morisita-Horn's overlap index. The length of each line denotes the distance
between the 2 samples or clusters it is connected to.
Other dendrograms using Jaccard and Dice's formula are available in the same directory.

```{r cluster, fig.cap="Morisita-Horn distances of PCR1, PCR2, and PCR3. The distances are calculated using clonotype frequencies and are visualized as the length of the lines connecting samples or clusters.", echo=FALSE, out.width="100%"}
read_png(file.path(mdir, "clonotype_analysis", "morisita_horn.png"))
```


# Appendices

## Datasets {#example-data}

The datasets used in the above examples
was obtained from a combination of synthetic sample datasets generated
using [MiXCR](https://github.com/milaboratory/mixcr)'s program 
[here](http://files.milaboratory.com/mixcr/paper/mixcr-test-1.2-SNAPSHOT.jar). 
Firstly, three distinct samples were generated, each simulated with the following parameters
in `MiXCR`:

```{r mixr-test-params, echo=FALSE, message=FALSE, warnings=FALSE, results='asis'}
tabl <- "
| Parameter     | sample 1      | sample 2      | sample 3      |
|---------------|---------------|---------------|---------------|
| reads         |  10000        |  10000        | 10000         |
| clones        |  5000         |  5000         | 2000          |
| seed          |  4228         |  2428         | 2842          |
| conf          | MiSeq-300-300 | MiSeq-300-300 | MiSeq-300-300 |
| loci          |  IGH          | IGH           | IGH           |
| species       | hsa           | hsa           | hsa           |
"
cat(tabl)
```

Following that, an arbitrary number of sequences were randomly
drawn from each of the three samples and randomly amplified.
This process was repeated 3 times, resulting in a final repertoire
of three samples, named PCR1, PCR2, and PCR3.

Finally, these three samples were analyzed by `r sabseqPy()`. The command used to
analyze these samples are as follows:

```{bash abseqR__data_abseqPy_run, eval=FALSE}
abseq -y params.yml
```

where the contents of `params.yml` is:

```yaml
# params.yml
defaults:
    bitscore: 300
    sstart: 1-3
    alignlen: 250
    outdir: ex
    task: all
    threads: 1
---
file1: PCR1.fasta
name: PCR1
---
file1: PCR2.fasta
name: PCR2
---
file1: PCR3.fasta
name: PCR3
```

`r sabseqPy()`'s analysis output on these three samples are contained
within the dataset described in this vignette.

## Session Info

This vignette was rendered in the following environment:
```{r abseqR_session_info, echo=FALSE}
sessionInfo()
```


# References

<!-- add some academic references about the topic, find some from our manuscript that are related to software tools so authors can understand the scopt of our work. Add a reference to our potential paper with (under preparation) -->

[^Repseq]: Repertoire sequencing.
[^windows_biocparallel]: The performace offered by
[BiocParallel](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html) may differ across different operating systems.
[^compare-ws]: Trailing and leading whitespace between sample names are trimmed. That is, "PCR1,PCR2" is identical to
"PCR1 , PCR2".