---
title: "seqCAT: The High Throughput Sequencing Cell Authentication Toolkit"
author: "Erik Fasterius"
date: "`r Sys.Date()`"
package: "`r pkg_ver('seqCAT')`"
output:
    BiocStyle::html_document:
        toc_float: true
        number_sections: true
bibliography: bibliography.bib
vignette: >
    %\VignetteIndexEntry{seqCAT: The High Throughput Sequencing Cell Authentication Toolkit}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r Options, echo = FALSE}
knitr::opts_chunk$set(fig.align = "center", message = FALSE)
```

# Introduction

This vignette describes the use of the *seqCAT* package for
authentication, characterisation and evaluation of two or more *High Throughput
Sequencing* samples (HTS; RNA-seq or whole genome sequencing). The principle
of the method is built upon previous work, where it was demonstrated that
analysing the entirety of the variants found in HTS data provides unprecedented
statistical power and great opportunities for functional evaluation of genetic
similarities and differences between biological samples *[@Fasterius2017]*.

seqCAT work by creating *Single Nucelotide Variant* (SNV) profiles
of every sample of interest, followed by comparisons between each set to find
overall genetic similarity, in addition to detailed analyses of the
differences. By analysing your data with this workflow you will not only be
able to authenticate your samples to a high degree of confidence, but you will
also be able to investigate what genes and transcripts are affected by SNVs
differing between your samples, what biological effect they will have, and
more. seqCAT's workflow consists of three separate steps:

    1.  Creation of SNV profiles
    2.  Comparisons of SNV profiles
    3.  Authentication, characterisation and evaluation of profile comparisons

Each step has its own section(s) below demonstrating how to perform the
analyses. Input data should be in the form of [VCF files][vcf-format], *i.e*
output from variant callers such as the [Genome Analysis ToolKit][gatk] and
annotated with software such as [SnpEff][snpeff].

## Installation

The latest stable release of this package can be found on
[Bioconductor][bioc-home] and installed using the `biocLite` function:

```{r Installation, eval = FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite("seqCAT")
```

This will also install any missing packages requires for full functionality,
should they not already exist in your system. If you haven't installed
Bioconductor, you can do so by simply calling `biocLite()` without specifying a
package, and it will be installed for you. You can read more about this at
Bioconductor's [installation page][bioc-install]. You can find the development
version of seqCAT on [GitHub][github].

# Creation of SNV profiles

The first step of the workflow is to create the SNV profile of each sample,
which can then be compared to each other. In order to decrease the computation
time for large comparison sets and to facilitate re-analyses with different
parameters each SNV profile is saved on the harddrive as a normal `.txt` file.
While computation time is usually not an issue for simple binary comparisons
(*i.e.* comparisons with only two samples), this can quickly become a concern
for analyses where samples are compared to several others (A vs B, A vs C, ...,
and so on); this is doubly true for annotated VCF files.

The creation of a SNV profile includes filtering of low-confidence
variants and removal of variants below a sequencing depth threshold (`10` by
default). For annotated VCF files, only records with the highest SNV impact
(*i.e.* [impact][] on protein function) for each variant is kept, as they are
most likely to affect the biology of the cells. Creation of annotated SNV
profiles is also implemented in Python ([section 2.2][Create profiles faster
with Python]), which is faster than the standard implementation in R ([section
2.1][Create profiles with R]).

## Create profiles with R

Throughout this vignette we will be using some example data, `example.vcf.gz`,
which comes from the initial publication of the general process of this method
*[@Fasterius2017]*. It is a simplified multi-sample VCF file on a subset of
chromosome 12 (containing all variants up to position `25400000`, in order to
keep the file size low) for three different colorectal cancer cell lines:
*HCT116*, *HKE3* and *RKO*.

```{r Create SNV profile with R}
# Load the package
library("seqCAT")

# List the example VCF file
vcf <- system.file("extdata", "example.vcf.gz",
                   package = "seqCAT")

# Create two SNV profiles
create_profile(vcf, "HCT116", "hct116_profile.txt")
create_profile(vcf, "RKO", "rko_profile.txt", filter_depth = 15)
```

This creates SNV profiles for the two samples found in the example data
(`HCT116` and `RKO`) and saves them as `hct116.profile.txt` and
`rko_profile.txt` in the current directory, respectively. The profile of the
second sample was created with a non-standard filter for sequencing depth
(`15`), which should only be done if you want a stricter criteria for your
profile (such as when you're only interested in higher-than-standard confidence
variants).

## Create profiles faster with Python

Annotated SNV profiles can also be created with [Python][python-home], another
scripting language, if you have installed it. You will also need to install the
[PyVCF][pyvcf] module, in order for it to run. The `Python` version can create
SNV profiles approximately five to ten times quicker than its `R` equivalent
for annotated VCF files. This is not important for most users, but is
nevertheless included for cases with many annotated VCF files where extra speed
is desirable.

```{r Create SNV profile with Python, eval = FALSE}
create_profile(vcf, "RKO", "RKO_profile.txt", python = TRUE)
```

## Create COSMIC profiles

It is also possible to to compare your samples' variants to some external
source. Such a source is the *Catalogue of somatic mutations in cancer*, or
*COSMIC*. COSMIC has over a thousand cell line-specific mutational profiles,
and is thus a very useful resource if you are working with cell lines.

In order to use the COSMIC cell line database, you need to sign up for an
account at their [website][cosmic-home] and get permission to download their
files (which is given free of charge to academia and non-profit organisation,
but requires a commersial license for for-profit organisations). The required
file is the one named `CosmicCLP_MutantExport.tsv.gz`, listed under *complete
mutational data* [here][cosmic-downloads]. As redistributing this file is not
allowed, this package includes an extremely minimal subset of the original 
file, only useful for examples in this vignette and unit testing. *Do not* use
this file for your own analyses, as your results will neither be complete nor
accurate!

The first thing to check is to see if your specific cell line is available in
the database, which can be accomplished using the `list_cosmic` function:

```{r List COSMIC}
file <- system.file("extdata", "subset_CosmicCLP_MutantExport.tsv.gz",
                    package = "seqCAT")
cell_lines <- list_cosmic(file)
head(cell_lines)
```

This gives us a simple vector containing all the available cell lines in the
COSMIC database (this version of the file is for the GRCh37 assembly). You can
search it for a cell line of your choice:

```{r Search COSMIC}
any(grepl("HCT116", cell_lines))
```

All COSMIC-related functions perform some simplification of cell line names (as
there is variation in the usage of dashes, dots and other symbols), and are
case-insensitive. When you have asserted that your cell line of interest is
available, you can then read the profile for that cell line using the
`read_cosmic` function:

```{r Read COSMIC}
cosmic <- read_cosmic(file, "HCT116")
head(cosmic)
```

You now have a small, COSMIC SNV profile for your cell line, which you can
compare to any other profile you may have data for (more on this below). You
can also check how many variants are listed in COSMIC for your particular cell:

```{r Count COSMIC}
length(cosmic)
```

Here we only see a single variant for the HCT116 cell line, which is only
because of the extreme small subset of the COSMIC databse being used here.
HCT116 has, in fact, over 2000 listed COSMIC SNVs, making it one of the more
abundantly characterised cell lines available (as most cell lines has only a
few hundred SNVs listed in COSMIC). A COSMIC profile of a couple of hundred
variants is more common, though, and any analysis based only on COSMIC variants
is thus inherently limited.

# Comparing SNV profiles

## Comparing full profiles

Once each relevant sample has its own SNV profile the comparisons can be
performed. First, each profile is read using the `read_profile` function,
which outputs `GRanges` objects for fast and efficient comparisons.

```{r Read SNV profiles}
hct116 <- read_profile("hct116_profile.txt", "HCT116")
rko <- read_profile("rko_profile.txt", "RKO")
head(hct116)
```

SNV profiles contain most of the relevant annotation data from the original VCF
file, including SNV impacts, gene/transcript IDs and mutational (rs) ID. The
`DP` (depth) field lists the total sequencing depth of this variant, while the
specific allelic depths can be found in `AD1` and `AD2`. The alleles of each
variant can be found in `A1` and `A2`.

Once each profile has been read, the genotypes of the overlapping variants
between them can be compared using the `compare_profiles` function. Only
variants found in both profiles are considered to overlap, as similarity
calculations between profiles where some variants only have confident calls in
one of the samples are inappropriate. An SNV is considered a match if it has an
identical genotype in both profiles.

```{r Compare profiles}
hct116_rko <- compare_profiles(hct116, rko)
head(hct116_rko)
```

The resulting dataframe retains all the information from each input profile
(including any differing annotation, should they exist), and lists the depths
and alleles by adding the sample names as suffixes to the relevant column
names. An optional parameter, `mode`, can also be supplied: the default value
(`"intersection"`) discards any non-overlapping variants in the comparison,
while setting it to `"union"` will retain them.

```{r Compare profiles (union)}
hct116_rko_union <- compare_profiles(hct116, rko, mode = "union")
head(hct116_rko_union)
```

## Comparing to COSMIC profiles

If you are working with cell lines and only want to analyse a subset of your
data or as a orthogonal method complementary to others, you could compare your
profile to a COSMIC profile. This works in the same way as comparing to another
full profile, but gives slightly different output:

```{r Compare with COSMIC}
hct116_cosmic <- compare_profiles(hct116, cosmic)
head(hct116_cosmic)
```

You can use all the functions for downstream analyses for comparisons with
COSMIC data, but your options for functional analyses will be limited, given
that the COSMIC database is biased towards well-known and characterised
mutations. It is, however, an excellent way to authenticate your cell lines and
to assert the status of the mutations that exist in the analysed cells.

# Evaluating binary comparisons

## Similarity and global statistics

When you have your matched, overlapping SNVs, it's time to analyse and
characterise them. The first thing you might want to check are the global
similarities and summary statistics, which can be done with the
`calculate_similarity` function. The `concordance` is simply the number of
matching genotypes divided by the total number of overlapping variants, while
the `similarity score` is a weighted measure of the concordance in the form of
a binomial experiment, taking into account the number of overlapping variants
available:

$$Similarity = \frac{s + a}{n + a + b}$$

... where `s` is the number of matching genotypes, `n` is the total number of 
overlapping SNVs, `a` and `b` being the parameters used to weigh the
concordance in favour of comparisons with more overlaps. The default
parameters of `1` and `5` were selected to yield an equivalent cutoff to one
suggested by Yu *et al.* (2015), which results in a lower limit 44 of perfectly
matching overlapping variants with a similarity score of 90. The similarity
score is thus a better measure of biological equivalency than just the
concordance.

```{r Calculate similarities}
similarity <- calculate_similarity(hct116_rko)
similarity
```

Here, you can see a summary of the relevant statistics for your particular
comparison: the number of total variants from each profile (if the comparison
was done with `mode = "union"`, otherwise this number will just be equivalent
to the overlaps), the number of overlaps between your two samples, the number
of matching genotypes, their concordance as well as their similarity score. The
cutoff used by Yu *et al.* for cell line authenticity was `90 %` for their 48
SNP panel, something that could be considered the baseline for this method as
well. The score, `68.7`, is well below that cutoff, and we can thus be certain
that these two cells are indeed not the same (as expected). While hard
thresholds for similarity are inadvisable, a general guideline is that
comparisons with scores above `90` can be considered similar while those below
can be considered dissimilar. While a score just below `90` does not mean that
the cells definitely are different, it *does* mean that more rigorous
evaluation needs to be performed in order to ensure their biological 
equivalency. Are there specific genes or regions that are of special interest,
for example? If so, it might be informative to specifically investigate the
similarity there (more on this [below][Evaluation of specific chromosomes,
regions, genes and transcripts]).

You may additionally change the parameters of the score (if you, for example,
want a stricter calculation). You may also supply the `calculate_similarity`
function with an existing dataframe with summary data produced previously, in
order to aggregate scores and statistics for an arbitrary number of
comparisons.

```{r Calculate similarities iteratively}
# Create and read HKE3 profile
create_profile(vcf, "HKE3", "hke3_profile.txt")
hke3 <- read_profile("hke3_profile.txt", "HKE3")

# Compare HCT116 and HKE3
hct116_hke3 <- compare_profiles(hct116, hke3)

# Add HCT116/HKE3 similarities to HCT116/RKO similarities
similarities <- calculate_similarity(hct116_hke3,
                                     similarity, a = 1, b = 10)
similarities
```

Notice that the new `similarities` dataframe contains both the comparisons of
HCT116/RKO and HCT116/HKE3, and we can clearly see that HCT116 and HKE3 are
indeed very similar, as expected (HKE3 was derived from HCT116). This is true
even when using a higher value for the `b` parameter. Any number of samples can
be added using the `calculate_similarity` function, for use in further
downstream analyses.

## Evaluation of SNV impacts

An SNV's [impact] represent the putative effect that variant may have on the
function of the resulting protein, and ranges from HIGH through MODERATE, LOW
and MODIFIER, in decreasing order of magnitude. HIGH impact variants may, for
example, lead to truncated proteins due to the introduction of a stop codon,
while MODIFIER variants have little to no effect on the protein at all. While
there is no guarantee that a specific phenotype arises from a HIGH rather than
a MODERATE impact variant (for example), it may be informative to look at the
impact distribution of the overlapping SNVs between two profiles. This can
easily be performed by the `plot_impacts` function:

```{r Impact distributions}
impacts <- plot_impacts(hct116_rko)
impacts
```

This function takes a comparison dataframe as input and plots the impact
distribution of the overlapping variants. It has a number of arguments with
defaults, such as if you want to add text with the actual numbers to the plot
(`annotate = TRUE` by default), if you want to show the legend (`legend =
TRUE` by default) and what colours you want to plot the match-categories with
(`palette = c("#0D2D59", "#1954A6")` by default, two shades of blue). We can
see that most of the SNVs are present in the MODIFIER impact category, and that
there is not a single mismatched HIGH impact SNV. (You can also visualise the
impact distribution between your sample and the COSMIC database in exactly the
same way.)

You might also want to look at only a subset of variants, *e.g.* only the
variants with HIGH or MODERATE impacts, which is easily achieved with some data
manipulation:

```{r Subset impacts}
hct116_rko_hm <- hct116_rko[hct116_rko$impact == "HIGH" |
                            hct116_rko$impact == "MODERATE", ]
nrow(hct116_rko_hm)
```

## Evaluation of specific chromosomes, regions, genes and transcripts

You might be interested in a specific chromosome or a region on a chromosome,
and it might be useful to work with data for only that subset. This operation
is easily performed on a comparison dataframe:

```{r Subset chromosome or region}
hct116_rko_region <- hct116_rko[hct116_rko$chr == 12 &
                                hct116_rko$pos >= 25000000 &
                                hct116_rko$pos <= 30000000, ]
head(hct116_rko_region)
```

You might also be interested in a specific gene or transcript, of special
importance to your study:

```{r Subset gene or transcript}
hct116_rko_eps8_t <- hct116_rko[hct116_rko$ENSTID == "ENST00000281172", ]
hct116_rko_vamp1 <- hct116_rko[hct116_rko$ENSGID == "ENSG00000139190", ]
hct116_rko_ldhb <- hct116_rko[hct116_rko$gene == "LDHB", ]
head(hct116_rko_ldhb)
```

Here we see two mutations in the LDHB gene, one mismatching MODIFIER variant
and one matching LOW variant. This is a good approach to check for known
mutations in your dataset. For example, the HCT116 cell line is supposed to
have a KRAS<sup>G13D</sup> mutation. We might look for this using its known
`rsID` or position:

```{r Subset KRAS}
hct116_rko_kras <- hct116_rko[hct116_rko$rsID == "rs112445441", ]
hct116_rko_kras <- hct116_rko[hct116_rko$chr == 12 &
                              hct116_rko$pos == 25398281, ]
nrow(hct116_rko_kras)
```

The reason that we don't find this particular variant in the HCT116 vs. RKO
comparison is that it is not present in the RKO profile, either because it
isn't a mutation in RKO or because there was no confident variant call for that
particular position. The `compare_profiles` function only looks at overlapping
positions, so we will have to look at the individual profiles instead. `seqCAT`
has two functions to help with this: `list_variants` and `plot_variant_list`.

The `list_variants` function looks for the genotypes of each specified variant
in each provided SNV profile. First, let's create a small set of interesting
variants we want to look closer at:

```{r Create known variants list}
known_variants <- data.frame(chr  = c(12, 12, 12, 12),
                             pos  = c(25358650, 21788465, 21797029, 25398281),
                             gene = c("LYRM5", "LDHB", "LDHB", "KRAS"),
                             stringsAsFactors = FALSE)
known_variants
```

The minimum information needed are the `chr` and `pos` columns, any additional
columns (such as `gene`, here) will just be passed along for later use. We can
now pass this set (along with our SNV profiles) to the `list_variants`
function:

```{r List variants}
variant_list <- list_variants(list(hct116, rko), known_variants)
variant_list
```

While this gives you a nice little list of the genotypes of your specified
variants, we can also visualise this using the `plot_variant_list` function. It
takes a slightly modified version of the output from the `list_variants`
function: it may only contain the genotype columns. We thus need to create row
names to identify the variants, like this:

```{r, Plot variant list}
# Set row names to "chr: pos (gene)"
row.names(variant_list) <- paste0(variant_list$chr, ":", variant_list$pos,
                                  " (", variant_list$gene, ")")

# Remove "chr", "pos" and "gene" columns
to_remove <- c("chr", "pos", "gene")
variant_list <- variant_list[, !names(variant_list) %in% to_remove]

# Plot the genotypes in a grid
genotype_grid <- plot_variant_list(variant_list)
genotype_grid
```

This gives us an easily overviewed image of what variants are present in which
samples, and their precise genotype. We can see that the KRAS<sup>G13D</sup>
mutation is indeed present in the HCT116, but not in RKO. We can also see that
RKO has a homozygous `G/G` genotype for one of the LDHB variants, while HCT116
is heterozygous (`T/G`) for the same. (Please note that this data was aligned
and analysed using the GRCh37 / hg19 assembly and that listed positions might
not be accurate for other assemblies.)

# Evaluating multiple comparisons

Many scientific studies compare more than just two datasets, not to mention
meta-studies and large-scale comparisons. It is therefore important to be able
to characterise and evaluate many-to-one or many-to-many cases as well - the
`seqCAT` package provides a number of functions and procedures for doing so. 

## Performing multiple profile comparisons

The first step of such an analysis is to create and read SNV profiles for each
sample that is to be evaluated (please see [section 2][Creation of SNV
profiles]). The example data used here has three different samples: HCT116,
HKE3 and RKO. The `compare_many` function is a helper function for creating
either one-to-many or many-to-many SNV profile comparisons, and returns a
`list` of the global similarities for all combinations of profiles and their
respective data (for downstream analyses):

```{r Many-to-many comparisons}
# Create list of SNV profiles
profiles <- list(hct116, hke3, rko)

# Perform many-to-many comparisons
many <- compare_many(profiles)
many[[1]]
```

We can here see the summary statistics of all three combinations of the cell
lines in the example data. Notice that `compare_many` will only perform a
comparison that has not already been performed, *i.e.* it will not perform the
RKO vs. HCT116 comparison if it has already performed HCT116 vs. RKO.
Also notice that it does perform self-comparisons (*i.e.* HCT116 vs.
HCT116), which is useful for downstream visualisations.

The similarities are stored in the first element of the results (`many[[1]]`),
while the data for each comparison is stored in the second (`many[[2]]`). The
second element is itself also a list, whose indices correspond to the row names
of the similarity object. If we, for example, are interested in the HKE3
self-comparison, we can see that its row name is `4`. We can then access its
data like this:

```{r HKE3 self-comparisons}
hke3_hke3 <- many[[2]][[4]]
head(hke3_hke3)
```

You may also specify the `a` and `b` similarity score parameters, as above. If
you are interested in only a one-to-many comparison (for cases when you have a
"true" baseline profile to compare against), you can do this by also specifying
the `one = <profile>` parameter in the function call. This is useful if you
have a COSMIC profile to compare against, for example:

```{r COSMIC-to-many comparisons}
many_cosmic <- compare_many(profiles, one = cosmic)
many_cosmic[[1]]
```

It is important to note that performing many comparisons like this may take
quite some time, depending on the number of profiles and how much data each
profile has. By returning all the data in a list you may then save each
comparison to a file, for later re-analysis without having to re-do the
comparisons.

## Visualising multiple comparisons

A useful and straightforward way of visualising multiple profile comparisons is
to use a heatmap. We can use the summary statistics listed in the similarity
object from above as input to the function `plot_heatmap`, which gives you a
simple overview of all your comparisons:

```{r Plot heatmap, out.width = "60 %"}
heatmap <- plot_heatmap(many[[1]])
heatmap
```

Here we see a blue colour gradient for the similarity score of the three cell
lines, which are clustered according to their similarity (using `cluster =
TRUE`, as default). You may change the size of the text annotations using
`annotation_size = 5` (default) or suppress them entirely (`annotate = FALSE`).
You may also suppress the legend (`legend = FALSE`), change the main colour of
the gradient (`colour = "#1954A6"` by default) or change the limits of the
gradient (`limits = c(0, 50, 90, 100)` by default). The choice of gradient
limits are based on clarity (comparisons with a similarity score less than 50,
*i.e.* those that likely have too few overlapping variants to begin with, are
suppressed) and the previously mentioned 90 % concordance threshold
*[@Yu2015]*.

This heatmap makes it clear that HCT116 and HKE3 are, indeed, very similar to
each other, while RKO differs from them both. These types of heatmaps can be
created for an arbitrary number of samples, which will then give a great
overview of the global similarities of all the samples studied. This can be
used to evaluate the quality of the datasets (*e.g.* to see which comparisons
have very few overlaps), find similarity clusters and potential unexpected
outliers. If a sample stands out in a heatmap such as this, that is grounds for
further investigation, using both the methods described above and more
classical evaluations of sequencing data (read quality, adapter contamination,
alignments, variant calling, and so on).

```{r Remove temporary files, echo = FALSE, results = "hide"}
file.remove("hct116_profile.txt")
file.remove("rko_profile.txt")
file.remove("hke3_profile.txt")
```

# Citation {-}

If you are using seqCAT to analyse your samples, please cite the
article in which the general methodology was first published.

> **A novel RNA sequencing data analysis method for cell line authentication**
> <br/> Fasterius, E., Raso, C., Kennedy, S., Kolch, W., Al-Khalili C. *et al.*
> <br/> PloS One, 12(2), e0171435. (2017)
> <br/> doi: http://doi.org/10.1371/journal.pone.0171435

# Session info {-}

```{r Session info, echo = FALSE}
sessionInfo()
```

# References

[bioc-home]: http://bioconductor.org/
[bioc-install]: http://bioconductor.org/install/

[cosmic-home]: http://cancer.sanger.ac.uk/cosmic
[cosmic-downloads]: http://cancer.sanger.ac.uk/cell_lines/download

[gatk]: http://www.internationalgenome.org/wiki/Analysis/variant-call-format

[github]: https://github.com/fasterius/seqCAT

[impact]: http://snpeff.sourceforge.net/SnpEff_manual.html#eff
[python-home]: https://www.python.org/
[pyvcf]: https://pypi.python.org/pypi/PyVCF
[snpeff]: http://snpeff.sourceforge.net/
[vcf-format]: https://software.broadinstitute.org/gatk/