---
title: "Inversion genotyping with scoreInvHap"
author: 
- name: Carlos Ruiz-Arenas
  affiliation: 
  - &isglobal ISGlobal, Centre for Research in Environmental Epidemiology (CREAL)
  - &brge Bioinformatics Research Group in Epidemiolgy (BRGE)
- name: Alejandro Caceres
  affiliation: 
  - *isglobal
  - *brge
- name: Juan R. Gonzalez
  affiliation: 
  - *isglobal
  - *brge
  email: juanr.gonzalez@isglobal.org
date: "`r Sys.Date()`"
package: "`r pkg_ver('scoreInvHap')`"
output: 
  BiocStyle::html_document:
    number_sections: true
    toc: yes
    fig_caption: yes
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Inversion genotyping with scoreInvHap}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```

# Introduction

`r Rpackage("scoreInvHap")` genotypes inversions using SNP data. This method computes a similarity score between the available SNPs of an individual and experimental references for the three inversion-genotypes; NN: non-inverted homozygous, NI: inverted heterozygous and II: inverted homozygous. Individuals are classified into the inversion-genotypes with the highest score. There are other approaches to genotype inversions from SNP data: [inveRsion](https://bioconductor.org/packages/release/bioc/html/inveRsion.html), [invClust](http://nar.oxfordjournals.org/content/early/2015/02/05/nar.gkv073.full) or [PFIDO](https://github.com/MaxSalm/pfido). However, these approaches have limitations including:

* __they are based in population sample inferences__. When a new individual is included in the study the whole population sample needs to be reanalyzed. As they depend on dimensionality reduction methods, they are slow in the analysis of several individuals.  

* __they can only handle limited number of samples__. They need large samples sizes for accurate inferences. 
* __They are highly sensitive to SNP array densities__.  Their accuracy depends on how clear the population sample can be partitioned, which depends on the amount and quality of informative SNPs. 
 
* __They need external information to label the inversion-genotype clusters__. External validation of the clustering is performed at the end of the inference. If there are more than three clusters, it is not clear how which inversion genotype should be assigned to the clusters.  

`r Rpackage("scoreInvHap")` overcomes these difficulties by using a set of reference genotypes from the inversion of interest. The methods is a supervised classifier  that genotypes each individual according to their SNP similarities to the reference genotypes across the inverted segment. The classifier in particular uses the linkage desequillibrium (R^2^) between the SNPs and the inversion genoptypes, and the SNPs of each reference inversion-genotypes. 

# Inversion characterization

The package is loaded as usual

```{r, load_package}
library(scoreInvHap)
```

`r Rpackage("scoreInvHap")` takes as input SNP data formated in either `SNPMatrix` or `VCF` Bioconductor’s objects. In the case of having data in `SNPMatrix` format, a list with two elements is required:

* genotypes: a SNPMatrix with individuals in rows and SNPs in columns
* map: a data.frame with the SNPs annotation. It *must* contain the columns *allele.1* and *allele.2* with the alleles of the SNPs. 

If data is originally available in PLINK files (.bed, .bim), we can use the functions of  `r Biocpkg("snpStats")` to load the data as `SNPMatrix` object 

```{r, eval=FALSE}
library(snpStats)

## From a bed 
snps <- read.plink("example.bed")

## From a pedfile
snps <- read.pedfile("example.ped", snps = "example.map")
```


In both cases, data is readily formatted for `r Rpackage("scoreInvHap")`. 

If data is available in a variant call format (.vcf) file, we can load it in R into a `VCF` object, using the `r Biocpkg("VariantAnnotation")` package. `r Rpackage("scoreInvHap")` includes a small vcf in `r Rpackage("scoreInvHap")` as a demo. This file contains genotyped and imputed SNPs within inversion 7p11.2 for 30 European individuals from the 1000 Genomes project. We can load the vcf as follows

```{r, Load SNPs, message=FALSE}
library(VariantAnnotation)
vcf_file <- system.file("extdata", "example.vcf", package = "scoreInvHap")
vcf <- readVcf(vcf_file, "hg19")
vcf
```

The object `vcf` contains 380 SNPs and 30 individuals. 

`r Rpackage("scoreInvHap")` references were created using 1000 Genomes data. Thus, SNPs are annotated to the hg19 build in the plus strand. We have included a function that checks if input SNPs correspond to `r Rpackage("scoreInvHap")` references:

```{r}
check <- checkSNPs(vcf)
check
vcf <- check$genos
```

The function `checkSNPs` checks if the alleles in the SNP object are equal to those in `r Rpackage("scoreInvHap")` references. The function also tests if the frequencies are similar in the input data and in the references. `checkSNPs` returns a list with three elements. wrongAlleles are the SNPs with different alleles, wrongFreqs are the SNPs with different allele frequencies and `genos` is an object with the genotype data without SNPs failing any of the checks.  

Now, we illustrate how to perform the inversion genotyping of inv7p11.2 for these data. The inversion calling requires two pieces of information:

- sample genotypes and their annotation (argument `SNPlist`),
- name of the inversion

Currently, there are 20 inversion references included in `r Rpackage("scoreInvHap")`. We included a table with their coordinates and scoreInvHap labels at the end of this document. 

inv7p11.2 is labeled as `inv7_005` in `r Rpackage("scoreInvHap")`. We then obtain the  inversion genotypes for the 30 individuals in our dataset as follows  

```{r, classify}
res <- scoreInvHap(SNPlist = vcf, inv = "inv7_005")
res
```

This function has a parameter called `BPPARAM` that allows paralell computing using `BiocParallel` infrastructure. For example, parallel computation using 8 cores would be run by executing 

```{r, classify_par, eval=FALSE}
res <- scoreInvHap(SNPlist = vcf, inv = "inv7_005",
                   BPPARAM = BiocParallel::MulticoreParam(8))
```


The results of `scoreInvHap` are encapsulated in a object of class `scoreInvHapRes`. This object contains the classification of the individuals into the inversion-genotypes, as well as the similarity scores. We can retrieve results with the `classification` and the  `scores`  functions 

```{r, scoreInvHap results}
# Get classification
head(classification(res))
# Get scores
head(scores(res))
```

Inversion genotypes is returned as a factor, which includes the individual names present in the `snpMatrix` or `VCF`. Thus, inversion classification can be readily used in down-stream association tests. 

## Quality control

We can retrieve other values that are useful to evaluate the quality of the inference. For each individual, we can obtain the highest similarity score and the difference between the highest similarity score and the second highest:

```{r, scoreInvHap scores}
# Get max score
head(maxscores(res))
# Get difference score
head(diffscores(res))
```

Classification is good when the highest score is close to 1 and the other scores are small. This means that the SNPs in the individual are almost identical to one of the reference genotypes and  different from the rest. We can use the difference between the highest score and the second highest score as a quality measure. We can have a visual evaluation of the quality parameters with the function `plotScores`:

```{r}
plotScores(res, pch = 16, main = "QC based on scores")
```

The horizontal line is a threshold for quality, set to 0.1 but can be changed in the parameter `minDiff` in the function `scoreInvHap`. This default value considers that the SNPs  of the individual are at least 10% more similar to the selected reference than to second one. `plotScores` can be customized. 

Another quality measure is based on the number of SNPs used in the computation. `r Rpackage("scoreInvHap")` allows individuals having different SNP coverages within the inverted regions. SNPs with no information are excluded from the computation of the scores. To reflect this we report a SNP call rate:

```{r}
# Get Number of scores used
head(numSNPs(res))
# Get call rate
head(propSNPs(res))
```

The number of SNPs must always be taken into account to evaluate the performance of the computation. It is highly recommended to use, at least, 15 SNPs in any inversion-calling. `plotCallRate`  can be used to visualize the call rate

```{r}
plotCallRate(res, main = "Call Rate QC")
```

The vertical line is the minimum recommended call rate to consider a sample. By default, it is set to 0.9 but can be changed with the parameter `callRate`.

The function `classification` have two parameters that selects samples based on these two QC parameters. The argument `minDiff` sets the minimum difference between the highest and the second highest score. The argument `callRate` sets the minimum call rate of a sample to pass the QC. By default, both arguments are set to 0 so all the samples are included: 

```{r}
## No filtering
length(classification(res))
## QC filtering
length(classification(res, minDiff = 0.1, callRate = 0.9))
```

Finally, the function `classification` has the argument `inversion` that, when it is set to FALSE,  haplotype-genotypes are called instead of inversion-genotuypes. This is useful for inversions that have more than one haplotype per inversion status:

```{r}
## Inversion classification
table(classification(res))
## Haplotype classification
table(classification(res, inversion = FALSE))
```

## Imputed data

When SNPs data are imputed, we obtain three different types of results: the best-guess, the dosage and the posterior probabilities. By default, `scoreInvHap` uses the best-guess when computing the similarity scores. However, we can also use posterior probabilities setting the argument `probs` to TRUE:

```{r, classify imputed}
res_imp <- scoreInvHap(SNPlist = vcf, inv = "inv7_005", probs = TRUE)
res_imp
```

In this case, the samples were identically classified in both cases:

```{r, compare classifications}
table(PostProbs = classification(res_imp), 
      BestGuess = classification(res))
```

# Other features

There are two additional parameters of `scoreInvHap` than can reduce computing time: `R2` and `BPPARAM`. `R2` indicates the minimum R^2^ that a SNP should have with the inversion to be included in the score. The less number of SNPs the less time it takes. By default, all SNPs are included in the computation. On the other hand, `BPPARAM` requires an instance of `BiocParallelParam`, which allows to parallelize the computation of the score. You can find more information about this class in its help page (`?bpparam`) and in the `r Biocpkg("BiocParallel")` vignette. 


# Inversions included in scoreInvHap

The following table describes the inversion includes in `scoreInvHap`:


| scoreInvHap Label | Locus | Length (Kb) | Num. Haplos | 
|---|--:|--:|--:|
| inv1_004 | 1p22.1 | 0.77 | 2 |
| inv1_008 | 1q31.1 | 1.2  | 2 |
| inv2_002 | 2p22.3 | 0.72 | 2 |
| inv2_013 | 2q22.1 | 4.25 | 2 |
| inv3_003 | 3q26.1 | 2.28 | 4 (3/1) |
| inv6_002 | 6p21.33 | 0.87 | 2 |
| inv6_006 | 6q23.1 | 4.13 | 2 |
| inv7_003 | 7p14.3 | 5.25 | 2 |
| inv7_005 | 7p11.2 | 73.9 | 4 (2/2) |
| inv7_011 | 7q11.22 | 12.7 | 2 |
| inv7_014 | 7q36.1 | 2.08 | 2 |
| inv8_001 | 8p23.1 | 3925 | 2 |
| inv11_001 | 11p12 | 4.75 | 2 |
| inv11_004 | 11q13.2 | 1.38 | 3 (2/1) |
| inv12_004 | 12q21.2 | 19.3 | 2 |
| inv12_006 | 12q21.2 | 1.03 | 3 (2/1) |
| inv14_005 | 14q23.3 | 0.86 | 2 |
| inv16_009 | 16p11.2 | 364.2 | 2 |
| inv17_007 | 17q21.31 | 711 | 2 |
| inv21_005 | 21q21.3 | 1.06 | 4 (3/1) |
| invX_006 | Xq13.2 | 90.8 | 4 (3/1) |

: (\#tab:table) Inversions included in scoreInvHap. Length: inversion length in Kb in hg19. Num. Haplos: Number of haplotypes supported by the inversion. In parenthesis, number of standard and inverted haplotypes for inversions with more than two haplotypes. 


This information is also available in `inversionGR`:
```{r}
data(inversionGR)
inversionGR
```


# sessionInfo

```{r}
sessionInfo()
```