---
title: "Genetic distance calculation from genotype shifts of markers"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Get-Started-With-comapr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  
author: Ruqian Lyu
date: "`r format(Sys.time(), '%d %B, %Y')`"
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(comapr)
library(GenomicRanges)
library(BiocParallel)
```

# Installation

Install via BiocManager as follow:

```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("comapr")
```

# Introduction

`comapr` lets you interrogate the genotyping results for a sequence of markers 
across the chromosome and detect meiotic crossovers from genotype shifts. In 
this document, we demonstrate how genetic distances are calculated from genotyping 
results for a group of samples using functions available from `comapr`.


# The demo data set of genotyping results

The package includes a small data object that contains 70 markers and their genotyping
results for 22 samples. The 22 samples are the BC1F1 progenies generated by 

- Generating F1 hybrid mice by crossing an inbred C57BL/6J mouse and an inbred FVB/NJ mouse 
- Crossing F1 hybrid mice with inbred FVB/NJ

Therefore, the genotype shifts detected in BC1F1 samples represent crossovers that
have happened during meiosis for the F1 parents.

```{r}
BiocParallel::register(SerialParam())
BiocParallel::bpparam()
```

```{r}
data(snp_geno_gr)
data(parents_geno)
```

## Format the genotype

In order to detect crossovers from the 22 samples' genotype results (BC1F1 
samples), we need to format the result into `Homo_ref`, `Homo_alt` and `Het`
encodings. This can be simply done through comparing with the parents' genotypes.

Here, the genotype noted as "Fail" will be converted to NA. Please also note that
we supplied "Homo_ref" as one of the fail options because "Homo_ref" is not in
the possible genotypes of markers in BC1F1 samples.


```{r}
corrected_geno <- correctGT(gt_matrix = mcols(snp_geno_gr),
                            ref = parents_geno$ref,
                            alt = parents_geno$alt,
                            fail = "Fail",
                            wrong_label = "Homo_ref")

mcols(snp_geno_gr) <- corrected_geno
```

The `corrected_geno` matrix 

```{r}
head(mcols(snp_geno_gr)[,1:5])

```

Note that there are missing values in this resulting matrix that can be resulted
from:

- No genotype data from the genotyping result
- The genotype is wrong and it is converted to `NA`

## Find outlier samples/markers

In this step, we try to identify markers that have `NA` genotype across many 
samples or samples that have a lot markers failed for removal. We use the 
`countGT` function for find bad markers/samples.

```{r}
genotype_counts <- countGT(mcols(snp_geno_gr))

genotype_counts$plot
```

The number of markes and samples are saved in a list returned by `countGT`

```{r}
genotype_counts$n_markers
genotype_counts$n_samples
```

We now filter out markers/samples by using function `filterGT`. `min_markers` 
specifies at least how many markers a sample needs to be kept. Likewise for 
`min_samples`.

A printed message contains information about how much markers or samples have 
been filtered.

```{r}
corrected_geno <- filterGT(snp_geno_gr,
                           min_markers = 30,
                           min_samples = 2)
```

## Find sample duplicates

Sample duplicates are identified by finding samples that share exactly same genotypes
across all available markers. `findDupSamples` can be applied and a threshold 
value is provided and used as a cut-off on the percentage of same genotype markers
the duplicated samples should share.

```{r}
dups <- findDupSamples(mcols(corrected_geno),
                       threshold = 0.99)
dups
```

Now we remove the duplicated samples.

```{r}
mcols(corrected_geno) <- mcols(corrected_geno)[,colnames(mcols(corrected_geno))!="X98"]
#corrected_geno
```


## Count crossovers

Crossovers are detected and counted through examining the patterns of genotypes 
along the chromosome. When there is a shift from one genotype block to another,
a crossover is observed. This is done through calling `countCOs` function which
returns a `GRange` object with crossover counts for the list of marker intervals.

The crossover count values in the columns can be non-integer when one observed
crossover can not be determined to be completely distributed to the marker interval
in the corresponding row. The observed crossover is then distributed to the adjacent
intervals proportionally to their interval base pair sizes.

```{r}
marker_gr_cos <- countCOs(corrected_geno)

marker_gr_cos[1:5,1:5]
```

## Calculate genetic distance 

The genetic distances of marker intervals are calcuated based on the crossover 
rates via applying mapping function, `Kosambi` or `Haldane` by calling the
`calGeneticDist` function. The returned genetic distances are in unit of
centiMorgan.

```{r}
dist_gr <- calGeneticDist(marker_gr_cos,
                          mapping_fun = "k")

dist_gr[1:5,]
```

## Calculate genetic distance for equally binned intervals 

Alternatively, instead of returning the genetic distances in supplied marker 
intervals, we can specify a `bin_size` which tells `calGeneticDist` to return 
calculated genetic distances for equally sized chromosome bins.

```{r}
dist_bin_gr <- calGeneticDist(marker_gr_cos,bin_size = 1e6)

dist_bin_gr[1:5,]
```


## Total genetic distances

With genetic distances calculated, we can do a sum of all genetic distances across
all marker intervals. We can see that we got the same total genetic distances
for marker based intervals and the equally binned intervals.

```{r}
sum(dist_bin_gr$kosambi_cm)
sum(dist_gr$kosambi_cm)

```

## Plotting genetic distance for each bin

`comapr` also includes functions for visulising genetic distances of marker 
intervals or binned intervals.


```{r}
plotGeneticDist(dist_bin_gr,chr = "1")
```


## Plot cumulative distances

We can also plot the cumulative genetic distances of certain chromosomes

```{r}
plotGeneticDist(dist_bin_gr,chr=c("1"),cumulative = TRUE)

```

Multiple chromosomes

```{r,fig.height=10,fig.width=8}
plotGeneticDist(dist_bin_gr,cumulative = TRUE)
```

## Whole Genome Genetic distances Plot

`comapr` implements a whole genome plot function too, that takes all chromosomes
available in the result and plot a cumulative genetic distances by cumulatively 
summing all intervals across all chromosomes.


```{r}
plotWholeGenome(dist_bin_gr)
```


# Sessioninfo

```{r sessioninfo}
sessionInfo()
```