---
title: "VariantExperiment methods"
author: 
- name: Qian Liu
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
date: "last edit: 08/14/2019"
output:
    BiocStyle::html_document:
        toc: true
        toc_float: true
package: VariantExperiment
vignette: |
    %\VignetteIndexEntry{VariantExperiment-methods}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r options, eval=TRUE, echo=FALSE}
options(showHeadLines=3)
options(showTailLines=3)
```

# User instruction

This vignette is about the conversion methods and statistical
functions that are enabled on `VariantExperiment` objects, for the
analysis of genotyping or DNA-seq data sets. If you want to learn more
about the implementation of the `VariantExperiment` class, and basic
methods, please refer to the other vignette. 

# Introduction

The package of `VariantExperiment` is implemented to represent VCF/GDS
files using standard SummarizedExperiment metaphor. It is a container
for high-through genetic/genomic data with GDS back-end, and is
interoperable with the statistical functions/methods that are
implemented in `SeqArray` and `SeqVarTools` that are designed for GDS
data. The `SummarizedExperiment` metaphor also gets the benefit of
common manipulations within _Bioconductor_ ecosystem that are more
user-friendly.

First, we load the package into R session.
```{r Load, message=FALSE}
library(VariantExperiment)
```

# Coercion methods

To take advantage of the functions and methods that are defined on
`SummarizedExperiment`, from which the `VariantExperiment` extends, we
have defined coercion methods from `VCF` and `GDS` to
`VariantExperiment`.

## From `VCF` to `VariantExperiment`

The coercion function of `makeVariantExperimentFromVCF` could
convert the `VCF` file directly into `VariantExperiment` object. To
achieve the best storage efficiency, the assay data are saved in
`GDSArray` format, and the annotation data are saved in
`DelayedDataFrame` format (with no option of ordinary `DataFrame`),
which could be retrieved by `rowData()` for feature related
annotations and `colData()` for sample related annotations (Only when
`sample.info` argument is specified). 

```{r makeVariantExperimentFromVCF}
vcf <- SeqArray::seqExampleFileName("vcf")
ve <- makeVariantExperimentFromVCF(vcf)
ve
```

Internally, the `VCF` file was converted into a on-disk `GDS` file,
which could be retrieved by:
```{r retrieve GDS}
gdsfile(ve)
```

assay data is in `GDSArray` format:
```{r makeVariantExperimentFromVCF2}
assay(ve, 1)
```

feature-related annotation is in `DelayedDataFrame` format:
```{r makeVariantExperimentFromVCF3}
rowData(ve)
``` 

User could also have the opportunity to save the sample related
annotation info directly into the `VariantExperiment` object, by
providing the file path to the `sample.info` argument, and then
retrieve by `colData()`.

```{r sampleInfo}
sampleInfo <- system.file("extdata", "Example_sampleInfo.txt",
                          package="VariantExperiment")
ve <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo)
colData(ve)
```

Arguments could be specified to take only certain info columns or format
columns from the vcf file. 

```{r makeVariantExperimentFromVCFArgs}
ve1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP"))
rowData(ve1)
```
In the above example, only 2 info entries ("OR" and "GP") are read
into the `VariantExperiment` object.

The `start` and `count` arguments could be used to specify the start
position and number of variants to read into `Variantexperiment`
object.

```{r makeVariantExperimentFromVCFArgs_startCount}
ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
ve2
```
For the above example, only 1000 variants are read into the
`VariantExperiment` object, starting from the position of 101. 

## From `GDS` to `VariantExperiment`

The coercion function of `makeVariantExperimentFromGDS` coerces
`GDS` files into `VariantExperiment` objects directly, with the assay
data saved as `GDSArray`, and the `rowData()/colData()` in
`DelayedDataFrame` by default (with the option of ordinary `DataFrame`
object).

```{r makeVariantExperimentFromGDS}
gds <- SeqArray::seqExampleFileName("gds")
ve <- makeVariantExperimentFromGDS(gds)
ve
```
```{r makeVariantExperimentFromGDS2}
rowData(ve)
colData(ve)
```
Arguments could be specified to take only certain annotation columns
for features and samples. All available data entries for
`makeVariantExperimentFromGDS` arguments could be retrieved by the
`showAvailable()` function with the gds file name as input.

```{r showAvailable}
showAvailable(gds)
```

Note that the `infoColumns` from gds file will be saved as columns
inside the `rowData()`, with the prefix of
"info_". `rowDataOnDisk/colDataOnDisk` could be set as `FALSE` to
save all annotation data in ordinary `DataFrame` format.

```{r makeVariantExperimentFromGDSArgs}
ve3 <- makeVariantExperimentFromGDS(gds,
                                    rowDataColumns = c("ID", "ALT", "REF"),
                                    infoColumns = c("AC", "AN", "DP"),
                                    rowDataOnDisk = TRUE,
                                    colDataOnDisk = FALSE)
rowData(ve3)  ## DelayedDataFrame object 
colData(ve3)  ## DataFrame object
```

# Lazy data operations

`VariantExperiment` supports basic subsetting operations using `[`,
`[[`, `$`, and ranged-based subsetting operations using
`subsetByOverlap`. 

**NOTE** that after a set of lazy operations, you need to call
`saveVariantExperiment` function to synchronize the on-disk file
associated with the in-memory representation by providing a file
path. Statistical functions could only work on synchronized
`VariantExperiment` object, or error will return.

Refer to the "VariantExperiment-class" vignette for more details.

# Statistical functions

Many statistical functions and methods are defined on
`VariantExperiment` objects, most of which has their generic defined
in _Bioconductor_ package of `SeqArray` and `SeqVarTools`. These
functions could be called directly on `VariantExperiment` object as
input, with additional arguments to specify based on user's need. More
details please refer to the vignettes of [SeqArray][] and
[SeqVarTools][].

[SeqArray]: https://bioconductor.org/packages/SeqArray
[SeqVarTools]: https://bioconductor.org/packages/SeqVarTools

Here is a list of the statistical functions with brief description:

statistical functions | Description
--------------------- | ------------
seqAlleleFreq         | Calculates the allele frequencies
seqAlleleCount        | Calculates the allele counts 
seqMissing            | Calculates the missing rate for variant/sample
seqNumAllele          | Calculates the number of alleles (for ref/alt allele)
hwe                   | Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants
inbreedCoeff          | Calculates the inbreeding coefficient by variant/sample
pca                   | Calculates the eigenvalues and eignevectors with Principal Component Analysis
titv                  | Calculate transition/transversion ratio overall or by sample
refDosage             | Calculate the dosage of reference allele (matrix with integers of 0/1/2)
altDosage             | Calculate the dosage of alternative allele (matrix with integers of 0/1/2)
countSingletons       | Count singleton variants for each sample
heterozygosity        | Calculate heterozygosity rate by sample or by variants
homozygosity          | Calculate homozygosity rate by sample or by variants
meanBySample          | Calculate the mean value of a variable by sample over all variants
isSNV                 | Flag a single nucleotide variant 
isVariant             | Locate which samples are variant for each site

Here are some examples in calculating the sample missing rate, hwe, titv ratio and the count of singletons for each sample. 
```{r stats}
## sample missing rate
mr.samp <- seqMissing(ve, per.variant = FALSE)
head(mr.samp)
## hwe
hwe <- hwe(ve)
head(hwe)
## titv ratio by sample / overall
titv <- titv(ve, by.sample=TRUE)
head(titv)
titv(ve, by.sample=FALSE)
## countSingletons
countSingletons(ve)
```

As we have noted in the other vignette, after the subsetting by
`[`, `$` or `Ranged-based operations`, we need to save the new
`VariantExperiment` object to synchronize the gds file (on-disk)
associated with the subset of data (in-memory representation) before
any statistical analysis. Otherwise, an error will be returned.

# Future work
As a feature addition, we want to add the option of `VCFArray` in
saving the `assay` data in the step of
`makeVariantExperimentFromVCF`. We also seek to implement the
[SQLDataFrame][] in representation of the annotation data. We also
plan to connect _Bioconductor_ package [VariantAnnotation][] to
implement the variant filtering and annotation functions based on
`VariantExperiment` format, and with that, to develop a pipeline for
using `VariantExperiment` object as the basic data structure for
DNA-sequencing data analysis.

[SQLDataFrame]: https://github.com/Bioconductor/SQLDataFrame
[VariantAnnotation]: http://bioconductor.org/packages/VariantAnnotation

# Session Info
```{r sessionInfo}
sessionInfo()
```