---
title: "VariantExperiment: A RangedSummarizedExperiment Container for Large-Scale Variant Data with GDS Backend"
valauthor: 
- name: Qian Liu
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
date: "last edit: 04/10/2024"
output:
    BiocStyle::html_document:
        toc: true
        toc_float: true
package: VariantExperiment
vignette: |
    %\VignetteIndexEntry{VariantExperiment-class}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r options, eval=TRUE, echo=FALSE}
options(showHeadLines=3)
options(showTailLines=3)
```

<!-- # User instruction -->

<!-- This package includes 2 vignettes. This One is about the class -->
<!-- definitions, basic methods and operations of `VariantExperiment` that -->
<!-- are defined or directly inherited from `SummarizedExperiment`. For -->
<!-- users who want to apply `VariantExperiment` directly to their analysis -->
<!-- high-throughput genotyping or DNA-seq data sets, please refer to the -->
<!-- other vignette. -->

# Introduction 

With the rapid development of the biotechnologies, the sequencing
(e.g., DNA, bulk/single-cell RNA, etc.) and other types of biological
data are getting increasingly larger-profile. The memory space in R
has been an obstable for fast and efficient data processing, because
most available _R_ or _Bioconductor_ packages are developed based on
in-memory data manipulation. [SingleCellExperiment][] has achieved
efficient on-disk saving/reading of the large-scale count data as
[HDF5Array][] objects. However, there was still no such light-weight
containers available for high-throughput variant data (e.g., DNA-seq,
genotyping, etc.).

We have developed [VariantExperiment][], a _Bioconductor_ package to
contain variant data into `RangedSummarizedExperiment` object. The
package converts and represent VCF/GDS files using standard
`SummarizedExperiment` metaphor. It is a container for high-through
variant data with GDS back-end. 

In `VariantExperiment`, The high-throughput variant data is saved in
[DelayedArray][] objects with GDS back-end. In addition to the
light-weight `Assay` data, it also supports the on-disk saving of
annotation data for both features and samples (corresponding to
`rowData/colData` respectively) by implementing the
[DelayedDataFrame][] data structure. The on-disk representation of
both assay data and annotation data realizes on-disk reading and
processing and saves _R_ memory space significantly. The interface of
`RangedSummarizedExperiment` data format enables easy and common
manipulations for high-throughput variant data with common
[SummarizedExperiment][] metaphor in _R_ and _Bioconductor_.

[VariantExperiment]: https://bioconductor.org/packages/VariantExperiment 
[SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment
[SingleCellExperiment]: https://bioconductor.org/packages/SingleCellExperiment
[DelayedArray]: https://bioconductor.org/packages/DelayedArray
[HDF5Array]: https://bioconductor.org/packages/HDF5Array
[GDSArray]: https://bioconductor.org/packages/GDSArray
[DelayedDataFrame]: http://bioconductor.org/packages/DelayedDataFrame

# Installation

1. Download the package from Bioconductor: 

```{r getPackage, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("VariantExperiment")
```
Or install the development version of the package from Github.
```{r, eval = FALSE}
BiocManager::install("Bioconductor/VariantExperiment") 
``` 

2. Load the package into R session.
```{r Load, message=FALSE}
library(VariantExperiment)
```

# Background

## GDSArray

[GDSArray][] is a _Bioconductor_ package that represents `GDS` files as
objects derived from the [DelayedArray][] package and `DelayedArray`
class. It converts `GDS` nodes into a `DelayedArray`-derived data
structure. The rich common methods and data operations defined on
`GDSArray` makes it more _R_-user-friendly than working with the GDS
file directly. 

<!-- The array data from GDS files are always returned with -->
<!-- the first dimension being features (genes/variants/snps) and the -->
<!-- second dimension being `samples`. This feature is consistent with the -->
<!-- assay data saved in `SummarizedExperiment`, and makes the `GDSArray` -->
<!-- package more interoperable with other established _Bioconductor_ data -->
<!-- infrastructure.    -->

The `GDSArray()` constructor takes 2 arguments: the file path and the
GDS node name (which can be retrieved with the `gdsnodes()` function)
inside the GDS file.

```{r, GDSArray}
library(GDSArray)
file <- GDSArray::gdsExampleFileName("seqgds")
gdsnodes(file)
GDSArray(file, "genotype/data")
GDSArray(file, "sample.id")
```
More details about `GDS` or `GDSArray` format can be found in the
vignettes of the [gdsfmt][], [SNPRelate][], [SeqArray][], [GDSArray][]
and [DelayedArray][] packages.

[gdsfmt]: https://bioconductor.org/packages/gdsfmt
[SNPRelate]: https://bioconductor.org/packages/SNPRelate
[SeqArray]: https://bioconductor.org/packages/SeqArray

## DelayedDataFrame

[DelayedDataFrame][] is a _Bioconductor_ package that implements
delayed operations on `DataFrame` objects using standard `DataFrame`
metaphor. Each column of data inside `DelayedDataFrame` is represented
as 1-dimensional `GDSArray` with on-disk GDS file. Methods like
`show`,`validity check`, `[`, `[[` subsetting, `rbind`, `cbind` are
implemented for `DelayedDataFrame`.  The `DelayedDataFrame` stays lazy
until an explicit realization call like `DataFrame()` constructor or
`as.list()` triggered. More details about [DelayedDataFrame][] data
structure could be found in the vignette of [DelayedDataFrame][]
package.

# `VariantExperiment` class

## `VariantExperiment` class

`VariantExperiment` class is defined to extend
`RangedSummarizedExperiment`. The difference would be that the assay
data are saved as `DelayedArray`, and the annotation data are saved by
default as `DelayedDataFrame` (with option to save as ordinary
`DataFrame`), both of which are representing the data on-disk with
`GDS` back-end. 


Conversion methods into `VariantExperiment` object are
defined directly for `VCF` and `GDS` files. <!-- (Check the method
vignette --> <!-- for more details). --> Here we show one simple
example to convert a DNA-sequencing data in GDS format into
`VariantExperiment` and some class-related operations.

```{r makeVariantExperimentFromGDS}
ve <- makeVariantExperimentFromGDS(file)
ve
```

In this example, the sequencing file in GDS format was converted into
a `VariantExperiment` object, with all available assay data saved into
the `assay` slot, all available feature annotation nodes into
`rowRanges/rowData` slot, and all available sample annotation nodes
into `colData` slot. The available values for each arguments in
`makeVariantExperimentFromGDS()` function can be retrieved using the
`showAvailable()` function.

```{r showAvailable}
args(makeVariantExperimentFromGDS)
showAvailable(file)
```

## slot accessors

Assay data are in `GDSArray` format, and could be retrieve by the
`assays()/assay()` function. **NOTE** that when converted into a
`VariantExperiment` object, the assay data will be checked and
permuted, so that the first 2 dimensions always match to features
(variants/snps) and samples respectively, no matter how are the
dimensions are with the original GDSArray that can be constructed.

```{r makeVariantExperimentFromGDS2}
assays(ve)
assay(ve, 1)
GDSArray(file, "genotype/data")  ## original GDSArray from GDS file before permutation
```

In this example, the original `GDSArray` object from genotype data was
`2 x 90 x 1348`. But it was permuted to `1348 x 90 x 2` when
constructed into the `VariantExperiment` object.


The `rowData()` of the `VariantExperiment` is by default saved in
`DelayedDataFrame` format. We can use `rowRanges()` / `rowData()` to
retrieve the feature-related annotation file, with/without a
GenomicRange format.

```{r rrrd}
rowRanges(ve)
rowData(ve)
```

sample-related annotation is by default in `DelayedDataFrame` format,
and could be retrieved by `colData()`.

```{r colData}
colData(ve)
``` 

The `gdsfile()` will retrieve the gds file path associated with the
`VariantExperiment` object.
```{r gdsfile}
gdsfile(ve)
```

Some other getter function like `metadata()` will return any metadata
that we have saved inside the `VariantExperiment` object.
```{r metaData}
metadata(ve)
``` 
# Coercion methods

To take advantage of the functions and methods that are defined on
`SummarizedExperiment`, from which the `VariantExperiment` extends, we
have defined coercion methods from `VCF` and `GDS` to
`VariantExperiment`.

## From `VCF` to `VariantExperiment`

The coercion function of `makeVariantExperimentFromVCF` could
convert the `VCF` file directly into `VariantExperiment` object. To
achieve the best storage efficiency, the assay data are saved in
`DelayedArray` format, and the annotation data are saved in
`DelayedDataFrame` format (with no option of ordinary `DataFrame`),
which could be retrieved by `rowData()` for feature related
annotations and `colData()` for sample related annotations (Only when
`sample.info` argument is specified). 

```{r makeVariantExperimentFromVCF}
vcf <- SeqArray::seqExampleFileName("vcf")
ve <- makeVariantExperimentFromVCF(vcf, out.dir = tempfile())
ve
```

Internally, the `VCF` file was converted into a on-disk `GDS` file,
which could be retrieved by:
```{r retrieve GDS}
gdsfile(ve)
```

assay data is in `DelayedArray` format:
```{r makeVariantExperimentFromVCF2}
assay(ve, 1)
```

feature-related annotation is in `DelayedDataFrame` format:
```{r makeVariantExperimentFromVCF3}
rowData(ve)
``` 

User could also have the opportunity to save the sample related
annotation info directly into the `VariantExperiment` object, by
providing the file path to the `sample.info` argument, and then
retrieve by `colData()`.

```{r sampleInfo}
sampleInfo <- system.file("extdata", "Example_sampleInfo.txt",
                          package="VariantExperiment")
vevcf <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo)
colData(vevcf)
```

Arguments could be specified to take only certain info columns or format
columns from the vcf file. 

```{r makeVariantExperimentFromVCFArgs}
vevcf1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP"))
rowData(vevcf1)
```

In the above example, only 2 info entries ("OR" and "GP") are read
into the `VariantExperiment` object.

The `start` and `count` arguments could be used to specify the start
position and number of variants to read into `Variantexperiment`
object.

```{r makeVariantExperimentFromVCFArgs_startCount}
vevcf2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
vevcf2
```
For the above example, only 1000 variants are read into the
`VariantExperiment` object, starting from the position of 101. 

## From `GDS` to `VariantExperiment`

The coercion function of `makeVariantExperimentFromGDS` coerces `GDS`
files into `VariantExperiment` objects directly, with the assay data
saved as `DelayedArray`, and the `rowData()/colData()` in
`DelayedDataFrame` by default (with the option of ordinary `DataFrame`
object).

```{r makeVariantExperimentFromGDS_seq}
gds <- SeqArray::seqExampleFileName("gds")
ve <- makeVariantExperimentFromGDS(gds)
ve
```

Arguments could be specified to take only certain annotation columns
for features and samples. All available data entries for
`makeVariantExperimentFromGDS` arguments could be retrieved by the
`showAvailable()` function with the gds file name as input.

```{r showAvailable1}
showAvailable(gds)
```

Note that the `infoColumns` from gds file will be saved as columns
inside the `rowData()`, with the prefix of
"info.". `rowDataOnDisk/colDataOnDisk` could be set as `FALSE` to
save all annotation data in ordinary `DataFrame` format.

```{r makeVariantExperimentFromGDSArgs}
ve3 <- makeVariantExperimentFromGDS(gds,
                                    rowDataColumns = c("allele", "annotation/id"),
                                    infoColumns = c("AC", "AN", "DP"),
                                    rowDataOnDisk = TRUE,
                                    colDataOnDisk = FALSE)
rowData(ve3)  ## DelayedDataFrame object 
colData(ve3)  ## DataFrame object
```

## customization for certain gds types

For GDS formats of `SEQ_ARRAY` (defined in [SeqArray][] as
`SeqVarGDSClass` class) and `SNP_ARRAY` (defined in [SNPRelate][] as
`SNPGDSFileClass` class), we have made some customized transfer of
certain nodes when reading into `VariantExperiment` object for users'
convenience.

The `allele` node in `SEQ_ARRAY` gds file is converted into 2 columns
in `rowData()` asn `REF` and `ALT`. 

```{r}
veseq <- makeVariantExperimentFromGDS(file,
                                      rowDataColumns = c("allele"),
                                      infoColumns = character(0))
rowData(veseq)
```
The `snp.allele` node in `SNP_ARRAY` gds file was converted into 2
columns in `rowData()` as `snp.allele1` and `snp.allele2`.

```{r makeVariantExperimentFromGDS_snp}
snpfile <- SNPRelate::snpgdsExampleFileName()
vesnp <- makeVariantExperimentFromGDS(snpfile,
                                      rowDataColumns = c("snp.allele"))
rowData(vesnp)
```
# Subsetting methods

`VariantExperiment` supports basic subsetting operations using `[`,
`[[`, `$`, and ranged-based subsetting operations using
`subsetByOverlap`. 

## two-dimensional subsetting
```{r, 2d}
ve[1:10, 1:5]
```

## `$` subsetting

The `$` subsetting can be operated directly on `colData()` columns,
for easy sample extraction. **NOTE** that the `colData/rowData` are
(by default) in the `DelayedDataFrame` format, with each column saved
as `GDSArray`. So when doing subsetting, we need to use `as.logical()`
to convert the 1-dimensional `GDSArray` into ordinary vector.

```{r colDataExtraction}
colData(ve)
ve[, as.logical(ve$family == "1328")]
```

subsetting by `rowData()` columns.

```{r rowDataExtraction}
rowData(ve)
ve[as.logical(rowData(ve)$REF == "T"),]
```

## Range-based operations

`VariantExperiment` objects support all of the `findOverlaps()`
methods and associated functions. This includes `subsetByOverlaps()`,
which makes it easy to subset a `VariantExperiment` object by an
interval.

```{r overlap}
ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933"))
ve1
```

In this example, only 23 out of 1348 variants were retained with the
`GRanges` subsetting.

# Save / load `VariantExperiment` object

Note that after the subsetting by `[`, `$` or `Ranged-based
operations`, and you feel satisfied with the data for downstream
analysis, you need to save that `VariantExperiment` object to
synchronize the gds file (on-disk) associated with the subset of data
(in-memory representation) before any statistical analysis. Otherwise,
an error will be returned.

<!-- For example, after we subset the `ve` by `GRanges("22:1-48958933")`, -->
<!-- and we want to calculate the hwe based on the 23 variants, an error -->
<!-- will be generated indicating that we need to sync the on-disk and -->
<!-- in-memory representations. -->

<!-- ```{r saveLoad, eval=FALSE} -->
<!-- hwe(ve1) -->
<!-- ## Error in .saveGDSMaybe(gdsfile) : use -->
<!-- ##   'saveVariantExperiment()' to synchronize on-disk and -->
<!-- ##   in-memory representations -->
<!-- ``` -->
0
## save `VariantExperiment` object

Use the function `saveVariantExperiment` to synchronize the on-disk
and in-memory representation. This function writes the processed data
as `ve.gds`, and save the _R_ object (which lazily represent the
backend data set) as `ve.rds` under the specified directory. It
finally returns a new `VariantExperiment` object into current R
session generated from the newly saved data.

```{r saveVE}
a <- tempfile()
ve2 <- saveVariantExperiment(ve1, dir=a, replace=TRUE, chunk_size = 30)
```
## load `VariantExperiment` object

You can alternatively use `loadVariantExperiment` to load the
synchronized data into R session, by providing only the file
directory. It reads the `VariantExperiment` object saved as `ve.rds`, as lazy
representation of the backend `ve.gds` file under the specific
directory.

```{r loadVE}
ve3 <- loadVariantExperiment(dir=a)
gdsfile(ve3)
all.equal(ve2, ve3)
```

<!-- Now we are all set for any downstream analysis as needed.  -->
<!-- ```{r newVEstats} -->
<!-- head(hwe(ve2)) -->
<!-- ``` -->

# Session Info

```{r sessionInfo}
sessionInfo()
```