--- title: "VariantExperiment: A RangedSummarizedExperiment Container for VCF/GDS Data with GDS Backend" author: - name: Qian Liu affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY date: "last edit: 08/14/2019" output: BiocStyle::html_document: toc: true toc_float: true package: VariantExperiment vignette: | %\VignetteIndexEntry{VariantExperiment-class} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r options, eval=TRUE, echo=FALSE} options(showHeadLines=3) options(showTailLines=3) ``` # User instruction This package includes 2 vignettes. This One is about the class definitions, basic methods and operations of `VariantExperiment` that are defined or directly inherited from `SummarizedExperiment`. For users who want to apply `VariantExperiment` directly to their analysis high-throughput genotyping or DNA-seq data sets, please refer to the other vignette. # Introduction As the sequencing data (DNA-seq, RNA-seq, single cell RNA-seq...) gets increasingly larger-profile, the memory space in R has been an obstable for fast and efficient data processing, because most available _R_ or _Bioconductor_ packages are developed based on in-memory data manipulation. [SingleCellExperiment][] has achieved efficient on-disk saving/reading of the large-scale count data as [HDF5Array][] objects. However, there was still no such light-weight containers available for high-throughput DNA-seq / genotyping data. We have developed [VariantExperiment][], a _Bioconductor_ package for saving the `VCF/GDS` format data into `RangedSummarizedExperiment` object. The high-throughput genetic/genomic data are saved in [GDSArray][] objects, a direct extension of [DelayedArray][] with GDS back-end. In addition to the light-weight `Assay` data, We have also realized the on-disk saving of annotation data for both features/samples (corresponding to `rowData/colData`) by implementing the [DelayedDataFrame][] data structure. The on-disk representation of both assay data and annotation data realizes on-disk reading and processing and saves _R_ memory space significantly. The interface of `RangedSummarizedExperiment` data format enables easy and common manipulations for high-throughput genetic/genomic data with common [SummarizedExperiment][] metaphor in _R_ and _Bioconductor_. [VariantExperiment]: https://bioconductor.org/packages/VariantExperiment [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment [SingleCellExperiment]: https://bioconductor.org/packages/SingleCellExperiment [DelayedArray]: https://bioconductor.org/packages/DelayedArray [HDF5Array]: https://bioconductor.org/packages/HDF5Array [GDSArray]: https://bioconductor.org/packages/GDSArray [DelayedDataFrame]: http://bioconductor.org/packages/DelayedDataFrame # Installation 1. Download the package from Bioconductor: ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("VariantExperiment") ``` Or install the development version of the package from Github. ```{r, eval = FALSE} BiocManager::install(“Bioconductor/VariantExperiment”) ``` 2. Load the package into R session. ```{r Load, message=FALSE} library(VariantExperiment) ``` # Background ## GDSArray [GDSArray][] is a _Bioconductor_ package that represents `GDS` files as objects derived from the [DelayedArray][] package and `DelayedArray` class. It converts `GDS` nodes into a `DelayedArray`-derived data structure. The rich common methods and data operations defined on `GDSArray` makes it more _R_-user-friendly than working with the GDS file directly. The array data from GDS files are always returned with the first dimension being features (genes/variants/snps) and the second dimension being `samples`. This feature is consistent with the assay data saved in `SummarizedExperiment`, and makes the `GDSArray` package more interoperable with other established _Bioconductor_ data infrastructure. The `GDSArray()` constructor takes 2 arguments: the file path and the GDS node name inside the GDS file. ```{r, GDSArray} file <- SeqArray::seqExampleFileName("gds") #> GDSArray(file, "genotype") #> deactivate temporarily 3/5/20 GDSArray(file, "sample.id") ``` More details about `GDS` or `GDSArray` format can be found in the vignettes of the [gdsfmt][], [SNPRelate][], [SeqArray][], [GDSArray][] and [DelayedArray][] packages. [gdsfmt]: https://bioconductor.org/packages/gdsfmt [SNPRelate]: https://bioconductor.org/packages/SNPRelate [SeqArray]: https://bioconductor.org/packages/SeqArray ## DelayedDataFrame [DelayedDataFrame][] is a _Bioconductor_ package that implements delayed operations on `DataFrame` objects using standard `DataFrame` metaphor. Each column of data inside `DelayedDataFrame` is represented as 1-dimensional `GDSArray` with on-disk GDS file. Methods like `show`,`validity check`, `[`, `[[` subsetting, `rbind`, `cbind` are implemented for `DelayedDataFrame`. The `DelayedDataFrame` stays lazy until an explicit realization call like `DataFrame()` constructor or `as.list()` incurred. More details about [DelayedDataFrame][] data structure could be found in the vignette of [DelayedDataFrame][] package. # `VariantExperiment` class ## `VariantExperiment` class `VariantExperiment` class is defined to extend `RangedSummarizedExperiment`. The difference would be that the assay data are saved as `GDSArray`, and the annotation data are saved by default as `DelayedDataFrame` (with option to save as ordinary `DataFrame`), both of which are representing the data on-disk with `GDS` back-end. There are coercion methods defined for `VCF` and `GDS` files into `VariantExperiment` objects (Check the method vignette for more details). Here we take an example for illustration. ```{r makeVariantExperimentFromGDS, eval=FALSE} ve <- makeVariantExperimentFromGDS(gds) ve ``` ## slot accessors assay data are in `GDSArray` format, and could be retrieve by the `assays()/assay()` function. ```{r makeVariantExperimentFromGDS2, eval=FALSE} assays(ve) assay(ve, 1) ``` The `rowData()` of the `VariantExperiment` is saved in `DelayedDataFrame` format. We can use `rowRanges()` / `rowData()` to retrieve the feature-related annotation file, with/without a GenomicRange format. ```{r rrrd, eval=FALSE} rowRanges(ve) rowData(ve) ``` sample-related annotation is in `DelayedDataFrame` format, and could be retrieved by `colData()`. ```{r colData, eval=FALSE} colData(ve) ``` The `gdsfile()` will retrieve the gds file path associated with the `VariantExperiment` object. ```{r gdsfile, eval=FALSE} gdsfile(ve) ``` Some other getter function like `metadata()` will return any metadata that we have saved inside the `VariantExperiment` object. ```{r metaData, eval=FALSE} metadata(ve) ``` # Subsetting methods The `VariantExperiment` object supports subsetting methods with `[`, `$` and `subsetByOverlap` etc. as in `SummarizedExperiment`. ## two-dimensional subsetting ```{r, 2d, eval=FALSE} ve[1:10, 1:5] ``` ## `$` subsetting The `$` subsetting could be operated directly on `colData()` columns, for easy sample extraction. Note that the `colData/rowData` are in the `DelayedDataFrame` format, with each column saved as `GDSArray`. So when doing subsetting, we need to use `as.logical()` to convert the 1-dimensional `GDSArray` into ordinary vector. ```{r colDataExtraction, eval=FALSE} colData(ve) ve[, as.logical(ve$family == "1328")] ``` subsetting by `rowData()` columns. ```{r rowDataExtraction, eval=FALSE} rowData(ve) ve[as.logical(rowData(ve)$REF == "T"),] ``` ## Range-based operations `VariantExperiment` objects support all of the `findOverlaps()` methods and associated functions. This includes `subsetByOverlaps()`, which makes it easy to subset a `VariantExperiment` object by an interval. ```{r overlap, eval=FALSE} ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933")) ve1 ``` # Save / load `VariantExperiment` object Note that after the subsetting by `[`, `$` or `Ranged-based operations`, and you feel satisfied with the data for downstream analysis, you need to save that `VariantExperiment` object to synchronize the gds file (on-disk) associated with the subset of data (in-memory representation) before any statistical analysis. Otherwise, an error will be returned. For example, after we subset the `ve` by `GRanges("22:1-48958933")`, and we want to calculate the hwe based on the 23 variants, an error will be generated indicating that we need to sync the on-disk and in-memory representations. ```{r saveLoad, eval=FALSE} hwe(ve1) ## Error in .saveGDSMaybe(gdsfile) : use ## 'saveVariantExperiment()' to synchronize on-disk and ## in-memory representations ``` ## save `VariantExperiment` object Use the function `saveVariantExperiment` to synchronize the on-disk and in-memory representation. This function writes the processed data as `ve.gds`, and save the _R_ object (which lazily represent the backend data set) as `ve.rds` under the specified directory. It finally returns a new `VariantExperiment` object into current R session generated from the newly saved data. ```{r saveVE, eval=FALSE} a <- tempfile() ve2 <- saveVariantExperiment(ve1, dir=a, replace=TRUE) ``` ## load `VariantExperiment` object You can alternatively use `loadVariantExperiment` to load the synchronized data into R session, by providing only the file directory. It reads the `VariantExperiment` object saved as `ve.rds`, as lazy representation of the backend `ve.gds` file under the specific directory. ```{r loadVE, eval=FALSE} ve3 <- loadVariantExperiment(dir=a) gdsfile(ve3) all.equal(ve2, ve3) ``` Now we are all set for any downstream analysis as needed. ```{r newVEstats, eval=FALSE} head(hwe(ve2)) ``` # Session Info ```{r sessionInfo} sessionInfo() ```