--- title: "GDSArray: Representing GDS files as array-like objects" author: - name: Qian Liu affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY - name: Hervé Pagès affiliation: Fred Hutchinson Cancer Research Center, Seattle, WA - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY date: "last edit: 3/16/2018" output: BiocStyle::html_document: toc: true toc_float: true package: GDSArray vignette: | %\VignetteIndexEntry{GDSArray: Representing GDS files as array-like objects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r options, eval=TRUE, echo=FALSE} options(showHeadLines=3) options(showTailLines=3) ``` # Introduction [GDSArray][] is a _Bioconductor_ package that represents GDS files as objects derived from the [DelayedArray][] package and `DelayedArray` class. It converts a GDS node in the file to a `DelayedArray`-derived data structure. The rich common methods and data operations defined on `GDSArray` makes it more _R_-user-friendly than working with the GDS file directly. The array data from GDS files are always returned with the first dimension being `variants/snps` and the second dimension being `samples`. This feature is consistent with the assay data saved in `SummarizedExperiment`, and makes the `GDSArray` package interoperable with other established _Bioconductor_ data infrastructure. [GDSArray]: https://bioconductor.org/packages/GDSArray [DelayedArray]: https://bioconductor.org/packages/DelayedArray # Package installation 1. Download the package from Bioconductor. ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GDSArray") ``` 2. Load the package into R session. ```{r Load, message=FALSE} library(GDSArray) ``` # GDS format introduction ## Genomic Data Structure (GDS) The _Bioconductor_ package [gdsfmt][] has provided a high-level R interface to CoreArray Genomic Data Structure (GDS) data files, which is designed for large-scale datasets, especially for data which are much larger than the available random-access memory. The GDS format has been widely used in genetic/genomic research for high-throughput genotyping or sequencing data. There are two major classes that extends the `gds.class`: `SNPGDSFileClass` suited for genotyping data (e.g., GWAS), and `SeqVarGDSClass` that are designed specifically for DNA-sequencing data. The file format attribute in each data class is set as `SNP_ARRAY` and `SEQ_ARRAY`. There are rich functions written based on these data classes for common data operation and statistical analysis. More details about GDS format can be found in the vignettes of the [gdsfmt][], [SNPRelate][], and [SeqArray][] packages. [gdsfmt]: https://bioconductor.org/packages/gdsfmt [SNPRelate]: https://bioconductor.org/packages/SNPRelate [SeqArray]: https://bioconductor.org/packages/SeqArray # `GDSArray`, `GDSMatrix`, and `GDSFile` `GDSArray` represents GDS files as `DelayedArray` instances. It has methods like `dim`, `dimnames` defined, and it inherits array-like operations and methods from `DelayedArray`, e.g., the subsetting method of `[`. ## GDSArray, GDSMatrix, and DelayedArray The `GDSArray()` constructor takes as arguments the file path and the GDS node inside the GDS file. The `GDSArray()` constructor always returns the object with rows being features (genes / variants / snps) and the columns being "samples". This is consistent with the assay data inside `SummarizedExperiment`. ```{r, GDSArray} file <- SeqArray::seqExampleFileName("gds") GDSArray(file, "genotype") ``` A `GDSMatrix` is a 2-dimensional `GDSArray`, and will be returned from the `GDSArray()` constructor automatically if the input GDS node is 2-dimensional. ```{r, GDSMatrix} GDSArray(file, "annotation/format/DP") ``` ## `GDSFile` The `GDSFile` is a light-weight class to represent GDS files. It has the `$` completion method to complete any possible gds nodes. It could be used as a convenient `GDSArray` constructor if the slot of `current_path` in `GDSFile` object represents a valid gds node. Otherwise, it will return the `GDSFile` object with an updated `current_path`. ```{r, GDSFile} gf <- GDSFile(file) gf$annotation$info gf$annotation$info$AC ``` Try typing in `gf$ann` and pressing `tab` key for the completion. ## `GDSArray` methods ### slot accessors. - `seed` returns the `GDSArraySeed` of the `GDSArray` object. ```{r, seedAccessor} ga <- GDSArray(file, "genotype") seed(ga) ``` - `gdsfile` returns the file path of the corresponding GDS file. ```{r, gdsfileAccessor} gdsfile(ga) ``` ### Available GDS nodes `gdsnodes()` takes the GDS file path or `GDSFile` object as input, and returns all nodes that can be converted to `GDSArray` instances. The returned GDS node names can be used as input for the `GDSArray(name=)` constructor. ```{r, gdsnodes} gdsnodes(file) identical(gdsnodes(file), gdsnodes(gf)) GDSArray(file, name=gdsnodes(file)[2]) ``` ### `dim()`, `dimnames()` The `dimnames(GDSArray)` returns an unnamed list, with the length of each element to be the same as return from `dim(GDSArray)`. ```{r, dims} ga <- GDSArray(file, "annotation/format/DP") dim(ga) class(dimnames(ga)) lengths(dimnames(ga)) ``` ### `[` subsetting `GDSArray` instances can be subset, following the usual _R_ conventions, with numeric or logical vectors; logical vectors are recycled to the appropriate length. ```{r, methods} ga[1:3, 10:15] ga[c(TRUE, FALSE), ] ``` ### some numeric calculation ```{r, numeric} dp <- GDSArray(file, "annotation/format/DP") dp log(dp) dp[rowMeans(dp) < 60, ] ``` ## Internals: `GDSArraySeed` The `GDSArraySeed` class represents the 'seed' for the `GDSArray` object. It is not exported from the [GDSArray][] package. Seed objects should contain the GDS file path, and are expected to satisfy the “seed contract” i.e. to support dim() and dimnames(). ```{r, GDSArraySeed} seed <- GDSArray:::GDSArraySeed(file, "genotype") seed ``` The seed can be used to construct a `GDSArray` instance. ```{r, GDSArray-from-GDSArraySeed} GDSArray(seed) ``` The `DelayedArray()` constructor with `GDSArraySeed` object as argument will return the same content as the `GDSArray()` constructor over the same `GDSArraySeed`. ```{r, da} class(DelayedArray(seed)) ``` # sessionInfo ```{r, sessionInfo} sessionInfo() ```