GDSArray 1.14.1
GDSArray is a Bioconductor package that represents GDS files as
objects derived from the DelayedArray package and DelayedArray
class. It converts a GDS node in the file to a DelayedArray
-derived
data structure. The rich common methods and data operations defined on
GDSArray
makes it more R-user-friendly than working with the GDS
file directly.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GDSArray")
library(GDSArray)
The Bioconductor package gdsfmt has provided a high-level R interface to CoreArray Genomic Data Structure (GDS) data files, which is designed for large-scale datasets, especially for data which are much larger than the available random-access memory.
More details about GDS format can be found in the vignettes of the gdsfmt, SNPRelate, and SeqArray packages.
GDSArray
, GDSMatrix
, and GDSFile
GDSArray
represents GDS files as DelayedArray
instances. It has
methods like dim
, dimnames
defined, and it inherits array-like
operations and methods from DelayedArray
, e.g., the subsetting
method of [
.
The GDSArray()
constructor takes as arguments the file path and the
GDS node inside the GDS file. The GDSArray()
constructor always
returns the object with rows being features (genes / variants / snps)
and the columns being “samples”. This is consistent with the assay
data inside SummarizedExperiment
. FIXME: should GDSArray() return that dim?
file <- gdsExampleFileName("seqgds")
## This is a SeqArray GDS file
GDSArray(file, "genotype/data")
## <2 x 90 x 1348> array of class GDSArray and type "integer":
## ,,1
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 0 0 0 0
## [2,] 3 3 0 3 . 0 0 0 0
##
## ,,2
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 0 0 0 0
## [2,] 3 3 0 3 . 0 0 0 0
##
## ...
##
## ,,1347
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 0 0 0 0 . 0 0 0 0
## [2,] 0 0 0 0 . 0 0 0 0
##
## ,,1348
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 3 3 3 3
## [2,] 3 3 1 3 . 3 3 3 3
A GDSMatrix
is a 2-dimensional GDSArray
, and will be returned from
the GDSArray()
constructor automatically if the input GDS node is
2-dimensional.
GDSArray(file, "annotation/format/DP/data")
## <90 x 1348> matrix of class GDSMatrix and type "integer":
## [,1] [,2] [,3] [,4] ... [,1345] [,1346] [,1347] [,1348]
## [1,] 0 0 12 15 . 6 5 4 0
## [2,] 0 0 17 4 . 10 8 7 0
## [3,] 107 92 247 177 . 28 15 26 3
## ... . . . . . . . . .
## [88,] 81 84 217 110 . 36 61 92 0
## [89,] 67 47 134 111 . 46 57 71 2
## [90,] 156 150 417 195 . 78 101 144 2
GDSFile
The GDSFile
is a light-weight class to represent GDS files. It has
the $
completion method to complete any possible gds nodes. It could
be used as a convenient GDSArray
constructor if the slot of
current_path
in GDSFile
object represents a valid gds node.
Otherwise, it will return the GDSFile
object with an updated
current_path
.
gf <- GDSFile(file)
gf$annotation$info
## class: GDSFile
## file: /home/biocbuild/bbs-3.14-bioc/R/library/SeqArray/extdata/CEU_Exon.gds
## current node: annotation/info
## subnodes:
## annotation/info/AA
## annotation/info/AC
## annotation/info/AN
## annotation/info/DP
## annotation/info/HM2
## annotation/info/HM3
## annotation/info/OR
## annotation/info/GP
## annotation/info/BN
gf$annotation$info$AC
## <1348> array of class GDSArray and type "integer":
## [1] [2] [3] [4] . [1345] [1346] [1347] [1348]
## 4 1 6 128 . 2 11 1 1
Try typing in gf$ann
and pressing tab
key for the completion.
GDSArray
methodsseed
returns the GDSArraySeed
of the GDSArray
object.gt <- GDSArray(file, "genotype/data")
seed(gt)
## GDSArraySeed
## File: /home/biocbuild/bbs-3.14-bioc/R/library/SeqArray/extdata/CEU_Exon.gds
## Array node: genotype/data
## Dim: 2 x 90 x 1348
gdsfile
returns the file path of the corresponding GDS file.gdsfile(gt)
## [1] "/home/biocbuild/bbs-3.14-bioc/R/library/SeqArray/extdata/CEU_Exon.gds"
gdsnodes()
takes the GDS file path or GDSFile
object as input, and
returns all nodes that can be converted to GDSArray
instances. The
returned GDS node names can be used as input for the GDSArray(name=)
constructor.
gdsnodes(file)
## [1] "sample.id" "variant.id"
## [3] "position" "chromosome"
## [5] "allele" "genotype/data"
## [7] "genotype/~data" "genotype/extra.index"
## [9] "genotype/extra" "phase/data"
## [11] "phase/~data" "phase/extra.index"
## [13] "phase/extra" "annotation/id"
## [15] "annotation/qual" "annotation/filter"
## [17] "annotation/info/AA" "annotation/info/AC"
## [19] "annotation/info/AN" "annotation/info/DP"
## [21] "annotation/info/HM2" "annotation/info/HM3"
## [23] "annotation/info/OR" "annotation/info/GP"
## [25] "annotation/info/BN" "annotation/format/DP/data"
## [27] "annotation/format/DP/~data" "sample.annotation/family"
identical(gdsnodes(file), gdsnodes(gf))
## [1] TRUE
varname <- gdsnodes(file)[2]
GDSArray(file, varname)
## <1348> array of class GDSArray and type "integer":
## [1] [2] [3] [4] . [1345] [1346] [1347] [1348]
## 1 2 3 4 . 1345 1346 1347 1348
dim()
, dimnames()
The dimnames(GDSArray)
returns an unnamed list, with the value of
NULL or dimension names with length being the same as return from
dim(GDSArray)
.
dp <- GDSArray(file, "annotation/format/DP/data")
dim(dp)
## [1] 90 1348
class(dimnames(dp))
## [1] "list"
lengths(dimnames(dp))
## [1] 0 0
[
subsettingGDSArray
instances can be subset, following the usual R
conventions, with numeric or logical vectors; logical vectors are
recycled to the appropriate length.
dp[1:3, 10:15]
## <3 x 6> matrix of class DelayedMatrix and type "integer":
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 59 49 88 55 46 47
## [2,] 33 22 16 9 7 7
## [3,] 276 271 145 89 70 151
dp[c(TRUE, FALSE), ]
## <45 x 1348> matrix of class DelayedMatrix and type "integer":
## [,1] [,2] [,3] [,4] ... [,1345] [,1346] [,1347] [,1348]
## [1,] 0 0 12 15 . 6 5 4 0
## [2,] 107 92 247 177 . 28 15 26 3
## [3,] 0 0 17 0 . 4 4 4 0
## ... . . . . . . . . .
## [43,] 0 0 11 5 . 3 3 1 0
## [44,] 3 4 9 2 . 4 3 4 0
## [45,] 67 47 134 111 . 46 57 71 2
log(dp)
## <90 x 1348> matrix of class DelayedMatrix and type "double":
## [,1] [,2] [,3] ... [,1347] [,1348]
## [1,] -Inf -Inf 2.484907 . 1.386294 -Inf
## [2,] -Inf -Inf 2.833213 . 1.945910 -Inf
## [3,] 4.672829 4.521789 5.509388 . 3.258097 1.098612
## ... . . . . . .
## [88,] 4.394449 4.430817 5.379897 . 4.5217886 -Inf
## [89,] 4.204693 3.850148 4.897840 . 4.2626799 0.6931472
## [90,] 5.049856 5.010635 6.033086 . 4.9698133 0.6931472
dp[rowMeans(dp) < 60, ]
## <52 x 1348> matrix of class DelayedMatrix and type "integer":
## [,1] [,2] [,3] [,4] ... [,1345] [,1346] [,1347] [,1348]
## [1,] 0 0 12 15 . 6 5 4 0
## [2,] 0 0 17 4 . 10 8 7 0
## [3,] 0 0 11 1 . 3 1 1 0
## ... . . . . . . . . .
## [50,] 0 0 6 0 . 2 0 0 0
## [51,] 0 0 11 5 . 3 3 1 0
## [52,] 3 4 9 2 . 4 3 4 0
GDSArraySeed
The GDSArraySeed
class represents the ‘seed’ for the GDSArray
object. It is not exported from the GDSArray package. Seed objects
should contain the gds file of gds.class
, GDS file path, GDS file
node name, and are expected to satisfy the seed
contract
for implementing a DelayedArray
backend, i.e. to support dim() and
dimnames().
gds <- openfn.gds(file)
seed <- GDSArray:::GDSArraySeed(gds, "genotype/data")
seed
## GDSArraySeed
## File: /home/biocbuild/bbs-3.14-bioc/R/library/SeqArray/extdata/CEU_Exon.gds
## Array node: genotype/data
## Dim: 2 x 90 x 1348
closefn.gds(gds)
The seed can be used to construct a GDSArray
instance.
GDSArray(seed)
## <2 x 90 x 1348> array of class GDSArray and type "integer":
## ,,1
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 0 0 0 0
## [2,] 3 3 0 3 . 0 0 0 0
##
## ,,2
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 0 0 0 0
## [2,] 3 3 0 3 . 0 0 0 0
##
## ...
##
## ,,1347
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 0 0 0 0 . 0 0 0 0
## [2,] 0 0 0 0 . 0 0 0 0
##
## ,,1348
## [,1] [,2] [,3] [,4] ... [,87] [,88] [,89] [,90]
## [1,] 3 3 0 3 . 3 3 3 3
## [2,] 3 3 1 3 . 3 3 3 3
The DelayedArray()
constructor with GDSArraySeed
object as
argument will return the same content as the GDSArray()
constructor
over the same GDSArraySeed
.
class(DelayedArray(seed))
## [1] "GDSArray"
## attr(,"package")
## [1] "GDSArray"
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GDSArray_1.14.1 DelayedArray_0.20.0 IRanges_2.28.0
## [4] S4Vectors_0.32.3 MatrixGenerics_1.6.0 matrixStats_0.61.0
## [7] Matrix_1.4-0 BiocGenerics_0.40.0 gdsfmt_1.30.0
## [10] BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] XVector_0.34.0 bslib_0.3.1 compiler_4.1.2
## [4] BiocManager_1.30.16 jquerylib_0.1.4 GenomeInfoDb_1.30.0
## [7] zlibbioc_1.40.0 bitops_1.0-7 tools_4.1.2
## [10] digest_0.6.29 jsonlite_1.7.2 evaluate_0.14
## [13] lattice_0.20-45 rlang_0.4.12 SeqArray_1.34.0
## [16] yaml_2.2.1 parallel_4.1.2 xfun_0.29
## [19] fastmap_1.1.0 GenomeInfoDbData_1.2.7 stringr_1.4.0
## [22] knitr_1.37 Biostrings_2.62.0 sass_0.4.0
## [25] grid_4.1.2 R6_2.5.1 rmarkdown_2.11
## [28] bookdown_0.24 magrittr_2.0.1 htmltools_0.5.2
## [31] GenomicRanges_1.46.1 stringi_1.7.6 RCurl_1.98-1.5
## [34] crayon_1.4.2