--- title: "R Interface to CoreArray Genomic Data Structure (GDS) Files" author: "Xiuwen Zheng" date: "May 3, 2016" output: html_document: theme: spacelab toc: yes pdf_document: toc: yes toc_depth: 3 vignette: > %\VignetteIndexEntry{Introduction to GDS Format} %\VignetteEngine{knitr::rmarkdown} --- # Introduction The package gdsfmt provides a high-level R interface to CoreArray Genomic Data Structure (GDS) data files, which are portable across platforms and include hierarchical structure to store multiple scalable array-oriented data sets with metadata information. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. The package [gdsfmt](http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html) offers the efficient operations specifically designed for integers with less than 8 bits, since a single genetic/genomic variant, like single-nucleotide polymorphism (SNP), usually occupies fewer bits than a byte. Data compression and decompression are also supported with relatively efficient random access. # Installation of the package gdsfmt To install the package [gdsfmt](http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html), you need a current version (>=2.14.0) of [R](http://www.r-project.org). After installing R you can run the following commands from the R command shell to install the package [gdsfmt](http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html). Install the package from Bioconductor repository: ```{r eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("gdsfmt") ``` Install the development version from Github: ```{r eval=FALSE} library("devtools") install_github("zhengxwen/gdsfmt") ``` The `install_github()` approach requires that you build from source, i.e. `make` and compilers must be installed on your system -- see the [R FAQ](http://cran.r-project.org/faqs.html) for your operating system; you may also need to install dependencies manually. # High-level R functions ## Creating a GDS file and variable hierarchy An empty GDS file can be created by `createfn.gds()`: ```{r} library(gdsfmt) gfile <- createfn.gds("test.gds") ``` ```{r echo=FALSE} set.seed(1000) ``` Now, a file handle associated with "test.gds" is saved in the R variable *gfile*. The GDS file can contain a hierarchical structure to store multiple GDS variables (or GDS nodes) in the file, and various data types are allowed (see the document of `add.gdsn()`) including integer, floating-point number and character. ```{r} add.gdsn(gfile, "int", val=1:10000) add.gdsn(gfile, "double", val=seq(1, 1000, 0.4)) add.gdsn(gfile, "character", val=c("int", "double", "logical", "factor")) add.gdsn(gfile, "logical", val=rep(c(TRUE, FALSE, NA), 50), visible=FALSE) add.gdsn(gfile, "factor", val=as.factor(c(NA, "AA", "CC")), visible=FALSE) add.gdsn(gfile, "bit2", val=sample(0:3, 1000, replace=TRUE), storage="bit2") # list and data.frame add.gdsn(gfile, "list", val=list(X=1:10, Y=seq(1, 10, 0.25))) add.gdsn(gfile, "data.frame", val=data.frame(X=1:19, Y=seq(1, 10, 0.5))) ``` ```{r} folder <- addfolder.gdsn(gfile, "folder") add.gdsn(folder, "int", val=1:1000) add.gdsn(folder, "double", val=seq(1, 100, 0.4), visible=FALSE) ``` Users can display the file content by typing `gfile` or `print(gfile)`: ```{r} gfile ``` `print(gfile, ...)` has an argument *all* to control the display of file content. By default, *all=FALSE*; if *all=TRUE*, to show all contents in the file including hidden variables or folders. The GDS variables *logical*, *factor* and *folder/double* are hidden. ```{r} print(gfile, all=TRUE) ``` The asterisk indicates attributes attached to a GDS variable. The attributes can be used in the R environment to interpret the variable as *logical*, *factor*, *data.frame* or *list*. `index.gdsn()` can locate the GDS variable by a *path*: ```{r} index.gdsn(gfile, "int") index.gdsn(gfile, "list/Y") index.gdsn(gfile, "folder/int") ``` ```{r} # close the GDS file closefn.gds(gfile) ``` ## Writing Data Array-oriented data sets can be written to the GDS file. There are three possible ways to write data to a GDS variable. ```{r} gfile <- createfn.gds("test.gds") ``` ### R function *add.gdsn* Users could pass an R variable to the function `add.gdsn()` directly. `show()` provides the preview of GDS variable. ```{r} n <- add.gdsn(gfile, "I1", val=matrix(1:15, nrow=3)) show(n) ``` ### R function *write.gdsn* Users can specify the arguments *start* and *count* to write a subset of data. -1 in *count* means the size of that dimension, and the corresponding element in *start* should be 1. The values in *start* and *cound* should be in the dimension range. ```{r} write.gdsn(n, rep(0,5), start=c(2,1), count=c(1,-1)) show(n) ``` ### R function *append.gdsn* Users can append new data to an existing GDS variable. ```{r} append.gdsn(n, 16:24) show(n) ``` ### R function *assign.gdsn* Users could call `assign.gdsn()` to replace specific values, subset or reorder the data variable. ```{r} # initialize n <- add.gdsn(gfile, "mat", matrix(1:48, 6)) show(n) # substitute assign.gdsn(n, .value=c(9:14,35:40), .substitute=NA) show(n) # subset assign.gdsn(n, seldim=list(rep(c(TRUE, FALSE),3), rep(c(FALSE, TRUE),4))) show(n) # initialize and subset n <- add.gdsn(gfile, "mat", matrix(1:48, 6), replace=TRUE) assign.gdsn(n, seldim=list(c(4,2,6,NA), c(5,6,NA,2,8,NA,4))) show(n) # initialize and sort into descending order n <- add.gdsn(gfile, "mat", matrix(1:48, 6), replace=TRUE) assign.gdsn(n, seldim=list(6:1, 8:1)) show(n) ``` ### Create a large-scale data set **1)** When the size of dataset is larger than the system memory, users can not add a GDS variable via `add.gdsn()` directly. If the dimension is pre-defined, users can specify the dimension size in `add.gdsn()` to allocate data space. Then call `write.gdsn()` to write a small subset of data space. ```{r} (n2 <- add.gdsn(gfile, "I2", storage="int", valdim=c(100, 2000))) for (i in 1:2000) { write.gdsn(n2, seq.int(100*(i-1)+1, length.out=100), start=c(1,i), count=c(-1,1)) } show(n2) ``` **2)** Call `append.gdsn()` to append new data when the initial size is ZERO. If a compression algorithm is specified in `add.gdsn()` (e.g., *compress="ZIP"*), users should call `append.gdsn()` instead of `write.gdsn()`, since data has to be compressed sequentially. ```{r} (n3 <- add.gdsn(gfile, "I3", storage="int", valdim=c(100, 0), compress="ZIP")) for (i in 1:2000) { append.gdsn(n3, seq.int(100*(i-1)+1, length.out=100)) } readmode.gdsn(n3) # finish writing with the compression algorithm show(n3) ``` ```{r} # close the GDS file closefn.gds(gfile) ``` ## Reading Data ```{r} gfile <- createfn.gds("test.gds") add.gdsn(gfile, "I1", val=matrix(1:20, nrow=4)) add.gdsn(gfile, "I2", val=1:100) closefn.gds(gfile) ``` `read.gdsn()` can load all data to an R variable in memory. ```{r} gfile <- openfn.gds("test.gds") n <- index.gdsn(gfile, "I1") read.gdsn(n) ``` ### Subset reading *read.gdsn* and *readex.gdsn* A subset of data can be specified via the arguments *start* and *count* in the R function `read.gdsn`. Or specify a list of logical vectors in `readex.gdsn()`. ```{r} # read a subset read.gdsn(n, start=c(2, 2), count=c(2, 3)) read.gdsn(n, start=c(2, 2), count=c(2, 3), .value=c(6,15), .substitute=NA) ``` ```{r} # read a subset readex.gdsn(n, list(c(FALSE,TRUE,TRUE,FALSE), c(TRUE,FALSE,TRUE,FALSE,TRUE))) readex.gdsn(n, list(c(1,4,3,NA), c(2,NA,3,1,3,1))) readex.gdsn(n, list(c(1,4,3,NA), c(2,NA,3,1,3,1)), .value=NA, .substitute=-1) ``` ### Apply a user-defined function marginally A user-defined function can be applied marginally to a GDS variable via `apply.gdsn()`. *margin=1* indicates applying the function row by row, and *margin=2* for applying the function column by column. ```{r} apply.gdsn(n, margin=1, FUN=print, as.is="none") apply.gdsn(n, margin=2, FUN=print, as.is="none") # close the GDS file closefn.gds(gfile) ``` # Examples To create a simple GDS file, ```{r} gfile <- createfn.gds("test.gds") n1 <- add.gdsn(gfile, "I1", val=1:100) n2 <- add.gdsn(gfile, "I2", val=matrix(1:20, nrow=4)) gfile ``` ## Output to a text file `apply.gdsn()` can be used to export a GDS variable to a text file. If the GDS variable is a vector, ```{r} fout <- file("text.txt", "wt") apply.gdsn(n1, 1, FUN=cat, as.is="none", file=fout, fill=TRUE) close(fout) scan("text.txt") ``` The arguments *file* and *fill* are defined in the function `cat()`. If the GDS variable is a matrix: ```{r} fout <- file("text.txt", "wt") apply.gdsn(n2, 1, FUN=cat, as.is="none", file=fout, fill=4194304) close(fout) readLines("text.txt") ``` The number 4194304 is the maximum number of columns on a line used in printing vectors. ## Transpose a matrix `permdim.gdsn()` can be used to transpose an array by permuting its dimensions. Or `apply.gdsn()` allows that the data returned from the user-defined function *FUN* is directly written to a target GDS node *target.node*, when *as.is="gdsnode"* and *target.node* are both given. Little *c* in R is a generic function which combines its arguments, and it passes all data to the target GDS node in the following code: ```{r} n.t <- add.gdsn(gfile, "transpose", storage="int", valdim=c(5,0)) # apply the function over rows of matrix apply.gdsn(n2, margin=1, FUN=c, as.is="gdsnode", target.node=n.t) # matrix transpose read.gdsn(n.t) # close the GDS file closefn.gds(gfile) ``` ## Floating-point number vs. packed real number In computing, floating point is a method of representing an approximation of a real number in a way that can support a trade-off between range and precision, which can be represented exactly is of the following form "*significand* $\times$ 2^*exponent*^". A packed real number in GDS format is defined as "*int* $\times$ scale $+$ offset", where *int* can be a 8-bit, 16-bit or 32-bit signed interger. In some cases, the strategy of packed real numbers can significantly improve the compression ratio for real numbers. ```{r} set.seed(1000) val <- sample(seq(0,1,0.001), 50000, replace=TRUE) head(val) gfile <- createfn.gds("test.gds") add.gdsn(gfile, "N1", val=val) add.gdsn(gfile, "N2", val=val, compress="ZIP", closezip=TRUE) add.gdsn(gfile, "N3", val=val, storage="float") add.gdsn(gfile, "N4", val=val, storage="float", compress="ZIP", closezip=TRUE) add.gdsn(gfile, "N5", val=val, storage="packedreal16", scale=0.001, offset=0) add.gdsn(gfile, "N6", val=val, storage="packedreal16", scale=0.001, offset=0, compress="ZIP", closezip=TRUE) gfile ``` ```{r echo=FALSE} KB <- function(i) { s <- objdesp.gdsn(index.gdsn(gfile, paste("N", i, sep="")))$size sprintf("%.1f KB", s/1000) } Ratio <- function(i) { s <- objdesp.gdsn(index.gdsn(gfile, paste("N", i, sep="")))$size r <- 100*s / (8*length(val)) sprintf("%.1f%%", r) } Epsilon <- function(i) { ans <- mean(abs(val - read.gdsn(index.gdsn(gfile, paste0("N",i))))) sprintf("%0.3g", ans) } ``` | Variable | Type | Compression Method | Size | Ratio | Machine epsilon^1^ | |:---------|:-----|:------------------:|-----:|------:|----------------:| | **N1** | 64-bit floating-point number | --- | `r KB(1)` | `r Ratio(1)` | `r Epsilon(1)` | | **N2** | 64-bit floating-point number | ZIP | `r KB(2)` | `r Ratio(2)` | `r Epsilon(2)` | | **N3** | 32-bit floating-point number | --- | `r KB(3)` | `r Ratio(3)` | `r Epsilon(3)` | | **N4** | 32-bit floating-point number | ZIP | `r KB(4)` | `r Ratio(4)` | `r Epsilon(4)` | | **N5** | 16-bit packed real number | --- | `r KB(5)` | `r Ratio(5)` | `r Epsilon(5)` | | **N6** | 16-bit packed real number | ZIP | `r KB(6)` | `r Ratio(6)` | `r Epsilon(6)` | ^1^: the relative error due to rounding in floating point arithmetic. ```{r} # close the GDS file closefn.gds(gfile) ``` ## Limited random-access of compressed data * 10,000,000 random 0,1 sequence of 32-bit integers * in each 32 bits, one bit stores random 0,1 and others are ZERO * lower bound of compression percentage is 1/32 = 3.125% * Testing: * of 10,000 random positions, read a 32-bit integer * compression ratio is maximized for each method * compression method: none, ZIP, ZIP_ra, LZ4, LZ4_ra, LZMA, LZMA_ra * ZIP_ra, LZ4_ra and LZMA_ra: data stored in the file are composed of multiple independent compressed blocks ```{r eval=FALSE} set.seed(100) # 10,000,000 random 0,1 sequence of 32-bit integers val <- sample.int(2, 10*1000*1000, replace=TRUE) - 1L table(val) ``` ``` ## val ## 0 1 ## 4999138 5000862 ``` ```{r eval=FALSE} # cteate a GDS file f <- createfn.gds("test.gds") # compression algorithms (LZMA_ra:32K is the lower bound of LZMA_ra) compression <- c("", "ZIP.max", "ZIP_ra.max:16K", "LZ4.max", "LZ4_ra.max:16K", "LZMA", "LZMA_ra:32K") # save for (i in 1:length(compression)) print(add.gdsn(f, paste0("I", i), val=val, compress=compression[i], closezip=TRUE)) # close the file closefn.gds(f) cleanup.gds("test.gds") ``` * System configuration: * MacBook Pro, Retina, 13-inch, Late 2013, 2.8 GHz Intel Core i7, 16 GB 1600 MHz DDR3 * R 3.2.4 ```{r eval=FALSE} # open the GDS file f <- openfn.gds("test.gds") # 10,000 random positions set.seed(1000) idx <- sample.int(length(val), 10000) # enumerate each compression method dat <- vector("list", length(compression)) for (i in seq_len(length(compression))) { cat("Compression:", compression[i], "\n") n <- index.gdsn(f, paste0("I", i)) print(system.time({ dat[[i]] <- sapply(idx, FUN=function(k) read.gdsn(n, start=k, count=1L)) })) } # check for (i in seq_len(length(compression))) stopifnot(identical(dat[[i]], dat[[1L]])) # close the file closefn.gds(f) ``` | Compression Method | Raw | ZIP | ZIP_ra | LZ4 | LZ4_ra | LZMA | LZMA_ra | |:----------------------|:-----|:-------|:-------|:------|:-------|:------|:--------| | Data Size (MB) | 38.1 | 1.9 | 2.1 | 2.8 | 2.9 | 1.4 | 1.4 | | Compression Percent | 100% | 5.08% | 5.42% | 7.39% | 7.60% | 3.65% | 3.78% | | Reading Time (second) | 0.21 | 202.64 | 2.97 | 84.43 | 0.84 | 462.1 | 29.7 | ## Checksum for Data Integrity Users can create hash function digests (e.g., md5, sha1, sha256, sha384, sha512) to verify data integrity, and md5 is the default digest algorithm. For example, ```{r} # create a GDS file f <- createfn.gds("test.gds") n <- add.gdsn(f, "raw", rnorm(1115), compress="ZIP", closezip=TRUE) digest.gdsn(n, action="add") print(f, attribute=TRUE) closefn.gds(f) ``` Reopen the file and verify data integrity: ```{r} f <- openfn.gds("test.gds") n <- index.gdsn(f, "raw") get.attr.gdsn(n)$md5 digest.gdsn(n, action="verify") # NA indicates "not applicable" closefn.gds(f) ``` # Stylish Terminal Output in R If the R package [crayon](http://cran.r-project.org/package=crayon) is installed in the R environment, `print()` can display the context of GDS file with different colours. For example, on Apple Mac, ![crayon output](crayon_show.jpg) Users can disable crayon terminal output by `options(gds.crayon=FALSE)`, ``` File: 1KG_autosome_phase3_shapeit2_mvncall_integrated_v5_20130502_genotypes.gds (3.4G) + [ ] * |--+ sample.id { VStr8 2504 ZIP_ra(27.15%), 5.4K } |--+ snp.id { Int32 81271745 ZIP_ra(34.58%), 112.4M } |--+ snp.rs.id { VStr8 81271745 ZIP_ra(38.67%), 193.1M } |--+ snp.position { Int32 81271745 ZIP_ra(39.73%), 129.1M } |--+ snp.chromosome { VStr8 81271745 ZIP_ra(0.10%), 190.2K } |--+ snp.allele { VStr8 81271745 ZIP_ra(17.05%), 57.3M } |--+ genotype { Bit2 2504x81271745 ZIP_ra(5.66%), 2.9G } * \--+ snp.annot [ ] |--+ qual { Float32 81271745 ZIP_ra(0.10%), 316.1K } \--+ filter { VStr8 81271745 ZIP_ra(0.15%), 592.0K } ``` # Session Information ```{r} sessionInfo() ``` ```{r echo=FALSE} unlink(c("test.gds", "text.txt"), force=TRUE) ```