DelayedTensor 1.8.0
Authors: Koki Tsuyuzaki [aut, cre]
Last modified: 2023-10-24 14:41:21.735014
Compiled: Tue Oct 24 16:56:11 2023
Biological systems have very complicated structures like this figure1 https://f1000research.com/slides/9-1260.
For example, in the cell, DNA sequences are folded in cellular nucleus, RNA molecules are transcripted from the DNA, proteins are translated from the RNAs, and finally, the proteins are related to cellular functions. Outside of the cell, there are also many signals like a bacterial infection, adding chemical reagents, drugs, lifestyle, and so on. The change of these molecular types/phenomena finally causes the phenotype such as disease, BMI, and morphology. It is not possible to measure all molecular types or phenomena simultaneously, so one or two of them are chosen and exhaustively measured. This approach is called omics study and widely used. For example, genomics measure all DNA sequences, and transcriptomics measure all RNA molecules.
There is a need for a framework that can handle and analyze such heterogeneous data structures in a unified manner and provide biological interpretation. Tensors are a mathematical framework that can be very useful in such a situation.
A tensor can be considered as a generalized form of data representation2 https://f1000research.com/slides/9-1260. For example, a scalar value, a vector, and a matrix are also called 0th-order tensor, 1st-order tensor, and 2nd-order tensor, respectively. If a data has three “modes” (1. height, 2. width, and 3. depth), it is called a 3rd-order tensor.
That’s why any data is basically a tensor, but in most cases, the term tensor implies 3rd-order or higher-order tensor.
Tensor decomposition is the extension of matrix decomposition. If we have a third-order tensor, gene times tissue times condition, using tensor decomposition, we can extract a small number of patterns3 https://f1000research.com/slides/9-1260.
Each vector can be summarized to the multiple matrices and these are called factor matrices. The scalar values are summarized to a small tensor, and this is called core tensor.
A tensor is more than just a multi-dimensional array; as we will see later,
there are various operations that are specific to tensors, such as reshaping,
mode-wise statistics, and various tensor products.
These operations are essential in the analysis of tensor data and
the implementation of tensor decomposition algorithms.
Although the standard array
of R language can express
increasing orders of tensors, it does not provide tensor-specific operations.
Therefore, many R users manipulate tensor data by using
the functions implemented in the rTensor package for now.
Although rTensor is very useful,
it assumes the input object to be an in-memory array.
On the other hand, tensors can easily become huge
as the order and the size of each mode increase,
and may no longer fit in memory.
DelayedTensor is implmented for such extreamly huge tensor data. DelayedTensor provides some functions of rTensor by reimplementing them with DelayedArray. DelayedArray is a framework that allows us to use the data on the disk as if it were a standard array in R. There are some out-of-core backend file system such as HDF5Array and TileDBArray used in DelayedArray and the incremental calculations can be performed by implementing the functions in support of “block processing”.
The functionality of DelayedTensor is fourfold.
Block-Processing Tensor Reshaping: Operations such as folding and unfolding a higher-order tensor data into a matrix can be performed while taking care of the block size.
Block-Processing Tensor Arithmetic: Calculation of sums and averages for each mode, and operations such as Hadamard product, Kronecker product, and Khatri-Rao product can be performed while taking care of block size.
Block-Processing Tensor Decomposition: Some of the tensor decomposition algorithms implemented in rTensor have been reimplemented using reshaping and arithmetic functions of DelayedTensor.
Block-Processing einsum: In addition to the rTensor functions, the einsum function, which is well-known as the Numpy (Python) function, has also been reimplemented based on DelayedArray and the block processing framework. einsum is a very powerful preprocessing method to merge multiple tensor data into a single tensor.
Although what is executed inside the functions are very different, the function name, argument name, and value name are exactly the same as those of rTensor so that rTensor users can easily introduce them.
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.33 R6_2.5.1 bookdown_0.36
## [4] fastmap_1.1.1 xfun_0.40 cachem_1.0.8
## [7] knitr_1.44 htmltools_0.5.6.1 rmarkdown_2.25
## [10] cli_3.6.1 sass_0.4.7 jquerylib_0.1.4
## [13] compiler_4.3.1 tools_4.3.1 evaluate_0.22
## [16] bslib_0.5.1 yaml_2.3.7 BiocManager_1.30.22
## [19] jsonlite_1.8.7 rlang_1.1.1