--- title: Using the `ScaledMatrix` class author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com date: "Revised: 12 December 2020" output: BiocStyle::html_document: toc_float: true package: ResidualMatrix vignette: > %\VignetteIndexEntry{Using the ScaledMatrix} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide", message=FALSE} knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) ``` # Overview The `ScaledMatrix` provides yet another method of running `scale()` on a matrix. In other words, these three operations are equivalent: ```{r} mat <- matrix(rnorm(10000), ncol=10) smat1 <- scale(mat) head(smat1) library(DelayedArray) smat2 <- scale(DelayedArray(mat)) head(smat2) library(ScaledMatrix) smat3 <- ScaledMatrix(mat, center=TRUE, scale=TRUE) head(smat3) ``` The biggest difference lies in how they behave in downstream matrix operations. - `smat1` is an ordinary matrix, with the scaled and centered values fully realized in memory. Nothing too unusual here. - `smat2` is a `DelayedMatrix` and undergoes block processing whereby chunks are realized and operated on, one at a time. This sacrifices speed for greater memory efficiency by avoiding a copy of the entire matrix. In particular, it preserves the structure of the original `mat`, e.g., from a sparse or file-backed representation. - `smat3` is a `ScaledMatrix` that refactors certain operations so that they can be applied to the original `mat` without any scaling or centering. This takes advantage of the original data structure to speed up matrix multiplication and row/column sums, albeit at the cost of numerical precision. # Matrix multiplication Given an original matrix $\mathbf{X}$ with $n$ columns, a vector of column centers $\mathbf{c}$ and a vector of column scaling values $\mathbf{s}$, our scaled matrix can be written as: $$ \mathbf{Y} = (\mathbf{X} - \mathbf{c} \cdot \mathbf{1}_n^T) \mathbf{S} $$ where $\mathbf{S} = \text{diag}(s_1^{-1}, ..., s_n^{-1})$. If we wanted to right-multiply it with another matrix $\mathbf{A}$, we would have: $$ \mathbf{YA} = \mathbf{X}\mathbf{S}\mathbf{A} - \mathbf{c} \cdot \mathbf{1}_n^T \mathbf{S}\mathbf{A} $$ The right-most expression is simply the outer product of $\mathbf{c}$ with the column sums of $\mathbf{SA}$. More important is the fact that we can use the matrix multiplication operator for $\mathbf{X}$ with $\mathbf{SA}$, as this allows us to use highly efficient algorithms for certain data representations, e.g., sparse matrices. ```{r} library(Matrix) mat <- rsparsematrix(20000, 10000, density=0.01) smat <- ScaledMatrix(mat, center=TRUE, scale=TRUE) blob <- matrix(runif(ncol(mat) * 5), ncol=5) system.time(out <- smat %*% blob) # The slower way with block processing. da <- scale(DelayedArray(mat)) system.time(out2 <- da %*% blob) ``` The same logic applies for left-multiplication and cross-products. This allows us to easily speed up high-level operations involving matrix multiplication by just switching to a `ScaledMatrix`, e.g., in approximate PCA algorithms from the `r Biocpkg("BiocSingular")` package. ```{r} library(BiocSingular) set.seed(1000) system.time(pcs <- runSVD(smat, k=10, BSPARAM=IrlbaParam())) ``` # Other utilities Row and column sums are special cases of matrix multiplication and can be computed quickly: ```{r} system.time(rowSums(smat)) system.time(rowSums(da)) ``` Subsetting, transposition and renaming of the dimensions are all supported without loss of the `ScaledMatrix` representation: ```{r} smat[,1:5] t(smat) rownames(smat) <- paste0("GENE_", 1:20000) smat ``` Other operations will cause the `ScaledMatrix` to collapse to the general `DelayedMatrix` representation, after which point block processing will be used. ```{r} smat + 1 ``` # Caveats For most part, the implementation of the multiplication assumes that the $\mathbf{A}$ matrix and the matrix product are small compared to $\mathbf{X}$. It is also possible to multiply two `ScaledMatrix`es together if the underlying matrices have efficient operators for their product. However, if this is not the case, the `ScaledMatrix` offers little benefit for increased overhead. It is also worth noting that this speed-up is not entirely free. The expression above involves subtracting two matrix with potentially large values, which runs the risk of catastrophic cancellation. The example below demonstrates how `ScaledMatrix` is more susceptible to loss of precision than a normal `DelayedArray`: ```{r} set.seed(1000) mat <- matrix(rnorm(1000000), ncol=100000) big.mat <- mat + 1e12 # The 'correct' value, unaffected by numerical precision. ref <- rowMeans(scale(mat)) head(ref) # The value from scale'ing a DelayedArray. library(DelayedArray) smat2 <- scale(DelayedArray(big.mat)) head(rowMeans(smat2)) # The value from a ScaledMatrix. library(ScaledMatrix) smat3 <- ScaledMatrix(big.mat, center=TRUE, scale=TRUE) head(rowMeans(smat3)) ``` In most practical applications, though, this does not seem to be a major concern, especially as most values (e.g., log-normalized expression matrices) lie close to zero anyway. # Session information {-} ```{r} sessionInfo() ```