--- title: "Package development from _beachmat_" author: "Aaron Lun" package: beachmat output: BiocStyle::html_document: toc_float: yes vignette: > %\VignetteIndexEntry{1. Developer guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide", message=FALSE} require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) ``` # Background The `r Biocpkg("beachmat")` package provides a C++ API for handling submatrices generated by `r Biocpkg("DelayedArray")`'s block processing mechanism. This enables Bioconductor packages to use C++ for high-performance processing of data stored in almost any matrix representation, ranging from dense ordinary matrices, sparse matrices from the `r CRANpkg("Matrix")` package, file-backed matrices or even cloud-based matrices (e.g., `r Biocpkg("HDF5Array")`, `r Biocpkg("TileDBArray")`). Packages use the `r Biocpkg("DelayedArray")` block processing to realize blocks of data into memory as sparse or dense matrices, which can then be easily consumed by C++ code via `r Biocpkg("beachmat")`'s API. The aim is to abstract away the specific class (and even type!) of matrix object in R to facilitate development of C++ extensions. # Historical context Back in 2017, we had two broad strategies for how to make our various C++ functions work with alternative matrix representations. The first approach was implemented in versions 1 and 2 of `r Biocpkg("beachmat")`, which takes the `Rcpp::RObject` in C++, figures out what the underlying representation, and then chooses appropriate methods to read a vector of values from a given row or column. This worked reasonably well and allows people to write C++ code in a manner that is agnostic to the matrix representation. Most importantly, it was a plug-and-play replacement for the usual `r CRANpkg("Rcpp")` `.column()` and `.row()` methods for matrix access, which allowed client packages to quickly get up and running with their existing C++ code. The second strategy was to perform block-wise looping in R using `DelayedArray::blockApply`. This would read in a block of values as an ordinary matrix that would then be passed into the C++ code; this means that developers would only have to worry about handling ordinary (column-major, etc.) matrices in their C++ code. The problem was that realization to a dense matrix would sacrifice some of the efficiencies of a sparse matrix format. Also, implementing this would involve amore refactoring as the looping would be shared across both C++ and R - one loop over blocks in R, and another loop within each block in C++ to do the desired computation. The equation has now changed with proper sparse support in `r Biocpkg("DelayedArray")`. Block processing can now yield both dense and sparse submatrices (depending on the underlying representation), thus avoiding the previous loss of efficiency from coercion to a dense format. With R-based blockwise looping, we get easy access to `r Biocpkg("BiocParallel")`-based parallelization; we avoid the smelly calls back into R from C++ that are necessary to handle matrices that are not of a known type; and we get a more user-tunable method of handling large matrix outputs via `RealizationSink`s. Block processing is now the preferred mechanism for handling large matrices for independent computations across rows or columns. So, how does `r Biocpkg("beachmat")` fit into this new paradigm? Inside the block processing loop, we may wish to call a C++ function that takes the block and operates on it. This block may be dense or sparse, but the C++ function may not care if it just wants to get, e.g., the individual row vectors out for further processing. In this context, beachmat can continue to provide agnostic data access from the realized blocks so that developers don't have to worry about handling different formats in their C++ code. Alternatively, if the sparse format lends itself to a more efficient algorithm, beachmat also provides dedicated sparse classes to only operate on the non-zero values. # Linking instructions The _beachmat_ package currently has several dependencies: - The compiler should support the C++11 standard. This usually requires GCC version 4.8.1 or higher. You need to tell the build system to use C++11, by modifying the `SystemRequirements` field of the `DESCRIPTION` file: SystemRequirements: C++11 - `r CRANpkg("Rcpp")` should be installed. You also need to ensure that `r CRANpkg("Rcpp")` is loaded when your package is loaded. This requires addition of `Rcpp` to the `Imports` field of the `DESCRIPTION` file: Imports: Rcpp ... and a corresponding `importFrom` specification in the `NAMESPACE` file: importFrom(Rcpp, sourceCpp) (The exact function to be imported doesn't matter, as long as the namespace is loaded. Check out the `r CRANpkg("Rcpp")` documentation for more details.) `r Biocpkg("beachmat")` is a header-only library, so linking is as simple as writing: ``` LinkingTo: Rcpp, beachmat ``` ... in your `DESCRIPTION` file. (This ensures the inclusion of the `r CRANpkg("Rcpp")` headers.) In C or C++ code files, use standard techniques to include header definitions, e.g., `#include "beachmat3/numeric_matrix.h"` (see `r Biocpkg("beachmat", vignette="examples.html", label="here")` for more details). Header files are available for perusal at the following location (enter in an R session): ```{R headers} system.file(package="beachmat", "include") ``` If you see an `expected unqualified-id before 'using'` error during compilation, this usually means that the version of GCC used to compile `r Biocpkg("beachmat")` is out of date and does not fully support C++11. # Examples of C++ code ## Reading agnostic inputs The API itself has [comprehensive documentation](https://ltla.github.io/beachmat) so we will not go into detail there. Instead, we will focus on the most common usage patterns. Let's say we wanted to compute column sums from a logical, integer or numeric matrix-like object: ```{Rcpp} #include "beachmat3/beachmat.h" #include #include // Not necessary in a package context: // [[Rcpp::depends(beachmat)]] // [[Rcpp::export(rng=false)]] Rcpp::NumericVector column_sums (Rcpp::RObject mat) { auto ptr = beachmat::read_lin_block(mat); std::vector workspace(ptr->get_nrow()); Rcpp::NumericVector output(ptr->get_ncol()); for (size_t i = 0; i < ptr->get_ncol(); ++i) { auto colptr = ptr->get_col(i, workspace.data()); output[i] = std::accumulate(colptr, colptr + ptr->get_nrow(), 0.0); } return output; } ``` The `read_lin_block()` function sets up a pointer to the underlying matrix data that can be extracted with the `get_col()` method (or `get_row()`, if we were instead interested in the rows). This may involve memory allocations, so we provide the method with a pointer to "workspace" for it to put values in. The `get_col()` method then returns a pointer to a vector of data values for each requested column `i`, which we accumulate and store in the `output` to be returned to the R session. To demonstrate: ```{r} column_sums(matrix(rnorm(1000), ncol=10)) ``` This code works with either dense or sparse blocks generated by `r Biocpkg("DelayedArray")` block processing. Additionally, it automatically supports all numeric-type data that might be in `mat`, i.e., logical, integer or double-precision values. All values are cast to the type of the workspace - in this case, a double-precision vector, but it is equally possible to cast to an integer array by creating an `int` workspace. This means that we can avoid writing multiple versions of code to handle different types of inputs. ```{r} # Integer column_sums(matrix(rpois(1000, 10), ncol=10)) # Logical column_sums(matrix(rbinom(1000, 1, 0.5)==1, ncol=10)) # Sparse column_sums(Matrix::rsparsematrix(100, 10, 0.1)) ``` The pointer returned by `get_col()` may point to the workspace and is only valid up to the next use of the workspace. Users should only rely on the returned pointer prior to any subsequent call to `get_col()`. In addition, there is no guarantee that the workspace is actually populated with data values, as some matrix representations may not require a copy procedure to access the underlying data. If a copy into the workspace is required, the following pattern can be used: ```{Rcpp} #include "beachmat3/beachmat.h" #include #include // Not necessary in a package context: // [[Rcpp::depends(beachmat)]] // [[Rcpp::export(rng=false)]] Rcpp::NumericVector column_sums_copy (Rcpp::RObject mat) { auto ptr = beachmat::read_lin_block(mat); std::vector workspace(ptr->get_nrow()); Rcpp::NumericVector output(ptr->get_ncol()); for (size_t i = 0; i < ptr->get_ncol(); ++i) { auto colptr = ptr->get_col(i, workspace.data()); // Checking if colptr is already pointing to the workspace. if (colptr != workspace.data()) { // If not, forcibly copying contents into the workspace. std::copy(colptr, colptr + ptr->get_nrow(), workspace.data()); } output[i] = std::accumulate(workspace.begin(), workspace.end(), 0.0); } return output; } ``` ## Reading sparse inputs For some applications, it is possible to implement a more efficient algorithm for sparse data that only uses non-zero entries. We can use `read_lin_sparse_block()` to create a pointer to a sparse C++ matrix, which overloads the `get_col()` and `get_row()` methods to return a listing of the non-zero entries. This is demonstrated below to compute the column sums by only iterating over the non-zero entries. Note that we need two workspace vectors this time, one for the non-zero values and another for their indices. (Again, there is no guarantee that the workspaces are populated, so a manual copy should be performed if this is necessary.) ```{Rcpp} #include "beachmat3/beachmat.h" #include #include // Not necessary in a package context: // [[Rcpp::depends(beachmat)]] // [[Rcpp::export(rng=false)]] Rcpp::NumericVector column_sums_sparse (Rcpp::RObject mat) { auto ptr = beachmat::read_lin_sparse_block(mat); std::vector workspace_x(ptr->get_nrow()); std::vector workspace_i(ptr->get_nrow()); Rcpp::NumericVector output(ptr->get_ncol()); for (size_t i = 0; i < ptr->get_ncol(); ++i) { auto indices = ptr->get_col(i, workspace_x.data(), workspace_i.data()); auto xptr = indices.x; auto iptr = indices.i; // row indices, not used here. auto nnzero = indices.n; output[i] = std::accumulate(xptr, xptr + nnzero, 0.0); } return output; } ``` To demonstrate: ```{r} column_sums_sparse(Matrix::rsparsematrix(100, 10, 0.1)) # Errors out as an ordinary matrix isn't sparse: try(column_sums_sparse(matrix(rpois(1000, 10), ncol=10))) ``` In practice, we would like our C++ function to be able to handle both sparse and dense inputs from block processing. We can do so by check if our matrix `is_sparse()`, and if it is, promoting it to a sparse instance as shown below. This allows developers to smoothly switch between algorithms according to the observed representation of the data. ```{Rcpp} #include "beachmat3/beachmat.h" #include #include // Not necessary in a package context: // [[Rcpp::depends(beachmat)]] // [[Rcpp::export(rng=false)]] Rcpp::NumericVector column_sums_flexible (Rcpp::RObject mat) { auto ptr = beachmat::read_lin_block(mat); Rcpp::NumericVector output(ptr->get_ncol()); std::vector workspace(ptr->get_nrow()); if (ptr->is_sparse()) { auto sptr = beachmat::promote_to_sparse(ptr); // note, ptr becomes NULL. std::vector workspace_i(sptr->get_nrow()); for (size_t i = 0; i < sptr->get_ncol(); ++i) { auto indices = sptr->get_col(i, workspace.data(), workspace_i.data()); auto xptr = indices.x; output[i] = std::accumulate(xptr, xptr + indices.n, 0.0); } } else { for (size_t i = 0; i < ptr->get_ncol(); ++i) { auto vec = ptr->get_col(i, workspace.data()); output[i] = std::accumulate(vec, vec + ptr->get_nrow(), 0.0); } } return output; } ``` To demonstrate: ```{r} column_sums_flexible(Matrix::rsparsematrix(100, 10, 0.1)) column_sums_flexible(matrix(rpois(1000, 10), ncol=10)) ``` ## Creating matrix outputs For most part, any output matrices should be generated using `Rcpp::Matrix` objects. If these outputs are anticipated to be large, we suggest using the `r Biocpkg("DelayedArray")` `RealizationSink` infrastructure to store the results in, e.g., a file-backed matrix across block processing iterations. On occassion, sparse outputs may need to be generated so `r Biocpkg("beachmat")` provides some helpful utilities for doing so. The first and more general approach is relevant when the number of non-zero entries is not known in advance. This assumes that you can collect all non-zero entries into a `std::map` for storage: ```{Rcpp} #include "beachmat3/beachmat.h" #include // Not necessary in a package context: // [[Rcpp::depends(beachmat)]] // [[Rcpp::export(rng=false)]] Rcpp::RObject generate_sparse_general() { std::map, double> storage; storage[std::make_pair(1, 0)] = 0.1; // column 2, row 1 storage[std::make_pair(2, 5)] = 0.2; // column 3, row 6 storage[std::make_pair(0, 3)] = 0.3; // column 1, row 4 storage[std::make_pair(5, 4)] = 0.4; // column 6, row 5 // Create a dgCMatrix with 7 rows and 10 columns return beachmat::as_gCMatrix(7, 10, storage); } ``` To demonstrate: ```{r} generate_sparse_general() ``` Alternatively, a more specific approach is to replace the non-zero values of an existing `*gCMatrix` object with another vector corresponding to new values in the same order and position as the original values. This is more efficient for operations that do not change the positions of the non-zero entries. ```{Rcpp} #include "beachmat3/beachmat.h" #include // Not necessary in a package context: // [[Rcpp::depends(beachmat)]] // [[Rcpp::export(rng=false)]] Rcpp::RObject generate_sparse_specific(Rcpp::RObject mat) { auto ptr = beachmat::read_lin_sparse_block(mat); auto nnzero = ptr->get_nnzero(); Rcpp::LogicalVector thing(nnzero, 1); return beachmat::as_gCMatrix(mat, thing); } ``` To demonstrate: ```{r} library(Matrix) mat <- rsparsematrix(10,10,0.1) generate_sparse_specific(mat) ``` # With block processing Once we have written our desired function, we can perform `r Biocpkg("DelayedArray")` block processing on any matrix-like input. This can be done directly using the `blockApply()` function, or we can use `r Biocpkg("beachmat")`'s wrappers to conveniently split up the input matrix into blocks of consecutive rows or columns: ```{r} library(beachmat) blockedColSums <- function(x, ...) { out <- colBlockApply(x, FUN=column_sums, ...) unlist(out) # combining results across blocks. } ``` This function is now capable of operating on any matrix-like object that can be subjected to block processing. Roughly speaking, if an object has two dimensions and an `as.array()` method, it can be processed correctly. ```{r} # Ordinary matrices: x1 <- matrix(rnorm(1000), ncol=20) blockedColSums(x1) # Works for Matrix classes: x2 <- Matrix(x1) blockedColSums(x2) # Works for DelayedMatrix objects: library(DelayedArray) x3 <- DelayedArray(x1) blockedColSums(x3) ``` As hinted above, this process can be easily parallelized by simply passing `BPPARAM=` to the `colBlockApply()` function. Note that this only has a benefit for large matrices where the benefits of parallelization offset the associated overhead. ```{r, eval=.Platform$OS.type=="unix"} library(BiocParallel) blockedColSums(x3, BPPARAM=MulticoreParam(2)) ``` # Session information {-} ```{r} sessionInfo() ```