beachmat 1.4.0
This document describes the use of the beachmat API for accessing data in R matrices. We will demonstrate the API on numeric matrices, though same semantics are used for matrices of other types (e.g., logical, integer, character). First, we include the relevant header file:
#include "beachmat/numeric_matrix.h"
A double-precision matrix object dmat
is handled in C++ by passing the SEXP
struct from .Call
to create_numeric_matrix
:
auto dptr = beachmat::create_numeric_matrix(dmat);
This creates a unique pointer that points to an object of the numeric_matrix
base class.
The exact derived class that is actually instantiated depends on the type of matrix in dmat
, though the behaviour of the user-level functions are not affected by this detail.
Additional notes
auto
keyword just avoids the need to write the full type of the returned pointer, which is std::unique_ptr<beachmat::numeric_matrix>
.
We use unique pointers to control ownership and smoothly handle destruction and memory deallocation at the end of the function.std::exception
class, containing an informative error message.
These should be caught and handled gracefully by the end-user code, otherwise a segmentation fault will probably occur.
See the error-handling mechanism in Rcpp for how to deal with these exceptions.The get_nrow()
method returns the number of rows in the matrix:
size_t nrow = dptr->get_nrow();
The get_ncol()
method returns the number of columns in the matrix:
size_t ncol = dptr->get_ncol();
The get_matrix_type()
method returns the type of matrix representation pointed to by dptr
.
This can be tested against constants like beachmat::SIMPLE
or beachmat::SPARSE
.
beachmat::matrix_type mat_type = dptr->get_matrix_type();
The yield()
method returns the original R matrix that was used to create dptr
.
Rcpp::RObject original = dptr->yield();
The get_col()
method fills an iterator in
to an Rcpp vector with values from a column c
of the matrix.
There should be at least nrow
accessible elements, i.e., *in
and *(in+nrow-1)
should be valid entries.
dptr->get_col(
c, /* size_t */
in /* Rcpp::Vector::iterator */
);
Extraction of a range of the column can be specified with the first
and last
arguments.
This will fill in
with values at column c
from row first
to last-1
.
There should be at least last-first
accessible elements, i.e., *in
and *(in+last-first-1)
should be valid entries.
dptr->get_col(
c, /* size_t */
in, /* Rcpp::Vector::iterator */
first, /* size_t */
last /* size_t */
);
No value is returned by either of these methods.
Note that c
should be a zero-indexed integer in [0, ncol)
.
Similarly, both first
and last
should be in [0, nrow]
and zero-indexed, with the additional requirement that last >= first
.
The get_row()
method takes an iterator in
to a Rcpp vector and fills it with values at row r
.
There should be at least ncol
accessible elements, i.e., *in
and *(in+ncol-1)
should be valid entries.
dptr->get_row(
r, /* size_t */
in /* Rcpp::Vector::iterator */
);
Extraction of a range of the row can be specified with the first
and last
arguments.
This will fill in
with values at row r
from column first
to last-1
.
There should be at least last-first
accessible elements, i.e., *in
and *(in+last-first-1)
should be valid entries.
dptr->get_row(
r, /* size_t */
in, /* Rcpp::Vector::iterator */
first, /* size_t */
last /* size_t */
);
No value is returned by either of these methods.
Again, r
should be a zero-indexed integer in [0, nrow)
.
Both first
and last
should be in [0, ncol]
and zero-indexed, with the additional requirement that last >= first
.
The get()
method returns a double-precision value at the matrix entry for row r
and column c
.
Both r
and c
should be zero-indexed integers in [0, nrow)
and [0, ncol)
respectively.
double val = dptr->get(
r, /* size_t */
c /* size_t */
);
If the object in
is a Rcpp::NumericVector::iterator
instance, matrix entries will be extracted as double-precision values.
If it is a Rcpp::IntegerVector::iterator
instance, matrix entries will be extracted as integers with implicit conversion from the double-precision type in dptr
.
It is also possible to use a Rcpp::LogicalVector::iterator
, though see the warnings below.
The get_const_col()
method returns an iterator to a Rcpp vector pointing to first
row of column c
.
The arguments are the same as get_col()
1 Users can also specify first
and last
, though we will not do so here for simplicity. with work
being equivalent to in
.
The type of work
and the output iterator must correspond to the data type of the matrix - in this case, both should be Rcpp::NumericVector::iterator
objects.
Rcpp::NumericVector::iterator it = dptr->get_const_col(
c, /* size_t */
work, /* Rcpp::NumericVector::iterator */
);
For simple/dense matrices, this method is more efficient than get_col()
as it returns the iterator without needing to copy data into work
.
For other matrices, this function simply calls get_col()
to copy the data into the vector starting at work
, and then returns an iterator equal to work
.
Thus, for general use, work
must point to a writeable block of memory, even if it does not get used when dptr
points to a simple/dense matrix.
Additional notes
work
to be read-only after a call to get_const_col()
.
In particular, the underlying data will be modified if get_const_col()
is called again with the same work
.
This has some consequences if a process calls get_const_col()
multiple times to obtain more than one iterators for further computation.
In such cases, developers should ensure that each get_const_col()
call uses a work
pointing to a different vector.The get_const_col_indexed()
method is guaranteed to index at least all the “non-zero” entries in column c
of the matrix.
For numeric types, this represents non-zero values (obviously), while for character matrices, this represents non-empty strings.
auto indexinfo = dptr->get_const_col_indexed(
c, /* size_t */
work, /* Rcpp::NumericVector::iterator */
first, /* size_t */
last /* size_t */
);
It returns a const_col_indexed_info
object, which is a tuple consisting of:
size_t
, specifying the number of entries in column c
from rows [first, last)
.Rcpp::IntegerVector::iterator
, pointing to a vector containing the row index for each entry in column c
from rows [first, last)
.Rcpp::NumericVector::iterator
, pointing to a vector containing the value of each entry in column c
from rows [first, last)
.The data type of work
and the output iterator in (3) must correspond to that of the matrix - in this case, both should be Rcpp::NumericVector::iterator
objects.
Again, first
and last
do not have to be specified and will default to the entire column.
This method is quite efficient for dgCMatrix
objects, as it will directly return iterators to the indices and values of the non-zero entries only.
No copying is performed, and the zero values do not have to be explicitly generated as they would be in get_col
.
For all other matrices, get_const_col
is called2 The comments above regarding the read-only nature of work
also apply here. and iterator (2) is pointed at an internal array of consecutive integers (which should be treated as read-only).
Note that this method is not guaranteed to index only the non-zero entries. The specific representation may choose to return only the non-zero entries or all of them.
The get_cols()
method fills an iterator in
to an Rcpp vector with values from multiple columns of the matrix.
The idx
iterator should point to an array of integers of length n
, containing the column indices to use for extraction.
The indices should be zero-based and strictly increasing, i.e., no duplicates.
dptr->get_cols(
idx, /* Rcpp::IntegerVector::iterator */
n, /* size_t */
in, /* Rcpp::Vector::iterator */
first, /* size_t */
last /* size_t */
);
For each column, the range of values in [first, last)
are extracted.
If first
and last
are not specified, the range will default to [0, nrow)
.
Thus, there should be at least n*(last-first)
accessible elements pointed to by in
.
This method will extract values in column-major format.
That is, if one were to compute a submatrix containing the selected columns and the chosen row range, that submatrix would be available in column-major form in in
.
No value is returned by this method.
The get_rows()
method fills an iterator in
to an Rcpp vector with values from multiple rows of the matrix.
The idx
iterator should point to an array of integers of length n
, containing the column indices to use for extraction.
The indices should be zero-based and strictly increasing, i.e., no duplicates.
dptr->get_rows(
idx, /* Rcpp::IntegerVector::iterator */
n, /* size_t */
in, /* Rcpp::Vector::iterator */
first, /* size_t */
last /* size_t */
);
For each row, the range of values in [first, last)
are extracted.
If first
and last
are not specified, the range will default to [0, ncol)
.
Thus, there should be at least n*(last-first)
accessible elements pointed to by in
.
Like get_cols()
, this method will extract values in column-major format.
That is, if one were to compute a submatrix containing the selected columns and the chosen row range, that submatrix would be available in column-major form in in
.
Note that this means that contiguous elements in in
are not from the same row!
Rather, they will be from the same column, but only from the rows specified by idx
.
No value is returned by this method.
To create logical, integer and character matrices, include the following header files:
#include "beachmat/logical_matrix.h"
#include "beachmat/integer_matrix.h"
#include "beachmat/character_matrix.h"
The dispatch function changes correspondingly for logical matrix lmat
, integer matrix imat
or character matrix cmat
.
Each function creates a unique pointer to a *_matrix
of the appropriate type.
// creates a std::unique_ptr<beachmat::logical_matrix>
auto lptr=beachmat::create_logical_matrix(lmat);
// creates a std::unique_ptr<beachmat::integer_matrix>
auto iptr=beachmat::create_integer_matrix(imat);
// creates a std::unique_ptr<beachmat::character_matrix>
auto cptr=beachmat::create_character_matrix(cmat);
Equivalent methods are available for each matrix type with appropriate changes in type.
For integer and logical matrices, get()
will return an integer.
in
can be any type previously described for numeric_matrix
objects.
work
should be an iterator
to a Rcpp::IntegerVector
for integer matrices or a Rcpp::LogicalVector
for logical matrices.
Similar changes apply to the iterator in the tuple from get_const_col_indexed()
.
For character matrices, all iterators should be of type Rcpp::StringVector::iterator
, and get()
will return a Rcpp::String
.
Additional notes
in
is a Rcpp::LogicalVector::iterator
for non-logical matrices, the result may not behave as expected.
For numeric_matrix
instances, double-precision values in (-1, 1)
are coerced to zero due to double-to-integer casting in C++.
This is not consistent with the behaviour in R for non-zero values, which are coerced to TRUE
.
For integer_matrix
instances, integer values are not coerced to {0, 1}
when they are assigned to *in
.
Thus, even though the interpretation is correct, the vector produced will not be equivalent to the result of an as.logical
call.
As a general rule, it is unwise to use Rcpp::LogicalVector::iterator
s for anything other than logical_matrix
access.character_matrix
data, we do not return raw const char*
pointers to the C-style string.
Rather, the Rcpp::String
class is used as it provides a convenient wrapper around the underlying CHARSXP
.
This ensures that the string is stored in R’s global cache and is suitably protected against garbage collection.The following matrix classes are directly supported by the API:
matrix
, dgeMatrix
, dgCMatrix
, HDF5Matrix
matrix
, RleMatrix
, HDF5Matrix
matrix
, lgeMatrix
, lgCMatrix
, HDF5Matrix
matrix
, HDF5Matrix
The API will also directly support DelayedMatrix
objects using the above matrices as backends and containing only subsetting or transposition operations.
It is possible to directly support arbitrary user-supplied matrices, see here for more details.
For all other matrices, the API indirectly supports data access via a block processing mechanism.
This involves a call to R to realize a block of the matrix (containing the requested row or column) as a dense contiguous array.
A block is realized so that further requests to rows/columns within the same block do not involve a new call to R.
The size of the blocks can be controlled using methods in the DelayedArray package, see ?blockGrid
for details.
Additional notes
yield
method can be used to obtain the original Rcpp::RObject
for input to RcppArmadillo or RcppEigen.
This functionality is generally limited to base matrices, though there is also limited support for sparse matrices in these libraries.The clone()
method returns a unique pointer to a numeric_matrix
instance of the same type as that pointed to by dptr
.
auto dptr_copy = dptr->clone();
This is useful as direct use of the beachmat API is not thread-safe for simultaneous calls to the get
methods from different threads.
Some methods use cached class members for greater efficiency, and simultaneous calls will cause race conditions.
It is the responsibility of the calling function to coordinate data access across threads.
To this end, the clone
method can be called to generate a unique pointer to a new *_matrix
instance, which can be used concurrently in another thread.
This is fairly cheap as the underlying matrix data are not copied.
An example of parallelized beachmat code using OpenMP might look like this:
#pragma omp parallel
{
beachmat::numeric_matrix* rptr=NULL;
std::unique_ptr<beachmat::numeric_matrix> uptr=nullptr;
if (omp_get_thread_num()==0) {
rptr=dptr.get();
} else {
uptr=dptr->clone();
rptr=uptr.get();
}
size_t col;
const size_t NC=rptr->get_ncol();
Rcpp::NumericVector output(rptr->get_nrow());
#pragma omp for schedule(static)
for (col=0; col<NC; ++col) {
// Do parallel operation here.
rptr->get_col(col, output.begin());
}
}
The start of the parallel region uses the existing dptr
in the master thread and clones a new matrix in the other threads.
The parallelized for
loop then uses rptr
to avoid race conditions in cached variables.
Note that a static schedule may be faster than other schedule types, as several of the matrix implementations in beachmat are optimized for consecutive row/column access.