---
title: "Using HDF5-backed matrices with beachmat"
author: "Aaron Lun"
package: beachmat.hdf5
output: 
  BiocStyle::html_document:
    toc_float: yes
vignette: >
  %\VignetteIndexEntry{User guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r, echo=FALSE, results="hide", message=FALSE}
require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
```

# Overview 

`r Biocpkg("beachmat.hdf5")` provides a C++ API to extract numeric data from HDF5-backed matrices from the `r Biocpkg("HDF5Array")` package.
This extends the `r Biocpkg("beachmat")` package to the matrix representations in the [**tatami_hdf5**](https://github.com/tatami-inc/tatami_hdf5) library.
By including this package, users and developers can enable **tatami**-compatible C++ code to operate natively on file-backed data via the HDF5 C library.

# For users 

Users can simply load the package in their R session:

```{r}
library(beachmat.hdf5)
```

This will automatically extend `r Biocpkg("beachmat")`'s functionality to `r Biocpkg("HDF5Array")` matrices.
Any package code based on `r Biocpkg("beachmat")` will now be able to access HDF5 data natively without any further work.

# For developers

Developers should read the `r Biocpkg("beachmat")` developer guide if they have not done so already.

Developers can import `r Biocpkg("beachmat.hdf5")` in their packages to guarantee native support for `r Biocpkg("HDF5Array")` classes.
This registers more `initializeCpp()` methods that initializes the appropriate C++ representations for these classes.
Of course, this adds some more dependencies to the package, which may or may not be acceptable;
some developers may prefer to leave this choice to the user or hide it behind an optional parameter to reduce the installation burden 
(e.g., if HDF5-backed matrices are not expected to be a common input in the package workflow).

It's worth noting that `r Biocpkg("beachmat")` by itself will already work with `HDF5Matrix`, `H5SparseMatrix`, etc. objects even without loading `r Biocpkg("beachmat.hdf5")`.
However, this is not as efficient as any package C++ code needs to go back into R to extract the matrix data via `DelayedArray::extract_array()` and friends.
Importing `r Biocpkg("beachmat.hdf5")` provides native support without the need for calls to R functions.

# In-memory caching

The `initializeCpp()` methods for the `r Biocpkg("HDF5Array")` classes have an optional `memorize=` parameter.
If this is `TRUE`, the entire matrix is loaded from the HDF5 file into memory and stored in a global cache on first use.
Any subsequent calls to `initializeCpp()` on the same matrix instance will re-use the cached value.

In-memory caching is intended for functions or workflows that need to iterate through the matrix multiple times.
By setting `memorize=TRUE`, developers can pay an up-front loading cost to avoid the repeated penalty of disk access on subsequent iterations.
Obviously, this assumes that the matrix is still small enough that an in-memory store is feasible.

For long-running analyses, users may call `beachmat::flushMemoryCache()` to clear the cache.

# Session information {-}

```{r}
sessionInfo()
```