%\VignetteIndexEntry{rhdf5} %\VignetteKeywords{HDF5} %\VignettePackage{rhdf5} \documentclass[10pt,a4paper]{article} <>= BiocStyle::latex() @ \RequirePackage{amsfonts,amsmath,amstext,amssymb,amscd} \usepackage{graphicx} \usepackage{xcolor,colortbl} %\usepackage{verbatim} %\usepackage{hyperref} %\usepackage{color} %\definecolor{darkblue}{rgb}{0.2,0.0,0.4} %\topmargin -1.5cm %\oddsidemargin -0cm % read Lamport p.163 %\evensidemargin -0cm % same as oddsidemargin but for left-hand pages %\textwidth 17cm %\textheight 24.5cm %\parindent0em %\newcommand{\lib}[1]{{\mbox{\normalfont\textsf{#1}}}} %\newcommand{\file}[1]{{\mbox{\normalfont\textsf{'#1'}}}} %\newcommand{\R}{{\mbox{\normalfont\textsf{R}}}} %\newcommand{\Rfunction}[1]{{\mbox{\normalfont\texttt{#1}}}} %\newcommand{\Robject}[1]{{\mbox{\normalfont\texttt{#1}}}} %\newcommand{\Rpackage}[1]{{\mbox{\normalfont\textsf{#1}}}} %\newcommand{\Rclass}[1]{{\mbox{\normalfont\textit{#1}}}} %\newcommand{\code}[1]{{\mbox{\normalfont\texttt{#1}}}} %\newcommand{\email}[1]{\mbox{\href{mailto:#1}{\textcolor{darkblue}{\normalfont{#1}}}}} %\newcommand{\web}[2]{\mbox{\href{#2}{\textcolor{darkblue}{\normalfont{#1}}}}} \SweaveOpts{keep.source=TRUE,eps=FALSE} \begin{document} \title{rhdf5 - HDF5 interface for R} \author{Bernd Fischer} \maketitle \tableofcontents \section{Introduction} The package is an R interface for HDF5. On the one hand it implements \R{}~interfaces to many of the low level functions from the C interface. On the other hand it provides high level convenience functions on \R{}~ level to make a usage of HDF5 files more easy. \section{Installation of the HDF5 package} To install the package \Biocpkg{rhdf5}, you need a current version (>2.15.0) of \R{}~(www.r-project.org). After installing \R{}~you can run the following commands from the \R{}~command shell to install the bioconductor package \Biocpkg{rhdf5}. <>= source("http://bioconductor.org/biocLite.R") biocLite("rhdf5") @ \section{High level \R~-HDF5 functions} \subsection{Creating an HDF5 file and group hierarchy} An empty HDF5 file is created by <>= library(rhdf5) h5createFile("myhdf5file.h5") @ The HDF5 file can contain a group hierarchy. We create a number of groups and list the file content afterwards. <>= h5createGroup("myhdf5file.h5","foo") h5createGroup("myhdf5file.h5","baa") h5createGroup("myhdf5file.h5","foo/foobaa") h5ls("myhdf5file.h5") @ \subsection{Writing and reading objects} \label{sec_writeMatrix} Objects can be written to the HDF5 file. Attributes attached to an object are written as well, if \Robject{write.attributes=TRUE} is given as argument to \Robject{h5write}. Note that not all \R{}-attributes can be written as HDF5 attributes. <>= A = matrix(1:10,nr=5,nc=2) h5write(A, "myhdf5file.h5","foo/A") B = array(seq(0.1,2.0,by=0.1),dim=c(5,2,2)) attr(B, "scale") <- "liter" h5write(B, "myhdf5file.h5","foo/B") C = matrix(paste(LETTERS[1:10],LETTERS[11:20], collapse=""), nr=2,nc=5) h5write(C, "myhdf5file.h5","foo/foobaa/C") df = data.frame(1L:5L,seq(0,1,length.out=5), c("ab","cde","fghi","a","s"), stringsAsFactors=FALSE) h5write(df, "myhdf5file.h5","df") h5ls("myhdf5file.h5") D = h5read("myhdf5file.h5","foo/A") E = h5read("myhdf5file.h5","foo/B") F = h5read("myhdf5file.h5","foo/foobaa/C") G = h5read("myhdf5file.h5","df") @ If a dataset with the given \Robject{name} does not yet exist, a dataset is created in the HDF5 file and the object \Robject{obj} is written to the HDF5 file. If a dataset with the given \Robject{name} already exists and the datatype and the dimensions are the same as for the object \Robject{obj}, the data in the file is overwritten. If the dataset already exists and either the datatype or the dimensions are different, \Rfunction{h5write} fails. \subsection{Writing and reading with subsetting, chunking and compression} The \Biocpkg{rhdf5} package provides two ways of subsetting. One can specify the submatrix with the \R{}-style index lists or with the HDF5 style hyperslabs. Note, that the two next examples below show two alternative ways for reading and writing the exact same submatrices. Before writing subsetting or hyperslabbing, the dataset with full dimensions has to be created in the HDF5 file. This can be achieved by writing once an array with full dimensions as in Section \ref{sec_writeMatrix} or by creating a dataset. Afterwards the dataset can be written sequentially. \paragraph{Influence of chunk size and compression level} The chosen chunk size and compression level have a strong impact on the reading and writing time as well as on the resulting file size. In an example an integer vector of size 10e7is written to an HDF5 file. The file is written in subvectors of size 10'000. The definition of the chunk size influences the reading as well as the writing time. In the chunk size is much smaller or much larger than actually used, the runtime performance decreases dramatically. Furthermore the file size is larger for smaller chunk sizes, because of an overhead. The compression can be much more efficient when the chunk size is very large. The following figure illustrates the runtime and file size behaviour as a function of the chunk size for a small toy dataset. \begin{center} \includegraphics[width=0.75\textwidth]{../inst/demoChunkSize/chunksize.pdf} \end{center} After the creation of the dataset, the data can be written sequentially to the HDF5 file. Subsetting in \R{}-style needs the specification of the argument index to \Rfunction{h5read} and \Rfunction{h5write}. <>= h5createDataset("myhdf5file.h5", "foo/S", c(5,8), storage.mode = "integer", chunk=c(5,1), level=7) h5write(matrix(1:5,nr=5,nc=1), file="myhdf5file.h5", name="foo/S", index=list(NULL,1)) h5read("myhdf5file.h5", "foo/S") h5write(6:10, file="myhdf5file.h5", name="foo/S", index=list(1,2:6)) h5read("myhdf5file.h5", "foo/S") h5write(matrix(11:40,nr=5,nc=6), file="myhdf5file.h5", name="foo/S", index=list(1:5,3:8)) h5read("myhdf5file.h5", "foo/S") h5write(matrix(141:144,nr=2,nc=2), file="myhdf5file.h5", name="foo/S", index=list(3:4,1:2)) h5read("myhdf5file.h5", "foo/S") h5write(matrix(151:154,nr=2,nc=2), file="myhdf5file.h5", name="foo/S", index=list(2:3,c(3,6))) h5read("myhdf5file.h5", "foo/S") h5read("myhdf5file.h5", "foo/S", index=list(2:3,2:3)) h5read("myhdf5file.h5", "foo/S", index=list(2:3,c(2,4))) h5read("myhdf5file.h5", "foo/S", index=list(2:3,c(1,2,4,5))) @ The HDF5 hyperslabs are defined by some of the arguments \Robject{start}, \Robject{stride}, \Robject{count}, and \Robject{block}. These arguments are not effective, if the argument \Robject{index} is specified. <>= h5createDataset("myhdf5file.h5", "foo/H", c(5,8), storage.mode = "integer", chunk=c(5,1), level=7) h5write(matrix(1:5,nr=5,nc=1), file="myhdf5file.h5", name="foo/H", start=c(1,1)) h5read("myhdf5file.h5", "foo/H") h5write(6:10, file="myhdf5file.h5", name="foo/H", start=c(1,2), count=c(1,5)) h5read("myhdf5file.h5", "foo/H") h5write(matrix(11:40,nr=5,nc=6), file="myhdf5file.h5", name="foo/H", start=c(1,3)) h5read("myhdf5file.h5", "foo/H") h5write(matrix(141:144,nr=2,nc=2), file="myhdf5file.h5", name="foo/H", start=c(3,1)) h5read("myhdf5file.h5", "foo/H") h5write(matrix(151:154,nr=2,nc=2), file="myhdf5file.h5", name="foo/H", start=c(2,3), stride=c(1,3)) h5read("myhdf5file.h5", "foo/H") h5read("myhdf5file.h5", "foo/H", start=c(2,2), count=c(2,2)) h5read("myhdf5file.h5", "foo/H", start=c(2,2), stride=c(1,2),count=c(2,2)) h5read("myhdf5file.h5", "foo/H", start=c(2,1), stride=c(1,3),count=c(2,2), block=c(1,2)) @ \subsection{Saving multiple objects to an HDF5 file (h5save)} A number of objects can be written to the top level group of an HDF5 file with the function \Rfunction{h5save} (as analogonanalogous to the \R{}~function \Rfunction{save}). <>= A = 1:7; B = 1:18; D = seq(0,1,by=0.1) h5save(A, B, D, file="newfile2.h5") h5dump("newfile2.h5") @ \subsection{List the content of an HDF5 file} The function \Rfunction{h5ls} provides some ways of viewing the content of an HDF5 file. <>= h5ls("myhdf5file.h5") h5ls("myhdf5file.h5", all=TRUE) h5ls("myhdf5file.h5", recursive=2) @ \subsection{Dump the content of an HDF5 file} The function \Rfunction{h5dump} is similar to the function \Rfunction{h5ls}. If used with the argument \Robject{load=FALSE}, it produces the same result as \Rfunction{h5ls}, but with the group structure resolved as a hierarchy of lists. If the default argument \Robject{load=TRUE} is used all datasets from the HDF5 file are read. <>= h5dump("myhdf5file.h5",load=FALSE) D <- h5dump("myhdf5file.h5") @ \subsection{Reading HDF5 files with external software} The content of the HDF5 file can be checked with the command line tool \software{h5dump} (available on linux-like systems with the HDF5 tools package installed) or with the graphical user interface \software{HDFView} (http://www.hdfgroup.org/hdf-java-html/hdfview/) available for all major platforms. <>= system("h5dump myhdf5file.h5") @ \warning{Please note, that arrays appear as transposed matrices when opening it with a C-program (\software{h5dump} or \software{HDFView}). This is due to the fact the fastest changing dimension on C is the last one, but on \R~ it is the first one (as in Fortran).} \section{64-bit integers} \R{} does not support a native datatype for 64-bit integers. All integers in \R{} are 32-bit integers. When reading 64-bit integers from a HDF5-file, you may run into troubles. \Biocpkg{rhdf5} is able to deal with 64-bit integers, but you still should pay attention. As an example, we create an HDF5 file that contains 64-bit integers. <>= x = h5createFile("newfile3.h5") D = array(1L:30L,dim=c(3,5,2)) d = h5createDataset(file="newfile3.h5", dataset="D64", dims=c(3,5,2),H5type="H5T_NATIVE_INT64") h5write(D,file="newfile3.h5",name="D64") @ There are three different ways of reading 64-bit integers in \R{}. \Rfunction{H5Dread} and \Rfunction{h5read} have the argument \Robject{bit64conversion} the specify the conversion method. By setting {\bf \Robject{bit64conversion='int'}}, a coercing to 32-bit integers is enforced, with the risc of data loss, but with the insurance that numbers are represented as native integers. <>= D64a = h5read(file="newfile3.h5",name="D64",bit64conversion="int") D64a storage.mode(D64a) @ {\bf \Robject{bit64conversion='double'}} coerces the 64-bit integers to floating point numbers. doubles can represent integers with up to 54-bits, but they are not represented as integer values anymore. For larger numbers there is still a data loss. <>= D64b = h5read(file="newfile3.h5",name="D64",bit64conversion="double") D64b storage.mode(D64b) @ {\bf \Robject{bit64conversion='bit64'}} is recommended way of coercing. It represents the 64-bit integers as objects of class \Rclass{integer64} as defined in the package \CRANpkg{bit64}. Make sure that you have installed \CRANpkg{bit64}. \warning{The datatype \Rclass{integer64} is not part of base \R{}, but defined in an external package. This can produce unexpected behaviour when working with the data.} When choosing this option the package \CRANpkg{bit64} will be loaded. <>= D64c = h5read(file="newfile3.h5",name="D64",bit64conversion="bit64") D64c class(D64c) @ \subsection{Large integer data types} The following table gives an overview of the limits of the different integer representations in R and in HDF5. \definecolor{Gray1}{gray}{0.85} \definecolor{Gray2}{gray}{0.8} \definecolor{Gray3}{gray}{0.75} \definecolor{LightBlue}{rgb}{0.88,0.88,1} \newcommand{\N}{-} \newcommand{\Y}{\cellcolor{Gray2}+} \begin{tabular}{|l|r||c|c|c||c|c|c|c|} \hline \rowcolor{LightBlue} \multicolumn{2}{|c||}{value} & \multicolumn{3}{c||}{R-datatype} & \multicolumn{4}{c|}{HDF5 datatype}\\ \rowcolor{LightBlue} \multicolumn{2}{|c||}{} & integer & double & integer64 & I32 & U32 & I64 & U64 \\ \hline \hline $\phantom{-}2^{64}$ & 18446744073709551616 & \phantom{\vdots} - \phantom{\vdots}& - & - & - & - & - & - \\ \hline $\phantom{-}2^{64}-1$ & 18446744073709551615 & \phantom{\vdots} - \phantom{\vdots}& - & - & - & - & - & \Y \\\hline $\phantom{-}$\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots \\ \hline $\phantom{-}2^{63}$ & 9223372036854775808 & \phantom{\vdots} - \phantom{\vdots} & - & - & - & - & - & \Y \\\hline $\phantom{-}2^{63}-1$ & 9223372036854775807 & \phantom{\vdots} - \phantom{\vdots} & - & \Y & - & - & \Y & \Y \\\hline $\phantom{-}$\vdots & \vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots \\ \hline $\phantom{-}2^{53}$ & 9007199254740992 & \phantom{\vdots} - \phantom{\vdots} & - & \Y & - & - & \Y & \Y \\\hline $\phantom{-}2^{53}-1$ & 9007199254740991 & \phantom{\vdots} - \phantom{\vdots} & \Y & \Y & - & - & \Y & \Y \\\hline $\phantom{-}$\vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots \\ \hline $\phantom{-}2^{32}$ & 4294967296 & \phantom{\vdots} - \phantom{\vdots} & \Y & \Y & - & - & \Y & \Y \\\hline $\phantom{-}2^{32}-1$ & 4294967295 & \phantom{\vdots} - \phantom{\vdots} & \Y & \Y & - & \Y & \Y & \Y \\\hline $\phantom{-}$\vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots \\ \hline $\phantom{-}2^{31}$ & 2147483648 & \phantom{\vdots} - \phantom{\vdots} & \Y & \Y & - & \Y & \Y & \Y \\\hline $\phantom{-}2^{31}-1$ & 2147483647 & \phantom{\vdots}\Y\phantom{\vdots} & \Y & \Y & \Y & \Y & \Y & \Y \\\hline $\phantom{-}$\vdots & \vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots \\ \hline $\phantom{-}2^0$ & 1 & \phantom{\vdots}\Y\phantom{\vdots} & \Y & \Y & \Y & \Y & \Y & \Y \\\hline $\phantom{-}0$ & 0 & \phantom{\vdots}\Y\phantom{\vdots} & \Y & \Y & \Y & \Y & \Y & \Y \\\hline $-2^0$ & -1 & \phantom{\vdots}\Y\phantom{\vdots} & \Y & \Y & \Y & - & \Y & - \\\hline $\phantom{-}$\vdots & \vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \vdots & \cellcolor{Gray2}\vdots & \vdots \\ \hline $-2^{31}+1$ & -2147483647 & \phantom{\vdots}\Y\phantom{\vdots} & \Y & \Y & \Y & - & \Y & - \\\hline $-2^{31}$ & -2147483648 & \cellcolor{Gray2}\phantom{\vdots}NA\phantom{\vdots} & \Y & \Y & \Y & - & \Y & - \\\hline $-2^{31}-1$ & -2147483649 & \phantom{\vdots} - \phantom{\vdots} & \Y & \Y & - & - & \Y & - \\\hline $\phantom{-}$\vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots & \cellcolor{Gray2}\vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots & \vdots \\ \hline $-2^{53}+1$ & -9007199254740991 & \phantom{\vdots} - \phantom{\vdots} & \Y & \Y & - & - & \Y & - \\\hline $-2^{53}$ & -9007199254740992 & \phantom{\vdots} - \phantom{\vdots} & - & \Y & - & - & \Y & - \\\hline $\phantom{-}$\vdots & \vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots & \vdots & \vdots & \cellcolor{Gray2}\vdots & \vdots \\ \hline $-2^{63}+1$ & -9223372036854775807 & \phantom{\vdots} - \phantom{\vdots} & - & \Y & - & - & \Y & - \\\hline $-2^{63}$ & -9223372036854775808 & \phantom{\vdots} - \phantom{\vdots} & - & \cellcolor{Gray2}NA & - & - & \Y & - \\\hline $-2^{63}-1$ & -9223372036854775809 & \phantom{\vdots} - \phantom{\vdots} & - & - & - & - & - & - \\\hline \end{tabular} From the table it becomes obvious that some integer values in HDF5 files cannot be displayed in R. Note that this can happen for both 64-bit integer as well as for unsigned 32-bit integer. When generating an HDF5 file, it is recommended to use signed 32-bit integers. \section{Low level HDF5 functions} \subsection{Creating an HDF5 file and a group hierarchy} Create a file. <>= library(rhdf5) h5file = H5Fcreate("newfile.h5") h5file @ and a group hierarchy <>= h5group1 <- H5Gcreate(h5file, "foo") h5group2 <- H5Gcreate(h5file, "baa") h5group3 <- H5Gcreate(h5group1, "foobaa") h5group3 @ \subsection{Writing data to an HDF5 file} Create 4 different simple and scalar data spaces. The data space sets the dimensions for the datasets. <>= d = c(5,7) h5space1 = H5Screate_simple(d,d) h5space2 = H5Screate_simple(d,NULL) h5space3 = H5Scopy(h5space1) h5space4 = H5Screate("H5S_SCALAR") h5space1 H5Sis_simple(h5space1) @ Create two datasets, one with integer and one with floating point numbers. <>= h5dataset1 = H5Dcreate( h5file, "dataset1", "H5T_IEEE_F32LE", h5space1 ) h5dataset2 = H5Dcreate( h5group2, "dataset2", "H5T_STD_I32LE", h5space1 ) h5dataset1 @ Now lets write data to the datasets. <>= A = seq(0.1,3.5,length.out=5*7) H5Dwrite(h5dataset1, A) B = 1:35 H5Dwrite(h5dataset2, B) @ To release resources and to ensure that the data is written on disk, we have to close datasets, dataspaces, and the file. There are different functions to close datasets, dataspaces, groups, and files. <>= H5Dclose(h5dataset1) H5Dclose(h5dataset2) H5Sclose(h5space1) H5Sclose(h5space2) H5Sclose(h5space3) H5Sclose(h5space4) H5Gclose(h5group1) H5Gclose(h5group2) H5Gclose(h5group3) H5Fclose(h5file) @ % We can now check how it looks like on disk. % <>= % system("h5dump newfile2.h5") % @ % Please note, that arrays appear as transposed matrices when opening it with a C-program (h5dump). This is due to the fact the fastest changing dimension on C is the last one, but on \R~ it is the first one (as in Fortran). \section{Session Info} <>= toLatex(sessionInfo()) @ \end{document}