%\VignetteIndexEntry{4. Random Numbers in BiocParallel} %\VignetteKeywords{parallel, Infrastructure} %\VignettePackage{BiocParallel} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= BiocStyle::latex() @ <>= suppressPackageStartupMessages({ library(BiocParallel) }) @ \newcommand{\BiocParallel}{\Biocpkg{BiocParallel}} \title{Random Numbers in BiocParallel} \author{Martin Morgan\footnote{\url{Martin.Morgan@RoswellPark.org}}} \date{Edited: 7 September, 2021; Compiled: \today} \begin{document} \maketitle \tableofcontents \section{Scope} \BiocParallel{} enables use of random number streams in a reproducible manner. This document applies to the following \Rcode{*Param()}: \begin{itemize} \item \Rcode{SerialParam()}: sequential evaluation in a single \emph{R} process. \item \Rcode{SnowParam()}: parallel evaluation in multiple independent \emph{R} processes. \item \Rcode{MulticoreParam())}: parallel evaluation in \emph{R} sessions running in forked threads. Not available on Windows. \end{itemize} The \Rcode{*Param()} can be used for evaluation with: \begin{itemize} \item \Rcode{bplapply()}: \Rcode{lapply()}-like application of a user-supplied function \Rcode{FUN} to a vector or list of elements \Rcode{X}. \item \Rcode{bpiterate()}: apply a user-supplied function \Rcode{FUN} to an unknown number of elements resulting from successive calls to a user-supplied function \Rcode{ITER}. \end{itemize} The reproducible random number implementation also supports: \begin{itemize} \item \Rcode{bptry()} and the \Rcode{BPREDO=} argument, for re-evaluation of elements that fail (e.g., because of a bug in \Rcode{FUN}). \end{itemize} \section{Essentials} \subsection{Use of \Rcode{bplapply()} and \Rcode{RNGseed=}} Attach \BiocParallel{} and ensure that the version is greater than 1.27.5 <>= library(BiocParallel) stopifnot( packageVersion("BiocParallel") > "1.27.5" ) @ For reproducible calculation, use the \Rcode{RNGseed=} argument in any of the \Rcode{*Param()} constructors. <>= result1 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 100)) result1 @ Repeating the calculation with the same value for \Rcode{RNGseed=} results in the same result; a different random number seed results in different results. <>= result2 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 100)) stopifnot( identical(result1, result2) ) result3 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 200)) result3 stopifnot( !identical(result1, result3) ) @ Results are invariant across \Rcode{*Param()} <>= result4 <- bplapply(1:3, runif, BPPARAM = SnowParam(RNGseed = 100)) stopifnot( identical(result1, result4) ) if (!identical(.Platform$OS.type, "windows")) { result5 <- bplapply(1:3, runif, BPPARAM = MulticoreParam(RNGseed = 100)) stopifnot( identical(result1, result5) ) } @ Parallel backends can adjust the number of \Rcode{workers} (processes performing the evaluation) and \Rcode{tasks} (how elements of \Rcode{X} are distributed between workers). Results are invariant to these parameters. This is illustrated with \Rcode{SnowParam()}, but applies also to \Rcode{MulticoreParam()}. <>= result6 <- bplapply(1:3, runif, BPPARAM = SnowParam(workers = 2, RNGseed = 100)) result7 <- bplapply(1:3, runif, BPPARAM = SnowParam(workers = 3, RNGseed = 100)) result8 <- bplapply( 1:3, runif, BPPARAM = SnowParam(workers = 2, tasks = 3, RNGseed = 100) ) stopifnot( identical(result1, result6), identical(result1, result7), identical(result1, result8) ) @ Subsequent sections illustrate results with \Rcode{SerialParam()}, but identical results are obtained with \Rcode{SnowParam()} and \Rcode{MulticoreParam()}. \subsection{Use with \Rcode{bpiterate()}} \Rcode{bpiterate()} allows parallel processing of a 'stream' of data as a series of tasks, with a task consisting of a portion of the overall data. It is useful when the data size is not known or easily partitioned into elements of a vector or list. A real use case might involve iterating through a BAM file, where a task represents successive records (perhaps 100,000 per task) in the file. Here we illustrate with a simple example -- iterating through a vector \Rcode{x = 1:3} <>= ITER_FUN_FACTORY <- function() { x <- 1:3 i <- 0L function() { i <<- i + 1L if (i > length(x)) return(NULL) x[[i]] } } @ \Rcode{ITER\_FUN\_FACTORY()} is used to create a function that, on each invocation, returns the next task (here, an element of \Rcode{x}; in a real example, perhaps 100000 records from a BAM file). When there are no more tasks, the function returns \Rcode{NULL} <>= ITER <- ITER_FUN_FACTORY() ITER() ITER() ITER() ITER() @ In our simple example, \Rcode{bpiterate()} is performing the same computations as \Rcode{bplapply()} so the results, including the random number streams used by each task in \Rcode{bpiterate()}, are the same <>= result9 <- bpiterate( ITER_FUN_FACTORY(), runif, BPPARAM = SerialParam(RNGseed = 100) ) stopifnot( identical(result1, result9) ) @ \subsection{Use with \Rcode{bptry()}} \Rcode{bptry()} in conjunction with the \Rcode{BPREDO=} argument to \Rcode{bplapply()} or \Rcode{bpiterate()} allows for graceful recovery from errors. Here a buggy \Rcode{FUN1()} produces an error for the second element. \Rcode{bptry()} allows evaluation to continue for other elements of \Rcode{X}, despite the error. This is shown in the result. <>= FUN1 <- function(i) { if (identical(i, 2L)) { ## error when evaluating the second element stop("i == 2") } else runif(i) } result10 <- bptry(bplapply( 1:3, FUN1, BPPARAM = SerialParam(RNGseed = 100, stop.on.error = FALSE) )) result10 @ \Rcode{FUN2()} illustrates the flexibility of \Rcode{bptry()} by fixing the bug when \Rcode{i == 2}, but also generating incorrect results if invoked for previously correct values. The identity of the result to the original computation shows that only the error task is re-computed, and that the random number stream used by the task is identical to the original stream. <>= FUN2 <- function(i) { if (identical(i, 2L)) { ## the random number stream should be in the same state as the ## first time through the loop, and rnorm(i) should return ## same result as FUN runif(i) } else { ## if this branch is used, then we are incorrectly updating ## already calculated elements -- '0' in the output would ## indicate this error 0 } } result11 <- bplapply( 1:3, FUN2, BPREDO = result10, BPPARAM = SerialParam(RNGseed = 100, stop.on.error = FALSE) ) stopifnot( identical(result1, result11) ) @ \subsection{Relationship between \Rcode{RNGseed=} and \Rcode{set.seed()}} The global random number stream (influenced by \Rcode{set.seed()}) is ignored by \BiocParallel{}, and \BiocParallel{} does NOT increment the global stream. <>= set.seed(200) value <- runif(1) set.seed(200) result12 <- bplapply(1:3, runif, BPPARAM = SerialParam(RNGseed = 100)) stopifnot( identical(result1, result12), identical(value, runif(1)) ) @ When \Rcode{RNGseed=} is not used, an internal stream (not accessible to the user) is used and \BiocParallel{} does NOT increment the global stream. <>= set.seed(100) value <- runif(1) set.seed(100) result13 <- bplapply(1:3, runif, BPPARAM = SerialParam()) stopifnot( !identical(result1, result13), identical(value, runif(1)) ) @ \subsection{\Rcode{bpstart()} and random number streams} In all of the examples so far \Rcode{*Param()} objects are passed to \Rcode{bplapply()} or \Rcode{bpiterate()} in the 'stopped' state. Internally, \Rcode{bplapply()} and \Rcode{bpiterate()} invoke \Rcode{bpstart()} to establish the computational environment (e.g., starting workers for \Rcode{SnowParam()}). \Rcode{bpstart()} can be called explicitly, e.g., to allow workers to be used across calls to \Rcode{bplapply()}. The cluster random number stream is initiated with \Rcode{bpstart()}. Thus <>= param <- bpstart(SerialParam(RNGseed = 100)) result16 <- bplapply(1:3, runif, BPPARAM = param) bpstop(param) stopifnot( identical(result1, result16) ) @ This allows a second call to \Rcode{bplapply} to represent a continuation of a random number computation -- the second call to \Rcode{bplapply()} results in different random number streams for each element of \Rcode{X}. <>= param <- bpstart(SerialParam(RNGseed = 100)) result16 <- bplapply(1:3, runif, BPPARAM = param) result17 <- bplapply(1:3, runif, BPPARAM = param) bpstop(param) stopifnot( identical(result1, result16), !identical(result1, result17) ) @ \subsection{Relationship between \Rcode{bplapply()} and \Rcode{lapply()}} The results from \Rcode{bplapply()} are different from the results from \Rcode{lapply()}, even with the same random number seed. This is because correctly implemented parallel random streams require use of a particular random number generator invoked in specific ways for each element of \Rcode{X}, as outlined in the Implementation notes section. <>= set.seed(100) result20 <- lapply(1:3, runif) stopifnot( !identical(result1, result20) ) @ \section{Implementation notes} The implementation uses the L'Ecuyer-CMRG random number generator (see \Rcode{?RNGkind} and \Rcode{?parallel::clusterSetRNGStream} for additional details). This random number generates independent streams and substreams of random numbers. In \BiocParallel{}, each call to \Rcode{bpstart()} creates a new stream from the L'Ecuyer-CMRG generator. Each element in \Rcode{bplapply()} or \Rcode{bpiterate()} creates a new substream. Each application of \Rcode{FUN} is therefore using the L'Ecuyer-CMRG random number generator, with a substream that is independent of the substreams of all other elements. Within the user-supplied \Rcode{FUN} of \Rcode{bplapply()} or \Rcode{bpiterate()}, it is a mistake to use \Rcode{RNGkind()} to set a different random number generator, or to use \Rcode{set.seed()}. This would in principle compromise the independence of the streams across elements. \section{\Rcode{sessionInfo()}} <>= sessionInfo() @ \end{document}