---
title: "Package Quick Start Guide"
author: 
- name: Jiefei Wang
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document:
        toc: true
        toc_float: true
vignette: >
  %\VignetteIndexEntry{quickStart}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
package: SharedObject
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library("SharedObject")
```
# Introduction

The `SharedObject` package is designed for sharing data across
multiple R processes, where all processes can read the data located in
the same memory location. This sharing mechanism has the potential to
save the memory usage and reduce the overhead of data transmission in
the parallel computing. The use of the package arises from many
data-science subjects such as high-throughput gene data analysis, in
which case A paralle computing is desirable and the data is very
large. Blindly calling an export function such as `clusterExport` will
duplicate the data for each process and it is obviously unnecessary if
the data is read-only in the parallel computing. The `sharedObject`
package can share the data without duplications and is able to reduce
the time cost. A new set of R APIs called `ALTREP` is used to provide
a seamless experience when sharing an object.

# Quick example

We first illustrate the package with an example. In the example, we
create a cluster with 4 cores and share an n-by-n matrix `A`, we use
the function `share` to create the shared object `A_shr` and call the
function `clusterExport` to export it:

```{r}
library(parallel)
## Initiate the cluster
cl=makeCluster(1)
## create data
n=3
A=matrix(runif(n^2),n,n)
## create shared object
A_shr=share(A)
## export the shared object
clusterExport(cl,"A_shr")

stopCluster(cl)
```

As the code shows above, the procedure of sharing a shared object is
similar to the procedure of sharing an R object, except that we
replace the matrix `A` with a shared object `A_shr`. Notably, there is
no different between the matrix `A` and the shared object `A_shr`. The
shared object `A_shr` is neither an S3 nor S4 object and its behaviors
are exactly the same as the matrix `A`, so there is no need to change
the existing code to work with the shared object. We can verify this
through 
```{r} 
## check the data 
A 
A_shr 
## check the properties
attributes(A) 
attributes(A_shr) 
## check the class 
class(A)
class(A_shr) 
``` 
Users can treate the shared object as a matrix and do
operations on it as usual.

# Supported data types

Currently, the package supports `atomic`(aka `vector`), `matrix` and
`data.frame` data structures. `List` is not allowed for the
`sharedObject` function but users can create a shared object for each
child of the list.

Please note that `data.frame` is fundamentally a list of
vectors. Sharing a `data.frame` will share its vector elements, not
the `data.frame` itself. Therefore, adding or replace a column in a
shared `data.frame` will not affect the shared memory. Users should
avoid such behaviors.

The type of `integer`, `numeric`, `logical` and `raw` are available
for sharing. `string` is not supported.


# Check object class

In order to distinguish a shared object, the package provide several
functions to examine the internal data structure

```{r}
## Check if an object is of an ALTREP class
is.altrep(A)
is.altrep(A_shr)

## Check if an object is a shared object
## This works for both vector and data.frame
is.shared(A)
is.shared(A_shr)
```

The function `is.altrep` only checks if an object is an ALTREP
object. Since the shared object class inherits ALTREP class, the
function returns `TRUE` for a shared object. However, R also creates
ALTREP object in some cases(e.g. A=1:10, A is an ALTREP object), this
function may fail to check determine whether an object is a shared
object. `is.shared` is the most suitable way to check the shared
object. For `data.frame` type, it return `TRUE` only when all of its
vector elements are shared objects.

There are several properties with the shared object, one can check
them via

```{r}
## get a summary report
getSharedProperties(A_shr)

## Internal function to check the properties
## All properties can be accessed via the similar way
.getProperty(A_shr,"dataId")
.getProperty(A_shr,"processId")
.getProperty(A_shr,"typeId")

## Public function to check the properties
getCopyOnWrite(A_shr)
getSharedSubset(A_shr)
getSharedCopy(A_shr)
```

Please see the advanced topic in the next section to see which
properties are changable and how to change them in a proper way.

# Advanced topic: Copy-On-Write

Because all cores are using the shared object `A_shr` located in the
same memory location, a reckless change made on the matrix `A_shr` in
one process will immediately be broadcasted to the other process. To
prevent users from changing the values of a shared object without
awareness, a shared object will duplicate itself if a change of its
value is made. Therefore, the code like

```{r}
A_shr2=A_shr
A_shr[1,1]=10

A_shr
A_shr2
```

will result in a memory dulplication. The matrix `A_shr2` is not
affected. This default behavior can be overwritten by passing an
argument `copyOnWrite` to the function `share`. For example

```{r}
A_shr=share(A,copyOnWrite=FALSE)
A_shr2=A_shr
A_shr[1,1]=10

A_shr
A_shr2
```

A change in the matrix `A_shr` cause a change in `A_shr2`. This
feature could be potentially useful to return the result from each R
process without additional memory allocation, so `A_shr` can be both
the initial data and the final result. However, due to the limitation
of R, only copy-on-write feature is fully supported, not the
reverse. it is possible to change the value of a shared object
unexpectly.

```{r}
A_shr=share(A,copyOnWrite=FALSE)
-A_shr
A_shr
```

The above example shows an unexpected result when the copy-on-write
feature is off. Simply calling an unary function can change the values
of a shared object. Therefore, for the safty of the naive user, the
copy-on-write feature is active by default. For the experienced user,
the the copy-on-write feature can be altered via `setCopyOnwrite`
funtion. There is no return value for the function.

```{r}
A_shr=share(A,copyOnWrite=FALSE)
#Assign A_shr to another object
A_shr2=A_shr
#change the value of A_shr
A_shr[1,1]=10
#Both A_shr and A_shr2 are affected
A_shr
A_shr2
#Enable copy-on-write
setCopyOnWrite(A_shr,TRUE)
#The unary function does not affect the variable A_shr
-A_shr
A_shr

getCopyOnWrite(A_shr)
```

These flexibilities provide us a way to do safe operations during the
computation and return the results without memory duplications.

## Warning

If a high-precision value is assigned to a low-precision shared
object, An implicit type conversion will be triggered for correctly
storing the change. The resulting object would be a regular R object,
not a shared object. Therefore, the change will not be broadcasted
even if the copy-on-write feature is off. The most common senario is
to assign a numeric value to an integer shared object. Users should be
caution with the data type that a shared object is using.

# Advanced topic: shared subset and shared copy

The options `sharedSubset` controls whether to create a shared object
when subsetting a shared object. `sharedCopy` determines if the
duplication of a shared object is still a shared object. For
performance consideration, the default settings are
`sharedSubset=TRUE` and `sharedCopy=FALSE`, but they can be
overwritten via:

```{r}
A_shr=share(A,sharedSubset=FALSE,sharedCopy=TRUE)
getSharedProperties(A_shr)
```

Please note that `sharedCopy` is only effective when
`copyOnWrite=TRUE`.

# Last word on Linux system

There is a certain limitation on how many shared objects a process can create on Linux system. In case if you see the error message "Too many open files", it means either you have explictly created too many shared objects, or you have implicitly generated too many shared subsets via `[` operator. You can turn `sharedSubset` off for reducing the number of opened files, or check your system settings to increase the number of opened files that one process can have.


# Session Information

```{r}
sessionInfo()
```