---
title: "BiocSklearn -- exposing python Scikit machine learning elements for Bioconductor"
author: "Vincent J. Carey, stvjc at channing.harvard.edu, Shweta Gopaulakrishnan, reshg at channing.harvard.edu, Samuela Pollack, spollack at jimmy.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{BiocSklearn overview}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

# Introduction

Scientific computing in python is well-established.  This
package takes advantage of new work at Rstudio that fosters
python-R interoperability.  Identifying
good practices of interface
design will require extensive discussion and experimentation,
and this package takes an initial step in this direction.

A key motivation is experimenting with an incremental PCA
implementation with very large out-of-memory data.  We have
also provided an interface to the sklearn.cluster.KMeans
procedure.

# Basic concepts

```{r dsetup,echo=FALSE,results="hide",include=FALSE}
suppressPackageStartupMessages({
library(BiocSklearn)
library(BiocStyle)
})
```

## Module references

The package includes a list of references to python
modules.

```{r loadup}
library(BiocSklearn)
```

## Python documentation

We can acquire python documentation of included modules with
reticulate's `py_help`:  The following result could
get stale:
```
skd = reticulate::import("sklearn")$decomposition
py_help(skd)
Help on package sklearn.decomposition in sklearn:

NAME
    sklearn.decomposition

FILE
    /Users/stvjc/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/__init__.py

DESCRIPTION
    The :mod:`sklearn.decomposition` module includes matrix decomposition
    algorithms, including among others PCA, NMF or ICA. Most of the algorithms of
    this module can be regarded as dimensionality reduction techniques.

PACKAGE CONTENTS
    _online_lda
    base
    cdnmf_fast
    dict_learning
    factor_analysis
    fastica_
    incremental_pca
...
```

## Importing data for direct handling by python functions

The reticulate package is designed to limit the amount
of effort required to convert data from R to python
for natural use in each language.

```{r doimp, eval=TRUE}
np = reticulate::import("numpy", convert=FALSE, delay_load=TRUE)
irloc = system.file("csv/iris.csv", package="BiocSklearn")
irismat = np$genfromtxt(irloc, delimiter=',')
```

To examine a submatrix, we use the take method from numpy.
The bracket format seen below notifies us that we are not looking at
data native to R.

```{r dota, eval=TRUE}
np$take(irismat, 0:2, 0L )
```

# Dimension reduction with sklearn: illustration with iris dataset

We'll use R's prcomp as a first
test to demonstrate performance of the sklearn modules
with the iris data.

```{r dor}
fullpc = prcomp(data.matrix(iris[,1:4]))$x
```

## PCA

We have a python representation of the iris data.  We
compute the PCA as follows:
```{r dopc1}
ppca = skPCA(data.matrix(iris[,1:4]))
ppca
```
This returns an object that can be reused through python methods.
The numerical transformation is accessed via `getTransformed`.
```{r lk1}
tx = getTransformed(ppca)
dim(tx)
head(tx)
```

Concordance with the R computation can be checked:
```{r lkconc}
round(cor(tx, fullpc),3)
```

## Incremental PCA

A computation supporting _a priori_ bounding of memory
consumption is available.  In this procedure one can
also select the number of principal components to compute.

In August 2022 this chunk is blocked.  Basilisk discipline is needed.
```{r doincr, eval=FALSE}
ippca = skIncrPCA(irismat) # problematic, needs basilisk cover
ippcab = skIncrPCA(irismat, batch_size=25L)
round(cor(getTransformed(ippcab), fullpc),3)
```

## Manual incremental PCA with explicit chunking

This procedure can be used when data are provided
in chunks, perhaps from a stream.  We iteratively
update the object, for which there is no container
at present.  Again the number of components computed
can be specified.
```{r dopartial, eval=FALSE}
ta = np$take # provide slicer utility
ipc = skPartialPCA_step(ta(irismat,0:49,0L))
ipc = skPartialPCA_step(ta(irismat,50:99,0L), obj=ipc)
ipc = skPartialPCA_step(ta(irismat,100:149,0L), obj=ipc)
ipc$transform(ta(irismat,0:5,0L))
fullpc[1:5,]
```

# Interoperation with HDF5 matrix

We have extracted methylation data for the Yoruban
subcohort of CEPH from the yriMulti package.  Data
from chr6 and chr17 are available in an HDF5 matrix
in this BiocSklearn package.  A reference to the
dataset through the h5py File interface is created by
`H5matref`.

Please run `example(H5matref)` for illustration.

## How to expand scope of BiocSklearn

Consider the problem reported at [slack](https://community-bioc.slack.com/archives/CLUJWDQF4/p1661525488687379),
in which
```
>>> import sklearn.impute
>>> X = [[0, 1, 3], [3, 4, 5]]
>>> gen = sklearn.metrics.pairwise_distances_chunked(X)
>>> for chunk in gen:
...     print(chunk)
```
cannot be done in a certain docker container application.

We introduce the following function in BiocSklearn/R, 
omitting the 'chunked' element:
```
skPWD = function(mat, ...) {
 proc = basilisk::basiliskStart(bsklenv) # avoid package-specific import
 on.exit(basilisk::basiliskStop(proc))
 basilisk::basiliskRun(proc, function(mat, ...) {
     sk = reticulate::import("sklearn") 
     sk$metrics$pairwise_distances(mat, ...)
   }, mat=mat, ...)
}
```

`bsklenv` is defined for the BiocSklearn package as

```
# necessary for python module control
bsklenv <- basilisk::BasiliskEnvironment(envname="bsklenv",
    pkgname="BiocSklearn",
    packages=c("scikit-learn==1.0.2", "h5py==3.6.0", "pandas==1.3.5", "joblib==1.0.0"))

```


# Conclusions

We need more applications and profiling.