--- title: "BiocSklearn -- exposing python Scikit machine learning elements for Bioconductor" author: "Vincent J. Carey, stvjc at channing.harvard.edu, Shweta Gopaulakrishnan, reshg at channing.harvard.edu, Samuela Pollack, spollack at jimmy.harvard.edu" date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{BiocSklearn overview} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes --- # Introduction Scientific computing in python is well-established. This package takes advantage of new work at Rstudio that fosters python-R interoperability. Identifying good practices of interface design will require extensive discussion and experimentation, and this package takes an initial step in this direction. A key motivation is experimenting with an incremental PCA implementation with very large out-of-memory data. We have also provided an interface to the sklearn.cluster.KMeans procedure. # Basic concepts ```{r dsetup,echo=FALSE,results="hide",include=FALSE} suppressPackageStartupMessages({ library(BiocSklearn) library(BiocStyle) }) ``` ## Module references The package includes a list of references to python modules. ```{r loadup} library(BiocSklearn) ``` ## Python documentation We can acquire python documentation of included modules with reticulate's `py_help`: The following result could get stale: ``` skd = reticulate::import("sklearn")$decomposition py_help(skd) Help on package sklearn.decomposition in sklearn: NAME sklearn.decomposition FILE /Users/stvjc/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/__init__.py DESCRIPTION The :mod:`sklearn.decomposition` module includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques. PACKAGE CONTENTS _online_lda base cdnmf_fast dict_learning factor_analysis fastica_ incremental_pca ... ``` ## Importing data for direct handling by python functions The reticulate package is designed to limit the amount of effort required to convert data from R to python for natural use in each language. ```{r doimp, eval=TRUE} np = reticulate::import("numpy", convert=FALSE, delay_load=TRUE) irloc = system.file("csv/iris.csv", package="BiocSklearn") irismat = np$genfromtxt(irloc, delimiter=',') ``` To examine a submatrix, we use the take method from numpy. The bracket format seen below notifies us that we are not looking at data native to R. ```{r dota, eval=TRUE} np$take(irismat, 0:2, 0L ) ``` # Dimension reduction with sklearn: illustration with iris dataset We'll use R's prcomp as a first test to demonstrate performance of the sklearn modules with the iris data. ```{r dor} fullpc = prcomp(data.matrix(iris[,1:4]))$x ``` ## PCA We have a python representation of the iris data. We compute the PCA as follows: ```{r dopc1} ppca = skPCA(data.matrix(iris[,1:4])) ppca ``` This returns an object that can be reused through python methods. The numerical transformation is accessed via `getTransformed`. ```{r lk1} tx = getTransformed(ppca) dim(tx) head(tx) ``` Concordance with the R computation can be checked: ```{r lkconc} round(cor(tx, fullpc),3) ``` ## Incremental PCA A computation supporting _a priori_ bounding of memory consumption is available. In this procedure one can also select the number of principal components to compute. In August 2022 this chunk is blocked. Basilisk discipline is needed. ```{r doincr, eval=FALSE} ippca = skIncrPCA(irismat) # problematic, needs basilisk cover ippcab = skIncrPCA(irismat, batch_size=25L) round(cor(getTransformed(ippcab), fullpc),3) ``` ## Manual incremental PCA with explicit chunking This procedure can be used when data are provided in chunks, perhaps from a stream. We iteratively update the object, for which there is no container at present. Again the number of components computed can be specified. ```{r dopartial, eval=FALSE} ta = np$take # provide slicer utility ipc = skPartialPCA_step(ta(irismat,0:49,0L)) ipc = skPartialPCA_step(ta(irismat,50:99,0L), obj=ipc) ipc = skPartialPCA_step(ta(irismat,100:149,0L), obj=ipc) ipc$transform(ta(irismat,0:5,0L)) fullpc[1:5,] ``` # Interoperation with HDF5 matrix We have extracted methylation data for the Yoruban subcohort of CEPH from the yriMulti package. Data from chr6 and chr17 are available in an HDF5 matrix in this BiocSklearn package. A reference to the dataset through the h5py File interface is created by `H5matref`. Please run `example(H5matref)` for illustration. ## How to expand scope of BiocSklearn Consider the problem reported at [slack](https://community-bioc.slack.com/archives/CLUJWDQF4/p1661525488687379), in which ``` >>> import sklearn.impute >>> X = [[0, 1, 3], [3, 4, 5]] >>> gen = sklearn.metrics.pairwise_distances_chunked(X) >>> for chunk in gen: ... print(chunk) ``` cannot be done in a certain docker container application. We introduce the following function in BiocSklearn/R, omitting the 'chunked' element: ``` skPWD = function(mat, ...) { proc = basilisk::basiliskStart(bsklenv) # avoid package-specific import on.exit(basilisk::basiliskStop(proc)) basilisk::basiliskRun(proc, function(mat, ...) { sk = reticulate::import("sklearn") sk$metrics$pairwise_distances(mat, ...) }, mat=mat, ...) } ``` `bsklenv` is defined for the BiocSklearn package as ``` # necessary for python module control bsklenv <- basilisk::BasiliskEnvironment(envname="bsklenv", pkgname="BiocSklearn", packages=c("scikit-learn==1.0.2", "h5py==3.6.0", "pandas==1.3.5", "joblib==1.0.0")) ``` # Conclusions We need more applications and profiling.