--- title: > How to use `r Biocpkg("iSEE")` with big data author: - name: Kevin Rue-Albrecht affiliation: - &id4 MRC WIMM Centre for Computational Biology, University of Oxford, Oxford, OX3 9DS, UK email: kevinrue67@gmail.com - name: Federico Marini affiliation: - &id1 Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz - Center for Thrombosis and Hemostasis (CTH), Mainz email: marinif@uni-mainz.de - name: Charlotte Soneson affiliation: - &id3 Institute of Molecular Life Sciences, University of Zurich - SIB Swiss Institute of Bioinformatics email: charlottesoneson@gmail.com - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('iSEE')`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{6. Using iSEE with big data} %\VignetteEncoding{UTF-8} %\VignettePackage{iSEE} %\VignetteKeywords{GeneExpression, RNASeq, Sequencing, Visualization, QualityControl, GUI} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: iSEE.bib --- **Compiled date**: `r Sys.Date()` **Last edited**: 2018-03-08 **License**: `r packageDescription("iSEE")[["License"]]` ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", error = FALSE, warning = FALSE, message = FALSE, crop = NULL ) stopifnot(requireNamespace("htmltools")) htmltools::tagList(rmarkdown::html_dependency_font_awesome()) unlink("assay.h5") ``` ```{r, eval=!exists("SCREENSHOT"), include=FALSE} SCREENSHOT <- function(x, ...) knitr::include_graphics(x) ``` # Overview Some tweaks can be performed to enable `r Biocpkg("iSEE")` to run efficiently on large datasets. This includes datasets with many features (methylation, SNPs) or many columns (cytometry, single-cell RNA-seq). To demonstate some of this functionality, we will use a dataset from the `r Biocpkg("TENxPBMCData")` dataset: ```{r} library(TENxPBMCData) sce.pbmc <- TENxPBMCData("pbmc68k") sce.pbmc$Library <- factor(sce.pbmc$Library) sce.pbmc ``` # Using out-of-memory matrices {#out-of-memory} Many `SummarizedExperiment` objects store assay matrices as in-memory matrix-like objects, be they ordinary matrices or alternative representations such as sparse matrices from the `r CRANpkg("Matrix")` package. For example, if we looked at the Allen data, we would see that the counts are stored as an ordinary matrix. ```{r} library(scRNAseq) sce.allen <- ReprocessedAllenData("tophat_counts") class(assay(sce.allen, "tophat_counts")) ``` In situations involving large datasets and limited computational resources, storing the entire assay in memory may not be feasible. Rather, we can represent the data as a file-backed matrix where contents are stored on disk and retrieved on demand. Within the Bioconductor ecosystem, the easiest way of doing this is to create a `HDF5Matrix`, which uses a [HDF5 file](https://support.hdfgroup.org/HDF5/whatishdf5.html) to store all the assay data. We see that this has already been done for use in the 68K PBMC dataset: ```{r} counts(sce.pbmc, withDimnames=FALSE) ``` Despite the dimensions of this matrix, the `HDF5Matrix` object occupies very little space in memory. ```{r} object.size(counts(sce.pbmc, withDimnames=FALSE)) ``` However, parts of the data can still be read in on demand. For all intents and purposes, the `HDF5Matrix` appears to be an ordinary matrix to downstream applications and can be used as such. ```{r} first.gene <- counts(sce.pbmc)[1,] head(first.gene) ``` This means that we can use the 68K PBMC `SingleCellExperiment` object in `iSEE()` without any extra work. The app below shows the distribution of counts for everyone's favorite gene _MALAT1_ across libraries. Here, `iSEE()` is simply retrieving data on demand from the `HDF5Matrix` without ever loading the entire assay matrix into memory. This enables it to run efficiently on arbitrary large datasets with limited resources. ```{r} library(iSEE) app <- iSEE(sce.pbmc, initial= list(RowDataTable(Selected="ENSG00000251562", Search="MALAT1"), FeatureAssayPlot(XAxis="Column data", XAxisColumnData="Library", YAxisFeatureSource="RowDataTable1") ) ) ``` ```{r, echo=FALSE} SCREENSHOT("screenshots/bigdata-hdf5.png") ``` Generally speaking, these HDF5 files are written once by a process with sufficient computational resources (i.e., memory and time). We typically create `HDF5Matrix` objects using the `writeHDF5Array()` function from the `r Biocpkg("HDF5Array")` package. After the file is created, the objects can be read many times in more deprived environments. ```{r} sce.h5 <- sce.allen library(HDF5Array) assay(sce.h5, "tophat_counts", withDimnames=FALSE) <- writeHDF5Array(assay(sce.h5, "tophat_counts"), file="assay.h5", name="counts") class(assay(sce.h5, "tophat_counts", withDimnames=FALSE)) list.files("assay.h5") ``` It is worth noting that `iSEE()` does not know or care that the data is stored in a HDF5 file. The app is fully compatible with any matrix-like representation of the assay data that supports `dim()` and `[,`. As such, `iSEE()` can be directly used with other memory-efficient objects like the `DeferredMatrix` and `LowRankMatrix` from the `r BiocStyle::Biocpkg("BiocSingular")` package, or perhaps the `ResidualMatrix` from the `r Biocpkg("batchelor")` package. ```{r, echo=FALSE} unlink("assay.h5") ``` # Downsampling points {#downsampling} It is also possible to downsample points to reduce the time required to generate the plot. This involves subsetting the dataset so that only the most recently plotted point for an overlapping set of points is shown. In this manner, we avoid wasting time in plotting many points that would not be visible anyway. To demonstrate, we will re-use the 68K PBMC example and perform downsampling on the feature assay plot; we can see that its aesthetics are largely similar to the non-downsampled counterpart above. ```{r} library(iSEE) app <- iSEE(sce.pbmc, initial= list(RowDataTable(Selected="ENSG00000251562", Search="MALAT1"), FeatureAssayPlot(XAxis="Column data", XAxisColumnData="Library", YAxisFeatureSource="RowDataTable1", VisualChoices="Point", Downsample=TRUE, VisualBoxOpen=TRUE ) ) ) ``` ```{r, echo=FALSE} SCREENSHOT("screenshots/bigdata-downsample.png") ``` Downsampling is possible in all `iSEE()` plotting panels that represent features or samples as points. We can turn on downsampling for all such panels using the relevant field in `iSEEOptions()`, which spares us the hassle of setting `Downsample=` individually in each panel constructor. ```{r, eval=FALSE} iSEEOptions(downsample=TRUE) ``` The downsampling only affects the visualization and the speed of the plot rendering. Any interactions with other panels occur as if all of the points were still there. For example, if one makes a brush, all of the points therein will be selected regardless of whether they were downsampled. ```{r, echo=FALSE} brushed <- list(xmin = 0.4, xmax = 1.6672635882221, ymin = 21.106156820944, ymax = 62.899238475283, coords_css = list(xmin = 41.5274754930861, xmax = 102.504791259766, ymin = 200.009613037109, ymax = 365.009613037109), coords_img = list(xmin = 53.986301369863, xmax = 133.25766825691, ymin = 260.012487411041, ymax = 474.512479543227), img_css_ratio = list(x = 1.30001404440901, y = 1.29999995231628), mapping = list(x = "X", y = "Y", group = "GroupBy"), domain = list(left = 0.4, right = 8.6, bottom = -5.1, top = 107.1, discrete_limits = list( x = list("1", "2", "3", "4", "5", "6", "7", "8"))), range = list(left = 53.986301369863, right = 566.922374429224, bottom = 609.013698630137, top = 33.1552511415525), log = list(x = NULL, y = NULL), direction = "xy", brushId = "FeatureAssayPlot1_Brush", outputId = "FeatureAssayPlot1") library(iSEE) app <- iSEE(sce.pbmc, initial= list(RowDataTable(Selected="ENSG00000251562", Search="MALAT1"), FeatureAssayPlot(XAxis="Column data", XAxisColumnData="Library", YAxisFeatureSource="RowDataTable1", YAxisFeatureName="ENSG00000251562", VisualChoices="Point", Downsample=TRUE, VisualBoxOpen=TRUE, BrushData=brushed ) ) ) ``` ```{r, echo=FALSE} SCREENSHOT("screenshots/bigdata-downsample2.png") ``` The downsampling resolution determines the degree to which points are considered to be overlapping. Decreasing the resolution will downsample more aggressively, improving plotting speed but potentially affecting the fidelity of the visualization. This may compromise the aesthetics of the plot when the size of the points is small, in which case an increase in resolution may be required at the cost of speed. Obviously, downsampling will not preserve overlays for partially transparent points, but any reliance on partial transparency is probably not a good idea in the first place when there are many points. # Changing the interface One can generally improve the speed of the `iSEE()` interface by only initializing the app with the desired panels. For example, it makes little sense to spend time rendering a `RowDataPlot` when only the `ReducedDimensionPlot` is of interest. Specification of the initial state is straightforward with the `initial=` argument, as described in a `r Biocpkg("iSEE", "configure.html", "previous vignette")`. On occasion, there may be alternative panels with more efficient visualizations for the same data. The prime example is the `ReducedDimensionHexPlot` class from the `r Biocpkg("iSEEu")` package; this will create a hexplot rather than a scatter plot, thus avoiding the need to render each point in the latter. # Comments on deployment It is straightforward to host `r Biocpkg("iSEE")` applications on hosting platforms like [Shiny Server](https://rstudio.com/products/shiny/shiny-server/) or [Rstudio Connect](https://rstudio.com/products/connect/). All one needs to do is to create an `app.R` file that calls `iSEE()` with the desired parameters, and then follow the instructions for the target platform. For a better user experience, we suggest setting a minimum number of processes to avoid the initial delay from R start-up. It is also possible to deploy and host Shiny app on [shinyapps.io](https://www.shinyapps.io/), a platform as a service (PaaS) provided by RStudio. In many cases, users will need to configure the settings of their deployed apps, in particular selecting larger instances to provide sufficient memory for the app. The maximum amount of 1GB available to free accounts may not be sufficient to deploy large datasets; in which case you may consider [using out-of-memory matrices](#out-of-memory), filtering your dataset (e.g., removing lowly detected features), or going for a paid account. Detailed instructions to get started are available at . For example, see the [isee-shiny-contest](https://rstudio.cloud/project/230765) app, [winner of the 1st Shiny Contest](https://blog.rstudio.com/2019/04/05/first-shiny-contest-winners/). # Session Info {.unnumbered} ```{r sessioninfo} sessionInfo() # devtools::session_info() ```