--- title: "Work with Big Datasets" author: "Zuguang Gu ( z.gu@dkfz.de )" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{5. Work with Big Datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE} library(markdown) library(knitr) knitr::opts_chunk$set( error = FALSE, tidy = FALSE, message = FALSE, fig.align = "center") options(width = 100) options(rmarkdown.html_vignette.check_title = FALSE) library(cola) ``` *cola* can be idealy applied to datasets with intermediate number of samples (around 500), however, the running time and the memory usage might not be acceptable if the number of samples gets very huge, e.g., more than thousands. Nevertheless, we can propose the following two strategies to partially solve this problem. 1. A randomly sampled subset of samples which take relatively short running time (e.g., 100-200 samples) can be firstly applied with cola, from which a specific combination of top-value method and partitioning method that gives the best results can be pre-selected. Later user can only apply these two specific methods to the complete dataset. This would be much faster than blindly running cola with many methods in sequence. 2. The random subset of samples can be treated as a training set to generate a classification of cells. Then, the class labels of the remaining samples can be predicted, e.g. by testing the distance to the centroid of each cell group, without rerunning consensus partitioning on them. cola implements a `predict_classes()` function for this purpose (see the vignette ["Predict classes for new samples"](predict.html) for details). Note, since these two strategies are performed by sampling a small subset of cells from the cohort, the cell clusters with small size might not be detectable. In the following examples, we use pseudo code to demonstrate the ideas. Assuming the full matrix is in an object `mat`. We randomly sample 200 samples from it: ```{r, eval = FALSE} ind = sample(ncol(mat), 200) mat1 = mat[, ind] ``` **Strategy 1**. First we can apply cola on `mat1`: ```{r, eval = FALSE} rl = run_all_consensus_partition_methods(mat1, ...) cola_report(rl, ...) ``` Then find a best combination of top-value method and partitioining method. Assuming they are saved in objects `tm` and `pm`. Next run `consensus_partition()` on `mat` only with `tm` and `pm`: ```{r, eval = FALSE} res = consensus_partition(mat, top_value_method = tm, partition_method = pm, ...) ``` **Strategy 2**. Similar as Strategy 1, we get the `ConsensusPartition` object from methods `tm` and `pm`, which was applied on `mat1`: ```{r, eval = FALSE} res = rl[tm, pm] ``` Note the consensus partition object `res` is only based on a subset of original samples. With `res`, the classes of remaining samples can be predicted: ```{r, eval = FALSE} mat2 = mat[, setdiff(seq_len(ncol(mat)), ind)] mat2 = t(scale(t(mat2))) cl = predict_classes(res, mat2) ``` Please note, by default `mat1` is scaled in cola analysis, correspondingly, `mat2` should also be row-scaled. You can also directly send `mat` to `predict_classes()` function: ```{r, eval = FALSE} cl = predict_classes(res, t(scale(t(mat)))) ``` ## The `DownSamplingConsensusPartition` class *cola* implements a new `DownSamplingConsensusPartition` class for **Strategy 2** mentioned above. It runs consensus partitioning only on a subset of samples and the classes of other samples are predicted by `predict_classes()` internally. The `DownSamplingConsensusPartition` class inherits the `ConsensusPartition` class, so the functions that can be applied to the `ConsensusPartition` objects can also be applied to the `DownSamplingConsensusPartition` objects, just with some tiny difference. In the following example, we demonstrate the usage of `DownSamplingConsensusPartition` class with the Golub dataset. Here the Golub dataset was used to generate another data object `golub_cola` and we can extract the matrix and the annotations by `get_matrix()` and `get_anno()` functions. To perform down sampling consensus partitioning, use the helper function `consensus_partition_by_down_sampling()`. This function basically run `consensus_partition()` on a subset of samples and later predict the classes for all samples by `predict_classes()` function. Here we set `subset` argument to 50, which means to randomly sample 50 samples from the whole dataset. ```{r, eval = FALSE} data(golub_cola) m = get_matrix(golub_cola) set.seed(123) golub_cola_ds = consensus_partition_by_down_sampling(m, subset = 50, anno = get_anno(golub_cola), anno_col = get_anno_col(golub_cola), top_value_method = "SD", partition_method = "kmeans") ``` The object `golub_cola_ds` is already shipped in the package. Simply load the data object. ```{r} data(golub_cola_ds) golub_cola_ds ``` The summary of the `golub_cola_ds` is very similar as the `ConsensusPartition` object, except that it mentions the object is generated from 50 samples randomly sampled from 72 samples. All the functions that are applied to the `ConsensusParition` class can be applied to the `DownSamplingConsensusPartition` class, except some tiny differences. `get_classes()` returns the predicted classes for _all_ samples: ```{r} class = get_classes(golub_cola_ds, k = 2) nrow(class) class ``` There is an additional column named `p` which is the p-value for predicting the class labels. For more details of how the p-value is calculated, please refer to the vignette ["Predict classes for new samples"](predict.html). If the argument `k` is not specified or `k` is specified as a vector, the class labels for all `k` are returned. Now you can set the `p_cutoff` argument so that the class label with p-value larger than this is set to `NA`. ```{r} get_classes(golub_cola_ds, p_cutoff = 0.05) ``` There are several functions uses `p_cutoff` argument which controls the number of "usable samples" or the samples with reliable classifications. These functions are `get_classes()`, `test_to_known_factors()`, `dimension_reduction()` and `get_signatures()`. For `dimension_reduction()` function, the samples with p-value higher than `p_cutoff` are marked by crosses. The samples that were not selected in the random sampling were mapped with smaller dots. ```{r, fig.width = 8, fig.height = 7, out.width = "500"} dimension_reduction(golub_cola_ds, k = 2) ``` For `get_signatures()` function, the signatures are only found in the samples with p-values less than `p_cutoff`. ```{r, fig.width = 8, fig.height = 7, out.width = "500"} get_signatures(golub_cola_ds, k = 2) ``` ## Session info ```{r} sessionInfo() ```