--- title: "distinct: a method for differential analyses via hierarchical permutation tests" author: - name: Simone Tiberi affiliation: - &IMLS Institute for Molecular Life Sciences, University of Zurich, Switzerland - &SIB SIB Swiss Institute of Bioinformatics, University of Zurich, Switzerland email: simone.tiberi@uzh.ch package: "`r BiocStyle::pkg_ver('distinct')`" date: "`r format(Sys.Date(), '%m/%d/%Y')`" bibliography: References.bib vignette: > %\VignetteIndexEntry{distinct: a method for differential analyses via hierarchical permutation tests} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- --- ```{r setup, echo=FALSE, results="hide"} knitr::opts_chunk$set(tidy=FALSE, cache=TRUE, dev="png", message=TRUE, error=FALSE, warning=TRUE) ``` # Introduction *distinct* is a statistical method to perform differential testing between two or more groups of distributions; differential testing is performed via hierarchical non-parametric permutation tests on the cumulative distribution functions (cdfs) of each sample. While most methods for differential expression target differences in the mean abundance between conditions, *distinct*, by comparing full cdfs, identifies, both, differential patterns involving changes in the mean, as well as more subtle variations that do not involve the mean (e.g., unimodal vs. bi-modal distributions with the same mean). *distinct* is a general and flexible tool: due to its fully non-parametric nature, which makes no assumptions on how the data was generated, it can be applied to a variety of datasets. It is particularly suitable to perform differential state analyses on single cell data (e.g., differential analyses within sub-populations of cells), such as single cell RNA sequencing (scRNA-seq) and high-dimensional flow or mass cytometry (HDCyto) data. At present, covariates are not allowed, and only 2-group comparisons are implemented. In future releases, we will allow for covariates and for differential testing between more than 2 groups. A pre-print will follow in the coming months. To access the R code used in the vignettes, type: ```{r vignettes, eval=FALSE} browseVignettes("distinct") ``` Questions relative to *distinct* should be reported as a new issue at *[BugReports](https://github.com/SimoneTiberi/distinct/issues)*. To cite *distinct*, type: ```{r citation, eval=FALSE} citation("distinct") ``` ## Bioconductor installation *distinct* is available on [Bioconductor](https://bioconductor.org/packages/distinct) and can be installed with the command: ```{r Bioconductor_installation, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("distinct") ``` # Differential State analysis Differential state analyses aim at investigating differential patterns between conditions in sub-populations of cells. To use *distinct* one needs data from two or more groups of samples (i.e., experimental conditions), with at least 2 samples (i.e., biological replicates) per group. Given a single-cell RNA-sequencing (scRNA-seq) or a high dimensional flow or mass cytometry (HDCyto) dataset, cells need first to be clustered in groups via some form of clustering algorithms; *distinct* is then applied to identify differential patterns between groups, within each cluster of cells. ## Input data Load the example dataset, consisting of a subset of 6 samples (3 individuals observed across 2 conditions) and 100 genes selected from the `Kang18_8vs8()` object of the *muscData* package. ```{r load-example-data, message = FALSE} library(SingleCellExperiment) data("Kang_subset", package = "distinct") Kang_subset ``` Columns `ind` and `stim` of the `colData` indicate the indivual id and the experimental condition (control or stimulated) of each cell, while column `sample_id` shows the sample id, needed for the differential anlyses. Column `cell` represents the cell type, which defines the clustering structure of cells: differential testing between conditions is performed separately for each cluster of cells. Note that, if cell clustering label was unknown, we would need to cluster cells into groups via some clustering algorithm. ```{r visualize colData} colData(Kang_subset) ``` The experimental design compares two groups (stim vs ctrl) with 3 biological replicates each. ```{r visualize experimental design} Kang_subset@metadata$experiment_info ``` ## Differential analyses within sub-populations of cells Load *distinct*. ```{r load_distinct, message=FALSE} library(distinct) ``` Create the design of the study: ```{r create_design} samples = Kang_subset@metadata$experiment_info$sample_id group = Kang_subset@metadata$experiment_info$stim design = model.matrix(~group) # rownames of the design must indicate sample ids: rownames(design) = samples design ``` Perform differential state testing between conditions. Parameter `name_assays_expression` specifies the input data (logcounts) in `assays(x)`, while `name_cluster` and `name_sample` define the column names of `colData(x)` containing the clustering of cells (cell) and the id of individual samples (sample_id). The group we would like to test for is in the second column of the design, therefore we will specify: column_to_test = 2. Note that the sample names in `colData(x)$name_sample` have to be the same ones as those in `rownames(design)` (although not necessarily in the same order). ```{r check_sample_names} rownames(design) unique(colData(Kang_subset)$sample_id) ``` In order to obtain a finer ranking for the most significant genes, if computational resources are available, we encourage users to increase `P_4` (i.e., the number of permutations when a raw p-value is < 0.001) and set `P_4 = 20,000` (by default `P_4 = 10,000`). We strongly encourage using normalized data, such as counts per million (CPM) or log2-CPM (e.g., `logcounts` as created via `scater::logNormCounts`). ```{r differential-analyses} set.seed(61217) res = distinct_test(x = Kang_subset, name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", design = design, column_to_test = 2, min_non_zero_cells = 20, n_cores = 2) ``` ### Handling covariates and batch effects Covariates (such as batch effects), if present, can be added to the design matrix. In each cluster of cells, we fit a linear model, with covariates as predictors, and regress them out by performing differential analyeses on the residuals. By separately fitting a linear model on each cluster, we allow the effect of covariates to vary from cluster to cluster. When specifying covariates, we highly recommend using log-normalized data, such as log2-CPMs (e.g., `logcounts` as created via `scater::logNormCounts`), because it is generally assumed that covariates (and particularly batch effects) have an approximately linear effect on the log or log2 scale of counts. Assume samples are associated to three different batches; we modify the design to also include batches. ```{r define-batches} batch = factor(c("A", "B", "C", "A", "B", "C")) design = model.matrix(~group + batch) # rownames of the design must indicate sample ids: rownames(design) = samples design ``` We proceed as before to perform differential testing. Again, we specify the column of the design to be tested via `column_to_test = 2`. ```{r differential-analyses-with-batches} set.seed(61217) res_batches = distinct_test(x = Kang_subset, name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", design = design, column_to_test = 2, min_non_zero_cells = 20, n_cores = 2) ``` ## Visualizing results Results are reported as a `data.frame`, where columns `gene` and `cluster_id` contain the gene and cell-cluster name, while `p_val`, `p_adj.loc` and `p_adj.glb` report the raw p-values, locally and globally adjusted p-values, via Benjamini and Hochberg (BH) correction. In locally adjusted p-values (`p_adj.loc`) BH correction is applied in each cluster separately, while in globally adjusted p-values (`p_adj.glb`) BH correction is performed to the results from all clusters. We can further compute the fold change (FC) and log2-FC between groups. To compute FCs, use normalized data, such as CPMs; do not use logarithm transformed data (e.g., logcounts). ```{r compute-log2_FC} res = log2_FC(res = res, x = Kang_subset, name_assays_expression = "cpm", name_group = "stim", name_cluster = "cell") ``` `log2_FC` computes the log-FC between the first and the second level of the group id, in this case beween controls (numerator) and stimulated samples (denominator). ```{r visualize-group-levels} levels(colData(Kang_subset)$stim) head(res[,9:10], 3) ``` To use a different level (i.e., stim/ctrl), we can reorder the levels before running `log2_FC2`. ```{r reorder-group-levels} # set "stim" as 1st level: colData(Kang_subset)$stim = relevel(colData(Kang_subset)$stim, "stim") levels(colData(Kang_subset)$stim) res_2 = log2_FC(res = res, x = Kang_subset, name_assays_expression = "cpm", name_group = "stim", name_cluster = "cell") head(res_2[,9:10], 3) ``` We can visualize significant results via `top_results` function. ```{r visualize-results} head(top_results(res)) ``` We can also visualize significant results for a specific cluster of cells. ```{r visualize-results-one-cluster} top_results(res, cluster = "Dendritic cells") ``` By default, results from 'top_results' are sorted by (globally) adjusted p-value; they can also be sorted by log2-FC. ```{r visualize-results-one-cluster-sort-by-FC} top_results(res, cluster = "Dendritic cells", sort_by = "log2FC") ``` We can further filter results to visualize significant up- or down-regulated results only. Here we visualize the down-regulated gene-cluster results; i.e., results with lower expression in 'ctlr' group compared to 'stim'. ```{r visualize-results-down} top_results(res, up_down = "down", cluster = "Dendritic cells") ``` ## Plotting significant results Density plot of one significant gene (ISG15) in `Dendritic cells` cluster. ```{r plot_densitied} plot_densities(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim") ``` Instead of one curve per sample, we can also plot aggregated group-level curves by setting `group_level = TRUE`. ```{r plot_densitied_group_level} plot_densities(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim", group_level = TRUE) ``` CDF plot of one significant gene (ISG15) in `Dendritic cells` cluster. ```{r plot_cdfs} plot_cdfs(x = Kang_subset, gene = "ISG15", cluster = "Dendritic cells", name_assays_expression = "logcounts", name_cluster = "cell", name_sample = "sample_id", name_group = "stim") ``` Violin plots of significant genes in `Dendritic cells` cluster. ```{r plotExpression} # select cluster of cells: cluster = "Dendritic cells" sel_cluster = res$cluster_id == cluster sel_column = Kang_subset$cell == cluster # select significant genes: sel_genes = res$p_adj.glb < 0.01 genes = as.character(res$gene[sel_cluster & sel_genes]) # make violin plots: library(scater) plotExpression(Kang_subset[,sel_column], features = genes, exprs_values = "logcounts", log2_values = FALSE, x = "sample_id", colour_by = "stim", ncol = 3) + guides(fill = guide_legend(override.aes = list(size = 5, alpha = 1))) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` Visualize the concordance of differential results between cell clusters. We select as significant genes with globally adjusted p-value below 0.01. ```{r UpsetR plot} library(UpSetR) res_by_cluster = split( ifelse(res$p_adj.glb < 0.01, 1, 0), res$cluster_id) upset(data.frame(do.call(cbind, res_by_cluster)), nsets = 10, nintersects = 20) ``` # Session info ```{r sessionInfo} sessionInfo() ```