Contents

1 Installation

The package can be installed from bioconductor

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scDDboost")

Issue can be reported at “https://github.com/wiscstatman/scDDboost/issues

2 Introduction

scDDboost scores evidence of a gene being differentially distributed(DD) across two conditions for single cell RNA-seq data. Higher resolution brings several chanllenges for analyzing the data, specifically, the distribution of gene expression tends to have high prevalence of zero and multi-modes. To account for those characteristics and utilizing some biological intuition, we view the expression values sampled from a pool of cells mixed by distinct cellular subtypes blind to condition label. Consequently, the distributional change can be fully determined by the the change of subtype proportions. One tricky part is that not any change of proportions will lead to a distributional change. Given that some genes could be equivalent expressed across several subtypes, even the individual subytpe proportion may differ between conditions but as long as the aggregated proportions over those subtypes remain the same between conditions, it will not introduce different distribution. For example

Proportions of subtypes 1 and 2 changed between the 2 conditions. The gene is not DD if subtype 1 and 2 have the same expression level

For subtype 1 and 2 have different expression level, there is different distribution

3 Posterior probability of a gene being DD

pdd is the core function developed to quantify the posterior probabilities of DD for input genes. Let’s look at an example,

suppressMessages(library(scDDboost))

Next, we load the toy simulated example a object that we will use for identifying and classifying DD genes.

data(sim_dat)

Verify that this object is a member of the SingleCellExperiment class and that it contains 200 cells and 1000 genes. The colData slot (which contains a dataframe of metadata for the cells) should have a column that contains the biological condition or grouping of interest. In this example data, that variable is the condition variable. Note that the input gene set needs to be a matrix of normalized counts. We run the function pdd

data_counts <- SummarizedExperiment::assays(sim_dat)$counts
conditions <- SummarizedExperiment::colData(sim_dat)$conditions
rownames(data_counts) <- seq_len(1000)

##here we use 2 cores to compute the distance matrix
bp <- BiocParallel::MulticoreParam(2)
D_c <- calD(data_counts,bp)

ProbDD <- pdd(data = data_counts,cd = conditions, bp = bp, D = D_c)

There are 4 input parameters needed to be specified by user, the dataset, the condition label, number of cpu cores used for computation and a distance matrix of cells. Other input parameters have default settings.

3.1 clustering of cells

We provide a default method of getting the distance matrix, archived by calD, in general pdd accept all valid distance matrix. User can also input a cluster label rather than distance matrix for the argument D, but the random distancing mechanism which relies on distance matrix will be disabled and random should be set to false.

For the number of sutypes, we provide a default function detK, which consider the smallest number of sutypes such that the ratio of difference within cluster between difference between clusters become smaller than a threshold (default setting is 1).

If user have other ways to determine \(K\), \(K\) should be specified in pdd.

## determine the number of subtypes
K <- detK(D_c)

If we set threshold to be 5% then we have estimated DD genes

EDD <- which(ProbDD > 0.95)

Notice that, pdd is actually local false discovery rate, this is a conservative estimation of DD genes. We could gain further power, let index gene by \(g = 1,2,...,G\) and let \(p_g = P(DD_g | \text{data})\), \(p_{(1)},...,p_{(G)}\) be ranked local false discovery rate from small to large. To control the false discovery rate at 5%, our positive set is those genes with the \(s^*\) smallest lFDR, where \[s^* = \text{argmax}_s\{s,\frac{\Sigma_{i = 1}^s p_{(i)}}{s} \leq 0.05\}\]

EDD <- getDD(ProbDD,0.05)

Function getDD extracts the estimated DD genes using the above transformation.

Session Information

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] S4Vectors_0.40.0     IRanges_2.36.0       GenomicRanges_1.54.0
## [4] scDDboost_1.4.0      ggplot2_3.4.4        BiocStyle_2.30.0    
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.32.0 gtable_0.3.4               
##  [3] xfun_0.40                   bslib_0.5.1                
##  [5] caTools_1.18.2              Biobase_2.62.0             
##  [7] lattice_0.22-5              vctrs_0.6.4                
##  [9] tools_4.3.1                 bitops_1.0-7               
## [11] generics_0.1.3              stats4_4.3.1               
## [13] parallel_4.3.1              tibble_3.2.1               
## [15] fansi_1.0.5                 cluster_2.1.4              
## [17] blockmodeling_1.1.5         pkgconfig_2.0.3            
## [19] KernSmooth_2.23-22          Matrix_1.6-1.1             
## [21] EBSeq_2.0.0                 desc_1.4.2                 
## [23] lifecycle_1.0.3             GenomeInfoDbData_1.2.11    
## [25] compiler_4.3.1              farver_2.1.1               
## [27] brio_1.1.3                  gplots_3.1.3               
## [29] munsell_0.5.0               codetools_0.2-19           
## [31] GenomeInfoDb_1.38.0         htmltools_0.5.6.1          
## [33] sass_0.4.7                  RCurl_1.98-1.12            
## [35] yaml_2.3.7                  pillar_1.9.0               
## [37] crayon_1.5.2                jquerylib_0.1.4            
## [39] BiocParallel_1.36.0         SingleCellExperiment_1.24.0
## [41] cachem_1.0.8                DelayedArray_0.28.0        
## [43] magick_2.8.1                abind_1.4-5                
## [45] mclust_6.0.0                gtools_3.9.4               
## [47] tidyselect_1.2.0            digest_0.6.33              
## [49] dplyr_1.1.3                 bookdown_0.36              
## [51] labeling_0.4.3              rprojroot_2.0.3            
## [53] fastmap_1.1.1               grid_4.3.1                 
## [55] colorspace_2.1-0            cli_3.6.1                  
## [57] SparseArray_1.2.0           magrittr_2.0.3             
## [59] S4Arrays_1.2.0              utf8_1.2.4                 
## [61] withr_2.5.1                 scales_1.2.1               
## [63] rmarkdown_2.25              XVector_0.42.0             
## [65] matrixStats_1.0.0           RcppEigen_0.3.3.9.3        
## [67] evaluate_0.22               knitr_1.44                 
## [69] testthat_3.2.0              Oscope_1.32.0              
## [71] rlang_1.1.1                 Rcpp_1.0.11                
## [73] glue_1.6.2                  BiocManager_1.30.22        
## [75] BiocGenerics_0.48.0         pkgload_1.3.3              
## [77] jsonlite_1.8.7              R6_2.5.1                   
## [79] MatrixGenerics_1.14.0       zlibbioc_1.48.0