Predicting cell cycle phase using peco

Installation

To install and load the package, run:

install.packages("devtools")
library(devtools)
install_github("jhsiao999/peco")

peco uses SingleCellExperiment class objects.

library(peco)
library(SingleCellExperiment)
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
library(doParallel)
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
library(foreach)

Overview

peco is a supervised approach for PrEdicting cell cycle phase in a COntinuum using single-cell RNA sequencing data. The R package provides functions to build training dataset and also functions to use existing training data to predict cell cycle on a continuum.

Our work demonstrated that peco is able to predict continuous cell cylce phase using a small set of cylcic genes: CDK1, UBE2C, TOP2A, HISTH1E, and HISTH1C (identified as cell cycle marker genes in studies of yeast (Spellman et al., 1998) and HeLa cells (Whitfield et al., 2002)).

Below we provide two use cases. Vignette 1 shows how to use the built-training dataset to predict continuous cell cycle. Vignette 2 shows how to make a training datast and build a predictor using training data.

Users can also view the vigenettes via browseVignettes("peco").

About the training dataset

training_human stores built-in training data of 101 significant cyclic genes. Below are the slots contained in training_human:

predict.yy: a gene by sample matrix (101 by 888) that stores predict cyclic expression values.
cellcycle_peco_reordered: cell cycle phase in a unit circle (angle), ordered from 0 to 2$pi$
cellcycle_function: lists of 101 function corresponding to the top 101 cyclic genes identified in our dataset
sigma: standard error associated with cyclic trends of gene expression
pve: proportion of variance explained by the cyclic trend

data("training_human")

Predict cell cycle phase using gene expression data

peco is integrated with SingleCellExperiment object in Bioconductor. Below shows an example of inputting SingleCellExperiment object to perform cell cycle phase prediction.

sce_top101genes includes 101 genes and 888 single-cell samples and one assay slot of counts.

data("sce_top101genes")
assays(sce_top101genes)
#> List of length 1
#> names(1): counts

Transform the expression values to quantile-normalizesd counts-per-million values. peco uses the cpm_quantNormed slot as input data for predictions.

sce_top101genes <- data_transform_quantile(sce_top101genes)
#> computing on 2 cores
assays(sce_top101genes)
#> List of length 3
#> names(3): counts cpm cpm_quantNormed

Apply the prediction model using function cycle_npreg_outsample and generate prediction results contained in a list object pred_top101genes.

pred_top101genes <- cycle_npreg_outsample(
    Y_test=sce_top101genes,
    sigma_est=training_human$sigma[rownames(sce_top101genes),],
    funs_est=training_human$cellcycle_function[rownames(sce_top101genes)],
    method.trend="trendfilter",
    ncores=1,
    get_trend_estimates=FALSE)

The pred_top101genes$Y contains a SingleCellExperiment object with the predict cell cycle phase in the colData slot.

head(colData(pred_top101genes$Y)$cellcycle_peco)
#> 20170905-A01 20170905-A02 20170905-A03 20170905-A06 20170905-A07 20170905-A08 
#>     1.099557     4.680973     2.481858     4.303982     4.052655     1.413717

Visualize results of prediction for one gene. Below we choose CDK1 (“ENSG00000170312”). Because CDK1 is a known cell cycle gene, this visualization serves as a sanity check for the results of fitting. The fitted function training_human$cellcycle_function[[1]] was obtained from our training data.

plot(y=assay(pred_top101genes$Y,"cpm_quantNormed")["ENSG00000170312",],
     x=colData(pred_top101genes$Y)$theta_shifted, main = "CDK1",
     ylab = "quantile normalized expression")
points(y=training_human$cellcycle_function[["ENSG00000170312"]](seq(0,2*pi, length.out=100)),
       x=seq(0,2*pi, length.out=100), col = "blue", pch =16)

Visualize cyclic expression trend based on predicted phase

Visualize results of prediction for the top 10 genesone genes. Use fit_cyclical_many to estimate cyclic function based on the input data.

# predicted cell time in the input data
theta_predict = colData(pred_top101genes$Y)$cellcycle_peco
names(theta_predict) = rownames(colData(pred_top101genes$Y))

# expression values of 10 genes in the input data
yy_input = assay(pred_top101genes$Y,"cpm_quantNormed")[1:6,]

# apply trendfilter to estimate cyclic gene expression trend
fit_cyclic <- fit_cyclical_many(Y=yy_input, 
                                theta=theta_predict)
#> computing on 2 cores

gene_symbols = rowData(pred_top101genes$Y)$hgnc[rownames(yy_input)]

par(mfrow=c(2,3))
for (i in 1:6) {
plot(y=yy_input[i,],
     x=fit_cyclic$cellcycle_peco_ordered, 
     main = gene_symbols[i],
     ylab = "quantile normalized expression")
points(y=fit_cyclic$cellcycle_function[[i]](seq(0,2*pi, length.out=100)),
       x=seq(0,2*pi, length.out=100), col = "blue", pch =16)
}

Session information

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] doParallel_1.0.17           iterators_1.0.14           
#>  [3] foreach_1.5.2               SingleCellExperiment_1.24.0
#>  [5] SummarizedExperiment_1.32.0 Biobase_2.62.0             
#>  [7] GenomicRanges_1.54.0        GenomeInfoDb_1.38.0        
#>  [9] IRanges_2.36.0              S4Vectors_0.40.0           
#> [11] BiocGenerics_0.48.0         MatrixGenerics_1.14.0      
#> [13] matrixStats_1.0.0           peco_1.14.0                
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.0          viridisLite_0.4.2        
#>  [3] vipor_0.4.5               dplyr_1.1.3              
#>  [5] viridis_0.6.4             bitops_1.0-7             
#>  [7] fastmap_1.1.1             RCurl_1.98-1.12          
#>  [9] pracma_2.4.2              digest_0.6.33            
#> [11] rsvd_1.0.5                lifecycle_1.0.3          
#> [13] magrittr_2.0.3            compiler_4.3.1           
#> [15] rlang_1.1.1               sass_0.4.7               
#> [17] tools_4.3.1               igraph_1.5.1             
#> [19] utf8_1.2.4                yaml_2.3.7               
#> [21] knitr_1.44                S4Arrays_1.2.0           
#> [23] DelayedArray_0.28.0       abind_1.4-5              
#> [25] BiocParallel_1.36.0       grid_4.3.1               
#> [27] fansi_1.0.5               beachmat_2.18.0          
#> [29] colorspace_2.1-0          ggplot2_3.4.4            
#> [31] scales_1.2.1              cli_3.6.1                
#> [33] mvtnorm_1.2-3             rmarkdown_2.25           
#> [35] crayon_1.5.2              generics_0.1.3           
#> [37] DelayedMatrixStats_1.24.0 genlasso_1.6.1           
#> [39] scuttle_1.12.0            ggbeeswarm_0.7.2         
#> [41] cachem_1.0.8              geigen_2.3               
#> [43] zlibbioc_1.48.0           assertthat_0.2.1         
#> [45] XVector_0.42.0            vctrs_0.6.4              
#> [47] boot_1.3-28.1             Matrix_1.6-1.1           
#> [49] jsonlite_1.8.7            BiocSingular_1.18.0      
#> [51] BiocNeighbors_1.20.0      ggrepel_0.9.4            
#> [53] irlba_2.3.5.1             beeswarm_0.4.0           
#> [55] scater_1.30.0             jquerylib_0.1.4          
#> [57] glue_1.6.2                codetools_0.2-19         
#> [59] gtable_0.3.4              circular_0.5-0           
#> [61] ScaledMatrix_1.10.0       munsell_0.5.0            
#> [63] tibble_3.2.1              pillar_1.9.0             
#> [65] htmltools_0.5.6.1         conicfit_1.0.4           
#> [67] GenomeInfoDbData_1.2.11   R6_2.5.1                 
#> [69] sparseMatrixStats_1.14.0  evaluate_0.22            
#> [71] lattice_0.22-5            bslib_0.5.1              
#> [73] Rcpp_1.0.11               gridExtra_2.3            
#> [75] SparseArray_1.2.0         xfun_0.40                
#> [77] pkgconfig_2.0.3