Introduction to scToppR

This package functions as an API wrapper to ToppGene. It takes a file from Seurat’s FindAllMarkers, Presto’s Wilcoxauc functions, or similarly formatted data that contains columns of genes, groups of cells (clusters or celltypes), avg log fold changes, and p-values.

As an introduction, this vignette will work with the FindAllMarkers output from Seurat’s PBMC 3k clustering tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

You can follow that tutorial and get the markers file from this line:

pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)

Alternatively, this markers table is included in the scToppR package:

library(scToppR)
data("pbmc.markers")
head(pbmc.markers)
#>               p_val avg_log2FC pct.1 pct.2     p_val_adj cluster  gene
#> RPS12 1.273332e-143  0.7387061 1.000 0.991 1.746248e-139       0 RPS12
#> RPS6  6.817653e-143  0.6934523 1.000 0.995 9.349729e-139       0  RPS6
#> RPS27 4.661810e-141  0.7372604 0.999 0.992 6.393206e-137       0 RPS27
#> RPL32 8.158412e-138  0.6266075 0.999 0.995 1.118845e-133       0 RPL32
#> RPS14 5.177478e-130  0.6336957 1.000 0.994 7.100394e-126       0 RPS14
#> RPS25 3.244898e-123  0.7689940 0.997 0.975 4.450053e-119       0 RPS25

With this data we can run the function toppFun to get results from ToppGene.

toppData <- toppFun(markers = pbmc.markers,
                    topp_categories = NULL, 
                    cluster_col = "cluster", 
                    gene_col = "gene",
                    p_val_col = "p_val_adj",
                    logFC_col = "avg_log2FC")
#> This function returns data generated from ToppGene (https://toppgene.cchmc.org/)
#> 
#> Any use of this data must be done so under the Terms of Use and citation guide established by ToppGene.
#> 
#> Terms of Use: https://toppgene.cchmc.org/navigation/termsofuse.jsp
#> Citations: https://toppgene.cchmc.org/help/publications.jsp
#> Working on cluster: 0 
#> Working on cluster: 1 
#> Working on cluster: 2 
#> Working on cluster: 3 
#> Working on cluster: 4 
#> Working on cluster: 5 
#> Working on cluster: 6 
#> Working on cluster: 7 
#> Working on cluster: 8

Here it is important to tell toppFun the names of the relevant columns for clusters and genes. Additionally, you can run toppFun on all ToppGene categories by setting topp_categories to NULL. You may also provide 1 or more specific categories as a list. To see all ToppGene categories, you can also use the function get_ToppCats():

get_ToppCats()
#>  [1] "GeneOntologyMolecularFunction" "GeneOntologyBiologicalProcess"
#>  [3] "GeneOntologyCellularComponent" "HumanPheno"                   
#>  [5] "MousePheno"                    "Domain"                       
#>  [7] "Pathway"                       "Pubmed"                       
#>  [9] "Interaction"                   "Cytoband"                     
#> [11] "TFBS"                          "GeneFamily"                   
#> [13] "Coexpression"                  "CoexpressionAtlas"            
#> [15] "ToppCell"                      "Computational"                
#> [17] "MicroRNA"                      "Drug"                         
#> [19] "Disease"

You can also set additional parameters in the toppFun function, please check the documentation for more information.

The results of toppFun are organized into a data frame as such:

head(toppData)
#>                        Category         ID
#> 1 GeneOntologyMolecularFunction GO:0003735
#> 2 GeneOntologyMolecularFunction GO:0005198
#> 3 GeneOntologyMolecularFunction GO:0019843
#> 4 GeneOntologyMolecularFunction GO:1990948
#> 5 GeneOntologyMolecularFunction GO:0055105
#> 6 GeneOntologyMolecularFunction GO:1990932
#>                                               Name       PValue  QValueFDRBH
#> 1               structural constituent of ribosome 4.251802e-99 2.359750e-96
#> 2                     structural molecule activity 3.879953e-47 1.076687e-44
#> 3                                     rRNA binding 2.987291e-27 5.526489e-25
#> 4              ubiquitin ligase inhibitor activity 4.599566e-12 6.381897e-10
#> 5 ubiquitin-protein transferase inhibitor activity 1.261239e-10 1.399975e-08
#> 6                                5.8S rRNA binding 2.725570e-10 2.521152e-08
#>    QValueFDRBY QValueBonferroni TotalGenes GenesInTerm GenesInQuery
#> 1 1.627540e-95     2.359750e-96      19968         183          246
#> 2 7.426002e-44     2.153374e-44      19968         891          246
#> 3 3.811666e-24     1.657947e-24      19968          79          246
#> 4 4.401648e-09     2.552759e-09      19968          10          246
#> 5 9.655745e-08     6.999874e-08      19968          14          246
#> 6 1.738860e-07     1.512691e-07      19968           5          246
#>   GenesInTermInQuery Source URL Cluster
#> 1                 76                  0
#> 2                 80                  0
#> 3                 24                  0
#> 4                  7                  0
#> 5                  7                  0
#> 6                  5                  0

Plotting

scToppR can automatically create DotPlots for each ToppGene category. Simply run:

plots <- toppPlot(toppData, category = "GeneOntologyMolecularFunction", clusters = NULL)
#> Multiple clusters entered: function returns a list of ggplots
plots[1]
#> $`0`

This will create a list of plots for all clusters in one specific category. Here, the category “GenoOntologyMolecularFunction” was requested, and the clusters parameter was left NULL as default. If clusters is NULL, then all available ones are used. For example, the output here creates a list of plots for each cluster for the “GenoOntologyMolecularFunction”. If multiple clusters are selected, users can use combine = TRUE to return a patchwork object of plots. Leaving combine = FALSE returns a list of ggplot objects. If using the save = TRUE parameter, the function will automatically save each individual plot in the format: {category}_{cluster}_dotplot.pdf

scToppR can also create balloon plots showing overlapping terms between all clusters.

toppBalloon(toppData, categories = "GeneOntologyMolecularFunction")
#> Balloon Plot: GeneOntologyMolecularFunction
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA

This function also has a save parameter, which will automatically save plots, which is helpful if multiple categories are visualized.

Saving

scToppR will also automatically save the results of the ToppGene query. By default it will save separate files for each cluster. To save as one large file, set the parameter split = FALSE. It will also save all files as Excel spreadsheets, but this can be changed using the format parameter–it must be one of c("xlsx", "csv", "tsv").

toppSave(toppData, filename = "PBMC", split = TRUE, format = "xslx")
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] dplyr_1.1.4                 DESeq2_1.47.1              
#>  [3] airway_1.27.0               SummarizedExperiment_1.37.0
#>  [5] Biobase_2.67.0              GenomicRanges_1.59.1       
#>  [7] GenomeInfoDb_1.43.2         IRanges_2.41.2             
#>  [9] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [11] generics_0.1.3              MatrixGenerics_1.19.0      
#> [13] matrixStats_1.4.1           scToppR_0.99.1             
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6            rjson_0.2.23            xfun_0.49              
#>  [4] bslib_0.8.0             ggplot2_3.5.1           lattice_0.22-6         
#>  [7] vctrs_0.6.5             tools_4.5.0             curl_6.0.1             
#> [10] parallel_4.5.0          tibble_3.2.1            fansi_1.0.6            
#> [13] pkgconfig_2.0.3         Matrix_1.7-1            lifecycle_1.0.4        
#> [16] GenomeInfoDbData_1.2.13 compiler_4.5.0          farver_2.1.2           
#> [19] stringr_1.5.1           munsell_0.5.1           codetools_0.2-20       
#> [22] htmltools_0.5.8.1       sass_0.4.9              yaml_2.3.10            
#> [25] pillar_1.9.0            crayon_1.5.3            jquerylib_0.1.4        
#> [28] BiocParallel_1.41.0     cachem_1.1.0            DelayedArray_0.33.3    
#> [31] viridis_0.6.5           abind_1.4-8             locfit_1.5-9.10        
#> [34] tidyselect_1.2.1        zip_2.3.1               digest_0.6.37          
#> [37] stringi_1.8.4           labeling_0.4.3          forcats_1.0.0          
#> [40] fastmap_1.2.0           grid_4.5.0              colorspace_2.1-1       
#> [43] cli_3.6.3               SparseArray_1.7.2       magrittr_2.0.3         
#> [46] patchwork_1.3.0         S4Arrays_1.7.1          utf8_1.2.4             
#> [49] withr_3.0.2             scales_1.3.0            UCSC.utils_1.3.0       
#> [52] rmarkdown_2.29          XVector_0.47.0          httr_1.4.7             
#> [55] gridExtra_2.3           openxlsx_4.2.7.1        evaluate_1.0.1         
#> [58] knitr_1.49              viridisLite_0.4.2       rlang_1.1.4            
#> [61] Rcpp_1.0.13-1           glue_1.8.0              jsonlite_1.8.9         
#> [64] R6_2.5.1                zlibbioc_1.53.0