Prepare a classifier with CHETAH

title: “adverSCarial, generate and analyze the vulnerability of scRNA-seq classifiers to adversarial attacks” shorttitle: “adverSCarial” author: Ghislain FIEVET ghislain.fievet@gmail.com package: adverSCarial abstract: > adverSCarial is an R Package designed for generating and analyzing the vulnerability of scRNA-seq classifiers to adversarial attacks. The package is versatile and provides a format for integrating any type of classifier. It offers functions for studying and generating two types of attacks, single gene attack and max change attack. The single gene attack involves making a small modification to the input to alter the classification. The max change attack involves making a large modification to the input without changing its classification. The package provides a comprehensive solution for evaluating the robustness of scRNA-seq classifiers against adversarial attacks. output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Vign03_adaptClassifier} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8}

Prepare a classifier with CHETAH

Here we demonstrate how to implement a classifier, and take the example of CHETAH a Bioconductor scRNA-seq classifier.

de Kanter JK, Lijnzaad P, Candelli T, Margaritis T, Holstege FCP (2019). “CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing.” Nucleic Acids Research. ISSN 0305-1048, doi: 10.1093/nar/gkz543.

Load data

library(adverSCarial)
library(TENxPBMCData)
library(CHETAH)
library(scater)
library(scran)

First let’s load a train and a test dataset.

train_3k <- TENxPBMCData(dataset = "pbmc3k")
test_4k <- TENxPBMCData(dataset = "pbmc4k")

cell_types_3k <- system.file("extdata", "pbmc3k_cell_types.tsv", package="adverSCarial")
cell_types_3k <- read.table(cell_types_3k, sep="\t")
colData(train_3k)$celltypes <- cell_types_3k$cell_type
colnames(train_3k) <- colData(train_3k)[['Barcode']]
colnames(test_4k) <- colData(test_4k)[['Barcode']]

Then we process the test_4k to annotate and visualize the cell types.

We annotate cells with CHETAH, and process data.

input <- CHETAHclassifier(input = test_4k, ref_cells = train_3k)
input <- Classify(input = input, 0.00001)
colData(test_4k)$celltypes <- input$celltype_CHETAH

test_4k <- logNormCounts(test_4k)
dec <- modelGeneVar(test_4k)
hvg <- getTopHVGs(dec, prop=0.1)
test_4k <- runPCA(test_4k, ncomponents=25, subset_row=hvg)
test_4k <- runUMAP(test_4k, dimred = 'PCA')

Visualize the results

plotUMAP(test_4k, colour_by="celltypes")

plot of chunk chetah dimplot

Adapt the classifier

CHETAH is a classifier that, when given a SingleCellExperiment object, returns a specific cell type from each cell. We need to adjust the classifier so that it can be used by adverSCarial.

Each classifier function has to be formated as follow to be used with adverSCarial:

    classifier = function(expr, clusters, target){
                
                # `score` should be numeric between 0 and 1
                # 1 being the highest confidance into the cell type classification.
                c("cell type", score)
    }

The expr argument contrains the RNA expression values, should be a DelayedMatrix or a SingleCellExperiment. The list clusters consists of the cluster IDs for each cell in expr, and target is the ID of the cluster for which we want to have a classification. The function returns a vector with the classification result, and a trust indice.

This is how you can adapt CHETAH for adverSCarial.

CHETAHClassifier <- function(expr, clusters, target){
    reference_3k <- train_3k
    input <- CHETAHclassifier(input = expr, ref_cells = reference_3k)
    input <- Classify(input = input, 0.01)
    final_predictions = input$celltype_CHETAH[clusters == target]
    ratio <- as.numeric(sort(table(final_predictions), decreasing = TRUE)[1]) /
        sum(as.numeric(sort(table(final_predictions), decreasing = TRUE)))
    predicted_class <- names(sort(table(final_predictions), decreasing = TRUE)[1])
    if ( ratio < 0.3){
        predicted_class <- "NA"
    }
    c(predicted_class, ratio)
}

This classifier takes as input a SingleCellExperiment object, you need to specify the argForClassif="SingleCellExperiment" argument in adverSCarial function. If the classifier takes as input a DelayedMatrix you can let the default argForClassif="DelayedMatrix" argument.

You can now test CHETAH classifier with adverSCarial tools.

Let’s run a maxChangeAttack. If you have enough available memory we recommand to use the argForModif="data.frame" option, which is faster.

adv_max_change <- advMaxChange(test_4k, colData(test_4k)$celltypes, "CD14+ Mono", CHETAHClassifier, advMethod="perc99",    maxSplitSize = 2000, argForClassif="SingleCellExperiment", argForModif="data.frame")

Let’s run this attack and verify if it is successful.

First we modify the test_4k SingleCellExperiment object on the target cluster, on the genes previously determined. Then we verify that classification is still CD14+ Mono.

test_4k_adver <- advModifications(test_4k, adv_max_change@values, colData(test_4k)$celltypes, "CD14+ Mono",
	argForClassif="SingleCellExperiment", argForModif="data.frame")

rf_result <- CHETAHClassifier(test_4k_adver, colData(test_4k)$celltypes, "CD14+ Mono")

rf_result
## [1] "CD14+ Mono" "1"
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Paris
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scran_1.28.1                scater_1.28.0              
##  [3] scuttle_1.10.1              CHETAH_1.16.0              
##  [5] ggplot2_3.4.2               TENxPBMCData_1.18.0        
##  [7] HDF5Array_1.28.1            rhdf5_2.44.0               
##  [9] DelayedArray_0.26.3         S4Arrays_1.0.4             
## [11] Matrix_1.5-4.1              SingleCellExperiment_1.22.0
## [13] SummarizedExperiment_1.30.1 Biobase_2.60.0             
## [15] GenomicRanges_1.52.0        GenomeInfoDb_1.36.0        
## [17] IRanges_2.34.0              S4Vectors_0.38.1           
## [19] BiocGenerics_0.46.0         MatrixGenerics_1.12.0      
## [21] matrixStats_0.63.0          adverSCarial_0.99.38       
## [23] knitr_1.42                 
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3            jsonlite_1.8.4               
##   [3] magrittr_2.0.3                ggbeeswarm_0.7.2             
##   [5] farver_2.1.1                  corrplot_0.92                
##   [7] zlibbioc_1.46.0               vctrs_0.6.2                  
##   [9] memoise_2.0.1                 DelayedMatrixStats_1.22.0    
##  [11] RCurl_1.98-1.12               base64enc_0.1-3              
##  [13] htmltools_0.5.5               AnnotationHub_3.8.0          
##  [15] curl_5.0.0                    BiocNeighbors_1.18.0         
##  [17] Rhdf5lib_1.22.0               htmlwidgets_1.6.2            
##  [19] plyr_1.8.8                    plotly_4.10.1                
##  [21] cachem_1.0.8                  uuid_1.1-0                   
##  [23] igraph_1.4.2                  mime_0.12                    
##  [25] lifecycle_1.0.3               pkgconfig_2.0.3              
##  [27] rsvd_1.0.5                    R6_2.5.1                     
##  [29] fastmap_1.1.1                 GenomeInfoDbData_1.2.10      
##  [31] shiny_1.7.4                   digest_0.6.31                
##  [33] colorspace_2.1-0              AnnotationDbi_1.62.1         
##  [35] dqrng_0.3.0                   irlba_2.3.5.1                
##  [37] ExperimentHub_2.8.0           RSQLite_2.3.1                
##  [39] beachmat_2.16.0               labeling_0.4.2               
##  [41] filelock_1.0.2                fansi_1.0.4                  
##  [43] httr_1.4.6                    compiler_4.3.0               
##  [45] bit64_4.0.5                   withr_2.5.0                  
##  [47] BiocParallel_1.34.1           viridis_0.6.3                
##  [49] DBI_1.1.3                     highr_0.10                   
##  [51] dendextend_1.17.1             rappdirs_0.3.3               
##  [53] bluster_1.10.0                tools_4.3.0                  
##  [55] vipor_0.4.5                   beeswarm_0.4.0               
##  [57] interactiveDisplayBase_1.38.0 httpuv_1.6.11                
##  [59] glue_1.6.2                    rhdf5filters_1.12.1          
##  [61] promises_1.2.0.1              grid_4.3.0                   
##  [63] pbdZMQ_0.3-9                  cluster_2.1.4                
##  [65] reshape2_1.4.4                generics_0.1.3               
##  [67] gtable_0.3.3                  tidyr_1.3.0                  
##  [69] data.table_1.14.8             metapod_1.8.0                
##  [71] BiocSingular_1.16.0           ScaledMatrix_1.8.1           
##  [73] utf8_1.2.3                    XVector_0.40.0               
##  [75] RcppAnnoy_0.0.20              ggrepel_0.9.3                
##  [77] BiocVersion_3.17.1            pillar_1.9.0                 
##  [79] stringr_1.5.0                 limma_3.56.1                 
##  [81] IRdisplay_1.1                 later_1.3.1                  
##  [83] dplyr_1.1.2                   BiocFileCache_2.8.0          
##  [85] lattice_0.21-8                bit_4.0.5                    
##  [87] tidyselect_1.2.0              locfit_1.5-9.7               
##  [89] Biostrings_2.68.1             gridExtra_2.3                
##  [91] edgeR_3.42.2                  xfun_0.39                    
##  [93] statmod_1.5.0                 pheatmap_1.0.12              
##  [95] stringi_1.7.12                lazyeval_0.2.2               
##  [97] yaml_2.3.7                    evaluate_0.21                
##  [99] codetools_0.2-19              tibble_3.2.1                 
## [101] BiocManager_1.30.20           cli_3.6.1                    
## [103] uwot_0.1.14                   IRkernel_1.3.2               
## [105] xtable_1.8-4                  repr_1.1.6                   
## [107] munsell_0.5.0                 Rcpp_1.0.10                  
## [109] bioDist_1.72.0                dbplyr_2.3.2                 
## [111] png_0.1-8                     parallel_4.3.0               
## [113] ellipsis_0.3.2                blob_1.2.4                   
## [115] sparseMatrixStats_1.12.0      bitops_1.0-7                 
## [117] viridisLite_0.4.2             scales_1.2.1                 
## [119] purrr_1.0.1                   crayon_1.5.2                 
## [121] rlang_1.1.1                   cowplot_1.1.1                
## [123] KEGGREST_1.40.0