title: "adverSCarial, generate and analyze the vulnerability of scRNA-seq classifiers to adversarial attacks" shorttitle: "adverSCarial" author: Ghislain FIEVET package: adverSCarial abstract: > adverSCarial is an R Package designed for generating and analyzing the vulnerability of scRNA-seq classifiers to adversarial attacks. The package is versatile and provides a format for integrating any type of classifier. It offers functions for studying and generating two types of attacks, single gene attack and max change attack. The single gene attack involves making a small modification to the input to alter the classification. The max change attack involves making a large modification to the input without changing its classification. The package provides a comprehensive solution for evaluating the robustness of scRNA-seq classifiers against adversarial attacks. output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Vign03_adaptClassifier} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} # Prepare a classifier with `CHETAH` Here we demonstrate how to implement a classifier, and take the example of `CHETAH` a Bioconductor scRNA-seq classifier. de Kanter JK, Lijnzaad P, Candelli T, Margaritis T, Holstege FCP (2019). “CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing.” Nucleic Acids Research. ISSN 0305-1048, doi: 10.1093/nar/gkz543. # Load data ```r library(adverSCarial) library(TENxPBMCData) library(CHETAH) library(scater) library(scran) ``` First let's load a `train` and a `test` dataset. ```r train_3k <- TENxPBMCData(dataset = "pbmc3k") test_4k <- TENxPBMCData(dataset = "pbmc4k") cell_types_3k <- system.file("extdata", "pbmc3k_cell_types.tsv", package="adverSCarial") cell_types_3k <- read.table(cell_types_3k, sep="\t") colData(train_3k)$celltypes <- cell_types_3k$cell_type colnames(train_3k) <- colData(train_3k)[['Barcode']] colnames(test_4k) <- colData(test_4k)[['Barcode']] ``` Then we process the `test_4k` to annotate and visualize the cell types. We annotate cells with `CHETAH`, and process data. ```r input <- CHETAHclassifier(input = test_4k, ref_cells = train_3k) input <- Classify(input = input, 0.00001) colData(test_4k)$celltypes <- input$celltype_CHETAH test_4k <- logNormCounts(test_4k) dec <- modelGeneVar(test_4k) hvg <- getTopHVGs(dec, prop=0.1) test_4k <- runPCA(test_4k, ncomponents=25, subset_row=hvg) test_4k <- runUMAP(test_4k, dimred = 'PCA') ``` Visualize the results ```r plotUMAP(test_4k, colour_by="celltypes") ``` ![plot of chunk chetah dimplot](figure/chetahdimplot1.png) ## Adapt the classifier `CHETAH` is a classifier that, when given a SingleCellExperiment object, returns a specific cell type from each cell. We need to adjust the classifier so that it can be used by *adverSCarial*. Each classifier function has to be formated as follow to be used with *adverSCarial*: ```R classifier = function(expr, clusters, target){ # `score` should be numeric between 0 and 1 # 1 being the highest confidance into the cell type classification. c("cell type", score) } ``` The `expr` argument contrains the RNA expression values, should be a *DelayedMatrix* or a *SingleCellExperiment*. The list `clusters` consists of the cluster IDs for each cell in `expr`, and `target` is the ID of the cluster for which we want to have a classification. The function returns a vector with the classification result, and a trust indice. This is how you can adapt `CHETAH` for `adverSCarial`. ```r CHETAHClassifier <- function(expr, clusters, target){ reference_3k <- train_3k input <- CHETAHclassifier(input = expr, ref_cells = reference_3k) input <- Classify(input = input, 0.01) final_predictions = input$celltype_CHETAH[clusters == target] ratio <- as.numeric(sort(table(final_predictions), decreasing = TRUE)[1]) / sum(as.numeric(sort(table(final_predictions), decreasing = TRUE))) predicted_class <- names(sort(table(final_predictions), decreasing = TRUE)[1]) if ( ratio < 0.3){ predicted_class <- "NA" } c(predicted_class, ratio) } ``` This classifier takes as input a SingleCellExperiment object, you need to specify the `argForClassif="SingleCellExperiment"` argument in *adverSCarial* function. If the classifier takes as input a *DelayedMatrix* you can let the default `argForClassif="DelayedMatrix"` argument. You can now test `CHETAH` classifier with `adverSCarial` tools. Let's run a `maxChangeAttack`. If you have enough available memory we recommand to use the `argForModif="data.frame"` option, which is faster. ```r adv_max_change <- advMaxChange(test_4k, colData(test_4k)$celltypes, "CD14+ Mono", CHETAHClassifier, advMethod="perc99", maxSplitSize = 2000, argForClassif="SingleCellExperiment", argForModif="data.frame") ``` Let's run this attack and verify if it is successful. First we modify the `test_4k` SingleCellExperiment object on the target cluster, on the genes previously determined. Then we verify that classification is still `CD14+ Mono`. ```r test_4k_adver <- advModifications(test_4k, adv_max_change@values, colData(test_4k)$celltypes, "CD14+ Mono", argForClassif="SingleCellExperiment", argForModif="data.frame") rf_result <- CHETAHClassifier(test_4k_adver, colData(test_4k)$celltypes, "CD14+ Mono") rf_result ``` ``` ## [1] "CD14+ Mono" "1" ``` ```r sessionInfo() ``` ``` ## R version 4.3.0 (2023-04-21) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 22.04.1 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C ## ## time zone: Europe/Paris ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] scran_1.28.1 scater_1.28.0 ## [3] scuttle_1.10.1 CHETAH_1.16.0 ## [5] ggplot2_3.4.2 TENxPBMCData_1.18.0 ## [7] HDF5Array_1.28.1 rhdf5_2.44.0 ## [9] DelayedArray_0.26.3 S4Arrays_1.0.4 ## [11] Matrix_1.5-4.1 SingleCellExperiment_1.22.0 ## [13] SummarizedExperiment_1.30.1 Biobase_2.60.0 ## [15] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 ## [17] IRanges_2.34.0 S4Vectors_0.38.1 ## [19] BiocGenerics_0.46.0 MatrixGenerics_1.12.0 ## [21] matrixStats_0.63.0 adverSCarial_0.99.38 ## [23] knitr_1.42 ## ## loaded via a namespace (and not attached): ## [1] RColorBrewer_1.1-3 jsonlite_1.8.4 ## [3] magrittr_2.0.3 ggbeeswarm_0.7.2 ## [5] farver_2.1.1 corrplot_0.92 ## [7] zlibbioc_1.46.0 vctrs_0.6.2 ## [9] memoise_2.0.1 DelayedMatrixStats_1.22.0 ## [11] RCurl_1.98-1.12 base64enc_0.1-3 ## [13] htmltools_0.5.5 AnnotationHub_3.8.0 ## [15] curl_5.0.0 BiocNeighbors_1.18.0 ## [17] Rhdf5lib_1.22.0 htmlwidgets_1.6.2 ## [19] plyr_1.8.8 plotly_4.10.1 ## [21] cachem_1.0.8 uuid_1.1-0 ## [23] igraph_1.4.2 mime_0.12 ## [25] lifecycle_1.0.3 pkgconfig_2.0.3 ## [27] rsvd_1.0.5 R6_2.5.1 ## [29] fastmap_1.1.1 GenomeInfoDbData_1.2.10 ## [31] shiny_1.7.4 digest_0.6.31 ## [33] colorspace_2.1-0 AnnotationDbi_1.62.1 ## [35] dqrng_0.3.0 irlba_2.3.5.1 ## [37] ExperimentHub_2.8.0 RSQLite_2.3.1 ## [39] beachmat_2.16.0 labeling_0.4.2 ## [41] filelock_1.0.2 fansi_1.0.4 ## [43] httr_1.4.6 compiler_4.3.0 ## [45] bit64_4.0.5 withr_2.5.0 ## [47] BiocParallel_1.34.1 viridis_0.6.3 ## [49] DBI_1.1.3 highr_0.10 ## [51] dendextend_1.17.1 rappdirs_0.3.3 ## [53] bluster_1.10.0 tools_4.3.0 ## [55] vipor_0.4.5 beeswarm_0.4.0 ## [57] interactiveDisplayBase_1.38.0 httpuv_1.6.11 ## [59] glue_1.6.2 rhdf5filters_1.12.1 ## [61] promises_1.2.0.1 grid_4.3.0 ## [63] pbdZMQ_0.3-9 cluster_2.1.4 ## [65] reshape2_1.4.4 generics_0.1.3 ## [67] gtable_0.3.3 tidyr_1.3.0 ## [69] data.table_1.14.8 metapod_1.8.0 ## [71] BiocSingular_1.16.0 ScaledMatrix_1.8.1 ## [73] utf8_1.2.3 XVector_0.40.0 ## [75] RcppAnnoy_0.0.20 ggrepel_0.9.3 ## [77] BiocVersion_3.17.1 pillar_1.9.0 ## [79] stringr_1.5.0 limma_3.56.1 ## [81] IRdisplay_1.1 later_1.3.1 ## [83] dplyr_1.1.2 BiocFileCache_2.8.0 ## [85] lattice_0.21-8 bit_4.0.5 ## [87] tidyselect_1.2.0 locfit_1.5-9.7 ## [89] Biostrings_2.68.1 gridExtra_2.3 ## [91] edgeR_3.42.2 xfun_0.39 ## [93] statmod_1.5.0 pheatmap_1.0.12 ## [95] stringi_1.7.12 lazyeval_0.2.2 ## [97] yaml_2.3.7 evaluate_0.21 ## [99] codetools_0.2-19 tibble_3.2.1 ## [101] BiocManager_1.30.20 cli_3.6.1 ## [103] uwot_0.1.14 IRkernel_1.3.2 ## [105] xtable_1.8-4 repr_1.1.6 ## [107] munsell_0.5.0 Rcpp_1.0.10 ## [109] bioDist_1.72.0 dbplyr_2.3.2 ## [111] png_0.1-8 parallel_4.3.0 ## [113] ellipsis_0.3.2 blob_1.2.4 ## [115] sparseMatrixStats_1.12.0 bitops_1.0-7 ## [117] viridisLite_0.4.2 scales_1.2.1 ## [119] purrr_1.0.1 crayon_1.5.2 ## [121] rlang_1.1.1 cowplot_1.1.1 ## [123] KEGGREST_1.40.0 ```