title: "adverSCarial, generate and analyze the vulnerability of scRNA-seq classifiers to adversarial attacks" shorttitle: "adverSCarial" author: Ghislain FIEVET package: adverSCarial abstract: > adverSCarial is an R Package designed for generating and analyzing the vulnerability of scRNA-seq classifiers to adversarial attacks. The package is versatile and provides a format for integrating any type of classifier. It offers functions for studying and generating two types of attacks, single gene attack and max change attack. The single gene attack involves making a small modification to the input to alter the classification. The max change attack involves making a large modification to the input without changing its classification. The package provides a comprehensive solution for evaluating the robustness of scRNA-seq classifiers against adversarial attacks. vignette: > %\VignetteIndexEntry{Vign04_advRandWalkMinChange} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} # advRandWalkMinChange The `CHETAH` classifier is not vulnerable to one gene attack on `DC`cluster. We can use the `advRandWalkMinChange` function to generate a more complex attack, still looking for a minimum change in the input. We will find a list of genes suscpetibles to move the classification by using the `findMarkers` function from `scran`. Then we will use the `advRandWalkMinChange` function to shorten the list as much as possible. # Load data ```r library(adverSCarial) library(TENxPBMCData) library(CHETAH) library(scater) library(scran) library(stringr) ``` We get two pbmc datasets, *train_3k* to train the CHETAH classifier, and *test_4k* to run the adverSCarial functions. ```r train_3k <- TENxPBMCData(dataset = "pbmc3k") test_4k <- TENxPBMCData(dataset = "pbmc4k") # Convert from ensemble id to hgnc symbol test_4k <- sceConvertToHGNC(test_4k) train_3k <- sceConvertToHGNC(train_3k) cell_types_3k <- system.file("extdata", "pbmc3k_cell_types.tsv", package="adverSCarial") cell_types_3k <- read.table(cell_types_3k, sep="\t") colData(train_3k)$celltypes <- cell_types_3k$cell_type colnames(train_3k) <- colData(train_3k)[['Barcode']] colnames(test_4k) <- colData(test_4k)[['Barcode']] ``` Annotation of the *test_4k* dataset with CHETAH, and processing of the SingleCellExperiment object. ```r input <- CHETAHclassifier(input = test_4k, ref_cells = train_3k) input <- Classify(input = input, 0.00001) colData(test_4k)$celltypes <- input$celltype_CHETAH test_4k <- logNormCounts(test_4k) dec <- modelGeneVar(test_4k) hvg <- getTopHVGs(dec, prop=0.1) test_4k <- runPCA(test_4k, ncomponents=25, subset_row=hvg) test_4k <- runUMAP(test_4k, dimred = 'PCA') ``` ```r CHETAHClassifier <- function(expr, clusters, target){ reference_3k <- train_3k input <- CHETAHclassifier(input = expr, ref_cells = reference_3k) input <- Classify(input = input, 0.01) final_predictions = input$celltype_CHETAH[clusters == target] ratio <- as.numeric(sort(table(final_predictions), decreasing = TRUE)[1]) / sum(as.numeric(sort(table(final_predictions), decreasing = TRUE))) predicted_class <- names(sort(table(final_predictions), decreasing = TRUE)[1]) if ( ratio < 0.3){ predicted_class <- "NA" } c(predicted_class, ratio) } ``` # Advanced attacks with `advRandWalkMinChange` We use the `advRandWalkMinChange` function to generate a complex attack, looking for a minimum change of several genes in the input. First step is to look for a list of genes suscpetibles to move the classification. We get this list by using the `findMarkers` function from `scran`. ```r markers <- c("IL7R", "CCR7", "CD14", "LYZ", "S100A4", "MS4A1", "CD8A", "FCGR3A", "MS4A7", "GNLY", "NKG7", "FCER1A", "CST3", "PPBP") test_4k_subset <- test_4k[, test_4k$celltypes %in% c("CD14+ Mono", "DC")] fm_t4k <- findMarkers(test_4k_subset, test_4k_subset$celltypes) genes_4walk <- rownames(fm_t4k[['DC']][abs(fm_t4k[['DC']]$summary.logFC)>1,]) # Remove the officiel markers from the candidates genes_4walk <- genes_4walk[!genes_4walk %in% markers] genes_4walk ``` ``` ## [1] "HLA-DRA" "S100A12" "HLA-DQB1" "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" ## [7] "S100A8" "HLA-DQA1" "S100A9" "CD74" "VCAN" "HLA-DMA" ## [13] "CTSS" "FTL" "FCN1" "TYROBP" "NEAT1" ``` Then we define a list of modifications to test: ```r modifications <- list() modifications[[1]] <- list("perc99") modifications[[2]] <- list("perc1") ``` Then we process to a random walk parameter search on these genes and these modifications: ```r rand_walk_min_change <- advRandWalkMinChange(test_4k, colData(test_4k)$celltypes, "DC", CHETAHClassifier, genes=genes_4walk, modifications=modifications, walkLength=15, argForClassif = 'SingleCellExperiment') head(rand_walk_min_change) ``` ``` ## DataFrame with 6 rows and 22 columns ## prediction odd genesModified typeModified iteration HLA.DRA ## ## 1 UNDETERMINED 0.484375 2 TRUE 15 NA ## 2 CD14+ Mono 0.578125 3 TRUE 14 NA ## 3 UNDETERMINED 0.5 4 TRUE 13 NA ## 4 UNDETERMINED 0.5 5 TRUE 11 perc99 ## 5 CD14+ Mono 0.5 7 TRUE 10 perc99 ## 6 UNDETERMINED 0.59375 8 TRUE 9 perc99 ## S100A12 HLA.DQB1 HLA.DPA1 HLA.DPB1 HLA.DRB1 S100A8 ## ## 1 NA perc1 perc1 NA NA NA ## 2 NA perc1 perc1 NA NA perc99 ## 3 NA perc1 perc1 NA NA perc1 ## 4 NA perc1 perc1 NA NA perc1 ## 5 NA perc1 perc1 NA NA perc1 ## 6 NA perc1 perc1 NA perc1 perc1 ## HLA.DQA1 S100A9 CD74 VCAN HLA.DMA CTSS ## ## 1 NA NA NA NA NA NA ## 2 NA NA NA NA NA NA ## 3 NA NA NA NA perc1 NA ## 4 NA NA NA NA perc1 NA ## 5 NA NA perc1 perc1 perc1 NA ## 6 NA NA perc1 perc1 perc1 NA ## FTL FCN1 TYROBP NEAT1 ## ## 1 NA NA NA NA ## 2 NA NA NA NA ## 3 NA NA NA NA ## 4 NA NA NA NA ## 5 NA NA NA NA ## 6 NA NA NA NA ``` The first line of `rand_walk_min_change` contains the parameter for the attack: ```r best_results <- rand_walk_min_change[1,] best_results <- as.data.frame(best_results[6:ncol(best_results)]) best_results <- best_results[,best_results!="NA"] best_results ``` ``` ## HLA.DQB1 HLA.DPA1 ## 1 perc1 perc1 ``` Then we modify the rna expression matrix to fool the classifier: ```r min_change_attack_rna_matrix <- test_4k for ( i in seq_len(length(colnames(best_results)))){ gene2modif <- colnames(best_results)[i] gene2modif <- str_replace(gene2modif, "\\.", "-") modif <- best_results[1,i] min_change_attack_rna_matrix <- advModifications(min_change_attack_rna_matrix, gene2modif, colData(test_4k)$celltypes, "DC", advMethod=modif, argForClassif="SingleCellExperiment") } ``` And we check it successfully changed the classification. ```r res_classif <- CHETAHClassifier(min_change_attack_rna_matrix, colData(test_4k)$celltypes, "DC") ``` ```r res_classif ``` ``` ## [1] "UNDETERMINED" "0.484375" ``` ```r sessionInfo() ``` ``` ## R version 4.3.0 (2023-04-21) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 22.04.1 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C ## ## time zone: Europe/Paris ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] stringr_1.5.0 scran_1.28.1 ## [3] scater_1.28.0 scuttle_1.10.1 ## [5] CHETAH_1.16.0 ggplot2_3.4.2 ## [7] TENxPBMCData_1.18.0 HDF5Array_1.28.1 ## [9] rhdf5_2.44.0 DelayedArray_0.26.3 ## [11] S4Arrays_1.0.4 Matrix_1.5-4.1 ## [13] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.1 ## [15] Biobase_2.60.0 GenomicRanges_1.52.0 ## [17] GenomeInfoDb_1.36.0 IRanges_2.34.0 ## [19] S4Vectors_0.38.1 BiocGenerics_0.46.0 ## [21] MatrixGenerics_1.12.0 matrixStats_0.63.0 ## [23] adverSCarial_0.99.38 knitr_1.42 ## ## loaded via a namespace (and not attached): ## [1] RColorBrewer_1.1-3 jsonlite_1.8.4 ## [3] magrittr_2.0.3 ggbeeswarm_0.7.2 ## [5] corrplot_0.92 zlibbioc_1.46.0 ## [7] vctrs_0.6.2 memoise_2.0.1 ## [9] DelayedMatrixStats_1.22.0 RCurl_1.98-1.12 ## [11] base64enc_0.1-3 htmltools_0.5.5 ## [13] AnnotationHub_3.8.0 curl_5.0.0 ## [15] BiocNeighbors_1.18.0 Rhdf5lib_1.22.0 ## [17] htmlwidgets_1.6.2 plyr_1.8.8 ## [19] plotly_4.10.1 cachem_1.0.8 ## [21] uuid_1.1-0 igraph_1.4.2 ## [23] mime_0.12 lifecycle_1.0.3 ## [25] pkgconfig_2.0.3 rsvd_1.0.5 ## [27] R6_2.5.1 fastmap_1.1.1 ## [29] GenomeInfoDbData_1.2.10 shiny_1.7.4 ## [31] digest_0.6.31 colorspace_2.1-0 ## [33] AnnotationDbi_1.62.1 dqrng_0.3.0 ## [35] irlba_2.3.5.1 ExperimentHub_2.8.0 ## [37] RSQLite_2.3.1 beachmat_2.16.0 ## [39] filelock_1.0.2 fansi_1.0.4 ## [41] httr_1.4.6 compiler_4.3.0 ## [43] bit64_4.0.5 withr_2.5.0 ## [45] BiocParallel_1.34.1 viridis_0.6.3 ## [47] DBI_1.1.3 dendextend_1.17.1 ## [49] rappdirs_0.3.3 bluster_1.10.0 ## [51] tools_4.3.0 vipor_0.4.5 ## [53] beeswarm_0.4.0 interactiveDisplayBase_1.38.0 ## [55] httpuv_1.6.11 glue_1.6.2 ## [57] rhdf5filters_1.12.1 promises_1.2.0.1 ## [59] grid_4.3.0 pbdZMQ_0.3-9 ## [61] cluster_2.1.4 reshape2_1.4.4 ## [63] generics_0.1.3 gtable_0.3.3 ## [65] tidyr_1.3.0 data.table_1.14.8 ## [67] metapod_1.8.0 BiocSingular_1.16.0 ## [69] ScaledMatrix_1.8.1 utf8_1.2.3 ## [71] XVector_0.40.0 RcppAnnoy_0.0.20 ## [73] ggrepel_0.9.3 BiocVersion_3.17.1 ## [75] pillar_1.9.0 limma_3.56.1 ## [77] IRdisplay_1.1 later_1.3.1 ## [79] dplyr_1.1.2 BiocFileCache_2.8.0 ## [81] lattice_0.21-8 bit_4.0.5 ## [83] tidyselect_1.2.0 locfit_1.5-9.7 ## [85] Biostrings_2.68.1 gridExtra_2.3 ## [87] edgeR_3.42.2 xfun_0.39 ## [89] statmod_1.5.0 pheatmap_1.0.12 ## [91] stringi_1.7.12 lazyeval_0.2.2 ## [93] yaml_2.3.7 evaluate_0.21 ## [95] codetools_0.2-19 tibble_3.2.1 ## [97] BiocManager_1.30.20 cli_3.6.1 ## [99] uwot_0.1.14 IRkernel_1.3.2 ## [101] xtable_1.8-4 repr_1.1.6 ## [103] munsell_0.5.0 Rcpp_1.0.10 ## [105] bioDist_1.72.0 dbplyr_2.3.2 ## [107] png_0.1-8 parallel_4.3.0 ## [109] ellipsis_0.3.2 blob_1.2.4 ## [111] sparseMatrixStats_1.12.0 bitops_1.0-7 ## [113] viridisLite_0.4.2 scales_1.2.1 ## [115] purrr_1.0.1 crayon_1.5.2 ## [117] rlang_1.1.1 cowplot_1.1.1 ## [119] KEGGREST_1.40.0 ```