Metagenomic classification using k-mer based tools like Kraken2 is incredibly fast, but it is notoriously prone to false-positive assignments. The standard approach of applying a single, global Confidence Score (CS) threshold across an entire sample is fundamentally flawed: one size does not fit all. Biological domains (Bacteria, Archaea, Viruses, and Eukaryota) have vastly different database representations, genome sizes, and k-mer uniqueness.
karioCaS (Kraken Confidence Scores for Reliable Domain-Specific Microbiota Inference) is an R package designed to solve this problem. It provides a robust analytical framework to evaluate taxonomic stability across multiple stringency levels, allowing researchers to find the exact inflection point where statistical noise is removed without losing true biological signal.
karioCaS relies on a strict, automated directory architecture. To analyze your data, you must place your Kraken2/Metaphlan-style outputs inside a base directory, specifically within a folder named 000_mpa_original.
Files must follow the naming convention: SAMPLENAME_CSXX.mpa (e.g., SAMPLE01_CS09.mpa for a Confidence Score of 0.9).
For this vignette, we will use the mock data bundled with the package, copying it to a temporary directory to comply with CRAN/Bioconductor file system policies.
# 1. Create a temporary project directory
proj_dir <- file.path(tempdir(), "karioCaS_Tutorial")
dir.create(file.path(proj_dir, "000_mpa_original"), recursive = TRUE)
# 2. Locate and copy the bundled mock data
mock_data_path <- system.file("extdata/your_project_name/000_mpa_original", package = "karioCaS")
mock_files <- list.files(mock_data_path, full.names = TRUE)
file.copy(mock_files, file.path(proj_dir, "000_mpa_original"))
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE
print(paste("Project Directory initialized at:", proj_dir))
#> [1] "Project Directory initialized at: /tmp/Rtmp3DiH0W/karioCaS_Tutorial"The first step is to import the multiple stringency outputs into a cohesive Bioconductor TreeSummarizedExperiment (TSE) object. This step parses the taxonomic strings, infers the phylogenetic ranks, and standardizes the matrices.
# Import data and generate the TSE object (Step 000)
tse_object <- import_karioCaS(project_dir = proj_dir)
# The object is now ready for Bioconductor-native downstream analysis if desired
tse_object
#> class: TreeSummarizedExperiment
#> dim: 2030 6
#> metadata(0):
#> assays(1): counts
#> rownames(2030): d__Archaea d__Archaea|k__Methanobacteriati ...
#> d__Bacteria|p__Pseudomonadota|c__Betaproteobacteria|o__Burkholderiales|f__Comamonadaceae|g__Hydrogenophaga|s__Hydrogenophaga
#> crassostreae
#> d__Bacteria|p__Pseudomonadota|c__Alphaproteobacteria|o__Sphingomonadales|f__Sphingomonadaceae|g__Novosphingobium|s__Novosphingobium
#> sp. KCTC 2891
#> rowData names(10): Taxonomy_Full Domain ... Species Rank
#> colnames(6): SAMPLE01_CS00 SAMPLE01_CS20 ... SAMPLE01_CS80
#> SAMPLE01_CS90
#> colData names(2): Sample_ID Confidence_Score
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULLNote: This function automatically creates a 000_karioCaS_input_matrix folder containing the .rds object and a human-readable .tsv audit trail.
karioCaS shines in its ability to generate publication-ready PDFs to visually inspect the behavior of taxa as stringency increases. All plotting functions automatically read from the harmonized data and save their outputs into dynamically generated subfolders.
# Note: In an active R session, these commands will generate high-quality
# PDF panels in your project directory.
# Evaluate how many taxa and reads "survive" as stringency increases
taxa_retention(project_dir = proj_dir)
# Evaluate taxonomic depth and Parent-to-Child resolution
taxa_resolution(project_dir = proj_dir, parent_level = "Genus", child_level = "Species")
# Generate abundance heatmaps showing taxa extinction patterns at a given CS
heatmaps_karioCaS(project_dir = proj_dir, analysis_rank = "Species", confidence_score = 40)While visual exploration is crucial, choosing the optimal Confidence Score manually introduces subjectivity.
karioCaS implements a Multi-Strategy Stability Index (SI) engine. Instead of relying on rigid geometric formulas, it analyzes step-wise loss and regime shifts to mathematically identify the optimal threshold (Primary SI). It offers three distinct strategies:
# Calculate the Stability Index for all domains using the default Dynamic method
audit_df <- optimize_CS(project_dir = proj_dir, tax_level = "Species", method = "dynamic")
# View the mathematical assessment for the Primary SIs
head(audit_df[audit_df$SI_Type == "Primary_SI", c("Sample", "Domain", "CS", "Taxa_Count", "Step_Loss_Pct")])
#> # A tibble: 6 × 5
#> Sample Domain CS Taxa_Count Step_Loss_Pct
#> <chr> <chr> <dbl> <int> <dbl>
#> 1 <NA> <NA> NA NA NA
#> 2 SAMPLE01 Bacteria 20 68 0
#> 3 <NA> <NA> NA NA NA
#> 4 <NA> <NA> NA NA NA
#> 5 <NA> <NA> NA NA NA
#> 6 <NA> <NA> NA NA NAThe optimize_CS() function saves an objective audit trail (.rds and .tsv) and generates PDFs highlighting the mathematical shift on the decay curves.
Once the optimal thresholds are defined mathematically, we extract the surviving taxa. Since Archaea might require a CS of 0.20 while Bacteria requires a CS of 0.40, karioCaS allows you to create a “Mosaic” .mpa file using specific thresholds for each domain.
By setting the arguments to "auto", the package natively reads the optimize_CS audit file and applies the mathematically optimal Primary SI for each domain.
# Generate the final mosaic using automated mathematical thresholds
retrieve_selected_taxa(
project_dir = proj_dir,
tax_level = "Species",
CS_B = "auto", # Will pull the Primary SI for Bacteria
CS_A = "secondary", # Attempts to pull a Secondary SI (or fallbacks to Primary)
CS_E = 40, # Manual override (strictly CS40)
CS_V = 0 # Manual fallback (CS0)
)The resulting SAMPLENAME_karioCaS_Mosaic.mpa file, saved in the 1000_final_selection folder, is a highly refined, high-confidence taxonomic profile, scrubbed of statistical noise and ready for downstream ecological or clinical analysis.
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] karioCaS_0.99.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.2.1
#> [3] farver_2.1.2 Biostrings_2.80.0
#> [5] S7_0.2.2 fastmap_1.2.0
#> [7] SingleCellExperiment_1.34.0 lazyeval_0.2.3
#> [9] digest_0.6.39 lifecycle_1.0.5
#> [11] tidytree_0.4.7 magrittr_2.0.5
#> [13] compiler_4.6.0 rlang_1.2.0
#> [15] sass_0.4.10 tools_4.6.0
#> [17] utf8_1.2.6 yaml_2.3.12
#> [19] knitr_1.51 labeling_0.4.3
#> [21] S4Arrays_1.12.0 bit_4.6.0
#> [23] DelayedArray_0.38.1 plyr_1.8.9
#> [25] xml2_1.5.2 RColorBrewer_1.1-3
#> [27] TreeSummarizedExperiment_2.20.0 abind_1.4-8
#> [29] BiocParallel_1.46.0 withr_3.0.2
#> [31] purrr_1.2.2 BiocGenerics_0.58.1
#> [33] grid_4.6.0 stats4_4.6.0
#> [35] ggplot2_4.0.3 scales_1.4.0
#> [37] dichromat_2.0-0.1 SummarizedExperiment_1.42.0
#> [39] cli_3.6.6 UpSetR_1.4.0
#> [41] rmarkdown_2.31 crayon_1.5.3
#> [43] ragg_1.5.2 treeio_1.36.1
#> [45] generics_0.1.4 otel_0.2.0
#> [47] tzdb_0.5.0 commonmark_2.0.0
#> [49] ape_5.8-1 cachem_1.1.0
#> [51] stringr_1.6.0 parallel_4.6.0
#> [53] XVector_0.52.0 matrixStats_1.5.0
#> [55] yulab.utils_0.2.4 vctrs_0.7.3
#> [57] Matrix_1.7-5 jsonlite_2.0.0
#> [59] litedown_0.9 IRanges_2.46.0
#> [61] hms_1.1.4 patchwork_1.3.2
#> [63] S4Vectors_0.50.1 bit64_4.8.0
#> [65] systemfonts_1.3.2 tidyr_1.3.2
#> [67] jquerylib_0.1.4 glue_1.8.1
#> [69] codetools_0.2-20 ggtext_0.1.2
#> [71] stringi_1.8.7 gtable_0.3.6
#> [73] GenomicRanges_1.64.0 tibble_3.3.1
#> [75] pillar_1.11.1 rappdirs_0.3.4
#> [77] htmltools_0.5.9 Seqinfo_1.2.0
#> [79] R6_2.6.1 textshaping_1.0.5
#> [81] vroom_1.7.1 evaluate_1.0.5
#> [83] lattice_0.22-9 Biobase_2.72.0
#> [85] markdown_2.0 readr_2.2.0
#> [87] gridtext_0.1.6 bslib_0.11.0
#> [89] Rcpp_1.1.1-1.1 gridExtra_2.3
#> [91] SparseArray_1.12.2 nlme_3.1-169
#> [93] xfun_0.57 fs_2.1.0
#> [95] MatrixGenerics_1.24.0 forcats_1.0.1
#> [97] pkgconfig_2.0.3