--- title: "RFGeneRank: Cross-validated, stable predictive gene ranking for transcriptomics" author: "Abdulaziz Albeshri" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{RFGeneRank} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE ) ``` # Abstract RFGeneRank is a Bioconductor package that provides a leakage-aware, machine-learning workflow for gene ranking and classification using bulk RNA-seq–derived expression matrices. The package emphasizes reproducible evaluation with cross-validation and out-of-fold (OOF) preprocessing, and provides interpretable gene importance outputs for downstream biological interpretation. # Introduction and Motivation High-throughput transcriptomic studies are often collected across different cohorts, sequencing runs, or laboratories. When datasets are integrated, batch effects and confounding can inflate predictive performance if preprocessing is performed outside cross-validation (information leakage). RFGeneRank addresses this by providing a fold-safe workflow for gene ranking and classification that integrates clean inputs, batch-aware options, and convenient reporting. # Relation to existing Bioconductor packages Bioconductor provides well-established methodologies for differential expression analysis and batch effect correction. RFGeneRank is not intended to replace these methodologies. Instead, it is designed to complement existing analytical workflows by adding a machine-learning–based gene ranking layer, with leakage-aware evaluation and diagnostic capabilities, implemented in a Bioconductor-compatible framework using standard data containers. # Installation ```{r session-info, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("RFGeneRank") ``` # A small runnable example This vignette demonstrates the core protocol used in RFGeneRank: 1. Generate a small transcriptomics-like dataset (genes × samples). 2. Run `prepare_data()`. 3. Run `rank_genes()`. 4. Run a small set of downstream functions and basic plots. ```{r package-overview} suppressPackageStartupMessages({ library(RFGeneRank) library(SummarizedExperiment) library(S4Vectors) }) set.seed(42) n_genes <- 300 n_samples <- 60 genes <- paste0("Gene", seq_len(n_genes)) samples <- paste0("Sample", seq_len(n_samples)) # Sample metadata (rows must match sample names) meta_df <- data.frame( state = factor(rep(c("CTRL","CASE"), each = n_samples/2), levels = c("CTRL","CASE")), batch = factor(rep(c("B1","B2"), length.out = n_samples)), sex = factor(rep(c("M","F"), length.out = n_samples)), age = round(stats::runif(n_samples, 25, 65)), stringsAsFactors = FALSE, check.names = TRUE ) # Transcriptomics-like expression: strictly positive values (log-normal) expr <- matrix( exp(rnorm(n_genes * n_samples, mean = 2.5, sd = 0.6)), nrow = n_genes, ncol = n_samples, dimnames = list(genes, samples) ) # Inject signal in CASE for a subset of genes signal_genes <- genes[1:25] case_cols <- meta_df$state == "CASE" expr[signal_genes, case_cols] <- expr[signal_genes, case_cols] * 1.8 # Critical alignment: metadata rownames must match expression colnames rownames(meta_df) <- colnames(expr) stopifnot(identical(colnames(expr), rownames(meta_df))) # Build SummarizedExperiment se <- SummarizedExperiment( assays = list(expr = expr), colData = DataFrame(meta_df) ) se ``` # Step 1: prepare_data() ```{r simulate-expression} # Detect whether the matrix is count-like (integer); our simulated data are continuous. is_integerish <- function(x) all(abs(x - round(x)) < 1e-8, na.rm = TRUE) counts_flag <- is_integerish(expr) se_prep <- prepare_data( mats = list(SummarizedExperiment::assay(se, "expr")), metas = list(meta_df), # use data.frame for robustness in vignettes label_col = "state", batch_col = "batch", log1p = counts_flag, batch_method = "combat", batch_correction_scope = "global" ) se_prep ``` # Step 2: rank_genes() ```{r build-se} cw <- c(CTRL = 1, CASE = 2) fit <- rank_genes( se_prep, label_col = "state", cv = "kfold", k = 3, n_top = 100, trees = 300, fold_batch_correction = FALSE, batch_col = "batch", class_weights = cw, auto_confounds = FALSE, seed = 42 ) fit ``` # Step 3: downstream utilities ## Top genes ```{r run-rank-genes} top_genes(fit, n = 10) ``` ## Signed importance (directionality) ```{r inspect-fit} tab_signed <- sign_importance( fit, se_prep, y = SummarizedExperiment::colData(se_prep)[["state"]], method = "mean" ) head(tab_signed, 10) ``` ## Basic plots ```{r plotting-example} plot_importance( fit, top = 20, map_to_symbol = FALSE ) plot_roc(fit) plot_sign_importance( fit, tab = tab_signed, top = 20 ) ``` # Session information ```{r session} sessionInfo() ```