Contents

1 Introduction

jvecfor is a high-performance KNN library for Bioconductor single-cell workflows. It wraps a Java library, built on top of jvector, which implements two algorithms:

The R package communicates with the JVM via system2() — no JNI, no rJava dependency. The JAR is bundled in inst/java/ and requires no separate installation step beyond having Java 20+ on your PATH.

1.1 Function overview

Function Description Returns
fastFindKNN() Core KNN search list($index, $distance)
fastMakeSNNGraph() KNN → shared nearest-neighbor graph igraph object
fastMakeKNNGraph() KNN → k-nearest-neighbor graph igraph object
jvecfor_setup() Copy a custom JAR into the package path (invisibly)

The output of fastFindKNN() is a named list with $index (n × k integer matrix, 1-based) and $distance (n × k numeric matrix), identical in structure to BiocNeighbors::findKNN().

2 Prerequisites and Setup

2.1 Java version

jvecfor requires Java ≥ 20 on PATH. The JAR uses the jdk.incubator.vector module for SIMD acceleration. Check your installation:

system2("java", "--version")  # should print openjdk 20 or higher

If java is not found, install Eclipse Temurin or any OpenJDK 20+ distribution and ensure the java binary is on your PATH. ## Package options

Two global options control runtime behaviour:

Option Default Effect
jvecfor.verbose FALSE Pass --verbose to jvecfor, printing HNSW build progress to stderr.
jvecfor.jar NULL Use a custom JAR instead of the one bundled in inst/java/.

Set them persistently in your .Rprofile or per-session:

options(jvecfor.verbose = TRUE)                        # global verbose mode
options(jvecfor.jar = "/path/to/custom/jvecfor.jar")   # custom JAR

2.2 Custom JAR setup

The bundled JAR is installed automatically. If you have compiled a custom jvecfor build:

# Auto-finds target/jvecfor-*.jar relative to the working directory
jvecfor_setup()

# Or provide the path explicitly
jvecfor_setup(jar_path = "/path/to/jvecfor-x.y.z.jar")

3 Quick Start

library(jvecfor)

# Simulate 300 cells × 30 principal components
set.seed(42)
pca <- matrix(rnorm(300 * 30), nrow = 300, ncol = 30)

# Find 15 nearest neighbours (HNSW-DiskANN approximate, Euclidean)
nn <- fastFindKNN(pca, k = 15)

str(nn)
#> List of 2
#>  $ index   : int [1:300, 1:15] 271 263 125 158 271 241 210 2 153 235 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:15] "V1" "V2" "V3" "V4" ...
#>  $ distance: num [1:300, 1:15] 0.0508 0.0396 0.0471 0.0409 0.0331 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:15] "V16" "V17" "V18" "V19" ...

The result is a two-element list. $index is a 300 × 15 integer matrix of 1-based neighbor indices; $distance is a 300 × 15 numeric matrix of Euclidean distances.

4 Core Function: fastFindKNN()

4.1 Output structure

dim(nn$index)     # 300 x 15
#> [1] 300  15
dim(nn$distance)  # 300 x 15
#> [1] 300  15

class(nn$index)     # "integer" — 1-based, same as BiocNeighbors
#> [1] "matrix" "array"
class(nn$distance)  # "numeric"
#> [1] "matrix" "array"

# No zeros in index (1-based) and all values within [1, nrow(pca)]
range(nn$index)
#> [1]   1 298

This output is structurally identical to BiocNeighbors::findKNN(), so existing code that passes nn$index to bluster or other Bioconductor tools works without modification.

4.2 Algorithm choice: ANN vs. exact

# Approximate (HNSW-DiskANN) — default, fast, O(log n) query
nn_ann <- fastFindKNN(pca, k = 15, type = "ann")

# Exact (VP-tree) — deterministic, O(n log n) query
nn_knn <- fastFindKNN(pca, k = 15, type = "knn")

# Self-consistency check: exact search always returns itself as nearest
# (excluding self) — no duplicate indices per row
stopifnot(all(apply(nn_knn$index, 1, function(x) !anyDuplicated(x))))

When to choose ANN vs. exact:

Criterion ANN ("ann") Exact ("knn")
Dataset size n ≥ 5K n < 5K, or benchmark ground truth
Recall ~95–99 % (tunable) 100 %
Speed O(log n) O(n log n)
Metrics euclidean, cosine, dot_product euclidean, cosine
Reproducibility Deterministic (fixed graph) Deterministic

4.3 Distance metrics

4.3.1 Euclidean (default)

The L2 distance in PC space. Appropriate when principal components have been whitened (equal variance per component) or when the embedding preserves meaningful inter-cell distances.

nn_euc <- fastFindKNN(pca, k = 15, metric = "euclidean")

4.3.2 Cosine

Measures the angle between cell vectors; ignores magnitude. This is useful when only the direction of the embedding vector matters, for instance with log-normalised count embeddings where library-size effects inflate norms. Normalise rows to unit length before calling to ensure consistent behaviour:

# L2-normalise rows
norms    <- sqrt(rowSums(pca^2))
pca_norm <- sweep(pca, 1, norms, "/")

nn_cos <- fastFindKNN(pca_norm, k = 15, metric = "cosine")

4.3.3 Dot-product (ANN only)

The inner product ⟨u, v⟩. A similarity measure (higher = more similar) rather than a distance. Only valid for type = "ann" — the VP-tree requires a proper metric. Attempting to combine metric = "dot_product" with type = "knn" raises an informative error:

# ANN with dot-product similarity
nn_dot <- fastFindKNN(pca_norm, k = 15, metric = "dot_product", type = "ann")
# Attempting exact + dot_product raises an error
fastFindKNN(pca, k = 15, metric = "dot_product", type = "knn")
#> Error in `.validate_knn_scalars()`:
#> ! 'dot_product' is not a proper metric and cannot be used with type='knn' (VP-tree exact search). Use type='ann', or switch to metric='euclidean' or metric='cosine' for exact search.

4.4 Skipping distance computation

When you only need neighbor indices (e.g. for graph construction), set get.distance = FALSE. This avoids allocating and transmitting the n × k distance matrix, saving roughly 50 % of output parsing time:

nn_idx <- fastFindKNN(pca, k = 15, get.distance = FALSE)
is.null(nn_idx$distance)   # TRUE
#> [1] TRUE

The fastMakeSNNGraph() and fastMakeKNNGraph() wrappers set get.distance = FALSE internally.

5 HNSW-DiskANN Tuning

5.1 Parameter reference

Parameter Default Tuning guidance
M 16 Connections per node. Increase to 32-64 for high-dimensional (>50 dims) data; doubles memory and build time.
ef.search 0 (auto) Beam width. Auto = max(k+1, 3k). Increase (e.g. 100-500) to trade speed for recall.
oversample.factor 1.0 Fetch ceil(ef x factor) candidates, keep top k. Values 1.5-3.0 improve recall with proportional cost.
pq.subspaces 0 (off) PQ subspaces. See section 5.3. 4-8x speedup with minimal recall loss.

5.2 High-recall configuration

Use this when recall quality matters more than raw speed — for example when producing the KNN graph that will underpin a publication’s cluster assignments or trajectory analysis:

nn_highrecall <- fastFindKNN(
    pca, k = 15,
    M                 = 32L,
    ef.search         = 150L,
    oversample.factor = 2.0
)
dim(nn_highrecall$index)
#> [1] 300  15

5.3 Product Quantization (pq.subspaces)

Product Quantization (PQ) compresses vectors into compact codes before distance computation, yielding a 4–8× search speedup with negligible recall loss. Set pq.subspaces to ncol(X) / 2 as a starting point; the value must evenly divide ncol(X).

# 300 × 20 matrix — use 10 PQ subspaces (= 20 / 2)
set.seed(1)
pca20 <- matrix(rnorm(300 * 20), 300, 20)

nn_default <- fastFindKNN(pca20, k = 15)                      # no PQ
nn_pq      <- fastFindKNN(pca20, k = 15, pq.subspaces = 10L)  # PQ enabled

# Recall overlap (fraction of PQ neighbors that match the default)
shared <- mapply(function(a, b) length(intersect(a, b)),
                    split(nn_default$index, row(nn_default$index)),
                    split(nn_pq$index,      row(nn_pq$index)))
message(sprintf("Mean PQ recall vs. default: %.1f%%", 100 * mean(shared / 15)))
#> Mean PQ recall vs. default: 100.0%

For small matrices (< 1K rows) the JVM startup cost dominates and PQ will not show a wall-clock benefit. The speedup becomes material at n ≥ 10K.

Avoid PQ when ncol(X) is very small (< 8 features): quantisation error dominates and recall drops sharply.

A timing comparison for large data:

# Illustrative — run on your own large matrix
nn_default_time <- system.time(fastFindKNN(large_pca, k = 15))
nn_pq_time <- system.time(
    fastFindKNN(large_pca, k = 15,
                pq.subspaces = ncol(large_pca) %/% 2L)
)
nn_default_time["elapsed"] / nn_pq_time["elapsed"]  # expect 4–8×

5.4 Thread management

jvecfor auto-detects the number of physical cores and uses max(1, physical_cores − 1) threads by default. Override with num.threads:

# Single-threaded (reproducible, good for shared HPC nodes)
nn_1t <- fastFindKNN(pca, k = 15, num.threads = 1L)

# Explicit thread count
nn_4t <- fastFindKNN(pca, k = 15, num.threads = 4L)

# Use verbose to confirm the thread count jvecfor actually uses
nn_v <- fastFindKNN(pca, k = 15, num.threads = 2L, verbose = TRUE)

On shared HPC nodes, set num.threads explicitly to avoid consuming all physical cores and interfering with co-located jobs.

6 Graph Construction

The fastMakeSNNGraph() and fastMakeKNNGraph() wrappers build igraph objects directly from the PC matrix, combining fastFindKNN() with bluster’s graph constructors.

6.1 Shared Nearest-Neighbor graph

library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union

g_snn <- fastMakeSNNGraph(pca, k = 15)
class(g_snn)
#> [1] "igraph"
igraph::vcount(g_snn)   # 300 — one vertex per cell
#> [1] 300
igraph::ecount(g_snn)   # weighted undirected edges
#> [1] 34462

The snn.type argument controls how edge weights are computed:

snn.type Edge weight Use case
"rank" (default) Rank-based Jaccard similarity Seurat-compatible; robust to hub nodes
"jaccard" Jaccard similarity Standard set-overlap measure
"number" Count of shared neighbors Simple; unnormalised
g_rank    <- fastMakeSNNGraph(pca, k = 15, snn.type = "rank")
g_jaccard <- fastMakeSNNGraph(pca, k = 15, snn.type = "jaccard")
g_number  <- fastMakeSNNGraph(pca, k = 15, snn.type = "number")

# Edge weight ranges differ between types
summary(igraph::E(g_rank)$weight)
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#>  0.000001  6.500000  9.000000  8.706247 11.000000 14.500000
summary(igraph::E(g_jaccard)$weight)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.03226 0.03226 0.06667 0.07545 0.10345 0.45455

6.2 K-Nearest Neighbor graph

# Undirected KNN graph (mutual edges)
g_knn_u <- fastMakeKNNGraph(pca, k = 15, directed = FALSE)

# Directed KNN graph (each cell points to its k neighbors)
g_knn_d <- fastMakeKNNGraph(pca, k = 15, directed = TRUE)

igraph::is_directed(g_knn_d)  # TRUE
#> [1] TRUE

6.3 Community detection

The SNN graph is the standard input for graph-based clustering in single-cell analysis. Here we use the Louvain algorithm:

louvain <- igraph::cluster_louvain(g_snn)
message(
    "Number of detected communities: ",
    length(igraph::communities(louvain))
)
#> Number of detected communities: 4
membership_vec <- igraph::membership(louvain)
table(membership_vec)
#> membership_vec
#>  1  2  3  4 
#> 36 98 70 96

Leiden clustering (when available via igraph ≥ 1.3) gives finer control over community resolution:

leiden <- igraph::cluster_leiden(g_snn, resolution_parameter = 0.5)
table(igraph::membership(leiden))

7 Full Single-Cell Workflow

This end-to-end example simulates three cell populations (clusters) embedded in 30 PCs and recovers them via SNN graph clustering. All steps use only jvecfor and igraph.

# Simulate 300 cells (100 per cluster) with separable cluster centres in PC1–PC2
set.seed(42)
n_per <- 100
centres <- list(c(5, 0), c(-5, 0), c(0, 5))

pca_clust <- do.call(rbind, lapply(seq_along(centres), function(i) {
    m        <- matrix(rnorm(n_per * 30), n_per, 30)
    m[, 1]   <- m[, 1] + centres[[i]][1]
    m[, 2]   <- m[, 2] + centres[[i]][2]
    m
}))
true_labels <- rep(1:3, each = n_per)
# Step 1: find 15 nearest neighbours
nn_wf <- fastFindKNN(pca_clust, k = 15)
# Step 2: build SNN graph
g_wf <- fastMakeSNNGraph(pca_clust, k = 15)
# Step 3: Louvain community detection
lou_wf   <- igraph::cluster_louvain(g_wf)
detected <- igraph::membership(lou_wf)

# Step 4: cross-tabulate detected vs. true labels
print(table(Detected = detected, True = true_labels))
#>         True
#> Detected   1   2   3
#>        1  99   0   0
#>        2   1   0 100
#>        3   0 100   0
# Visualise in the first two PCs (base R — no extra dependencies)
cluster_cols <- c("#E64B35", "#4DBBD5", "#00A087", "#3C5488", "#F39B7F")
plot(
    pca_clust[, 1], pca_clust[, 2],
    col  = cluster_cols[detected],
    pch  = 19, cex = 0.7,
    xlab = "PC 1", ylab = "PC 2",
    main = "Louvain clusters on SNN graph"
)
legend("topright",
        legend = paste("Cluster", sort(unique(detected))),
        col    = cluster_cols[sort(unique(detected))],
        pch    = 19, bty = "n", cex = 0.85)
Louvain clusters (colours) in PC1-PC2 space. Three well-separated populations are recovered.

Figure 1: Louvain clusters (colours) in PC1-PC2 space
Three well-separated populations are recovered.

7.0.1 Using a real SingleCellExperiment

Swap the simulated matrix for reducedDim(sce, "PCA") to apply the same pipeline to real data:

library(SingleCellExperiment)

# Assume sce is a SingleCellExperiment with PCA computed
# (e.g. via scater::runPCA)
pca_mat <- reducedDim(sce, "PCA")

# KNN search
nn_sce <- fastFindKNN(pca_mat, k = 15)

# Graph-based clustering
g_sce      <- fastMakeSNNGraph(pca_mat, k = 15)
lou_sce    <- igraph::cluster_louvain(g_sce)
sce$cluster <- as.factor(igraph::membership(lou_sce))

8 Comparing with BiocNeighbors

jvecfor is designed as a drop-in replacement for BiocNeighbors. The two calls below are functionally equivalent and return identically structured output:

library(BiocNeighbors)

# BiocNeighbors (Annoy backend, the default ANN method)
nn_bn <- BiocNeighbors::findKNN(pca, k = 15, BNPARAM = AnnoyParam())

# jvecfor (HNSW-DiskANN backend) — standalone function
nn_jv <- jvecfor::fastFindKNN(pca, k = 15)

# Same structure
identical(names(nn_bn), names(nn_jv))   # TRUE — both have $index, $distance
identical(dim(nn_bn$index), dim(nn_jv$index))  # TRUE

Results will not be identical because both methods are approximate — they use different ANN algorithms with different recall characteristics.

8.1 Drop-in BiocNeighbors integration

JvecforParam is a BiocNeighborParam subclass, so you can pass it as the BNPARAM argument to any Bioconductor function that accepts one:

library(BiocNeighbors)

# Use jvecfor through the standard BiocNeighbors interface
nn <- findKNN(pca, k = 15, BNPARAM = JvecforParam())

# Works with scran, scater, and other BNPARAM-aware packages:
# library(scran)
# g <- buildSNNGraph(sce, BNPARAM = JvecforParam())
#
# library(scater)
# sce <- runUMAP(sce, BNPARAM = JvecforParam())

# Customise algorithm parameters via the constructor
nn2 <- findKNN(pca, k = 15,
                BNPARAM = JvecforParam(
                    type = "knn",           # exact VP-tree search
                    distance = "Cosine",
                    M = 32L
                ))

8.2 Timing comparison

To benchmark on your own data:

# Replace `pca_large` with your n × p matrix (e.g. n = 100K cells, p = 50 PCs)
t_bn <- system.time(
    BiocNeighbors::findKNN(pca_large, k = 15, BNPARAM = AnnoyParam())
)
t_jv <- system.time(
    fastFindKNN(pca_large, k = 15)
)

message(sprintf("BiocNeighbors: %.1f s", t_bn["elapsed"]))
message(sprintf("jvecfor:       %.1f s", t_jv["elapsed"]))
message(sprintf("Speedup:       %.1fx",  t_bn["elapsed"] / t_jv["elapsed"]))

9 Troubleshooting

9.1 Common errors

Error message Likely cause Resolution
Java not found on PATH java not in shell PATH Install OpenJDK 20+ (adoptium.net); re-launch R after updating PATH.
Java >= 20 is required (found Java X) Outdated JDK Upgrade to OpenJDK 20+; verify with java --version.
jvecfor JAR not found. Run jvecfor_setup() inst/java/ empty Run jvecfor_setup() or reinstall.
jvecfor exited with status N JVM crash or invalid input Set verbose=TRUE to inspect Java stderr.
Very slow first call JVM cold-start overhead Expected; subsequent calls are faster.
Unexpected neighbour count k >= nrow(X) Reduce k; need at least k + 1 observations.

9.2 Debugging with verbose mode

Enable verbose output to see HNSW build progress and confirm threading:

options(jvecfor.verbose = TRUE)
nn_debug <- fastFindKNN(pca[1:50, ], k = 5)
options(jvecfor.verbose = FALSE)   # reset

9.3 Checking the Java runtime

# Which java is on PATH?
Sys.which("java")

# Version reported by that java binary
system2("java", "--version", stderr = TRUE)

# Where is the bundled JAR?
system.file("java", package = "jvecfor")

9.4 Using a safe wrapper

For production code that should degrade gracefully when Java is unavailable, wrap calls with tryCatch:

nn <- tryCatch(
    fastFindKNN(pca, k = 15),
        error = function(e) {
            warning("jvecfor unavailable: ", conditionMessage(e),
                "\nFalling back to BiocNeighbors.")
            BiocNeighbors::findKNN(pca, k = 15)
        }
)

10 Session Information

sessionInfo()
#> R version 4.6.0 alpha (2026-03-30 r89742)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] igraph_2.2.2     jvecfor_0.99.7   BiocStyle_2.39.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.7-5        jsonlite_2.0.0      compiler_4.6.0     
#>  [4] BiocManager_1.30.27 tinytex_0.59        Rcpp_1.1.1         
#>  [7] magick_2.9.1        parallel_4.6.0      cluster_2.1.8.2    
#> [10] jquerylib_0.1.4     BiocParallel_1.45.0 yaml_2.3.12        
#> [13] fastmap_1.2.0       lattice_0.22-9      R6_2.6.1           
#> [16] generics_0.1.4      knitr_1.51          BiocGenerics_0.57.0
#> [19] bluster_1.21.1      bookdown_0.46       bslib_0.10.0       
#> [22] pillar_1.11.1       BiocNeighbors_2.5.4 rlang_1.1.7        
#> [25] cachem_1.1.0        xfun_0.57           sass_0.4.10        
#> [28] otel_0.2.0          cli_3.6.5           magrittr_2.0.4     
#> [31] ps_1.9.2            digest_0.6.39       grid_4.6.0         
#> [34] processx_3.8.7      lifecycle_1.0.5     S4Vectors_0.49.0   
#> [37] vctrs_0.7.2         evaluate_1.0.5      glue_1.8.0         
#> [40] data.table_1.18.2.1 codetools_0.2-20    stats4_4.6.0       
#> [43] rmarkdown_2.31      tools_4.6.0         pkgconfig_2.0.3    
#> [46] htmltools_0.5.9