BiocNeighbors 1.22.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2848 4521 3928 6364 7001 4703 1666 2480 1078 3972
## [2,] 6407 8051 1606 8266 9804 3901 979 9696 9847 8034
## [3,] 3268 6710 9705 6519 8288 3434 7658 3646 3511 750
## [4,] 644 5238 3916 8231 2047 2580 5973 7882 8380 8753
## [5,] 5831 8211 5357 1487 4543 5926 8326 3097 9270 321
## [6,] 2642 5718 8416 9595 353 917 2331 6665 9137 7356
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8405916 0.8798988 0.8870191 0.9181291 0.9606939 0.9607217 0.9621375
## [2,] 0.9910530 1.0142183 1.0160263 1.0423205 1.0513763 1.0766540 1.0766710
## [3,] 0.9836815 0.9932737 0.9981568 1.0231274 1.0285779 1.0355318 1.0406004
## [4,] 0.9610632 0.9665034 0.9712912 0.9984006 1.0192084 1.0245547 1.0364628
## [5,] 0.8947265 0.9010723 1.0180677 1.0241910 1.0603987 1.0669070 1.0714582
## [6,] 0.9048750 0.9258313 0.9498504 0.9655181 1.0334547 1.0394921 1.0454750
## [,8] [,9] [,10]
## [1,] 0.9811712 0.9916555 0.9966623
## [2,] 1.0969291 1.0970849 1.1040561
## [3,] 1.0470090 1.0584370 1.0684750
## [4,] 1.0380205 1.0724257 1.0753788
## [5,] 1.0744734 1.0889053 1.1060251
## [6,] 1.0577204 1.0620967 1.1146837
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9802 478 2593 1906 2152
## [2,] 9295 429 6912 5393 6771
## [3,] 5647 692 806 6929 7735
## [4,] 3412 2906 5580 6854 8117
## [5,] 7189 4875 7230 9791 5891
## [6,] 7096 3292 9337 862 2863
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0188084 1.0249070 1.0424213 1.0674503 1.0863097
## [2,] 0.9499870 1.0097837 1.0343519 1.0417926 1.0602092
## [3,] 0.8892059 0.9806697 1.0053345 1.0672708 1.0726628
## [4,] 0.8908395 0.9013096 0.9373750 0.9587816 0.9697466
## [5,] 0.8676614 0.9245446 0.9404544 1.0141488 1.0142174
## [6,] 0.9886166 1.0187125 1.0624171 1.0642477 1.0776553
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "F:\\biocbuild\\bbs-3.19-bioc\\tmpdir\\Rtmp0s84m9\\file31a05b7c20b2.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows Server 2022 x64 (build 20348)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.22.0 knitr_1.46 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.42.0 htmltools_0.5.8.1
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.39 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-20 Rcpp_1.0.12 BiocParallel_1.38.0
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.7.0 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.50.0 cachem_1.0.8