BiocNeighbors 1.23.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3587 5694 94 1490 6448 7210 3381 7180 106 1197
## [2,] 3784 554 7807 4305 8822 604 2570 4475 3576 2023
## [3,] 7812 5933 7619 6470 4313 8819 7126 4645 9969 1945
## [4,] 9156 5294 1826 4399 2278 231 8749 6633 637 3937
## [5,] 4640 2395 9625 9575 8451 6771 8171 4453 4051 7179
## [6,] 411 2979 4292 696 721 5762 6553 2679 6720 4234
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9021356 0.9205363 1.0283924 1.0352873 1.0475036 1.1139144 1.1379948
## [2,] 0.9382922 1.0280027 1.0463891 1.0553226 1.0582689 1.0705205 1.0706172
## [3,] 0.9774883 0.9986819 1.0080993 1.0239663 1.0283052 1.0491102 1.0815026
## [4,] 0.8917468 0.8962045 0.9015843 0.9178713 0.9240661 0.9269338 0.9366021
## [5,] 0.9703493 1.0482774 1.0751156 1.0929271 1.1002343 1.1190203 1.1224794
## [6,] 0.8080394 0.8767816 0.8910694 0.9096510 0.9143987 0.9386662 0.9452432
## [,8] [,9] [,10]
## [1,] 1.1468021 1.1517737 1.1655892
## [2,] 1.0820990 1.0882543 1.0924572
## [3,] 1.1113075 1.1119481 1.1259694
## [4,] 0.9433661 0.9486399 0.9653121
## [5,] 1.1256322 1.1285431 1.1292713
## [6,] 0.9583816 0.9652805 0.9708000
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3066 8931 4332 6475 5806
## [2,] 4897 1198 9869 7847 9299
## [3,] 5984 6066 1659 4789 9078
## [4,] 1298 2477 7304 8497 1933
## [5,] 1356 1010 3125 2179 5843
## [6,] 6598 2734 693 5902 3331
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9725342 0.9938608 1.0042375 1.0421308 1.0623859
## [2,] 0.8331320 0.9495735 0.9558044 0.9673028 0.9724756
## [3,] 0.9676353 1.0097779 1.0232770 1.0241607 1.0251552
## [4,] 0.7896271 0.9158679 0.9505450 1.0054065 1.0286995
## [5,] 0.9504547 1.0188464 1.0624349 1.0747781 1.0818518
## [6,] 0.7524973 0.9294206 1.0867753 1.0924937 1.1076089
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "F:\\biocbuild\\bbs-3.20-bioc\\tmpdir\\RtmpAbrBk0\\file14e44da665f7.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 RC (2024-04-16 r86468 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows Server 2022 x64 (build 20348)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.23.0 knitr_1.46 BiocStyle_2.33.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.43.0 htmltools_0.5.8.1
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.39 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-20 Rcpp_1.0.12 BiocParallel_1.39.0
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.7.0 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.51.0 cachem_1.0.8