--- title: Finding neighbors in high-dimensional space author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com date: "Revised: July 28, 2024" output: BiocStyle::html_document: toc_float: true package: BiocNeighbors vignette: > %\VignetteIndexEntry{Finding neighbors in high-dimensional space} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide", message=FALSE} require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) library(BiocStyle) self <- Biocpkg("BiocNeighbors") set.seed(99999) ``` # Overview The `r self` package implements a variety of nearest-neighbor (NN) search algorithms with a consistent interface. This includes exact search methods like [vantage point trees](https://dl.acm.org/doi/10.5555/313559.313789), [k-means k-nearest neighbors](https://doi.org/10.1109/IJCNN.2011.6033373), and an exhaustive brute-force search, as well as approximate methods like [Approximate Nearest Neighbors Oh Yeah (Annoy)](https://github.com/spotify/annoy) and [hierarchical navigable small worlds (HNSW)](https://github.com/nmslib/hnswlib). We provides methods to search nearest neighbors within a dataset, query across datasets, and (for some algorithms) find all neighbors within a certain distance. # Finding nearest neighbors The `findKNN()` function will find the `k`-nearest neighbors within a dataset, returning one matrix of neighbor identities and another matrix of (sorted) distances to each neighbor. ```{r} # Mocking up a matrix with 'nobs' points in the rows # and 'ndim' dimensions in the columns. nobs <- 1000 ndim <- 20 data <- matrix(runif(nobs*ndim), ncol=ndim) library(BiocNeighbors) fout <- findKNN(data, k=10) str(fout) ``` Each row of the `index` matrix corresponds to a row in `data` and contains the row indices in `data` that are its nearest neighbors. For example, the 3rd point in `data` has the following nearest neighbors with their (sorted) distances: ```{r} fout$index[3,] fout$distance[3,] ``` Users can easily change the algorithm by setting the `BNPARAM=` argument. For example, the code below switches to the Annoy algorithm for a faster, less accurate search. ```{r} aout <- findKNN(data, k=10, BNPARAM=AnnoyParam()) str(aout) ``` The same mechanism can be used to change the distance metric from the Euclidean default. ```{r} vout <- findKNN(data, k=10, BNPARAM=VptreeParam(distance="Manhattan")) str(vout) ``` If the number of neighbors differs for each point, we can supply an integer vector to `k=` instead. This yields a list of vectors containing the neighbors and their distances to each point. ```{r} var.k <- sample(10, nrow(data), replace=TRUE) # use I() to distinguish between scalar and length-1 vectors. var.out <- findKNN(data, k=I(var.k)) head(var.out$index) head(var.out$distance) ``` `queryKNN()` is a related function that will find the `k`-nearest neighbors in one dataset based on query points in another dataset. Here, the rows of the output matrices correspond to rows of our `query` matrix. ```{r} nquery <- 100 ndim <- 20 query <- matrix(runif(nquery*ndim), ncol=ndim) qout <- queryKNN(data, query, k=5) str(qout) ``` When performing multiple calls to `findKNN()` or `queryKNN()` on the same data, advanced users may prefer to build the index first with `buildIndex()`. This can be efficiently reused without repeating the index construction in each call. ```{r} prebuilt <- buildIndex(data) out1 <- findKNN(prebuilt, k=5) out2 <- queryKNN(prebuilt, query, k=5) ``` # Finding all neighbors within range In some applications, we need to find all neighboring points within a certain distance threshold. This is achieved using the `findNeighbors()` function: ```{r} nobs <- 8000 ndim <- 20 data <- matrix(runif(nobs*ndim), ncol=ndim) fout <- findNeighbors(data, threshold=1) head(fout$index) head(fout$distance) ``` Each entry of the `index` list corresponds to a point in `data` and contains the row indices in `data` that are within `threshold`. For example, the 3rd point in `data` has the following neighbors with the associated distances: ```{r} fout$index[[3]] fout$distance[[3]] ``` Again, we can switch algorithms by specifying `BNPARAM=`. However, keep in mind that not all algorithms (particularly the approximate methods) support this range-based search. ```{r} vparam <- VptreeParam(distance="Manhattan") vout <- findNeighbors(data, threshold=1, BNPARAM=vparam) ``` `queryNeighbors()` is a related function that will identify all points within a certain distance of a query point. Each entry of the returned `index` and `distance` corresponds to a row of `query` and describes its neighbors in `data`. ```{r} nquery <- 100 ndim <- 20 query <- matrix(runif(nquery*ndim), ncol=ndim) qout <- queryNeighbors(data, query, threshold=1) head(qout$index) ``` As described above, advanced users can built the index first before repeated calls to `findNeighbors()` and `queryNeighbors()`. ```{r} prebuilt <- buildIndex(data) out1 <- findNeighbors(prebuilt, threshold=1) out2 <- queryNeighbors(prebuilt, query,threshold=1) ``` If only the number of neighbors is of interest, we can set `get.index=FALSE` and `get.distance=FALSE` in the `findNeighbors()`/`queryNeighbors()` call. This will count the number of neighbors without storing their identities/distances for greater memory efficiency. ```{r} num <- findNeighbors(data, threshold=1, get.index=FALSE, get.distance=FALSE) head(num) ``` # Usage in downstream C++ libraries R/Bioconductor package developers can use `r self` to perform nearest-neighbor searches in their own C++ code. This uses the interfaces in the [**knncolle**](https://github.com/knncolle/knncolle) library to abstract away the underlying search methods. First, we add some linking instructions in the `DESCRIPTION` to make the header files available: ``` LinkingTo: Rcpp, assorthead, BiocNeighbors ``` This includes the `BiocNeighbors.h` file that implements the `BiocNeigbors::Prebuilt` class: ```{r, results="hide"} system.file("include", "BiocNeighbors.h", package="BiocNeighbors") ``` When a `buildIndex()` method returns an external pointer, that pointer is expected to refer to an instance of a `BiocNeighbors::Prebuilt`. Thus, downstream code just needs to accept a pointer and cast it to a `BiocNeighbors::Prebuilt` for use in nearest-neighbor searches. (The same approach can be applied to obtain a `BiocNeighbors::Builder` for construction of new indices.) ```cpp SEXP do_something(SEXP raw_ptr, int k) { BiocNeighbors::PrebuiltPointer ptr(raw_ptr); const auto& prebuilt = ptr->index; auto searcher = prebuilt->initialize(); std::vector indices; std::vector distances; searcher->search(0, k, &indices, &distances); return Rcpp::List::create( Rcpp::IntegerVector(indices.begin(), indices.end()), Rcpp::IntegerVector(distances.begin(), distances.end()) ); } ``` Some extra work is required for cosine-normalized indices, which is indicated by `ptr->cosine == true`. Any query data should also be L2-normalized before use in `Searcher::search()` or `Searcher::search_all()`. (This is not required for overloads that accept an index for searching _within_ a dataset, only for queries between datasets.) # Extending to new methods ## Introduction Developers can extend the `r self` framework to more algorithms through the creation of new `BiocNeighborParam` classes. Users can then switch between algorithms by simply changing the `BNPARAM=` argument in their calls to `findKNN()`, etc. In general, we recommend putting these extensions in a new R/Bioconductor package that adds more methods to the `r self` generics. Developers can choose to write pure R extensions or they can write them partially in C++. ## In R Each extension package should: - Implement a `SomeNewParam` S4 class that inherits from `BiocNeighborParam`. - Implement a `SomeNewIndex` S4 class that inherits from `BiocNeighborIndex` and contains an arbitrary index structure. - Implement a `buildIndex()` S4 method, dispatching on a `matrix` type for `X=` and a `SomeNewParam` type for `BNPARAM=`. This should return an instance of `SomeNewIndex`. - Implement two methods for each of `findKNN()` and `queryKNN()` (and optionally `findNeighbors()` `queryNeighbors()`). One method should dispatch on a `matrix` type for `X=` and a `SomeNewParam` type for `BNPARAM=`. The other method should dispatch on a prebuilt `SomeNewIndex` type for `X=` and an `ANY` type for `BNPARAM=` (as the `BNPARAM=` is ignored for prebuilt indices). ## In C++ Alternatively, each extension package should: - Implement C++ classes that inherit from the `Builder`, `Prebuilt` and `Searcher` interfaces in the [**knncolle**](https://github.com/knncolle/knncolle) library. This requires `LinkingTo: assorthead, BiocNeighbors` in the `DESCRIPTION`. - Implement a `SomeNewParam` S4 class that inherits from `BiocNeighborParam`. - Implement a `SomeNewIndex` S4 class that inherits from `BiocNeighborGenericIndex`. This should contain a `ptr` slot for an external pointer and a `names` slot for the observation names. - Implement a `defineBuilder()` method, dispatching on a `matrix` type for `X=` and a `SomeNewParam` type for `BNPARAM=`. This should call into C++ and return a list containing `builder`, an external pointer to an instance of a `BiocNeighbors::Builder` (see definition in `BiocNeighbors.h`); and `class`, the constructor for the new `SomeNewIndex` class. No new methods are required for `findKNN()`, `queryNeighbors()`, etc. as a `BiocNeighborGeneric` method is already available for each generic. This approach also allows the new method to be used in C++ code of downstream packages that accept a `BiocNeighbors::Prebuilt` instance. # Session information {-} ```{r} sessionInfo() ```