--- title: "Spatial joins" date: "`r Sys.Date()`" output: rmarkdown::html_vignette code-annotations: hover urlcolor: blue vignette: > %\VignetteIndexEntry{Spatial joins} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = identical(tolower(Sys.getenv("NOT_CRAN")), "true"), out.width = "100%" ) # CRAN OMP THREAD LIMIT to avoid CRAN NOTE Sys.setenv(OMP_THREAD_LIMIT = 2) ``` This vignette shows how to use `ddbs_join()` to perform fast spatial join operations on large data sets with three different approaches: 1 **In-memory**: pass `sf` objects and get an `sf` result (DuckDB runs under the hood, no persistent DataBase). 2. **Connected**: pass table names stored in an existing DuckDB connection and get an `sf` result. 3. **Write-to-DB**: same as (2), but write the result to a new DuckDB table. Let's see a few examples. First, let's load a few libraries and our sample data: ```{r} library(duckspatial) # library(mapview) library(sf) # polygons countries_sf <- sf::st_read( system.file("spatial/countries.geojson", package = "duckspatial"), quiet = TRUE ) # random points set.seed(42) n <- 10000 points_sf <- data.frame( id = 1:n, x = runif(n, min = -180, max = 180), y = runif(n, min = -90, max = 90) ) |> sf::st_as_sf(coords = c("x","y"), crs = 4326) ``` # 1) In-memory: pass `sf`, return `sf` The simplest way to perform fast spatial join. You simply pass two `sf` objects, and `ddbs_join()` spins up a temporary DuckDB, runs the join, and returns an `sf.` - **When to use:** quick analysis, prototyping, or when you don’t need to persist intermediate tables. ```{r, message=FALSE} out_sf1 <- ddbs_join( x = points_sf, y = countries_sf, join = "within" ) # quick peek # mapview(out_sf1, zcol="NAME_ENGL") ``` # 2) Connected: pass table names in DuckDB, return `sf` In the second and third approaches, we make use of a connection to an existing DuckDB database. So let's create a fresh DuckDB connection using the `ddbs_create_conn()` function, which automatically install and load DuckDB spatial extension to the connection. ```{r} # create a fresh DuckDB connection conn <- duckspatial::ddbs_create_conn() ``` Now, in this second approach you need first to write your layers to DuckDB, and perform the spatial join by referencing their table names. Like this: ```{r, message=FALSE} # write data to DuckDB ddbs_write_vector(conn, points_sf, "points", overwrite = TRUE) ddbs_write_vector(conn, countries_sf, "countries", overwrite = TRUE) # spatial join inside DuckDB; result returned as sf out_sf2 <- ddbs_join( conn, x = "points", y = "countries", join = "within" ) ``` - **When to use:** iterative workflows, larger-than-memory data, or when you’ll run multiple queries on the same tables. # 3) Write-to-DB: create a new table with the join result The output of approaches 1 and 2 is an `sf` object loaded to your memory. In this third approach, `ddbs_join()` writes a new table in the DuckDB database. You simply need to the `name` of the new table. ```{r, message=FALSE} ddbs_join( conn = conn, x = "points", y = "countries", join = "within", name = "points_in_countries", overwrite = TRUE ) # use the result in SQL (or read back as sf later) # DBI::dbReadTable(conn, "points_in_countries") |> # sf::st_as_sf(wkt = 'geometry') |> # head() ``` - **When to use:** iterative workflows, larger-than-memory data, or when you’ll run multiple queries on the same tables. # Spatial Join Predicates: A spatial predicate is really just a function that evaluates some spatial relation between two geometries and returns true or false, e.g., “does a contain b” or “is a within distance x of b”. The `join` argument accepts the spatial predicates: - `"ST_Intersects"`: Whether a intersects b - `"ST_Contains"`: Whether a contains b - `"ST_ContainsProperly"`: Whether a contains b without b touching a's boundary - `"ST_Within"`: Whether a is within b - `"ST_Overlaps"`: Whether a overlaps b - `"ST_Touches"`: Whether a touches b - `"ST_Equals"`: Whether a is equal to b - `"ST_Crosses"`: Whether a crosses b - `"ST_Covers"`: Whether a covers b - `"ST_CoveredBy"`: Whether a is covered by b - `"ST_DWithin"`: x) Whether a is within distance x of b # Clean up Don’t forget to disconnect from the database. ```{r} duckdb::dbDisconnect(conn) ```