--- title: "Introduction to predictsr" output: html_vignette vignette: > %\VignetteIndexEntry{Introduction to predictsr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} knitr: opts_chunk: collapse: true comment: '#>' cache: true --- The [`predictsr`](https://github.com/Biodiversity-Futures-Lab/predictsr) package accesses the PREDICTS database (Hudson et al, 2013) from within R, conveniently as a dataframe. It uses the [Natural History Museum Data Portal](https://data.nhm.ac.uk/) to download the latest versions of the PREDICTS database and some related metadata. The [PREDICTS](https://www.nhm.ac.uk/our-science/research/projects/predicts/science.html) database comprises over 4 million measurements of species at sites across the world. All major terrestrial plant, animal, and fungal groups are represented within the data. The most well-represented group in the data are arthropods (mainly insects), covering about 2% of the species known by science There have been 2 public releases of PREDICTS data. The first was in [2016](https://data.nhm.ac.uk/dataset/the-2016-release-of-the-predicts-database-v1-1) and the second was in [2022](https://data.nhm.ac.uk/dataset/release-of-data-added-to-the-predicts-database-november-2022); each consisted of about 3 million and 1 million records, respectively. We include both, as a single dataset, in this package. *However*, the PREDICTS database is under constant development and is continuously growing, with later releases planned. ## Loading the data To get started, let's load in the database into R and poke around. To do so you will use the `LoadPredictsData` function, which pulls in the data from the data portal and saves it locally for you. It reads in both releases into a single dataframe (by default), or you can specify which years to save: ```{r setup} file_predicts <- file.path(tempdir(), "predicts.rds") predicts <- predictsr::LoadPredictsData(file_predicts, extract = c(2016, 2022)) str(predicts) ``` Be patient as this may take a few seconds to run, as it downloads the data from the NHM Data Portal. The first time you run this, it will save the data into `file_predicts`, and store some metadata in a `*.aux.json` file. Subsequent calls will then check the metadata (via a SHA-based invalidation), and if this looks OK, just load the data in `file_predicts`. This saves you having to repeatedly download the data. We can see that there are over 4 million records in the combined PREDICTS extracts. Let's look at a set of summary statistics for the database: ```{r} if (nrow(predicts) > 0) { taxa <- predicts[ !duplicated(predicts[, c("Source_ID", "Study_name", "Taxon_name_entered")]), ] species_counts <- length( unique(taxa[taxa$Rank %in% c("Species", "Infraspecies"), "Taxon"]) ) + nrow(taxa[!taxa$Rank %in% c("Species", "Infraspecies"), ]) print(glue::glue( "This database has {length(unique(predicts$SS))} studies across ", "{length(unique(predicts$SSBS))} sites, in ", "{length(unique(predicts$Country))} countries, and with ", "{species_counts} species." )) } else { print("No data available - check the download!") } ``` There are over 30,000 sites, 101 countries, and 54,863 species in this dataframe, with a couple of important columns that should be noted: - The `SS` column is what we use to identify studies when dealing with PREDICTS data. This is the concantenation of the `Source_ID`, and the `Study_number` columns. - The `SBBS` column is the concatenation of `Source_ID`, `Study_number`, `Block` and `Site_number`, which is what we use to identify single sites in the database. Let's also check the ranges of sample collection in the database: ```{r} if (nrow(predicts) > 0) { print(glue::glue( "Earliest sample collection (midpoint): {min(predicts$Sample_midpoint)}, ", "latest sample collection (midpoint): {max(predicts$Sample_midpoint)}" )) } else { print("No data available in the PREDICTS database - check the download!") } ``` ## Loading the data without caching If you don't want to cache the data, you can also just use `GetPredictsData`: ```{r} predicts <- predictsr::GetPredictsData(extract = c(2016, 2022)) ``` This will pull in the exact same data as `LoadPredictsData`, but won't save it to disk for you. If you are already using some sort of data pipeline tool (e.g. [`targets`](https://cran.r-project.org/package=targets)) then this may be your preferred option. ```{r, echo=FALSE} # Delete the database locally unlink(c(file_predicts, paste0(file_predicts, ".aux.json"))) ``` ## Accessing site-level summaries We also include access to the site-level summaries from the full release; to get *these data* you will need to use the `GetSitelevelSummaries` function. The function call is very similar to pull in the summaries for the same data as above: ```{r} summaries <- predictsr::GetSitelevelSummaries(extract = c(2016, 2022)) str(summaries) ``` Investigating the summary data closer we see that there are a number of missing columns between the two dataframes: ```{r} if (nrow(predicts) > 0 && nrow(summaries) > 0) { print(names(predicts)[!(names(predicts) %in% names(summaries))]) } ``` This is because all of the measurement-level data has been dropped from the dataframe. Now indeed we could try and replicate the creation of the `summaries` dataframe through some `dplyr` operations. These would be the following (roughly): **Note:** there seems to be some differences as there are a couple sites that don't match up. ```{r} if (nrow(summaries) > 0) { summaries_rep <- predicts |> dplyr::mutate( Higher_taxa = paste(sort(unique(Higher_taxon)), collapse = ","), N_samples = length(Measurement), Rescaled_sampling_effort = mean(Rescaled_sampling_effort), .by = SSBS ) |> dplyr::select( dplyr::all_of(names(summaries)) ) |> dplyr::distinct() |> dplyr::arrange(SSBS) summaries_copy <- summaries |> subset(SSBS %in% summaries_rep$SSBS) |> dplyr::arrange(SSBS) all.equal(summaries_copy, summaries_rep) } ``` ## Accessing descriptions of the columns in PREDICTS As we've seen already, there are 67 (!) columns in the PREDICTS database, and within the NHM Data portal releases, we have included a description of the data that is used in each of these columns. You can access this via the `GetColumnDescriptions` function: ```{r} descriptions <- predictsr::GetColumnDescriptions() str(descriptions) ``` So this includes the `Column` name, the resolution of the `Column` (`Applies_to`), whether it is in the PREDICTS extract (`Diversity_extract`), whether it is in the site-level summaries (`Site_extract`), the datatype of the `Column`*, whether it is guaranteed to be nonempty, any additional `Notes`, and any information on the range of values that the `Column` may be expected to take (`Validation`). ## Glossary To clarify some of the PREDICTS jargon, we include the following table of definitions, from the dataframes we have worked with thus far: ```{r, echo = FALSE, results = 'asis'} if (nrow(descriptions) > 0) { descriptions_sub <- subset( descriptions, Column %in% c( "Source_ID", "SS", "Block", "SSBS", "Diversity_metric_type", "Measurement" ) ) for (i in seq_along(descriptions_sub$Column)) { cat( paste0( "* `", descriptions_sub$Column[i], "` (`", descriptions_sub$Type[i], "`): " ) ) notes <- descriptions_sub$Notes[i] |> (\(s) gsub("\\*", " -", s))() |> (\(s) gsub("\n\n", " \n", s))() |> (\(s) gsub("\n", " \n", s))() if (notes == " ") { notes <- "As title." } cat(notes) cat(" \n") } } ``` **Notes:** When we refer to a "study" we typically refer to it from the `SS` that identifies it. Referring to an "extract" simply refers to a PREDICTS database release, either from 2016 or 2022 (or both). For further complete documentation see the SI in Hudson et. al. (2017). ## Notes The [NHM Data Portal API](https://data.nhm.ac.uk/about/download) has no rate limits so be considerate with your requests. Make sure you save the data somewhere, or use a tool like [`targets`](https://docs.ropensci.org/targets/index.html) to save you from re-running workflows. ## References Hudson, Lawrence N., et al. "The PREDICTS database: a global database of how local terrestrial biodiversity responds to human impacts." Ecology and evolution 4.24 (2014): 4701-4735. . Hudson, Lawrence N., et al. "The database of the PREDICTS (projecting responses of ecological diversity in changing terrestrial systems) project." Ecology and evolution 7.1 (2017): 145-188