--- title: "A. Introduction" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{A. Introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Original version: 26 September, 2023 # Introduction The AlphaMissense [publication][Science] outlines how a variant of AlphaFold / DeepMind was used to predict missense variant pathogenicity. Supporting data on [Zenodo][] include, for instance 70+M variants across hg19 and hg38 genome builds. The AlphaMissense package allows ready access to the data, downloading individual files to DuckDB databases for ready exploration and integration into *R* and *Bioconductor* workflows. [Science]: https://www.science.org/doi/epdf/10.1126/science.adg7492 [Zenodo]: https://zenodo.org//record/8360242 Install the package from *Bioconductor* or GitHub, ensuring correct *Bioconductor* dependencies. ```{r install, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager", repos = "https://cran.r-project.org") ``` When the package is available on *Bioconductor*, use ```{r install-Bioconductor, eval = FALSE} if (BiocManager::version() >= "3.19") { BiocManager::install("AlphaMissenseR") } else { stop( "'AlphaMissenseR' requires Bioconductor version 3.19 or later, ", "install from GitHub?" ) } ``` Use the pre-release or devel version with ```{r install-devel, eval = FALSE} remotes::install_github( "mtmorgan/AlphaMissenseR", repos = BiocManager::repositories() ) ``` Load the library. ```{r setup, message = FALSE} library(AlphaMissenseR) ``` Learn about available data by visiting the Zenodo record where data will be retrieved from. ```{r am_browse, eval = FALSE} am_browse() ``` # Discovery, retrieval and use Use `am_available()` to discover data resources available for representation in DuckDB databases. The `cached` column is initially `FALSE` for all data sets; `TRUE` indicates that a data set has been downloaded by `am_data()`, as described below. ```{r am_available} am_available() ``` The available data sets use the most recent record as of 25 September, 2023; this can be changed by specifying an alternative as the `record=` argument or changed globally by setting an environment variable `ALPHAMISSENSE_RECORD` *before* loading the package. Use `am_data()` to download a data resource and store it in a DuckDB database. The `key=` argument is from the column returned by `am_available()`. Files are cached locally (using [BiocFileCache][]) so this operation is expensive only the first time. Each `record=` is stored in a different database. [BiocFileCache]: https://bioconductor.org/packages/BiocFileCache ```{r am_data} tbl <- am_data("hg38") tbl ``` The return value `tbl` is a table from a DuckDB database. A (read-only) connection to the database itself is available with ```{r db_connect} db <- db_connect() ``` This connection remains open throughout the session; call `db_disconnect(db)` or `db_disconnect_all()` to close it at the end of the session. The database contains tables for each key downloaded. As an alternative to `am_available()` / `am_data()`, view available tables and create a [dplyr][] / [dbplyr][] tibble of the table of interest. [dplyr]: https://cran.r-project.org/package=dplyr [dbplyr]: https://cran.r-project.org/package=dbplyr ```{r am_data-duckdb} db_tables(db) tbl <- tbl(db, "hg38") tbl ``` It is fast and straight-forward to summarize the data, e.g., the number of variants assigned to each pathogenicity class. ```{r db-am_class} tbl |> count(am_class) ``` Or the average pathogenicity score in each class... ```{r db-pathogenicity} tbl |> group_by(am_class) |> summarize(n = n(), pathogenecity = mean(am_pathogenicity, na.rm = TRUE)) ``` Or the number of transitions between `REF` and `ALT` nucleotides across all variants. ```{r REF-ALT} tbl |> count(REF, ALT) |> tidyr::pivot_wider(names_from = "ALT", values_from = "n") |> select("REF", "A", "C", "G", "T") |> arrange(REF) ``` It is straight-forward to select variants in individual regions of interest, e.g., the first 200000 nucleoties of chromosome 4. ```{r} tbl |> filter(CHROM == "chr4", POS > 0, POS <= 200000) ``` # Working with Bioconductor This section illustrates how AlphaMissense data can be integrated with other *Bioconductor* workflows, including the [GenomicRanges][] infrastructure and the [ensembldb][] package / [AnnotationHub][] resources. [GenomicRanges]: https://bioconductor.org/packages/GenomicRanges [ensembldb]: https://bioconductor.org/packages/ensembldb [AnnotationHub]: https://bioconductor.org/packages/AnnotationHub ## Genomic ranges The [GenomicRanges][] infrastructure provides standard data structures that allow range-based operations across Bioconductor packages. The `GPos` data structure provides a convenient and memory-efficient representation of single-nucleotide variants like those in the `hg19` and `hg38` AlphaMissense resources. Start by installing (if necessary) the [GenomicRanges][] package. ```{r, eval = FALSE} if (!requireNamespace("GenomicRanges", quietly = TRUE)) BiocManager::install("GenomicRanges") ``` Select the `hg38` data resource, and filter to a subset of variants; `GPos` is an in-memory data structure and can easily manage 10's of millions of variants. ```{r} tbl <- am_data("hg38") |> filter(CHROM == "chr2", POS < 10000000, REF == "G") ``` Use `to_GPos()` to coerce to a `GPos` object. ```{r} gpos <- tbl |> to_GPos() gpos ``` Vignettes in the [GenomicRanges][] package illustrate use of these objects. ```{r, eval = FALSE} utils::browseVignettes("GenomicRanges") ``` ## GRCh38 annotation resources Start by identifying the most recent EnsDb resource for *Homo Sapiens*. ```{r ahub-homo-sap} hub <- AnnotationHub::AnnotationHub() AnnotationHub::query(hub, c("EnsDb", "Homo sapiens")) AnnotationHub::AnnotationHub()["AH113665"] ``` Load the [ensembldb][] library and retrieve the record. Unfortunately, there are many conflicts between function names in different packages, so it becomes necessary to fully resolve functions to the package where they are defined. ```{r AnnotationHub, message = FALSE} library(ensembldb) edb <- AnnotationHub::AnnotationHub()[["AH113665"]] edb ``` ## From *Bioconductor* to *DuckDB* In this section we will work more directly with the database, including writing temporary tables. For illustration purposes, we use a connection that can read and write; in practice, read-only permissions are sufficient for creating temporary tables. ```{r db_rw} db_rw <- db_connect(read_only = FALSE) ``` As an illustration, use [ensembldb][] to identify the exons of the canonical transcript of a particular gene. ```{r tx} bcl2l11 <- edb |> ensembldb::filter( ~ symbol == "BCL2L11" & tx_biotype == "protein_coding" & tx_is_canonical == TRUE ) |> exonsBy("tx") bcl2l11 ``` Munge the data to a tibble, updating the `seqnames` to a column `CHROM` with identifiers such as `"chr1"` (as in the AlphaMissense data). Write the tibble to a temporary table (it will be deleted when `db` is disconnected from the database) so that it can be used in 'lazy' SQL queries with the AlphaMissense data. ```{r temp-table} bcl2l11_tbl <- bcl2l11 |> dplyr::as_tibble() |> dplyr::mutate(CHROM = paste0("chr", seqnames)) |> dplyr::select(CHROM, everything(), -seqnames) db_temporary_table(db_rw, bcl2l11_tbl, "bcl2l11") ``` The temporary table is now available on the `db_rw` connection; the tables will be removed on disconnect, `db_disconnect(db_rw)`. ```{r db_tables-rw} "bcl2l11" %in% db_tables(db_rw) ``` Use `db_range_join()` to join the AlphaMissense data with the ranges defining the exons in our gene of interest. The arguments are the database connection, the AlphaMissense table of interest (this table must have columns `CHROM` and `POS`), the table containing ranges of interest (with columns `CHROM`, `start`, `end`), and the temporary table to contain the results. A range join is like a standard database join, expect that the constraints can be relations, in our case that `POS >= start` and `POS <= end` for each range of interest; implementation details are in this [DuckDB blog][]. The range join uses closed intervals (the start and end positions are included in the query), following *Bioconductor* convention. Writing to a temporary table avoids bringing potentially large datasets into R memory, and makes the table available for subsequent manipulation in the current session. [DuckDB blog]: https://duckdb.org/2022/05/27/iejoin.html ```{r range-join} rng <- db_range_join(db_rw, "hg38", "bcl2l11", "bcl2l11_overlaps") rng ``` This query takes place almost instantly. A larger query of 71M variants against 1000 ranges took about 20 seconds. The usual database and [dplyr][] verbs can be used to summarize the results, e.g., the number of variants in each pathogenecity class in each exon. ```{r} rng |> dplyr::count(exon_id, am_class) |> tidyr::pivot_wider(names_from = "am_class", values_from = "n") ``` It is perhaps instructive to review the range join as a source of inspiration for other computations that might be of interest. ``` sql -- range join of 'hg38' with 'bcl2l11'; overwrite any existing table -- 'bcl2l11_overlaps' DROP TABLE IF EXISTS bcl2l11_overlaps; CREATE TEMP TABLE bcl2l11_overlaps AS SELECT hg38.*, bcl2l11.* EXCLUDE (CHROM) FROM hg38 JOIN bcl2l11 ON bcl2l11.CHROM = hg38.CHROM AND bcl2l11.start <= hg38.POS AND bcl2l11.end >= hg38.POS; ``` As best practice, disconnect from the writable database connection when work is complete. ```{r db_disconnect-rw} db_disconnect(db_rw) ``` ## From *DuckDB* to *Bioconductor* There are likely more straight-forward ways of performing the query in the previous section, e.g., by filtering `hg38` on the relevant transcript id(s), retrieving to *R*, and working with `edb` to classify variants by exon. The transcript we are interested in is `"ENST00000393256"`. Select the relevant variants and, because there are not too many, load into *R*. ```{r filter-variants} variants_of_interest <- am_data("hg38") |> dplyr::filter(transcript_id %like% "ENST00000393256%") ``` Coerce the AlphaMissense data to a `GRanges::GPos` object. ```{r gpos} gpos <- variants_of_interest |> to_GPos() ## make gpos 'genome' and 'seqlevels' like bcl2l11 GenomeInfoDb::genome(gpos) <- "GRCh38" GenomeInfoDb::seqlevelsStyle(gpos) <- "Ensembl" gpos ``` One can then use [GenomicRanges][] functionality, e.g., to count the number of variants in each exon. [GenomicRanges]: https://bioconductor.org/packages/GenomicRanges ```{r countOverlaps} countOverlaps(unlist(bcl2l11), gpos) ``` # Finally Remember to disconnect and shutdown all managed DuckDB connections. ```{r db_disconnect_all} db_disconnect_all() ``` Database connections that are not closed correctly trigger warning messages. # Session information {.unnumbered} ```{r} sessionInfo() ```