---
title: "A. Introduction"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{A. Introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

Original version: 26 September, 2023

# Introduction

The AlphaMissense [publication][Science] outlines how a variant of
AlphaFold / DeepMind was used to predict missense variant
pathogenicity. Supporting data on [Zenodo][] include, for instance
70+M variants across hg19 and hg38 genome builds. The AlphaMissense
package allows ready access to the data, downloading individual files
to DuckDB databases for ready exploration and integration into *R* and
*Bioconductor* workflows.

[Science]: https://www.science.org/doi/epdf/10.1126/science.adg7492
[Zenodo]: https://zenodo.org//record/8360242

Install the package from *Bioconductor* or GitHub, ensuring correct
*Bioconductor* dependencies.

```{r install, eval = FALSE}
if (!"BiocManager" %in% rownames(installed.packages()))
    install.packages("BiocManager", repos = "https://cran.r-project.org")
```

When the package is available on *Bioconductor*, use

```{r install-Bioconductor, eval = FALSE}
if (BiocManager::version() >= "3.19") {
    BiocManager::install("AlphaMissenseR")
} else {
    stop(
        "'AlphaMissenseR' requires Bioconductor version 3.19 or later, ",
        "install from GitHub?"
    )
}
```

Use the pre-release or devel version with

```{r install-devel, eval = FALSE}
remotes::install_github(
    "mtmorgan/AlphaMissenseR",
    repos = BiocManager::repositories()
)
```

Load the library.

```{r setup, message = FALSE}
library(AlphaMissenseR)
```

Learn about available data by visiting the Zenodo record where data
will be retrieved from.

```{r am_browse, eval = FALSE}
am_browse()
```

# Discovery, retrieval and use

Use `am_available()` to discover data resources available for
representation in DuckDB databases. The `cached` column is initially
`FALSE` for all data sets; `TRUE` indicates that a data set has been
downloaded by `am_data()`, as described below.

```{r am_available}
am_available()
```

The available data sets use the most recent record as of 25 September,
2023; this can be changed by specifying an alternative as the
`record=` argument or changed globally by setting an environment
variable `ALPHAMISSENSE_RECORD` *before* loading the package.

Use `am_data()` to download a data resource and store it in a DuckDB
database. The `key=` argument is from the column returned by
`am_available()`. Files are cached locally (using [BiocFileCache][])
so this operation is expensive only the first time. Each `record=` is
stored in a different database.

[BiocFileCache]: https://bioconductor.org/packages/BiocFileCache

```{r am_data}
tbl <- am_data("hg38")
tbl
```

The return value `tbl` is a table from a DuckDB database. A
(read-only) connection to the database itself is available with

```{r db_connect}
db <- db_connect()
```

This connection remains open throughout the session; call
`db_disconnect(db)` or `db_disconnect_all()` to close it at the end of
the session.

The database contains tables for each key downloaded. As an
alternative to `am_available()` / `am_data()`, view available tables
and create a [dplyr][] / [dbplyr][] tibble of the table of interest.

[dplyr]: https://cran.r-project.org/package=dplyr
[dbplyr]: https://cran.r-project.org/package=dbplyr

```{r am_data-duckdb}
db_tables(db)

tbl <- tbl(db, "hg38")
tbl
```

It is fast and straight-forward to summarize the data, e.g., the
number of variants assigned to each pathogenicity class.

```{r db-am_class}
tbl |>
    count(am_class)
```

Or the average pathogenicity score in each class...

```{r db-pathogenicity}
tbl |>
    group_by(am_class) |>
    summarize(n = n(), pathogenecity = mean(am_pathogenicity, na.rm = TRUE))
```

Or the number of transitions between `REF` and `ALT` nucleotides
across all variants.

```{r REF-ALT}
tbl |>
    count(REF, ALT) |>
    tidyr::pivot_wider(names_from = "ALT", values_from = "n") |>
    select("REF", "A", "C", "G", "T") |>
    arrange(REF)
```

It is straight-forward to select variants in individual regions of
interest, e.g., the first 200000 nucleoties of chromosome 4.

```{r}
tbl |>
    filter(CHROM == "chr4", POS > 0, POS <= 200000)
```

# Working with Bioconductor

This section illustrates how AlphaMissense data can be integrated with
other *Bioconductor* workflows, including the [GenomicRanges][]
infrastructure and the [ensembldb][] package / [AnnotationHub][]
resources.

[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[ensembldb]: https://bioconductor.org/packages/ensembldb
[AnnotationHub]: https://bioconductor.org/packages/AnnotationHub

## Genomic ranges

The [GenomicRanges][] infrastructure provides standard data structures
that allow range-based operations across Bioconductor packages. The
`GPos` data structure provides a convenient and memory-efficient
representation of single-nucleotide variants like those in the `hg19`
and `hg38` AlphaMissense resources. Start by installing (if necessary)
the [GenomicRanges][] package.

```{r, eval = FALSE}
if (!requireNamespace("GenomicRanges", quietly = TRUE))
    BiocManager::install("GenomicRanges")
```

Select the `hg38` data resource, and filter to a subset of variants;
`GPos` is an in-memory data structure and can easily manage 10's of
millions of variants.

```{r}
tbl <-
    am_data("hg38") |>
    filter(CHROM == "chr2", POS < 10000000, REF == "G")
```

Use `to_GPos()` to coerce to a `GPos` object.

```{r}
gpos <-
    tbl |>
    to_GPos()
gpos
```

Vignettes in the [GenomicRanges][] package illustrate use of these
objects.

```{r, eval = FALSE}
utils::browseVignettes("GenomicRanges")
```

## GRCh38 annotation resources

Start by identifying the most recent EnsDb resource for *Homo
Sapiens*.

```{r ahub-homo-sap}
hub <- AnnotationHub::AnnotationHub()
AnnotationHub::query(hub, c("EnsDb", "Homo sapiens"))

AnnotationHub::AnnotationHub()["AH113665"]
```

Load the [ensembldb][] library and retrieve the record. Unfortunately,
there are many conflicts between function names in different packages,
so it becomes necessary to fully resolve functions to the package
where they are defined.

```{r AnnotationHub, message = FALSE}
library(ensembldb)
edb <- AnnotationHub::AnnotationHub()[["AH113665"]]
edb
```

## From *Bioconductor* to *DuckDB*

In this section we will work more directly with the database,
including writing temporary tables. For illustration purposes, we use
a connection that can read and write; in practice, read-only
permissions are sufficient for creating temporary tables.

```{r db_rw}
db_rw <- db_connect(read_only = FALSE)
```

As an illustration, use [ensembldb][] to identify the exons of the
canonical transcript of a particular gene.

```{r tx}
bcl2l11 <-
    edb |>
    ensembldb::filter(
        ~ symbol == "BCL2L11" &
            tx_biotype == "protein_coding" &
            tx_is_canonical == TRUE
    ) |>
    exonsBy("tx")
bcl2l11
```

Munge the data to a tibble, updating the `seqnames` to a column
`CHROM` with identifiers such as `"chr1"` (as in the AlphaMissense
data). Write the tibble to a temporary table (it will be deleted when
`db` is disconnected from the database) so that it can be used in
'lazy' SQL queries with the AlphaMissense data.

```{r temp-table}
bcl2l11_tbl <-
    bcl2l11 |>
    dplyr::as_tibble() |>
    dplyr::mutate(CHROM = paste0("chr", seqnames)) |>
    dplyr::select(CHROM, everything(), -seqnames)

db_temporary_table(db_rw, bcl2l11_tbl, "bcl2l11")
```

The temporary table is now available on the `db_rw` connection; the
tables will be removed on disconnect, `db_disconnect(db_rw)`.

```{r db_tables-rw}
"bcl2l11" %in% db_tables(db_rw)
```

Use `db_range_join()` to join the AlphaMissense data with the ranges
defining the exons in our gene of interest. The arguments are the
database connection, the AlphaMissense table of interest (this table
must have columns `CHROM` and `POS`), the table containing ranges of
interest (with columns `CHROM`, `start`, `end`), and the temporary
table to contain the results. A range join is like a standard database
join, expect that the constraints can be relations, in our case that
`POS >= start` and `POS <= end` for each range of interest;
implementation details are in this [DuckDB blog][]. The range join
uses closed intervals (the start and end positions are included in the
query), following *Bioconductor* convention. Writing to a temporary
table avoids bringing potentially large datasets into R memory, and
makes the table available for subsequent manipulation in the current
session.

[DuckDB blog]: https://duckdb.org/2022/05/27/iejoin.html

```{r range-join}
rng <- db_range_join(db_rw, "hg38", "bcl2l11", "bcl2l11_overlaps")
rng
```

This query takes place almost instantly. A larger query of 71M
variants against 1000 ranges took about 20 seconds.

The usual database and [dplyr][] verbs can be used to summarize the
results, e.g., the number of variants in each pathogenecity class in
each exon.

```{r}
rng |>
    dplyr::count(exon_id, am_class) |>
    tidyr::pivot_wider(names_from = "am_class", values_from = "n")
```

It is perhaps instructive to review the range join as a source of
inspiration for other computations that might be of interest.

``` sql
-- range join of 'hg38' with 'bcl2l11'; overwrite any existing table
-- 'bcl2l11_overlaps'
DROP TABLE IF EXISTS bcl2l11_overlaps;
CREATE TEMP TABLE bcl2l11_overlaps AS
SELECT
    hg38.*,
    bcl2l11.* EXCLUDE (CHROM)
FROM hg38
JOIN bcl2l11
    ON bcl2l11.CHROM = hg38.CHROM
    AND bcl2l11.start <= hg38.POS
    AND bcl2l11.end >= hg38.POS;
```

As best practice, disconnect from the writable database connection
when work is complete.

```{r db_disconnect-rw}
db_disconnect(db_rw)
```

## From *DuckDB* to *Bioconductor*

There are likely more straight-forward ways of performing the query in
the previous section, e.g., by filtering `hg38` on the relevant
transcript id(s), retrieving to *R*, and working with `edb` to
classify variants by exon. The transcript we are interested in is
`"ENST00000393256"`.

Select the relevant variants and, because there are not too many, load
into *R*.

```{r filter-variants}
variants_of_interest <-
    am_data("hg38") |>
    dplyr::filter(transcript_id %like% "ENST00000393256%")
```

Coerce the AlphaMissense data to a `GRanges::GPos` object.

```{r gpos}
gpos <-
    variants_of_interest |>
    to_GPos()
## make gpos 'genome' and 'seqlevels' like bcl2l11
GenomeInfoDb::genome(gpos) <- "GRCh38"
GenomeInfoDb::seqlevelsStyle(gpos) <- "Ensembl"
gpos
```

One can then use [GenomicRanges][] functionality, e.g., to count the
number of variants in each exon.

[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges

```{r countOverlaps}
countOverlaps(unlist(bcl2l11), gpos)
```

# Finally

Remember to disconnect and shutdown all managed DuckDB connections.

```{r db_disconnect_all}
db_disconnect_all()
```

Database connections that are not closed correctly trigger warning
messages.

# Session information {.unnumbered}

```{r}
sessionInfo()
```