---
title: "EpiTxDb: Storing and accessing epitranscriptomic information using the AnnotationDbi interface"
author: "Felix G.M. Ernst"
date: "`r Sys.Date()`"
package: EpiTxDb
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    df_print: paged
vignette: >
  %\VignetteIndexEntry{EpiTxDb}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown(css.files = c('custom.css'))
```

# Installation

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("EpiTxDb","EpiTxDb.Hs.hg38"))
```

# Introduction

The epitranscriptome includes all post-transcriptional modifications of the RNA
and describes and additional layer of information encoded on RNA. Like the term
epigenome it is not about a change in nucleotide sequences, but the addition of
functional elements through modifications.

With the development of high throughput detection strategies for specific RNA
modifications, such as miCLIP and Pseudo-Seq amongst other, a large number of
modified positions have been identified and were summarized via the RMBase 
project [[@Xuan.2017;@Sun.2015]](#References) project.

To make these information avaialble within the Bioconductor universe `EpiTxDb`
was developed, which facilitates the storage of epitranscriptomic information.
More specifically, it can keep track of modification identity, position, the
enzyme for introducing it on the RNA, a specifier which determines the position
on the RNA to be modified and the literature references each modification is
associated with.

# Getting started

```{r, results="hide", include=TRUE, message=FALSE, warning=FALSE}
library(EpiTxDb)
library(EpiTxDb.Hs.hg38)
```

The class `EpiTxDb` is the class for storing the epitranscriptomic data. It 
inherits the inner workings of `AnnotationDb` class from the `AnnotationDbi`
package.

As an example for the vignette the snoRNAdb data [[@Lestrade.2006]](#References)
from the `EpiTxDb.Hs.hg38` package will be used. The data is stored in the
`AnnotationHub` and is downloaded and cached upon the first request.

```{r}
etdb <- EpiTxDb.Hs.hg38.snoRNAdb()
etdb
```

As expected for an `AnnotationDb` class the general accessors are available.

```{r}
keytypes(etdb)
columns(etdb)
head(keys(etdb, "MODID"))
select(etdb, keys = "1",
       columns = c("MODNAME","MODTYPE","MODSTART","MODSTRAND","SNNAME",
                   "RXGENENAME","SPECTYPE","SPECGENENAME"),
       keytype = "MODID")
```

The columns with the prefix `RX` or `SPEC` reference the reaction enzyme and the
location specifier. This can be the same information, but for ribosomal 
modifications from the snoRNAdb it is of course fibrillarin and a snoRNA.

In addition the following accessor for metadata are available as well.

```{r}
species(etdb)
organism(etdb)
seqlevels(etdb)
```

# Accessing RNA modifications

The specialized accessors are `modifications()` and `modificationsBy()`. 
`modifications()` allows for filtering results, whereas `modificationsBy()`
returns all the modifications in batches separated by certain information.

```{r}
modifications(etdb, columns = c("mod_id","mod_type","mod_name",
                                "rx_genename","spec_genename",
                                "ref_type","ref"),
              filter = list(mod_id = 1:3))
```

```{r}
# split by sequence name, usually a transcipt identifier
modificationsBy(etdb, by = "seqnames")
# split modification type
modificationsBy(etdb, by = "modtype")
```

# Shifting coordinates from genomic to transcritomic

Since epitranscriptomic modifications by their nature refere to each individual
it might be of interest to shift the coordinates transctipt coordinates keeping
in mind that with transcript variants multiple options exist for each of the 
transcript maturation process.

```{r, echo = FALSE}
suppressPackageStartupMessages({
  library(TxDb.Hsapiens.UCSC.hg38.knownGene)
  library(BSgenome.Hsapiens.UCSC.hg38)
})
```
```{r, eval = FALSE}
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
```
```{r}
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
seqlevels(txdb) <- "chr1"
bs <- BSgenome.Hsapiens.UCSC.hg38

etdb <- EpiTxDb.Hs.hg38.RMBase()

tx <- exonsBy(txdb)
mod <- modifications(etdb, filter = list(sn_name = "chr1"))
length(mod)
```

In the following example we will focus on shifting the coordinates to individual
mature transcripts. However, keep in mind, that premature transcript might be
of interest as well and this can be controlled via the `tx` arguments of
`shiftGenomicToTranscript()`

```{r}
mod_tx <- shiftGenomicToTranscript(mod, tx)
length(mod_tx)
```

Due to multiple matches for each transcript variant the number of modifications
has increased.

With the we can plot the relative positions of modifications by type on 
`chr1` transcripts.

```{r}
mod_tx <- split(mod_tx,seqnames(mod_tx))
names <- Reduce(intersect,list(names(mod_tx),names(tx)))
# Getting the corresponding 5'-UTR and 3'-UTR annotations
fp <- fiveUTRsByTranscript(txdb)
tp <- threeUTRsByTranscript(txdb)
tx <- tx[names]
mod_tx <- mod_tx[names]
fp_m <- match(names,names(fp))
fp_m <- fp_m[!is.na(fp_m)]
tp_m <- match(names,names(tp))
tp_m <- tp_m[!is.na(tp_m)]
fp <- fp[fp_m]
tp <- tp[tp_m]

# Getting lengths of transcripts, 5'-UTR and 3'-UTR
tx_lengths <- sum(width(tx))
fp_lengths <- rep(0L,length(tx))
names(fp_lengths) <- names
fp_lengths[names(fp)] <- sum(width(fp))
tp_lengths <- rep(0L,length(tx))
names(tp_lengths) <- names
tp_lengths[names(tp)] <- sum(width(tp))

# Rescale modifications
# CDS start is at position 1L and cds end at position 1000L
from <- IRanges(fp_lengths+1L, tx_lengths - tp_lengths)
to <- IRanges(1L,1000L)
mod_rescale <- rescale(mod_tx, to, from)

# Construct result data.frame
rel_pos <- data.frame(mod_type = unlist(mcols(mod_rescale,level="within")[,"mod_type"]),
                      rel_pos = unlist(start(mod_rescale)))
rel_pos <- rel_pos[rel_pos$rel_pos < 1500 & rel_pos$rel_pos > -500,]
```

```{r}
library(ggplot2)
ggplot(rel_pos[rel_pos$mod_type %in% c("m6A","m1A","Y"),],
       aes(x = rel_pos, colour = mod_type)) + 
  geom_density()
```

# Session info

```{r}
sessionInfo()
```

<a name="References"></a>

# References