---
title: "Modstrings"
author: "Felix G.M. Ernst & Denis L.J. Lafontaine"
date: "`r Sys.Date()`"
package: Modstrings
abstract: >
  Classes for DNA and RNA sequences containing modified nucleotides
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    df_print: paged
vignette: >
  %\VignetteIndexEntry{Modstrings}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
bibliography: references.bib
editor_options: 
  chunk_output_type: console
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown(css.files = c('custom.css'))
```

# Introduction

Most nucleic acids, regardless of their being DNA or RNA, contain modified 
nucleotides, which enhances the normal function of encoding genetic information.
They have usually a regulatory function and/or modify folding behavior and 
molecular interactions.

RNA are nearly always post-transcriptionally modified. Most prominent examples
are of course ribsomal RNA (rRNA) and transfer RNA (tRNA), but in recent years
mRNA was also discovered to be post-transcriptionally modified. In addition, 
many small and long non-coding RNAs are also modified.

In many resources, like the tRNAdb [@Juehling.2009] or the modomics database
[@Boccaletto.2018], modified nucleotides are repertoried. However in the
Bioconductor context these information were not accessible, since they rely
extensively on special characters in the RNA modification alphabet.

Therefore, the `ModRNAString` class was implemented extending the `BString`
class from the `Biostrings` [@Pages.2017] package. It can store RNA sequences
containing special characters of the RNA modification alphabet and thus can
store location and identity of modifications. Functions for conversion to a
tabular format are implemented as well.

The implemented classes inherit most of the functions from the parental 
`BString` class and it derivatives, which allows them to behave like the 
normal `XString` classes within the Bioconductor context. Most of the 
functionality is directly inherited and derived from the `Biostrings` package.

Since a DNA modification alphabet also exists, a `ModDNAString` class was 
implemented as well. For details on the available letters have a look at the
[RNA modification](ModRNAString-alphabet.html) and 
[DNA modification](ModDNAString-alphabet.html alphabet vignettes.

# Creating a `ModRNAString` object

In principle `ModRNAString` and `ModDNAString` objects can be created as any 
other `XString` object. However encoding issue will most certainly come into
play, depending on the modification, the operation system and probably the R
version. This is not a problem of how the data is internally used, but how the
letter is transfered from the console to R and back.

```{r, echo=FALSE}
suppressPackageStartupMessages({
  library(Modstrings)
  library(GenomicRanges)
})
```

```{r, eval=FALSE}
library(Modstrings)
library(GenomicRanges)
```

```{r object_creation, error=TRUE, purl=FALSE}
# This works
mr <- ModRNAString("ACGU7")
# This might work on Linux, but does not on Windows
ModRNAString("ACGU≈")
# This cause a misinterpretation on Windows. Omega gets added as O. 
# This modifys the information from yW-72 (7-aminocarboxypropylwyosine) to 
# m1I (1-methylinosine)
ModRNAString("ACGUΩ")
```

To eliminate this issue the function `modifyNucleotide()` is implemented,
which can use short names or the nomenclature of a modification to add it at the 
desired position.

```{r alphabet, error=TRUE, purl=FALSE}
head(shortName(ModRNAString()))
head(nomenclature(ModRNAString()))
```

```{r object_creation2}
r <- RNAString("ACGUG")
mr2 <- modifyNucleotides(r,5L,"m7G")
mr2
mr3 <- modifyNucleotides(r,5L,"7G",nc.type = "nc")
mr3
```

In addition, one can also use the `alphabet()` function and subset to the
desired modifications.

```{r object_creation3, }
mr4 <- ModRNAString(paste0("ACGU",alphabet(ModRNAString())[33L]))
mr4
```

# Streamlining object creation and modification

To offer a more streamlined functionality, which can take more information as
input, the function `combineIntoModstrings()` is implemented. It takes a
`XString` object and a `GRanges` object with a `mod` column and returns a
`ModString` object. The information in the `mod` column must match the short
name or nomenclature of the particular modification of interest as returned by
the `shortName()` or `nomenclature()` functions as seen above.

```{r object_creation4}
gr <- GRanges("1:5", mod = "m7G")
mr5 <- combineIntoModstrings(r, gr)
mr5
```

`combineIntoModstrings()` is also implemented for `ModStringSet` objects.

```{r object_creation5}
rs <- RNAStringSet(list(r,r,r,r,r))
names(rs) <- paste0("Sequence", seq_along(rs))
gr2 <- GRanges(seqnames = names(rs)[c(1L,1L,2L,3L,3L,4L,5L,5L)],
               ranges = IRanges(start = c(4L,5L,5L,4L,5L,5L,4L,5L),
                                width = 1L),
               mod = c("D","m7G","m7G","D","m7G","m7G","D","m7G"))
gr2
mrs <- combineIntoModstrings(rs, gr2)
mrs
```

The reverse operation is also available via the function `separate()`, which
allows the positions of modifications to be transfered into a tabular format.

```{r object_separation}
gr3 <- separate(mrs)
rs2 <- RNAStringSet(mrs)
gr3
rs2
```

`modifyNucleotides()` and therefore also `combineIntoModstrings()` requires,
that the nucleotides to be modified match the originating base for the
modification. The next chunk fails, since the originating base for m7G is of
course G.

```{r object_creation_test, error=TRUE, purl=FALSE}
modifyNucleotides(r,4L,"m7G")
```

Calls for both functions check the sanity for this operation, so that the next
bit is always `TRUE`.

```{r object_comparison_teaser}
r <- RNAString("ACGUG")
mr2 <- modifyNucleotides(r,5L,"m7G")
r == RNAString(mr2)
```

# Comparing `ModString` objects

`ModString` objects can be directly compared to `RNAString` or `DNAString` 
objects depending on the type (`ModRNA` to `RNA` and `ModDNA` to `DNA`).

```{r object_comparison}
r == ModRNAString(r)
r == mr
rs == ModRNAStringSet(rs)
rs == c(mrs[1L:3L],rs[4L:5L])
```

# Conversion of `ModString` objects

`ModString` objects can be converted into each other. However any conversion 
will remove any information on modifications and revert each nucleotide back to 
its originating nucleotide.

```{r object_conversion}
RNAString(mr)
```

# Quality scaled `ModString`

Quality information can be encoded alongside `ModString` objects by combining it
with a `XStringQuality` object inside a `QualityScaledModStringSet` object. Two
class are implemented: `QualityScaledModRNAStringSet` and 
`QualityScaledModDNAStringSet`. They are usable as expected from a 
`QualityScaledXStringSet` object.

```{r object_qual}
qmrs <- QualityScaledModRNAStringSet(mrs,
                                     PhredQuality(c("!!!!h","!!!!h","!!!!h",
                                                    "!!!!h","!!!!h")))
qmrs
```

They can also be constructed/deconstructed using the functions
`combineIntoModstrings()` and `separate()` and use an additional metadata column
named `quality`. For quality information to persist during construction, set the
argument `with.qualities = TRUE`. If a `QualityScaledModStringSet` is used as an
input to separate, the quality information are returned in the `quality column`.
We choose to avoid clashes with the `score` column and not to recycle it.

```{r object_qual_sep_combine}
qgr <- separate(qmrs)
qgr
combineIntoModstrings(mrs,qgr, with.qualities = TRUE)
```

# Saving and reading `ModString` objects to/from file

The nucleotide sequences with modifications can be saved to a `fasta` or
`fastq` file using the functions `writeModStringSet()`. Reading of these files
is achieved using `readModRNAStringSet()` or `readModDNAStringSet()`. In case of
`fastq` files, the sequences can be automatically read as a
`QualityScaledModRNAStringSet` using `readQualityScaledModRNAStringSet()`
function.

```{r object_io}
writeModStringSet(mrs, file = "test.fasta")
# note the different function name. Otherwise empty qualities will be written
writeQualityScaledModStringSet(qmrs, file = "test.fastq")
mrs2 <- readModRNAStringSet("test.fasta", format = "fasta")
mrs2
qmrs2 <- readQualityScaledModRNAStringSet("test.fastq")
qmrs2
```

Since these functions are specifically designed to work with the modified 
nucleotides within the sequence, they are slower than the analogous functions
from the `Biostrings` package. This is the result of a purely R based 
implementation, whereas `Biostrings` functions are spead up through a C 
backend. This is a potential improvement for future developments, but 
currently special sequence files are limited, so it is not a priority.

```{r object_io_unlink, include=FALSE}
unlink("test.fasta")
unlink("test.fastq")
```

# Pattern matching

Pattern matching is implemented as well as expected for `XString` objects.

```{r object_pattern}
matchPattern("U7",mr)
vmatchPattern("D7",mrs)
mrl <- unlist(mrs)
matchLRPatterns("7ACGU","U7ACG",100L,mrl)
```

# Future development

In principle post-translational modifications of proteins could also be 
implemented. However, a one letter alphabet of post-translational modifications
must be developed first. If you are already aware of such an alphabet and want 
to use it in a Bioconductor context, let us know.

# Import example

This is a quick example showing how sequence information containing modified
nucleotides can be imported into an R session using the `Modstrings` package.
The file needs to be UTF-8 encoded.

```{r }
# read the lines
test <- readLines(system.file("extdata","test.fasta",package = "Modstrings"),
                      encoding = "UTF-8")
head(test,2L)
# keep every second line as sequence, the other one as name
names <- test[seq.int(from = 1L, to = 104L, by = 2L)]
seq <- test[seq.int(from = 2L, to = 104L, by = 2L)]
# sanitize input. This needs to be adapt to the individual case
names <- gsub(" ","_",
              gsub("> ","",
                   gsub(" \\| ","-",
                        names)))
seq <- gsub("-","",gsub("_","",seq))
names(seq) <- names
```
```{r }
# sanitize special characters to Modstrings equivalent
seq <- sanitizeFromModomics(seq)
seq <- ModRNAStringSet(seq)
seq
```
```{r }
# convert the contained modifications into a tabular format
separate(seq)
```

# Sessioninfo

```{r sessioninfo}
sessionInfo()
```

<a name="References"></a>

# References