---
bibliography: knitcitations.bib
csl: template.csl
css: mystyle.css
output:
  BiocStyle::html_document
vignette: |
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{rfaRm}
  \usepackage[utf8]{inputenc}
---


```{r setup, echo=FALSE}
knitr::opts_chunk$set(message=FALSE, fig.path='figures/')
```

<!-- Title block -->
<br>
<p style="text-align:center;font-size:200%;color:Red"> ncRNAtools </p>
<p style="text-align:center;font-size:90%;color:Gray"> Lara Selles Vidal,
Rafael Ayala, Guy-Bart Stan, Rodrigo Ledesma-Amaro </p>
<p style="text-align:center;font-size:90%;color:Gray"> July 24, 2020 </p>
<br>

# Abstract
<p style="font-size:90%"> Non-coding RNA (ncRNA) comprise a variety of RNA 
molecules that are transcribed from DNA but are not translated into proteins.  
They include a large diversity of RNA which perform key cellular functions, 
including ribosomic RNA, transference RNA, ribozymes, small nuclear and small 
nucleolar RNA and multiple RNAs involved in the regulation of gene expression 
and chromatin structure, such as cis-regulatory elements, micro RNA,
interferenceRNA and long non-coding RNA. A key aspect in determining the 
function and properties of ncRNAs is their secondary structure. This package 
provides a set of tools for identifying ncRNAs of interest, determining and 
plotting their secondary structure and importing and exporting the most common 
file-types for ncRNA data.  <br><br> </p>

# Introduction

The key role of ncRNAs in cellular processes has been well established over the 
last two decades, with at least 70% of the human genome reported to be 
transcribed into RNAs of different sizes. Given the fact that only around 28% of
the human genome corresponds to protein-encoding regions (including introns and 
other untranslated regions), it is currently thought that the rest of the genome
encodes a high number of functional RNA molecules that are not translated to 
proteins [@fu2014].

One of the main characteristic features of ncRNAs is their secondary structure, 
defined by the intramolecular pairing of bases through hydrogen bonds. The 
identification of RNA secondary structure is often the basis for determining 
essential bases for ncRNA function, as well as for engineering novel or modified 
ncRNAs, including ribozymes. Furthermore, several methodologies for the 
prediction of RNA threedimensional structure require secondary structure 
information as a key input [@zhiyong2011; @thiel2017]. 

ncRNAtools aims to facilitate the inclusion of ncRNA identification and analysis
within existing genomics and transcriptomics workflows, as well as the 
application of statistical modeling for the prediction of ncRNA properties and 
structure by providing a set of tools for importing, exporting, plotting and 
analyzing ncRNAs and their features.

# Installation

To install ncRNAtools, start R (>=4.0) and run:

```{r tidy = TRUE, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following line initializes usage of Bioc devel, and therefore should be
# skipped in order to install the stable release version
BiocManager::install(version='devel')

BiocManager::install("ncRNAtools")
```

# Software features

A first set of ncRNAtools functionalities allows to search the [RNAcentral](https://rnacentral.org/) 
database of annotated ncRNA [@rnacentral2019]. The RNAcentral database can be 
searched in two different ways: by keywords, and by genomic coordinates. 
Information about a specific entry can also be retrieved.

ncRNAtools also enables the prediction of secondary structure of RNA by simply 
providing a sequence of interest. Secondary structure predictions are performed 
through the [rtools](http://rtools.cbrc.jp/) web server for RNA Bioinformatics [@hamada2016]. 
Currently supported methodologies for secondary structure prediction include 
centroidFold [@hamada2008], centroidHomfold [@hamada2011] and IPknot [@sato2011]. 
Only IPknot can detect pseudoknots. Alternative secondary structures can be 
predicted with the RintW method [@hagio2018]. Utilities to determine paired 
bases from a secondary structure string and viceversa are also available.

Additionally, ncRNAtools provides utilities to calculate a matrix of base pair 
probabilities and visualize it as a heatmap, with the possibility to make a 
composite plot including also a defined secondary structure.

Finally, ncRNAtools supports reading and writing files in the CT and Dot-Bracket
formats, the two most commonly employed formats for storing the secondary 
structure of RNA.

# Examples

## Searching the RNAcentral database

The RNAcentral database can be searched by keywords with the "rnaCentralTextSearch"
function. The function returns a character vector where each element is a string
representing the RNAcentral accession for an entry that matched the search. 
RNAcentral accessions are always of either the "URSXXXXXXXXXX" form, where X is 
any hexadecimal numeral, or "URSXXXXXXXXXX_taxid", where taxid is the NCBI 
taxonomy identifier for the species or strain corresponding to the entry. 
Searches can either include only a keyword or phrase, or be given in the 
field_name:"field value" syntax, with several fields separated by logical 
operators as described in https://rnacentral.org/help/text-search. It should be 
noted that logical operators must be capitalized.

For example, the keyword "HOTAIR" can be passed as the query to the search 
function to identify entries corresponding to the HOTAIR ncRNA:

```{r include = FALSE}
library(ncRNAtools)
library(GenomicRanges)
library(IRanges)
library(S4Vectors)
library(ggplot2)
```

```{r tidy = TRUE}
rnaCentralTextSearch("HOTAIR")
```

With a more refined search, it is possible to identify RNAs corresponding to the
FMN riboswitch present in Bacillus subtilis strains only. It should be noted 
that the quotes enclosing the value for specific fields must be escaped:

```{r tidy = TRUE}
rnaCentralTextSearch("FMN AND species:\"Bacillus subtilis\"")
```

Information about the resulting entries can be retrieved with the 
"rnaCentralRetrieveEntry" function:

```{r tidy = TRUE}
rnaCentralRetrieveEntry("URS000037084E_1423")
```

It is also possible to perform a search of the RNAcentral database by specifying
genomic coordinates with the "rnaCentralGenomicCoordinatesSearch" function. 
A GRanges object must be provided, with sequence names of the **chr?** or simply
**?** format (where **?** denotes any number, the X or Y characters or another 
letter for organisms where chromosomes are named with letters), or, alternatively,
"MT" to denote mitochondrial DNA. The species name must also be provided as a 
string with the scientific name. A list of organisms for which search by genomic
coordinates is supported can be found at https://rnacentral.org/help/genomic-mapping. 
Several ranges can be provided in the same GRanges object, but they must refer
to the same organism.

In the following example, we aim to retrieve the known ncRNA present in Yarrowia
lipolytica (oleaginous yeast) between positions 200000 and 300000 of 
chromosome C, and positions 500000 and 550000 of chromosome D:

```{r tidy = TRUE, tidy.opts=list(width.cutoff = 80)}
## Generate a GRanges object with the genomic ranges to be used to search the RNAcentral database

genomicCoordinates <- GenomicRanges::GRanges(seqnames=S4Vectors::Rle(c("chrC", "chrD")), 
                                             ranges=IRanges::IRanges(c(200000, 500000), c(300000, 550000)))

## Use the generated GRanges object to search the RNAcentral database

RNAcentralHits <- rnaCentralGenomicCoordinatesSearch(genomicCoordinates, "Yarrowia lipolytica")

## Check the number of hits in each provided genomic range

length(RNAcentralHits[[1]]) # 22 known ncRNA between positions 200000 and 300000 of chromosome C
length(RNAcentralHits[[2]]) # No known ncRNA between positions 500000 and 550000 of chromosome D

```

## Predicting the secondary structure of an RNA

The "predictSecondaryStructure" function enables the prediction of the secondary
structure of RNA through three different methods: centroidFold [@hamada2008], 
centroidHomfold [@hamada2011] and IPknot [@sato2011]. IPknot has the advantage 
of being able to detect pseudoknots.

The secondary structure predictions by centroidFold and centroidHomfold are 
output in the Dot-Bracket format, where dots (".") represent unpaired bases and 
pairs of open and closed round brackets ("("-")") represent paired bases. Such 
format cannot unambiguously represent nested structures such as pseudoknots. 
Therefore, the output of IPknot is provided, if required, in the extended 
Dot-Bracket format, where additional pairs of symbols are used to represent such
nested structures.

In the following example, all three methods are used to predict the structure of
a fragment of a tRNA:

```{r tidy = TRUE}
tRNAfragment <- "UGCGAGAGGCACAGGGUUCGAUUCCCUGCAUCUCCA"

centroidFoldPrediction <- predictSecondaryStructure(tRNAfragment, "centroidFold")
centroidFoldPrediction$secondaryStructure

centroidHomFoldPrediction <- predictSecondaryStructure(tRNAfragment, "centroidHomFold")
centroidFoldPrediction$secondaryStructure

IPknotPrediction <- predictSecondaryStructure(tRNAfragment, "IPknot")
IPknotPrediction$secondaryStructure
```

It is also possible to predict not only a single canonical secondary structure, 
but also a set of possible alternative secondary structures. This is achieved 
with the RintW method [@hagio2018], available through the 
"predictAlternativeSecondaryStructures" function.

```{r tidy = TRUE}
tRNAfragment2 <- "AAAGGGGUUUCCC"

RintWPrediction <- predictAlternativeSecondaryStructures(tRNAfragment2)

length(RintWPrediction) # A total of 2 alternative secondary structures were identified by RintW

RintWPrediction[[1]]$secondaryStructure
RintWPrediction[[2]]$secondaryStructure
```

## Calculating and plotting of base pair probability matrices

Both centroidFold and centroidHomfold return not only a prediction of the 
secondary structure of an RNA, but also a list of probabilities for potential 
base pairs for each nucleotide. Such a matrix can be useful in assessing the 
potential presence of alternative secondary structures.

A matrix of base pair probabilities can be generated with the 
"generatePairsProbabilityMatrix", which takes as input the probabilities in the 
format returned by centroidFold and centroidHomFold. The base pair probabilities
matrix can then be plotted with the "plotPairsProbabilityMatrix" function:

```{r tidy = TRUE}
basePairProbabilityMatrix <- generatePairsProbabilityMatrix(centroidFoldPrediction$basePairProbabilities)

plotPairsProbabilityMatrix(basePairProbabilityMatrix)
```

A composite plot, where the top right triangle represents the base pair 
probabilities while the bottom left triangle represents base pairings in a 
specific secondary structure, can be generated with the "plotCompositePairsMatrix"
function. Usually, base pairs in a secondary structure correspond to pairs with 
a high probability. The function takes as its input a matrix of base pair 
probabilities, and a dataframe with bases paired in a given secondary structure.
Such a dataframe can be generated from a secondary structure string with the 
"findPairedBases" function.

In the following example, such a composite plot is generated the secondary
structure predicted by centroidFold for the tRNA fragment:

```{r tidy = TRUE}
pairedBases <- findPairedBases(sequence=tRNAfragment, secondaryStructureString=IPknotPrediction$secondaryStructure)

plotCompositePairsMatrix(basePairProbabilityMatrix, pairedBases)
```

## Reading and writing files and other utilities

ncRNAtools supports reading and writing files in the CT and Dot-Bracket format. 
For a description of each format, see http://projects.binf.ku.dk/pgardner/bralibase/RNAformats.html 
and https://rna.urmc.rochester.edu/Text/File_Formats.html.

CT files are read with the "readCT" function. Some CT files contain a line for 
each nucleotide of the RNA sequence, while others contain lines only for paired 
nucleotides. In the second case, the full sequence must be provided when calling 
"readCT".

The list returned by "readCT" includes an element named "pairsTable", consisting
of a dataframe indicating paired nucleotides. A corresponding secondary 
structure string in the Dot-Bracket format can be generated with the 
"pairsToSecondaryStructure" function.

CT files are written with the "writeCT" function. A complete CT file is always 
output by "writeCT".

Files in the Dot-Bracket format are read and written with the "readDotBracket" 
and "writeDotBracket" functions respectively. If the read Dot-Bracket file 
contains an energy value for the represented structure, it will be included in 
the output of "readDotBracket".

```{r tidy = TRUE}
## Read an example CT file corresponding to E. coli tmRNA

exampleCTFile <- system.file("extdata", "exampleCT.ct", package="ncRNAtools")

tmRNASequence <- "GGGGCUGAUUCUGGAUUCGACGGGAUUUGCGAAACCCAAGGUGCAUGCCGAGGGGCGGUUGGCCUCGUAAAAAGCCGCAAAAAAUAGUCGCAAACGACGAAAACUACGCUUUAGCAGCUUAAUAACCUGCUUAGAGCCCUCUCUCCCUAGCCUCCGCUCUUAGGACGGGGAUCAAGAGAGGUCAAACCCAAAAGAGAUCGCGUGGAAGCCCUGCCUGGGGUUGAAGCGUUAAAACUUAAUCAGGCUAGUUUGUUAGUGGCGUGUCCGUCCGCAGCUGGCAAGCGAAUGUAAAGACUGACUAAGCAUGUAGUACCGAGGAUGUAGGAAUUUCGGACGCGGGUUCAACUCCCGCCAGCUCCACCA"

tmRNASecondaryStructure <- readCT(exampleCTFile, tmRNASequence)

## Write a complete CT file for E. coli tmRNA

tempDir <- tempdir()
testCTFile <- paste(tempDir, "testCTfile.ct", sep="")

tmRNASecondaryStructureString <- pairsToSecondaryStructure(pairedBases=tmRNASecondaryStructure$pairsTable, sequence=tmRNASequence)

writeCT(testCTFile, sequence=tmRNASequence, 
        secondaryStructure=tmRNASecondaryStructureString, sequenceName="tmRNA")

## Read an example Dot-Bracket file

exampleDotBracketFile <- system.file("extdata", "exampleDotBracket.dot", 
                                     package="ncRNAtools")

exampleDotBracket <- readDotBracket(exampleDotBracketFile)

exampleDotBracket$freeEnergy # The structure has a free energy of -41.2 kcal/mol

## Write a Dot-Bracket file

tempDir2 <- tempdir()
testDotBracketFile <- paste(tempDir2, "testDotBracketFile.dot", sep="")

writeDotBracket(testDotBracketFile, sequence=exampleDotBracket$sequence, 
                secondaryStructure=exampleDotBracket$secondaryStructure, 
                sequenceName="Test sequence")
```

ncRNAtools includes an utility to flatten representations of secondary structure
in the extended Dot-Bracket format to basic Dot-Bracket format (i.e., comprising
only the ".", "(" and ")" characters). Although information is potentially lost 
upon such transformation, this is required for some RNA Bioinformatics software,
which can only accept secondary structures in the Dot-Bracket format as input. 
This can be achieved with the "flattenDotBracket" function.

```{r tidy = TRUE}
extendedDotBracketString <- "...((((..[[[.))))]]]..."

plainDotBracketString <- flattenDotBracket(extendedDotBracketString)
```

# Session info

```{r tidy = TRUE}
sessionInfo()
```

# References