---
bibliography: knitcitations.bib
csl: template.csl
css: mystyle.css
output:
BiocStyle::html_document
vignette: |
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{rfaRm}
\usepackage[utf8]{inputenc}
---
```{r setup, echo=FALSE}
knitr::opts_chunk$set(message=FALSE, fig.path='figures/')
```
ncRNAtools
Lara Selles Vidal,
Rafael Ayala, Guy-Bart Stan, Rodrigo Ledesma-Amaro
July 24, 2020
# Abstract
Non-coding RNA (ncRNA) comprise a variety of RNA
molecules that are transcribed from DNA but are not translated into proteins.
They include a large diversity of RNA which perform key cellular functions,
including ribosomic RNA, transference RNA, ribozymes, small nuclear and small
nucleolar RNA and multiple RNAs involved in the regulation of gene expression
and chromatin structure, such as cis-regulatory elements, micro RNA,
interferenceRNA and long non-coding RNA. A key aspect in determining the
function and properties of ncRNAs is their secondary structure. This package
provides a set of tools for identifying ncRNAs of interest, determining and
plotting their secondary structure and importing and exporting the most common
file-types for ncRNA data.
# Introduction
The key role of ncRNAs in cellular processes has been well established over the
last two decades, with at least 70% of the human genome reported to be
transcribed into RNAs of different sizes. Given the fact that only around 28% of
the human genome corresponds to protein-encoding regions (including introns and
other untranslated regions), it is currently thought that the rest of the genome
encodes a high number of functional RNA molecules that are not translated to
proteins [@fu2014].
One of the main characteristic features of ncRNAs is their secondary structure,
defined by the intramolecular pairing of bases through hydrogen bonds. The
identification of RNA secondary structure is often the basis for determining
essential bases for ncRNA function, as well as for engineering novel or modified
ncRNAs, including ribozymes. Furthermore, several methodologies for the
prediction of RNA threedimensional structure require secondary structure
information as a key input [@zhiyong2011; @thiel2017].
ncRNAtools aims to facilitate the inclusion of ncRNA identification and analysis
within existing genomics and transcriptomics workflows, as well as the
application of statistical modeling for the prediction of ncRNA properties and
structure by providing a set of tools for importing, exporting, plotting and
analyzing ncRNAs and their features.
# Installation
To install ncRNAtools, start R (>=4.0) and run:
```{r tidy = TRUE, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# The following line initializes usage of Bioc devel, and therefore should be
# skipped in order to install the stable release version
BiocManager::install(version='devel')
BiocManager::install("ncRNAtools")
```
# Software features
A first set of ncRNAtools functionalities allows to search the [RNAcentral](https://rnacentral.org/)
database of annotated ncRNA [@rnacentral2019]. The RNAcentral database can be
searched in two different ways: by keywords, and by genomic coordinates.
Information about a specific entry can also be retrieved.
ncRNAtools also enables the prediction of secondary structure of RNA by simply
providing a sequence of interest. Secondary structure predictions are performed
through the [rtools](http://rtools.cbrc.jp/) web server for RNA Bioinformatics [@hamada2016].
Currently supported methodologies for secondary structure prediction include
centroidFold [@hamada2008], centroidHomfold [@hamada2011] and IPknot [@sato2011].
Only IPknot can detect pseudoknots. Alternative secondary structures can be
predicted with the RintW method [@hagio2018]. Utilities to determine paired
bases from a secondary structure string and viceversa are also available.
Additionally, ncRNAtools provides utilities to calculate a matrix of base pair
probabilities and visualize it as a heatmap, with the possibility to make a
composite plot including also a defined secondary structure.
Finally, ncRNAtools supports reading and writing files in the CT and Dot-Bracket
formats, the two most commonly employed formats for storing the secondary
structure of RNA.
# Examples
## Searching the RNAcentral database
The RNAcentral database can be searched by keywords with the "rnaCentralTextSearch"
function. The function returns a character vector where each element is a string
representing the RNAcentral accession for an entry that matched the search.
RNAcentral accessions are always of either the "URSXXXXXXXXXX" form, where X is
any hexadecimal numeral, or "URSXXXXXXXXXX_taxid", where taxid is the NCBI
taxonomy identifier for the species or strain corresponding to the entry.
Searches can either include only a keyword or phrase, or be given in the
field_name:"field value" syntax, with several fields separated by logical
operators as described in https://rnacentral.org/help/text-search. It should be
noted that logical operators must be capitalized.
For example, the keyword "HOTAIR" can be passed as the query to the search
function to identify entries corresponding to the HOTAIR ncRNA:
```{r include = FALSE}
library(ncRNAtools)
library(GenomicRanges)
library(IRanges)
library(S4Vectors)
library(ggplot2)
```
```{r tidy = TRUE}
rnaCentralTextSearch("HOTAIR")
```
With a more refined search, it is possible to identify RNAs corresponding to the
FMN riboswitch present in Bacillus subtilis strains only. It should be noted
that the quotes enclosing the value for specific fields must be escaped:
```{r tidy = TRUE}
rnaCentralTextSearch("FMN AND species:\"Bacillus subtilis\"")
```
Information about the resulting entries can be retrieved with the
"rnaCentralRetrieveEntry" function:
```{r tidy = TRUE}
rnaCentralRetrieveEntry("URS000037084E_1423")
```
It is also possible to perform a search of the RNAcentral database by specifying
genomic coordinates with the "rnaCentralGenomicCoordinatesSearch" function.
A GRanges object must be provided, with sequence names of the **chr?** or simply
**?** format (where **?** denotes any number, the X or Y characters or another
letter for organisms where chromosomes are named with letters), or, alternatively,
"MT" to denote mitochondrial DNA. The species name must also be provided as a
string with the scientific name. A list of organisms for which search by genomic
coordinates is supported can be found at https://rnacentral.org/help/genomic-mapping.
Several ranges can be provided in the same GRanges object, but they must refer
to the same organism.
In the following example, we aim to retrieve the known ncRNA present in Yarrowia
lipolytica (oleaginous yeast) between positions 200000 and 300000 of
chromosome C, and positions 500000 and 550000 of chromosome D:
```{r tidy = TRUE, tidy.opts=list(width.cutoff = 80)}
## Generate a GRanges object with the genomic ranges to be used to search the RNAcentral database
genomicCoordinates <- GenomicRanges::GRanges(seqnames=S4Vectors::Rle(c("chrC", "chrD")),
ranges=IRanges::IRanges(c(200000, 500000), c(300000, 550000)))
## Use the generated GRanges object to search the RNAcentral database
RNAcentralHits <- rnaCentralGenomicCoordinatesSearch(genomicCoordinates, "Yarrowia lipolytica")
## Check the number of hits in each provided genomic range
length(RNAcentralHits[[1]]) # 22 known ncRNA between positions 200000 and 300000 of chromosome C
length(RNAcentralHits[[2]]) # No known ncRNA between positions 500000 and 550000 of chromosome D
```
## Predicting the secondary structure of an RNA
The "predictSecondaryStructure" function enables the prediction of the secondary
structure of RNA through three different methods: centroidFold [@hamada2008],
centroidHomfold [@hamada2011] and IPknot [@sato2011]. IPknot has the advantage
of being able to detect pseudoknots.
The secondary structure predictions by centroidFold and centroidHomfold are
output in the Dot-Bracket format, where dots (".") represent unpaired bases and
pairs of open and closed round brackets ("("-")") represent paired bases. Such
format cannot unambiguously represent nested structures such as pseudoknots.
Therefore, the output of IPknot is provided, if required, in the extended
Dot-Bracket format, where additional pairs of symbols are used to represent such
nested structures.
In the following example, all three methods are used to predict the structure of
a fragment of a tRNA:
```{r tidy = TRUE}
tRNAfragment <- "UGCGAGAGGCACAGGGUUCGAUUCCCUGCAUCUCCA"
centroidFoldPrediction <- predictSecondaryStructure(tRNAfragment, "centroidFold")
centroidFoldPrediction$secondaryStructure
centroidHomFoldPrediction <- predictSecondaryStructure(tRNAfragment, "centroidHomFold")
centroidFoldPrediction$secondaryStructure
IPknotPrediction <- predictSecondaryStructure(tRNAfragment, "IPknot")
IPknotPrediction$secondaryStructure
```
It is also possible to predict not only a single canonical secondary structure,
but also a set of possible alternative secondary structures. This is achieved
with the RintW method [@hagio2018], available through the
"predictAlternativeSecondaryStructures" function.
```{r tidy = TRUE}
tRNAfragment2 <- "AAAGGGGUUUCCC"
RintWPrediction <- predictAlternativeSecondaryStructures(tRNAfragment2)
length(RintWPrediction) # A total of 2 alternative secondary structures were identified by RintW
RintWPrediction[[1]]$secondaryStructure
RintWPrediction[[2]]$secondaryStructure
```
## Calculating and plotting of base pair probability matrices
Both centroidFold and centroidHomfold return not only a prediction of the
secondary structure of an RNA, but also a list of probabilities for potential
base pairs for each nucleotide. Such a matrix can be useful in assessing the
potential presence of alternative secondary structures.
A matrix of base pair probabilities can be generated with the
"generatePairsProbabilityMatrix", which takes as input the probabilities in the
format returned by centroidFold and centroidHomFold. The base pair probabilities
matrix can then be plotted with the "plotPairsProbabilityMatrix" function:
```{r tidy = TRUE}
basePairProbabilityMatrix <- generatePairsProbabilityMatrix(centroidFoldPrediction$basePairProbabilities)
plotPairsProbabilityMatrix(basePairProbabilityMatrix)
```
A composite plot, where the top right triangle represents the base pair
probabilities while the bottom left triangle represents base pairings in a
specific secondary structure, can be generated with the "plotCompositePairsMatrix"
function. Usually, base pairs in a secondary structure correspond to pairs with
a high probability. The function takes as its input a matrix of base pair
probabilities, and a dataframe with bases paired in a given secondary structure.
Such a dataframe can be generated from a secondary structure string with the
"findPairedBases" function.
In the following example, such a composite plot is generated the secondary
structure predicted by centroidFold for the tRNA fragment:
```{r tidy = TRUE}
pairedBases <- findPairedBases(sequence=tRNAfragment, secondaryStructureString=IPknotPrediction$secondaryStructure)
plotCompositePairsMatrix(basePairProbabilityMatrix, pairedBases)
```
## Reading and writing files and other utilities
ncRNAtools supports reading and writing files in the CT and Dot-Bracket format.
For a description of each format, see http://projects.binf.ku.dk/pgardner/bralibase/RNAformats.html
and https://rna.urmc.rochester.edu/Text/File_Formats.html.
CT files are read with the "readCT" function. Some CT files contain a line for
each nucleotide of the RNA sequence, while others contain lines only for paired
nucleotides. In the second case, the full sequence must be provided when calling
"readCT".
The list returned by "readCT" includes an element named "pairsTable", consisting
of a dataframe indicating paired nucleotides. A corresponding secondary
structure string in the Dot-Bracket format can be generated with the
"pairsToSecondaryStructure" function.
CT files are written with the "writeCT" function. A complete CT file is always
output by "writeCT".
Files in the Dot-Bracket format are read and written with the "readDotBracket"
and "writeDotBracket" functions respectively. If the read Dot-Bracket file
contains an energy value for the represented structure, it will be included in
the output of "readDotBracket".
```{r tidy = TRUE}
## Read an example CT file corresponding to E. coli tmRNA
exampleCTFile <- system.file("extdata", "exampleCT.ct", package="ncRNAtools")
tmRNASequence <- "GGGGCUGAUUCUGGAUUCGACGGGAUUUGCGAAACCCAAGGUGCAUGCCGAGGGGCGGUUGGCCUCGUAAAAAGCCGCAAAAAAUAGUCGCAAACGACGAAAACUACGCUUUAGCAGCUUAAUAACCUGCUUAGAGCCCUCUCUCCCUAGCCUCCGCUCUUAGGACGGGGAUCAAGAGAGGUCAAACCCAAAAGAGAUCGCGUGGAAGCCCUGCCUGGGGUUGAAGCGUUAAAACUUAAUCAGGCUAGUUUGUUAGUGGCGUGUCCGUCCGCAGCUGGCAAGCGAAUGUAAAGACUGACUAAGCAUGUAGUACCGAGGAUGUAGGAAUUUCGGACGCGGGUUCAACUCCCGCCAGCUCCACCA"
tmRNASecondaryStructure <- readCT(exampleCTFile, tmRNASequence)
## Write a complete CT file for E. coli tmRNA
tempDir <- tempdir()
testCTFile <- paste(tempDir, "testCTfile.ct", sep="")
tmRNASecondaryStructureString <- pairsToSecondaryStructure(pairedBases=tmRNASecondaryStructure$pairsTable, sequence=tmRNASequence)
writeCT(testCTFile, sequence=tmRNASequence,
secondaryStructure=tmRNASecondaryStructureString, sequenceName="tmRNA")
## Read an example Dot-Bracket file
exampleDotBracketFile <- system.file("extdata", "exampleDotBracket.dot",
package="ncRNAtools")
exampleDotBracket <- readDotBracket(exampleDotBracketFile)
exampleDotBracket$freeEnergy # The structure has a free energy of -41.2 kcal/mol
## Write a Dot-Bracket file
tempDir2 <- tempdir()
testDotBracketFile <- paste(tempDir2, "testDotBracketFile.dot", sep="")
writeDotBracket(testDotBracketFile, sequence=exampleDotBracket$sequence,
secondaryStructure=exampleDotBracket$secondaryStructure,
sequenceName="Test sequence")
```
ncRNAtools includes an utility to flatten representations of secondary structure
in the extended Dot-Bracket format to basic Dot-Bracket format (i.e., comprising
only the ".", "(" and ")" characters). Although information is potentially lost
upon such transformation, this is required for some RNA Bioinformatics software,
which can only accept secondary structures in the Dot-Bracket format as input.
This can be achieved with the "flattenDotBracket" function.
```{r tidy = TRUE}
extendedDotBracketString <- "...((((..[[[.))))]]]..."
plainDotBracketString <- flattenDotBracket(extendedDotBracketString)
```
# Session info
```{r tidy = TRUE}
sessionInfo()
```
# References