---
title: "mobileRNA: Investigate the RNA mobilome & population-scale changes."
author: "Katie Jeynes-Cupper, Marco Catoni"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
abstract: >
Genomic analysis can be utilised to identify differences between RNA
populations in two conditions, both in production and abundance. This includes
the identification of RNAs produced by multiple genomes within a biological
system. For example, RNA produced by pathogens within a host. Or for our
purposes, RNAs moving between the roots and shoots across a plant graft
junction of partners with distinct genotypes. To locate and identify these
RNAs genomic pipelines typically align reads consequential to each genome in
the system. This comes with benefits and disadvantages. Here, we address the
main disadvantages, the high levels of data noise and false positives. The
mobileRNA package provides methods to pre-process, analyse and visualise the
sRNA and mRNA populations based on the premise of mapping reads to all
genotypes at the same time. This vignette explains the use of the package and
demonstrates quick or advanced workflows.
output:
BiocStyle::html_document:
toc: true
toc_depth: 3
number_sections: true
theme: united
highlight: tango
fig_width: 5
bibliography: library.bib
vignette: >
%\VignetteIndexEntry{mobileRNA}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{mobileRNA}
%\VignetteDepends{bioseq}
%\VignetteDepends{dplyr}
%\VignetteDepends{magick}
%\usepackage[utf8]{inputenc}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
fig.align='center',
external=TRUE,
echo=TRUE,
warning=FALSE,
comment = "#>"
)
```
# Introduction
In plants, systemic signalling is an elaborated molecular system which
coordinates plant development, integrating and transmitting the information
perceived from the environment to distant organs. An important role in
long-distance signalling is played by small RNA molecules (sRNAs). The
nucleotide length of a sRNA helps researchers identify the class of sRNA and
predict its functionality. Micro-RNAs (miRNAs) are involved in directing
translational repression and/or the cleavage of messenger RNAs (mRNAs). Whereas
small interfering RNAs (siRNAs) are involved in the maintenance and de novo DNA
methylation and account for the majority of sRNAs in plants. These endogenous
sRNAs can be produced in a tissue and then transported systemically across the
vascular system into recipient organs, where they can induce a molecular
response and coordinate physiological changes. Similarly, messenger RNAs (mRNAs)
can move across distances, and it’s thought they may translate into proteins
which act as transcription factors in the recipient tissues.
Plant grafting can be utilised to create chimeric plant systems composed of two
genotypes, such as different species like tomato and eggplant, or plant
varieties or accessions. Grafting has been used as a method to study RNA
mobilomes and their impact on the phenotype. Yet, it is clear that there is no
standardised genomic approach for the analysis of sequencing data to identify an
RNA mobilome. Here we introduce the R package, mobileRNA, a recommended pipeline
and analysis workflow for the identification of a sRNAs/mRNA mobilome. In
addition, the flexibility supports standard RNA analysis between treatment and
control conditions. For example, to identify sRNA population changes due to the
application of a treatment such as cold/heat stress or exposure to a pest.
`mobileRNA` ultimately assists in pre-processing and analysis including the
characterization of different populations, visualization of the results, and
supporting output for functional analysis.
As stated, this was developed for applications for plant grafting experimental
analysis, however, we believe it could have further applications including the
analysis of dual-host systems.
# Approach
In grafted plants, when different genotypes are used as rootstock and scions,
the sequence variation between the two genomes involved can be used to
discriminate the origin of a sequenced RNA molecule. Therefore, if an RNA
molecule sequenced from one of the grafted partners (scion or rootstock) has
been found to match the genome of the other grafting partner, this could
empirically demonstrate its movement across the graft junction.
Most available genomics approaches to implement this analysis are based on RNA
sequencing, followed by alignment on a genotype of reference and post-alignment
screening of genetic variants to identify molecules which have a better match
for the genotype of the grafted partner. These methods have many limitations,
which might include:
* High dependency on arbitrary thresholds to determined mapped and unmapped RNAseq reads.
* Use of additional statistical tests to discriminate genetic variations from sequencing noise in the post-alignment analysis.
* High false positive rates.
Here, to circumvent such problems we propose a method inspired by the RNAseq
analysis of plant hybrids (Lopez-Gomollon 2022), including an alignment step
performed simultaneously on both genomes involved. The rationale of this
approach considers that alignment tools already implement an algorithm ideal
for the identification of the best matches (according to set parameters) in a
given genome reference, but they do not account for potential matches to DNA
sequences which are not provided as reference. Therefore, the two genomes from
all partners involved in the system are merged in a single FASTA file and used
as a reference for the unique alignment. Ultimately, in a bid to supply the
algorithm with as much information as possible to make the best possible
predictions and placement of sequencing reads to each genome.
The summarised workflow is shown below (Figure 1) where it contains a core
RNA analysis and a mobile sRNA/mRNA analysis. The core
analysis represents the standard workflow for the identification of RNA
populations which have been gained, lost or changed in abundance, for example,
the sRNA population difference between treatment and condition samples, or
similarly in a chimeric system, such as plant grafting, we might want to explore
the native sRNA population from the sample tissue origin (i.e. leaf) which have
been lost or gained or changes in sRNA abundance. While the mobile analysis
represents the workflow for the identification of putative mobile sRNAs or mRNA
in a plant graft system.
As input, the pipeline requires cleaned sRNA or mRNA sequencing reads in FASTQ
format, along with the genome assemblies which represent the genotypes in the
system. The diagram below illustrates the complete workflow using `mobileRNA`,
including essential, optional, and plotting functions.
```{r, echo=FALSE}
cap01<-"Comprehensive diagram of complete mobileRNA workflows. "
```
```{r, fig.align="centre", echo=FALSE, fig.cap=cap01 }
knitr::include_graphics("../man/figures/mobileRNA_graphic_2.png")
```
The analysis approach includes several underlying features to be aware of which
can alter the final output. When working with a chimeric system, the core steps
offer the ability to remove mapping errors by comparing control samples to
treatment samples. If the genotypes in the chimeric system are fairly distantly
related, it is unexpected that unique reads aligned to the mobile genotype will
be found in the control samples. With that in mind, we can assume RNAs with
reads mapped to the mobile genotype from the control samples could be artifacts
or mapping errors. Hence, these RNAs are removed from the analysis when the
parameters are utilized. Note that, if the chimeric system is expected to share
some or a high level of similarity it might be insightful to the analysis to not
remove these sRNA clusters.
At the end of the analysis, the user will have generated a dataframe where rows
represent either sRNA clusters or mRNAs and the columns include information on
the sRNA clusters or mRNAs, individual sample replicates, and more:
#### sRNA Analysis
##### Information on the sRNA cluster:
- `Locus`: Genomic location
- `chr`: Chromosome or scaffold name
- `start` : Start position of the cluster
- `end` : End position of the cluster
- `Cluster`: Cluster Name
- `DicerConsensus` : Consensus dicercall (Calculated by `RNAdicercall()`)
- `DicerCounts` : Number of replicates which contributed to the consensus dicercall (Calculated by `RNAdicercall()`)
- `CountMean` : Count mean (Calculated by `RNAdifferentialAnalysis()`)
- `log2FoldChange` : Log fold change (Calculated by `RNAdifferentialAnalysis()`)
- `pvalue` : p-value (Calculated by `RNAdifferentialAnalysis()`)
- `padjusted` : Adjusted p-value (Calculated by `RNAdifferentialAnalysis()`)
- `logCPM` : Log counts per million (CPM/RPM) (Calculated by `RNAdifferentialAnalysis()`)
##### Information on each sample replicate:
- `DicerCall_` : The nucleotide length of most abundant sRNA
- `Count_` : Number of aligned sRNA reads. As default, these are uniquely aligned (*e.g.* not multi-mapping).
- `RPM_` : Reads per Million
- `FPKM_` : Fragments Per Kilobase of transcript per Million
- `MajorRNA_` : RNA sequence of most abundant sRNA in the cluster
#### For mRNA Analysis
##### Information on the mRNA:
- `Feature`: mRNA name
- `SampleCounts` : Consensus dicercall (Calculated by `RNAdicercall()`)
- `CountMean` : Count mean (Calculated by `RNAdifferentialAnalysis()`)
- `log2FoldChange` : Log fold change (Calculated by `RNAdifferentialAnalysis()`)
- `pvalue` : p-value (Calculated by `RNAdifferentialAnalysis()`)
- `padjusted` : Adjusted p-value (Calculated by `RNAdifferentialAnalysis()`)
- `logCPM` : Log counts per million (CPM/RPM) (Calculated by `RNAdifferentialAnalysis()`)
##### Information on each sample replicate:
- `Count_` : Number of aligned mRNA reads. As default, these are uniquely aligned (*e.g.* not multi-mapping).
- `FPKM_` : Fragments Per Kilobase of transcript per Million
In addition, the user can utilise additional functions to plot a PCA,
distance matrix, sRNA class distribution and heatmaps. Plus, **mobileRNA**
offers functions to assist functional analysis to support the determination
of the biological implications. For instance, within sRNA analysis, the user can
extract the consensus RNA sequence for each sRNA cluster for target prediction
analysis, as well as identifying genomic features associates with each sRNA
clusters, such as genes or repeat regions. This can be utilised, for example,
to explore the RNA expression of sRNA-producing genes in parallel analysis.
# Installation
The latest version of `mobileRNA` can be installed via Bioconductor:
```{r, eval=FALSE, message=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mobileRNA")
```
It is also available on GitHub:
```{r, eval=FALSE, message=FALSE}
if (!require("devtools")) install.packages("devtools")
devtools::install_github("KJeynesCupper/mobileRNA", ref = "main")
```
Load into R library:
```{r, message=FALSE}
library("mobileRNA")
```
## Installing OS Dependencies
`mobileRNA` works on systems with `R`, and depending on the type of sequencing
data different OS dependencies installed via Conda [@Anaconda] are required for
the alignment step.
For sRNA data, `ShortStack` (>= 4.0) [@Axtell2013] and the dependencies are
required. Please consider that `ShortStack` is not available for Windows, hence,
Windows users will either need to opt to use a virtual machine or
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
In either case, both `R` and `ShortStack` will need to be installed and used on
the Linux side. Please head to [ShortStack](https://github.com/MikeAxtell/ShortStack#install-using-conda-recommended)
to see the recommended installation instructions with Conda [@Anaconda].
This will ensure all dependencies are available within the same environment.
For mRNA data, [HISAT2](https://anaconda.org/bioconda/hisat2) [@Kim2015],
[HTSeq](https://anaconda.org/bioconda/htseq) [@Anders2014],
[SAMtools](https://anaconda.org/bioconda/samtools) [@Danecek2021] are required
within the same Conda environment [@Anaconda].
## Example Data
The package includes a simulated data set to replicate the grafting
between eggplant-tomato (*Solanum melongena* - *Solanum lycopersicon*) where
eggplant represents the scion and tomato represents the rootstock. The FASTQ
files represent data extracted from the eggplant leaf tissue. Here we
will locate mobile RNAs produced by tomato roots which have travelled into the
eggplant leaves. In the example data set, there are heterograft replicate,
where each is an individual tomato replicate spiked with the same random set of
tomato sRNA clusters. There are two sets of each for mRNA analysis and three
sets for sRNA analysis. These are known as:
* `heterograft_1`
* `heterograft_2`
* `heterograft_3` (sRNA analysis only)
There are self-graft replicates, where each is an individual tomato
replicate without the spiked tomato sRNA clusters. These are:
* `selfgraft_1`
* `selfgraft_2`
* `selfgraft_3` (sRNA analysis only)
The replicates mirror each other where, for instance, `heterograft_1`
and `selfgraft_1` are the same replicate, either with or without the spiked
clusters.
Hence, we also provide more comprehensive data sets, for small RNA it is called
`sRNA_data` and for messenger RNA it is called `mRNA_data`. Each stores an
example dataframe produced by the importation step with `RNAimport()`. This can
be loaded in the R environment by using the following command:
```{r Load, message=FALSE}
# Load small RNA data
data("sRNA_data")
# Load messenger RNA data
data("mRNA_data")
```
For sRNAseq & mRNAseq alignment step, we provide demo FASTQ files; these have a
reduced number of reads and do not cluster as expected, so should not be used
for downstream analysis. These are stored within `inst/extdata`. Tomato and
Eggplant reduced genome assemblies and annotations have been provided
based on those generated by Hosmani et al., 2019 [@Hosmani2019] and Barchi
et al., 2021 [@Barchi2021].
The simulated data was generated from the data published
[@Li2022, @Qing2022, @Villanueva2023, @TomatoGenomeConsortium2012].
# Example Workflow: Locating the sRNA Mobilome
This is a quick-start example analysis of sRNAseq data to locate putative mobile
root-to-shoot sRNAs from a plant grafting experiment between eggplant and
tomato.
## Quick Start
### Merging Genome Assemblies
Merge the FASTA genome assemblies of tomato and eggplant into a single
reference file stored in your desired directory. Eggplant represents the scion
as genome A, and tomato represents the rootstock (foreign/mobile) as genome B.
```{r, message=FALSE}
fasta_1 <- system.file("extdata","reduced_chr12_Eggplant.fa.gz",
package="mobileRNA")
fasta_2 <-system.file("extdata","reduced_chr2_Tomato.fa.gz",
package="mobileRNA")
# define temporary output directory - replace with your directory
output_assembly_file <- file.path(tempfile("merged_assembly",
fileext = ".fa"))
# merge
merged_reference <- RNAmergeGenomes(genomeA = fasta_1,
genomeB = fasta_2,
output_file = output_assembly_file)
```
### Merging Genome Annotations
Now, repeat for the genome annotations (GFF) - this is required for downstream
analysis or for mRNA mapping.
*It is important that the same alterations are made to each genome in the merged
files. Otherwise, the annotation file will not align with the chromosome names
in the assembly file.*
```{r, message=FALSE}
anno1 <- system.file("extdata","reduced_chr12_Eggplant.gff.gz",
package="mobileRNA")
anno2 <- system.file("extdata","reduced_chr2_Tomato.gff.gz",
package="mobileRNA")
# define temporary output directory
output_annotation_file <- file.path(tempfile("merged_annotation",
fileext = ".gff3"))
# merge annotation files into a single file
merged_annotation <- RNAmergeAnnotations(annotationA = anno1,
annotationB = anno2,
output_file = output_annotation_file)
```
### Alignment
Here we will align our samples containing small RNA sequencing reads to the
merged genome assembly. The `mapRNA()` function invokes a system
command to `ShortStack`; an application which employs `Bowtie` [@Langmead2009]
mapping and performs comprehensive de novo annotation and quantification of
sRNA genes. The function first undertakes de novo sRNA cluster detection
(stored in `1_de_novo_detection`) and then quantification of sRNA genes
(stored in`2_sRNA_results`). Since we are using chimeric samples, we
use `mmap = n` to exclude multi-mapped reads. We exclude multi-mappers as we
currently do not have a method to distinguish whether a read is mapped to
multiple locations within one or both of the genomes in the merged reference.
**PLEASE NOTE:** For the alignment & import demo, we are using fastq files which
have a reduced size for the package. This data will not be used for the
continued analysis step, instead, you will need to load the full dataset.
```{r, eval=FALSE}
# directory containing only sRNAseq samples
samples <- system.file("extdata/sRNAseq",package="mobileRNA")
# output location
output_location <- tempdir()
mapRNA(input = "sRNA",
input_files_dir = samples,
output_dir = output_location,
genomefile = output_assembly_file,
condaenv = "/Users/user-name/miniconda3/envs/ShortStack4",
mmap = "n")
```
If there are issues utilising this function, the manual steps are illustrated in
the [Appendix](#Appendix)
### Import Pre-Processed Data into R
Import the results from the alignment step into R using the `RNAimport()`
function.
```{r, eval=FALSE}
# Directory containing results
results_dir <- file.path(output_location,"2_sRNA_results")
# Sample names and total number of reads, in the same order.
sample_names <- c("selfgraft_demo_1", "selfgraft_demo_2",
"selfgraft_demo_3", "heterograft_demo_1",
"heterograft_demo_2", "heterograft_demo_3")
sRNA_data_demo <- RNAimport(input = "sRNA",
directory = results_dir,
samples = sample_names)
```
This will generate a dataframe where rows represent sRNA clusters and
contains the following columns:
- `Locus`: Genomic location
- `chr`: Name of the chromosome or scaffold
- `start` : Start position of the cluster
- `end` : End position of the cluster
- `Cluster`: Cluster Name
For each sample, the following columns will be present. Where the sample name
follows after the underscore:
- `DicerCall_` : The size of most abundant small RNA size
- `Count_` : Number of uniquely aligned reads that overlap the locus.
- `RPM_` : Reads per Million
- `FPKM_` : Fragments Per Kilobase of transcript per Million
- `MajorRNA_` : RNA sequence of the most abundant sRNA in the cluster
### Import Example Data
Load the full analysis dataset for a more comprehensive analysis:
```{r}
data("sRNA_data")
```
### Identify Potential Mobile sRNA clusters
Select the putative mobile sRNA clusters using `RNAmobile()`. This
requires supplying the function with a unique identifier of the
rootstock genome, which is the prefix "B" to the tomato chromosome names. Here,
the function retains sRNA clusters which were aligned to the tomato genome.
```{r, message=FALSE}
# define control samples
controls <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
mobile_sRNA <- RNAmobile(input = "sRNA",
data = sRNA_data,
controls = controls,
genome.ID = "B",
task = "keep")
```
```{r, eval=FALSE}
# save output as txt file
write.table(mobile_sRNA, "./sRNA_mobile_output.txt")
```
## Advancing the Analysis
For a more advanced analysis, users can include further steps. Here we
demonstrate a more advanced analysis for the Core RNA analysis and Mobile
RNA analysis, both demonstrated with the sRNAseq dataset.
### Core sRNA Analysis
#### Quality control
A useful step before analysis is to assess the overall similarity between
sample replicates to understand which samples are similar and different. This
is known as sample-level quality control and can help us understand where the
largest variation is introduced, whether the data meets the expectations and if
there are outliers.
##### Plot the distribution of sRNA classes within each sample
Here, the `RNAdistribution()` function can generate a number of different
customized plots to represent the number of sRNA clusters within each
dicercall sRNA class in a sample. In plants, sRNAs are known to be produced
with the length between 20-24 nucleotides, and the lengths signify the sRNA
class and specific functional role in (epi)genetic regulation. If there was an
inconsistent size profile of sRNAs within the cluster, the dicercall is defined
as "N", ie, unclassified.
Plotting the distribution of dicercall sRNA classes within each replicate can
support expectation for samples. For instance, if a significant number of sRNA
clusters are unclassified it might suggest the data contains degraded
RNA fragments, or novel types of sRNA genes.
```{r, echo=FALSE}
cap1 <- "An example facet line graph to show the distribution of sRNA classes within each sample."
cap2 <- "An example facet bar graph to show the distribution of sRNA classes within each sample."
```
```{r}
# plot each replicate as a line, separately, and facet
sample_distribution_line <- RNAdistribution(sRNA_data,
style = "line",
overlap = FALSE,
facet = TRUE,
colour = "darkgreen")
# plot each replicate as a bar, separately, and facet
sample_distribution_bar <- RNAdistribution(sRNA_data,
style = "bar",
facet = TRUE,
colour ="lightblue")
```
Let's view the plots:
```{r, fig.cap=cap1, fig.show="hold", fig.height=10, fig.width=15}
# View plot (only)
sample_distribution_line$plot
```
```{r, fig.cap=cap2, fig.show="hold", fig.height=10, fig.width=15}
# View plot (only)
sample_distribution_bar$plot
```
##### PCA
Principal Component Analysis (PCA) is a useful technique to illustrate sample
distance as it emphasizes the variation through the reduction of dimensions in
the data set. Here, we introduce the function `plotSamplePCA()`
```{r, echo=FALSE}
cap4 <-"An example of a PCA, illustrating the sRNA data set sample similarity"
```
```{r, message=FALSE, fig.cap=cap4, fig.show="hold", fig.height=8, fig.width=15}
groups <- c("Heterograft", "Heterograft", "Heterograft",
"Selfgraft", "Selfgraft", "Selfgraft")
plotSamplePCA(data = sRNA_data, group = groups, size.ratio = 2)
```
##### Distance matrix heatmap
Similarly to a PCA plot, the `plotSampleDistance()` function undertakes
hierarchical clustering with an unbiased log transformation to calculate sample
distance illustrated with a distance matrix heatmap.
```{r, echo=FALSE}
cap5 <-"An example of a heatmap, illustrating the sRNA data set sample similarity"
```
```{r ,message=FALSE, fig.cap=cap5, fig.show="hold"}
plotSampleDistance(sRNA_data)
```
#### Define the consensus dicercall
Have a look at the `sRNA_data` data frame, you will see that for each sample the
sRNA class for a given cluster has been determined (see columns with names
containing with "DicerCall_") which, in this data, will state a number from
20-24. This value represent the length in nucleotides of the most abundant
sRNA within the cluster. For some clusters, there is no particular sRNA which
is more abundant than another, hence, it is stated as "NA" or "N", which is
referred to as being unclassified.
The `RNAdicercall()` function is used to calculate the consensus dicercall
for each sRNA cluster. This is based on the classification predicted for the
cluster by each sample within the analysis. There are several parameters which
will alter the output, including the handling of ties and the
method to draw the consensus from.
When working along the mobile sRNA analysis workflow, the function contains
a specialised parameter which can be utilized. This is `chimeric=TRUE`, plus
with the `genome.ID` and `controls` parameters. In the example below `B_`
represents the prefix added to the mobile/foreign genotype, which is
the rootstock tomato genome. This helps optimise classification by removal of
potential mapping errors.
```{r, message=FALSE}
# define consensus, store as a data summary file.
sRNA_data_dicercall <- RNAdicercall(data = sRNA_data,
chimeric = TRUE,
genome.ID = "B_",
controls = c("selfgraft_1",
"selfgraft_2",
"selfgraft_3"))
```
For the downstream analysis, it can be useful to define distinct groups of
sRNA classes depending on your organism. For plant samples, it is beneficial to
select a group of 24-nt and another containing 21/22-nt sRNAs.
To subset the data, use the `RNAsubset()` function to choose which sRNA
populations.
```{r, message=FALSE}
# Subset data for analysis: 24-nt sRNAs
sRNA_24 <- RNAsubset(sRNA_data_dicercall, class = 24)
# Subset data for analysis: 21/22-nt sRNAs
sRNA_2122 <- RNAsubset(sRNA_data_dicercall, class = c(21, 22))
```
##### Plot the consensus dicercall
The `RNAdistriution()` function can be used to visualize the
distribution of the consensus dicercall classes across the total data set.
```{r, message=FALSE}
consensus_plot <- RNAdistribution(data = sRNA_data_dicercall,
style = "bar",
data.type = "consensus")
```
```{r, echo=FALSE}
cap6<-"An example of the distribution of small RNA consensus dicer classifications. "
```
Now, view the plot:
```{r, fig.cap=cap6, fig.height=10, fig.width=15}
# view
consensus_plot$plot
```
#### Differential analysis with `DESeq2` or `edgeR`
Differential analysis is undertaken to identify RNAs which
are statistically significant to discover quantitative changes in the abundance
levels between the treatment and the control groups. This technique can be
undertaken with either the `DESeq2` [@Love2014] or `edgeR` [@edgeR_GLM]
analytical method.
First, let's re-order the data frame so we can compare control vs treatment
(i.e. Selfgraft vs Heterograft). When the data is imported, it may not be in the
correct order/levels for comparison.
```{r, message=FALSE}
#reorder df
controls <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
reorder_df <- RNAreorder(sRNA_data_dicercall, controls)
# sample conditions in order within dataframe
groups <- c("Selfgraft", "Selfgraft", "Selfgraft",
"Heterograft", "Heterograft", "Heterograft")
## Differential analysis of whole dataset: DESeq2 method
sRNA_DESeq2 <- RNAdifferentialAnalysis(data = reorder_df,
group = groups,
method = "DESeq2")
```
We can summarise the results using `RNAsummary()`:
```{r, results='asis'}
RNAsummary(sRNA_DESeq2)
```
How about looking at the sRNA population which are statistically significant:
```{r, results='asis'}
RNAsummary(sRNA_DESeq2, alpha=0.05)
```
##### Save output
```{r, eval = FALSE}
write.table(sRNA_DESeq2, "./sRNA_core_dataset.txt")
```
#### Differences in RNA abundance
When comparing treatment to control conditions, it might be the case that
the same sRNA clusters are found within both, yet, there could be difference in
the total abundance of the shared clusters. For instance, for a given sRNA
cluster the samples in the treatment condition might have a greater abundance
than the samples in the control condition.
The statistical analysis calculated the log2FC values for each sRNA cluster by
comparing the normalised counts between treatment and control. Here,
a positive log2FC indicates an increased abundance of transcripts for a given
sRNA cluster in the treatment compared to the control, while negative log2FC
indicates decreased abundance of transcripts for a given sRNA cluster. The
statistical significance of the log2FC is determined by the adjusted p-value.
Here we will filter the data to select sRNA clusters which are statistically
significant, and then plot the results as a heatmap to compare the conditions.
*NOTE:* the data used here will not yield any results as the
treatment and control samples contain the exact same population of eggplant
sRNAs, the only difference in the treatment samples are the spiked tomato
sRNA clusters.
```{r, fig.show="hold"}
# summary of statistical sRNA clusters
RNAsummary(sRNA_DESeq2, alpha=0.05)
# select significant
significant_sRNAs <- sRNA_DESeq2[sRNA_DESeq2$padjusted < 0.05, ]
```
```{r, echo = FALSE}
capp1 <- "Heatmap of statistically significant sRNA clusters"
```
Plot the results:
```{r, fig.show="hold", message=FALSE}
p1 <- plotHeatmap(significant_sRNAs, value = "RPM", row.names = FALSE,
title = "Heatmap of log-transformed FPKM")
```
```{r, fig.show="hold", message=FALSE, fig.cap=capp1}
p1$plot
```
#### Identify gain & loss of RNA populations
The `RNApopulation()` function can be utilised to identify unique
sRNA populations found in the treatment or control conditions. This represents
the sRNA which are gained or lost due to the treatment conditions.
First let's look at the sRNA clusters gained to the treatment condition.
In the chimeric heterografts, we expect that the foreign sRNAs will also be
selected in this pick-up, therefore, we can use the parameter `genome.ID` to
remove sRNA cluster related to the foreign genome.
```{r, message=FALSE}
# select sRNA clusters only found in treatment & not in the control samples
gained_sRNA <- RNApopulation(data = sRNA_DESeq2,
conditions = c("heterograft_1",
"heterograft_2" ,
"heterograft_3"),
chimeric = TRUE,
genome.ID = "B_",
controls = c("selfgraft_1",
"selfgraft_2",
"selfgraft_3"))
# look at number of sRNA cluster only found in treatment
nrow(gained_sRNA)
```
Now, the sRNA clusters lost and only produced in the control condition:
```{r, message=FALSE}
# select sRNA clusters only found in control & not in the treatment samples
lost_sRNA <- RNApopulation(data = sRNA_DESeq2,
conditions = c("selfgraft_1",
"selfgraft_2" ,
"selfgraft_3"),
chimeric = TRUE,
genome.ID = "B_",
controls = c("selfgraft_1",
"selfgraft_2",
"selfgraft_3"))
# look at number of sRNA cluster only found in control
nrow(lost_sRNA)
```
#### Functional analysis of gained sRNA populations
Now we have identified unique populations produced or not produced in our
treatment samples compared to our control samples, we can extract the RNA
sequences to undertaken target prediction and then onward to gene ontology
enrichment analysis.
```{r, message=FALSE}
gained_sRNA_sequences <- RNAsequences(gained_sRNA, method = "consensus" )
```
Moreover, we can identify genomic features associated with these sRNA clusters
which are unique to the treatment and absent in the control (i.e. gained).
```{r, message=FALSE}
gained_sRNA_attributes <- RNAattributes(data = gained_sRNA,
match ="genes",
annotation = output_annotation_file)
```
### Mobile sRNA Analysis
We identify candidate mobile RNAs by identifying those which are mapped
to the genotype representing the mobile RNAs. These can be isolated using the
`RNAmobile()` function. Then the user can undertake further steps to assist
functional analysis.
In respect to the example data set, we are looking to identify sRNAs traveling
from the tomato rootstock to the eggplant scion in the heterografts. Hence, this
function will look to select clusters mapped to the tomato genome and remove
those mapped to the eggplant genome. Previously, the example of the prefix `B_`
was added to the chromosomes of the tomato genome while prefix `A_` added to
the eggplant genome. To remove or keep specific clusters, we align this request
with the `"task"` parameter.
```{r , message=FALSE}
# vector of control names
control_names <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
## Identify potential tomato mobile molecules
mobile_sRNA <- RNAmobile(input = "sRNA",
data = sRNA_DESeq2,
controls = control_names,
genome.ID = "B_",
task = "keep",
statistical = FALSE)
```
#### Heatmap plots to represent mobile molecules
We can plot our results as a heatmap, which represents the normalised
reads-per-million (RPM) values which have been log transformed.
Here we will plot all potential mobile molecules and those which are
statistically significant:
```{r, echo=FALSE}
cap7 <- "An example heatmap of candidate mobile small RNAs. Where the columns represent the sample replicates and the rows represent the small RNA cluster."
```
```{r,fig.cap=cap7, fig.show="hold" , message=FALSE}
p2 <- plotHeatmap(mobile_sRNA, row.names = FALSE)
p2$plot
```
#### Save output
```{r, eval = FALSE}
write.table(mobile_sRNA, "./candidate_mobile_sRNAs.txt")
```
#### Functional analysis of putative mobile sRNA clusters
Now, we can extrapolate information to assist the prediction of their targets
and role in the biological system. **mobileRNA** offer three different tools to
assist the functional analysis.
**IMPORTANT:** Alterations to the genome assemblies by the `RNAmergeGenomes()`
function must be replicated in the annotations. A merged annotation with the
same amendments can be created with the function `RNAmergeAnnotations()`.
##### Add genomic attributes to sRNA clusters
Each sRNA cluster contains coordinates, these can be matched with coordinates in
an annotation file. A match occurs when the cluster is found within the
coordinates of a feature. If there is a match, the function returns the input
dataframe with additional fields of information from the annotation file.
This enables users to identify the genomic features producing the mobile
sRNAs. Here we will be overlapping the data with genes and adding a buffer
region of 1 kb upstream and downstream of each gene.
```{r}
annotation_file <- system.file("extdata",
"prefix_reduced_chr2_Tomato.gff.gz",
package="mobileRNA")
mobile_attributes <- RNAattributes(data = mobile_sRNA,
match ="genes",
annotation = annotation_file)
```
##### Summarise sRNA cluster overlaps with genomic features
Very similar to before, we can find stricter overlaps between our candidate
sRNA clusters and genomic features. However, this time we can calculate the
number of sRNA clusters which are associated to each type of feature. These
include promoter regions, exon, introns, untranslated regions and repeat
regions. The results can either be displayed in the data frame as an
absolute value or as a percentage of the total:
```{r}
mobile_features <- RNAfeatures(data = mobile_sRNA,
annotation = annotation_file)
print(mobile_features)
# plot as stacked bar chart
features_plot <- plotRNAfeatures(data = sRNA_data,
annotation = annotation_file)
plot(features_plot)
```
##### Retrieve RNA sequence from mobile sRNA clusters
To predict the targets of the mobile sRNA candidates, we need to extract
the RNA sequence. Here, we introduce the `RNAsequences()` function which
extrapolates the most common RNA sequence for a cluster and determines the
complementary sequences.
```{r, message=FALSE}
mobile_sequences <- RNAsequences(mobile_sRNA, method = "consensus")
```
The output consists of a dataframe of 6 columns where rows represent each
putative sRNA cluster. The columns include:
- `Cluster`: name of sRNA cluster
- `Match`: whether the RNA sequence is consistent across replicates (either "No", "Yes" or "Duplicate"; where "Duplicate" indicates a tie)
- `Sequence`: RNA sequence of the most abundant sRNA within a cluster across samples
- `Width`: length of nucleotide sequence
- `Complementary_RNA`: complementary RNA nucleotide sequence
- `Complementary_DNA`: complementary DNA nucleotide sequence
```{r, eval = FALSE}
write.table(mobile_sequences, "./candidate_mobile_sRNA_sequences.txt")
```
To predict the potential targets in a recipient tissue, a target prediction
tool such as `psRNATarget` [@Xinbin2018] can be used. For such a tool, we can
manipulate the output of `RNAsequences()` to match the input style for target
prediction using the following code:
```{r, message=FALSE}
# load `dplyr` package
library(dplyr)
# select the cluster and sequence columns
sequences <- mobile_sequences %>% select(Cluster, Sequence)
# add prefix, remove row with NA
prefix <- ">"
sequences$Cluster <- paste0(prefix, sequences$Cluster)
sequences <- na.omit(sequences)
# convert to required format:
res <- do.call(rbind, lapply(seq(nrow(sequences)),
function(i) t(sequences[i, ])))
```
Save output:
```{r, eval = FALSE, message=FALSE}
# save output
write.table(res, file ="./mobile_sRNA_sequences.txt",
row.names = FALSE, col.names = FALSE, quote = FALSE)
```
Other steps can be taken at this stage including alignment of the mobile sRNA
sequences to identify consensus sequences. For example, this can be undertaken
with the R package `bioseq`:
```{r, eval = FALSE}
# load `bioseq` package
library(bioseq)
# build a RNA vector from a character vector:
mobileRNA_seq <- bioseq::rna(mobile_sequences$Sequence)
# Find a consensus sequence for a set of sequences:
bioseq::seq_consensus(mobileRNA_seq)
```
# Appendix
## Manual sRNA Alignment Pipeline
If users wish to undertake the alignment step independently, the manual steps
are shown below, of which, will need modifying to suit your files and
directory names.
### Core sRNA Analysis
#### Step 1 - De novo sRNA analysis
Here, `ShortStack` [@Axtell2013] aligns the reads (includes multimapped reads)
with Bowtie [@Langmead2009] and undertakes de novo sRNA clustering analysis.
*bash scripting*
```{bash, eval=FALSE}
ShortStack \
--readfile "[fastqfile.fq]" \
--genomefile "[genomefile.fa.gz]" \
--threads 6 \
--mmap u \
--pad 200 \
--mincov 0.5 \
--nohp \
--outdir "[output_dir]"
```
#### Step 2 - Build sRNA cluster list
Step 2 collates all the sRNA loci information from each sample into a text file.
*R scripting*
```{r, eval=FALSE}
# names of samples, represented by folder names produced in previous step
samples <- c("[sample names]")
# location of output folders from previous step
dir <- "[output_dir]"
# merge information of the detected de novo sRNA clusters
gff_alignment <- GenomicRanges::GRangesList()
for (i in samples) {
file_path <- file.path(dir, i, "Results.gff.gz3")
if (file.exists(file_path)) {
gff_alignment[[i]] <- rtracklayer::import.gff.gz(file_path)
} else{
Stop("File does not exist:", file_path, "\n")
}
}
gff_merged <- GenomicRanges::reduce(unlist(gff_alignment),
ignore.strand = TRUE)
gff_merged <- as.data.frame(gff_merged)
colnames(gff_merged)[1] <- "chr"
if('*' %in% gff_merged$strand){
gff_merged <- gff_merged[, -match("strand", colnames(gff_merged))]
}
locifile_txt <- data.frame(Locus = paste0(gff_merged$chr, ":",
gff_merged$start,"-",
gff_merged$end),
Cluster = paste0("cluster_",
seq_len(nrow(gff_merged))))
# save output to location
dir_locifile <- "[output directory]"
loci_out <- file.path(dir_locifile,"locifile.txt")
utils::write.table(locifile_txt, file = loci_out, quote = FALSE,
sep = "\t", row.names = FALSE, col.names = TRUE)
```
#### Step 3 - sRNA clustering
Finally, we repeat the sRNA clustering using the de novo sRNA loci list to
produce our results. The output folders are used as input for the `RNAimport()`
function.
*bash scripting*
```{bash, eval=FALSE}
ShortStack \
--readfile "[fastqfile.fq]" \
--genomefile "[genomefile.fa.gz]" \
--locifile "[locifile]" \
--threads 6 \
--mmap u \
--nohp \
--pad 200 \
--mincov 0.5 \
--outdir "[output_dir_2]"
```
### Mobile sRNA Analysis
#### Step 1 - De novo sRNA analysis
Here, `Bowtie` [@Langmead2009] aligns the reads (excluding multimappers) and
`ShortStack` [@Axtell2013] undertakes de novo sRNA clustering analysis.
*bash scripting*
```{bash, eval=FALSE}
#index reference genome
bowtie-build --threads 6 [genomefile.fa.gz] [genomefile]
# alignment
bowtie -p 6 -v 0 -a -m 1 --sam -x [genomefile] [fastqfile.fq] | samtools view -bS | samtools sort -o [outputfile.bam]
# cluster analysis with ShortStack - use the alignment files as input
ShortStack \
--bamfile "[outputfile.bam]" \
--genomefile "[genomefile.fa.gz]" \
--threads 6 \
--pad 200 \
--mincov 0.5 \
--nohp \
--outdir "[output_dir]"
```
#### Step 2 - Build sRNA cluster list
Step 2 collates all the sRNA loci information from each sample into a text file.
*R scripting*
```{r, eval=FALSE}
# names of samples, represented by folder names produced in previous step
samples <- c("[sample names]")
# location of output folders from previous step
dir <- "[output_dir]"
# merge information of the detected de novo sRNA clusters
gff_alignment <- GenomicRanges::GRangesList()
for (i in samples) {
file_path <- file.path(dir, i, "Results.gff.gz3")
if (file.exists(file_path)) {
gff_alignment[[i]] <- rtracklayer::import.gff.gz(file_path)
} else{
Stop("File does not exist:", file_path, "\n")
}
}
gff_merged <- GenomicRanges::reduce(unlist(gff_alignment),
ignore.strand = TRUE)
gff_merged <- as.data.frame(gff_merged)
colnames(gff_merged)[1] <- "chr"
if('*' %in% gff_merged$strand){
gff_merged <- gff_merged[, -match("strand", colnames(gff_merged))]
}
locifile_txt <- data.frame(Locus = paste0(gff_merged$chr, ":",
gff_merged$start,"-",
gff_merged$end),
Cluster = paste0("cluster_",
seq_len(nrow(gff_merged))))
# save output to location
dir_locifile <- "[output directory]"
loci_out <- file.path(dir_locifile,"locifile.txt")
utils::write.table(locifile_txt, file = loci_out, quote = FALSE,
sep = "\t", row.names = FALSE, col.names = TRUE)
```
#### Step 3 - sRNA clustering
Finally, we repeat the sRNA clustering using the de novo sRNA loci list to
produce our results. The output folders are used as input for the `RNAimport()`
function.
*bash scripting*
```{bash, eval=FALSE}
ShortStack \
--bamfile "[outputfile.bam]" \
--genomefile "[genomefile.fa.gz]" \
--locifile "[locifile.txt]" \
--threads 6 \
--nohp \
--mincov 0.5 \
--pad 200 \
--outdir "[output_dir_2]"
```
## Manual mRNA Alignment pipeline
Below states the manual steps for mRNAseq alignment for mobile mRNA
identification. Keep in mind these commands need modifying to suit your files
and directory names.
These steps utilise HISAT2 [@Kim2015] and HTSeq [@Anders2014] OS software.
If required, index the genome assembly:
*bash scripting*
```{bash, eval=FALSE}
hisat2-build --threads 6 [genomefile.fa.gz] [genomefile]
```
### Single-End Sequencing Reads
*bash scripting*
```{bash, eval=FALSE}
# alignment with hisat2
hist2 -p 6 -x [genomefile] -U [fastqfile_1.fq] -S [outputfile.sam]
# select reads only mapped to one location:
grep "NH:i:1" [outputfile.sam] > [outputfile_uniqueReads.sam]
# reheader the sam file:
samtools view -H [outputfile.sam] > [outputfile_header.txt]
cat [outputfile_header.txt] [outputfile_uniqueReads.sam] > [outputfile_uniqueReads_H.sam]
# convert to bam and sort
samtools view -Sb [outputfile_uniqueReads_H.sam] | samtools sort -o [outputfile_uniqueReads_H.bam]
# make folder to store output for sample:
mkdir [samplename]
# raw count with HTSeq
python -m HTSeq.scripts.count \
--format=bam \
--order=pos \
--stranded=no \
--mode=union \
--nonunique=none \
--type=mRNA \
--idattr=Name \
[outputfile_uniqueReads_H.bam] \
[annotationfile.gff] \
> [./samplename/Results.txt]
```
### Pair-End Sequencing Reads
*bash scripting*
```{bash ,eval=FALSE}
# alignment with hisat2
hist2 -p 6 -x [genomefile] \
-1 [fastqfile_1.fq] \
-2 [fastqfile_2.fq] \
-S [outputfile.sam]
# select reads only mapped to one location:
grep "NH:i:1" [outputfile.sam] > [outputfile_uniqueReads.sam]
# reheader the sam file:
samtools view -H [outputfile.sam] > [outputfile_header.txt]
cat [outputfile_header.txt] [outputfile_uniqueReads.sam] > [outputfile_uniqueReads_H.sam]
# convert to bam and sort
samtools view -Sb [outputfile_uniqueReads_H.sam] | samtools sort -o [outputfile_uniqueReads_H.bam]
# make folder to store output for sample:
mkdir [samplename]
# raw count with HTSeq
python -m HTSeq.scripts.count \
--format=bam \
--order=pos \
--stranded=no \
--mode=union \
--nonunique=none \
--type=mRNA \
--idattr=Name \
[outputfile_uniqueReads_H.bam] \
[annotationfile.gff] \
> [./samplename/Results.txt]
```
# Session Information
```{r}
sessionInfo()
```
# References