---
title: "The sRNADiff package"
author: "Matthias Zytnicki"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('srnadiff')`"
output: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup_knitr, include = FALSE, cache = FALSE}
library("srnadiff")
library("BiocStyle")
library("knitr")
library("rmarkdown")
opts_chunk$set(message = FALSE, error = FALSE, warning = FALSE,
               cache = FALSE, fig.width = 5, fig.height = 5)
```

sRNADiff aims at finding differentially expressed short RNAs with or without
annotation.  To do so, sRNADiff uses up to 4 different strategies:

- annotation-based,
- naive,
- HMM,
- clustering.


## Example

Raw data have been downloaded from the GEO data set
[GSE62830](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62830),
provided in
[Viollet *et al.*, 2015](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0126439)
[@data].
Adapters were removed with
[fastx_clipper](http://hannonlab.cshl.edu/fastx_toolkit/commandline.html)
and mapped with
[bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) [@bowtie2]
on the human genome version GRCh38.

This example provides this data, restricted to a small locus on chr14.
It uses the whole genome annotation (with coding genes, etc.)
and extracts miRNAs.

The file `data.csv` contains three columns for each BAM file:

- the file name (`FileName`)
- a human readable sample name (`SampleName`)
- a condition, e.g. `WT` (`Condition`)

```{r, standard, results="hide", warning=FALSE}
dir         <- system.file("extdata", package="srnadiff", mustWork = TRUE)
data        <- read.csv(file.path(dir, "data.csv"))
gtfFile     <- file.path(dir, "Homo_sapiens.GRCh38.76.gtf.gz")
annotation  <- readWholeGenomeAnnotation(gtfFile)
bamFiles    <- file.path(dir, data$FileName)
replicates  <- data$SampleName
conditions  <- factor(data$Condition)
exp         <- sRNADiffExp(annotation, bamFiles, replicates, conditions)
diffRegions <- runAll(exp)
```

For your conveniance, the sample can be loaded with an only command
`exp <- sRNADiffExample()`, so the script boils down to:

```{r, example}
exp         <- sRNADiffExample()
diffRegions <- runAll(exp)
```


## Rationale and scope of the package

There is no real method for finding differentially expressed short RNAs.  The
most used method focusses on miRNAs, and only uses a standard RNA-Seq pipe-line
on these genes.

However, annotated tRF, siRNAs, piRNA, etc. are thus out of the scope of these
analyses.  Several ad hoc method have been used, and this package implements a
unifying method, finding differentially expressed genes or regions of any kind.


## Explanation of the different strategies

Each strategy segments the genome its own way and finds potential differentially
expressed regions.
Every strategy makes assumption of the profiles of the differentially expressed
regions.
These assumptions differ between strategies, and thus it is useful to combined
them.

### Annotation

This is the standard pipe-line, and the method simply provides the list of miRNA
genes extracted from the given GTF file.

If you do not provide a GTF file, this step is skipped.

Unfortunately, there is no clear standard for an annotation format.
We provide here four ways to parse the annotation file.


#### Whole genome file annotation

This GTF file can be found in the central repositories
([NCBI](https://www.ncbi.nlm.nih.gov/), 
[Ensembl](https://www.ensembl.org/index.html)) and
contains all the annotation found in an organism (coding genes, tranposable
element, etc.).
The following function reads the annotation file and extracts the miRNAs.
Annotation files may have different formats, but this command has been tested
on several model organisms (including human) from Ensembl.

```{r, read_whole_genome}
gtfFile     <- file.path(dir, "Homo_sapiens.GRCh38.76.gtf.gz")
annotation  <- readWholeGenomeAnnotation(gtfFile)
```


#### miRBase annotation

[miRBase](http://www.mirbase.org/) [@mirbase] is the central repository for
miRNAs.
If your organism is available, you can download their miRNA annotation in GFF3
format (check the "Browse" tab).
The following function parses a GFF3 miRBase file, and extracts the precursor
miRNAs.

```{r, read_mirbase_pre}
gffFile    <- file.path(dir, "mirbase21_GRCh38.gff3")
annotation <- readMiRBasePreAnnotation(gffFile)
```

As a result, the reads will be counted per pre-miRNA, and the 5' and 3' arms,
the miRNA and the miRNA* will be merged in the same feature.
If you want to separate the two, use:

```{r, read_mirbase_mature}
annotation <- readMiRBaseMatureAnnotation(gffFile)
```


#### Other format

When the previous functions do not work, you can use your own parser with:

```{r, read_annotation}
annotation <- readAnnotation(gtfFile, source="miRNA", feature="gene", name="gene_name")
```

The `source` parameter keeps all the lines such that the second field matches
the given parameter (e.g. `miRNA`).
The `feature` parameter keeps all the lines such that the third field matches
the given parameter (e.g. `gene`).
The name of the feature will be given by the tag `name` (e.g. `gene_name`).
`source`, `feature` and `name` can be `NULL`.
In this case, no selection is performed on `source` or `feature`.
If `name` is null, then a systematic name is given (`annotation_N`).


### Naive

This strategy is the simplest, most commonly used, to find transcription units.
It proceeds in two steps.

The first step merges all the BAM files.  Regions with a minimum number coverage
(summing all the reads in all input BAM files) are kept.

The second steps merges regions that are within a given distance (also provided
by the user).  The default distance can be change using this function:

```{r, set_merge_distance}
exp <- setMergeDistance(exp, 100)
```

This method is efficient when a gene is differentially expressed on the whole
transcript.


### HMM

This strategy first computes the coverage of every input BAM file of every
nucleotide of the genome.
If two genome nucleotides have exactly the same coverage profile, we merge them.
We give these coverage profiles to `r Biocpkg("DESeq2")` [@deseq2], and compute
the p-values.
We thus have a p-value for each position of the genome.
We then use a very crude hidden Markov model (HMM) on each chromosome of the
genome.
The HMM has two states: "not differentially expressed", and "differentially
expressed".
Each state uses a binomial distribution, the probability to emit a p-value
< $t$, and the probability to emit a p-value >= $t$ (where $t$ is a threshold
that can be tuned, initially set to 0.1).
We did not try to optimize the parameters of the HMM, because it takes times,
and do not improve the results.
Instead, the HMM is used a crude (but effective) way to segment the genome.

The parameters of the HMM can be changed using these functions:

```{r, hmm}
exp <- setTransitionProbabilities(exp, noDiffToDiff = 0.0001, diffToNoDiff = 0.01)
exp <- setEmissionProbabilities(exp, probability = 0.9)
exp <- setEmissionThreshold(exp, threshold = 0.1)
```

where:

 * `noDiffToDiff` is the probability to move from the "not differentially
expressed" state to the "differentially expressed" state,
 * `diffToNoDiff` is the probability to move from the "differentially expressed"
state to the "not differentially expressed" state,
 * `probability` is the probability to emit a p-value < $t$ in the
"differentially expressed" state, and a p-value >= $t$ in the "not
differentially expressed" state, where $t$ is the next threshold,
 * `threshold` is the threshold that limits each state.

This method is efficient when a region is highly differentially expressed, even
if it is concealed in a longer transcript.


### Slicing

This stretegy proceeds in three steps:

 * It normalizes counts using a CPM strategy developped in EdgeR.
 * The mean distribution of the coverage of the replicates is computed for each
condition.
 * For each condition, we "slice" the mean coverage distribution: we add
horizontal lines at height h (where h is the minimum depth parameter defined in
the general parameter section), $h+1$, $h+2$, ..., $+\infty$, and the intervals
where the distribution and the horizontal lines meet is a putative region.
 * These putative regions are the output of the strategy.

Obviously, many putative regions may have very similar start and end points.
If two regions differ only by a few base pairs, the smallest regions will be
discarded.

The default values can be changed using these functions:

```{r, slicing}
exp <- setMinDifferences(exp, differences = 50)
```

This method is especially efficient when a differentially expressed region is
large, even though the fold change is moderate.


### General parameters

The three last strategies can be tuned by specifying:

 * the minimum and maximum regions sizes,
 * the minimum depth of the most expressed condition.

The default values can be changed using these functions:

```{r, general_parameters}
exp <- setMinDepth(exp, depth = 10)
exp <- setSizes(exp, minSize = 15, maxSize = 1000)
```


### Combination of the strategies

#### Choice of the strategies

All the regions given by each strategies are then combined into a list of
regions.
You can choose not to use some strategies, use the function

```{r, strategies}
exp <- setStrategies(exp, annotation = TRUE, naive = FALSE, hmm = TRUE, slice = TRUE)
```

where `annotation`, `naive`, `hmm` and `slice` are logical variables, set to
`FALSE` if you want to skip that step.


#### Quantification of the features

The selected regions are then quantified using of the `summarizeOverlaps`
function of the `r Biocpkg("GenomicAlignments")` package.
Notice that a read can overlap two different regions (e.g. extracted from the
naive and the slicing methods), and thus can be counted twice for the
quantification.

You can adjust the minimum number of overlapping nucleotides between a read and
a region to declare a hit, using:
```{r, minOverlap}
exp <- setMinOverlap(exp, minOverlap = 20)
```

`r Biocpkg("DESeq2")` is then used to get the adjusted p-values of these
regions.


### Extracting regions

The regions, with corresponding information provided by `r Biocpkg("DESeq2")`
(mean expression, fold-change, p-value, adjusted p-value, etc.), can be
extracted with this command:

```{r regions}
exp     <- sRNADiffExample()
exp     <- runAll(exp)
regions <- regions(exp, pvalue = 0.5)
```

where `pvalue` is the (adjusted) p-value threshold.  The output in a
`r Biocpkg("GenomicRanges")` object, and the information is accessible with the
`mcols()` function.


## Visualization

You can get the read coverage profile of your selected regions using the
`plotRegion` method.  For instance, if `exp` is your `srnadiff` object, you
can type:

```{r, plot}
plotRegion(exp, regions[1])
```

Each line provides the read density on an input sample.
The dark grey rectangle emphasizes the selected region.


## Misc

### Naming the conditions

The conditions can be sorted (`WT` before `mutant`, `untreated` before
`treated`) this way:

```{r, conditions}
conditions  <- factor(c("wt", "wt", "ko", "ko"), levels=c("wt", "ko"))
```


### Using several cores

The quantification and differential expression steps can be accelerated using
several cores and the following command:

```{r, threads}
exp <- setNThreads(exp, nThreads = 4)
```


### Troubleshooting

While installing the package, if the compiler complains and says

    #error This file requires compiler and library support for the ISO C++ 2011 standard.
    This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.

Add this line

    Sys.setenv("PKG_CXXFLAGS"="-std=c++11")

before installing the package.


# Session information

```{r, session_info}
devtools::session_info()
```


# References