---
title: "Scanning sequences for miRNA binding sites and exploring matches with scanMiR"
author: 
- name: Fridolin Gross
  affiliation: Lab of Systems Neuroscience, D-HEST Institute for Neuroscience, ETH
- name: Pierre-Luc Germain
  affiliation:
    - D-HEST Institute for Neuroscience, ETH
    - Lab of Statistical Bioinformatics, UZH
- name: Michael Soutschek
  affiliation: Lab of Systems Neuroscience, D-HEST Institute for Neuroscience, ETH
package: scanMiR
output:
  BiocStyle::html_document
abstract: |
  This vignettes explores scanMiR's scanning functions.
vignette: |
  %\VignetteIndexEntry{1_scanning}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
library(BiocStyle)
```

```{r echo = FALSE, out.width="35%", fig.align = 'right'}
knitr::include_graphics(system.file('docs', 'sticker.svg', package = 'scanMiR'))
```

# Scanning
## Background
`r Biocpkg("scanMiR")` can be used to identify potential binding sites given a 
set of miRNAs and a set of transcripts. Furthermore, it determines the type of 
binding site and, given a `KdModel` object, the predicted affinity of the site.

## Basic Scan
The main function used for determining matches of miRNAs in a given set of 
sequences is `findSeedMatches`. It accepts a set of DNA sequences either as a 
character vector or as a [DNAStringSet](https://bioconductor.org/packages/release/bioc/html/Biostrings.html). 
The miRNAs can be provided either as a character vector of seeds/miRNA 
sequences or as a `KdModelList`.

### Using a miRNA Seed
The seed must be given in the form of a (RNA or DNA) sequence of length 7 or 8
(the 8th nucleotide being the final 'A' - it is added if only 7 are given). 
Note that the seed should be given as it would appear in a match in the target 
sequence (i.e. the reverse complement of how it appears in the miRNA).
```{r}
library(scanMiR)

# seed sequence of hsa-miR-155-5p
seed <- "AGCAUUAA"

# load a sample transcript
data("SampleTranscript")

# run scan
matches <- findSeedMatches(SampleTranscript, seed, verbose = FALSE)
matches
```

By default, a [GRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) 
object is returned. Apart from the position of the matches, it provides 
information on the type of the putative binding site corresponding to the match.
Setting `ret = "data.frame"` returns the same information as a data.frame.

### Using a miRNA sequence
Alternatively, we can provide the full miRNA sequence, which results in 
additional information on supplementary 3' pairing in the form of an aggregated 
score (see Section \@ref(sec:3pAlignment) for further details).
```{r}
# full sequence of the mature miR-155-5p transcript
miRNA <- "UUAAUGCUAAUCGUGAUAGGGGUU"

# run scan
matches <- findSeedMatches(SampleTranscript, miRNA, verbose = FALSE)
matches
```

We can take a closer look at the alignment of the first match:
```{r}
viewTargetAlignment(matches[1], miRNA, SampleTranscript)
```

Apart from the direct seed match (right), this representation also reveals the 
extensive supplementary 3' pairing (left).

### Using a KdModel
Finally, we can provide the miRNA in the form of a `KdModel` (see the 
[vignette on KdModels]("Kdmodels.html") for further information). In this case 
`findSeedMatches` also returns the predicted affinity value for each match. The 
`log_kd` column contains log(Kd) values multiplied by 1000, where Kd is the 
predicted dissociation constant of miRNA:mRNA binding for the putative binding 
site.
```{r}
# load sample KdModel
data("SampleKdModel")

# run scan
matches <- findSeedMatches(SampleTranscript, SampleKdModel, verbose = FALSE)
matches
```

Running `findSeedMatches` using a `Kdmodel` also returns matches that correspond
to non-canonical binding sites. These are typically of low affinity, but may 
affect repression if several of them are found on the same transcript. The scan 
can be restricted to canonical sites using the option `onlyCanonical = TRUE`.

`KdModel` collections corresponding to all human, mouse and rat mirbase miRNAs 
can be obtained through the 
[scanMiRData](https://github.com/ETHZ-INS/scanMiRData) package.

### About match types

For canonical sites, we use the site types described in [Grimson et al., Molecular Cell 2007](https://www.sciencedirect.com/science/article/pii/S1097276507004078) based on the matching nucleotides form the miRNA seed:

```{r echo = FALSE, out.width="80%", fig.align = 'right', fig.cap="Adapted from Grimson et al. 2007"}
knitr::include_graphics(system.file('docs', 'sitetypes.jpg', package = 'scanMiR'))
```

In addition, we include with two additional site types that are classically
considered non-canonical, but have much stronger evidence of binding
than other non-canonical ones, namely G-bulged sites that have an extra non-matching
G in position 5â€“6 that is bulged out (see [Chi, Hannon & Darnell, 2012](https://www.nature.com/articles/nsmb.2230), and
wobble sites have a single G:U replacement (see [Becker et al., 2019](https://www.sciencedirect.com/science/article/pii/S1097276519304459)).

Also note that all binding sites are reported as the alignment of the seed 
region of the miRNA onto the target mRNA, and as such it is always 8 nucleotide
long, even if only 6 or 7nt are actually matching the seed.

## Further Options

### ORF length
If the transcript sequences are provided as a `DNAStringSet`, one may specify 
the length of the open reading frame region of the transcripts as a metadata 
column in order to distinguish between matches in the ORF and 3'UTR regions.
```{r message = FALSE}
library(Biostrings)

# generate set of random sequences
seqs <- DNAStringSet(getRandomSeq(length = 1000, n = 10))

# add vector of ORF lengths
mcols(seqs)$ORF.length <- sample(500:800, length(seqs))

# run scan
matches2 <- findSeedMatches(seqs, SampleKdModel, verbose = FALSE)
head(matches2)
```

### Supplementary 3' pairing  {#sec:3pAlignment}

Upon binding the seed regions, further complementary pairing of the target to the 3' 
region of the miRNA can increase affinity and further stabilize the binding 
([Schirle, Sheu-Gruttadauria and MacRae, 2014](https://dx.doi.org/10.1126/science.1258040)).
Upon finding a seed match, `scanMiR` performs a local alignment on the upstream
region to identify such complementary 3' binding. This is internally done by the
`get3pAlignment()` function, the arguments to which (e.g. the maximum size of
the gap between the seed binding and the complementary binding) can be passed 
via the `findSeedMatches` argument `p3.params`. By default, when running 
`findSeedMatches` a 3' score is reported in the matches, which roughly 
corresponds to the number of consecutive matching nucleotides (adjusting for 
small gaps and T/G substitutions) within the constraints (see 
`?get3pAlignment` for more detail). More information (such as the size of the 
miRNA and target loops between the two complementary regions) can be reported by
setting `findSeedMatches(..., p3.extra=TRUE)`. In addition, the pairing can be
visualized with `viewTargetAlignment`:

```{r}
viewTargetAlignment(matches[1], SampleKdModel, SampleTranscript)
```


Some forms of 3' bindings can however lead to drastic functional consequences. 
For example, sufficient final complementary at the 3' end of the miRNA can lead 
to Target-Directed miRNA Degradation (TDMD, 
[Sheu-Gruttadauria, Pawlica et al., 2019](https://dx.doi.org/10.1016/j.molcel.2019.06.019)).
`findSeedMatches` will also flag such putative sites in the `notes` column of 
the matches. Finally, while circular RNAs can act as miRNA sponges, some miRNA 
bindings can slice their circular structure 
[Hansen et al., 2011](https://doi.org/10.1038/emboj.2011.359) and free their 
cargo. `findSeedMatches` will also flag such sites in the `notes` column.


### Shadow and Overlapping Matches
The `shadow` argument can be used to take into account the observation that 
sites within the first ~15 nucleotides of the 3'UTR show poor efficiency 
([Grimson et al. 2007](https://www.sciencedirect.com/science/article/pii/S1097276507004078)). 
`findSeedMatches` will treat matches within the first `shadow` positions of the 
UTR in the same way as matches in the ORF region. If no information on ORF 
lengths is provided, it will simply ignore the first `shadow` positions. The 
default setting is `shadow = 0`.

The parameter `minDist` can be used to specify the minimum distance between 
matches of the same miRNA (default 7). If there are multiple matches within 
`minDist`, only the highest affinity match will be considered.


### Aggregation on the fly
With `ret = "aggregated"` one obtains a data.frame that contains the predicted 
repression for each sequence-seed-pair aggregated over all matches along with 
information about the types and number of matches. Parameters for aggregation 
can be specified using `agg.params`. For further details, see Section 
\@ref(sec:aggregating).

# Aggregating Sites {#sec:aggregating}
## Background
`r Biocpkg("scanMiR")` implements aggregation of miRNA sites based on the 
biochemical model from 
[McGeary, Lin et al. (2019)](https://dx.doi.org/10.1126/science.aav1741). 
This model first predicts the occupancy of AGO-miRNA complexes at each 
potential binding site as a function of the measured or estimated dissociation 
constants (Kds). It then assumes an additive effect of the miRNA on the basal 
decay rate of the transcript that is proportional to this occupancy.

The key parameters of this model are:

* `a`: the relative concentration of unbound AGO-miRNA complexes
* `b`: the factor that multiplies with the occupancy and is added to the basal 
decay rate (can be interpreted as the additional repression caused by a single 
bound AGO)
* `c`: the penalty factor for sites that are found within the ORF region

More specifically, the occupancy of a mRNA $m$ by miRNA $g$, with $p$ matches 
in the ORF region and $q$ matches in the 3'UTR region, is given by the 
following equation:
$$
\begin{equation}
N_{m,g} = 
  \sum_{i=1}^{p}\left(\frac{a_g}{a_g + c_{\text{ORF}} 
  K_{d,i}^{\text{ORF}}}\right) +
  \sum_{j=1}^{q}\left(\frac{a_g}{a_g + K_{d,j}^{\text{3'UTR}}}\right)
\end{equation}
$$

The corresponding background occupancy is estimated by substituting the average 
affinity of nonspecifically bound sites (i.e. $K_d = 1.0$):
$$
\begin{equation}
N_{m,g,\text{background}} = 
  \sum_{i=1}^{p}\left(\frac{a_g}{a_g + c_{\text{ORF}}}\right) +
  \sum_{j=1}^{q}\left(\frac{a_g}{a_g + 1}\right)
\end{equation}
$$
In addition to this original model, `scanMiR` includes a coefficient `e` which 
adjusts the Kd values based on the supplementary 3' alignment:

$$
\begin{equation}
N_{m,g} = 
  \sum_{i=1}^{p}\left(\frac{a_g}{a_g + e_{i}c_{\text{ORF}} 
    K_{d,i}^{\text{ORF}}}\right) +
  \sum_{j=1}^{q}\left(\frac{a_g}{a_g + e_{j}K_{d,j}^{\text{3'UTR}}}\right)
\end{equation}
$$

with $e_i = \exp(\text{p3}\cdot\text{p3.score}_i)$. `p3` is a global parameter, 
and $p3.score_i$ is the 3' alignment score (roughly corresponding to the number
of  matched nucleotides, by default capped to 8 and set to 0 if below 3). Of
note,  the default value of `p3` is very small, leading to a very mild  effect.
The importance of complementary binding seems to depend on the miRNA, and at the
moment there is no easy way to predict this from the miRNA sequence. Our
conservative estimate might not attribute sufficient importance  to this factor
for some miRNAs.

The repression is then obtained as the log fold change of the two occupancies:
$$
\text{repression} = \log(1+bN_{m,g,\text{background}}) - \log(1+bN_{m,g})
$$

Because UTR and ORF lengths have been reported to influence the efficacy of 
repression, `scanMiR` also includes an additional modifier to terms handling 
these effects:

$$
\text{repression}_{\text{adj}} = \text{repression}\cdot
(1+f\cdot\text{UTR.length}+h\cdot\text{ORF.length})
$$
While `b`, `c`, `p3`, `f` and `h` are considered global parameters (i.e. the 
same for different miRNAs and transcripts and also across experimental 
contexts), `a` is expected to be different for each miRNA in a 
given experimental condition. However, as shown by 
[McGeary, Lin et al. (2019)](https://dx.doi.org/10.1126/science.aav1741), the 
model performance is robust to changes in `a` over several orders of magnitude. 
Aggregation for all miRNA-transcript pairs for a given data set is therefore 
usually based on a single `a` value.

## Basic Aggregation
Given a `GRanges` or data.frame of matches as returned by `findSeedMatches`, 
aggregation can be performed by the function `aggregateMatches`:
```{r}
agg_matches <- aggregateMatches(matches2)
head(agg_matches)
```
This returns a data.frame with predicted repression values for each 
miRNA-transcript pair along with a count table of the different site types. If 
`matches` does not contain a `log_kd` column, only the count table will be 
returned.

`r Biocpkg("scanMiR")` uses the following 
default parameter values for aggregation that have been determined by fitting 
and validating the model using several experimental data sets:
```{r}
unlist(scanMiR:::.defaultAggParams())
```
Where `coef_utr` and `coef_orf` respectively correspond to the `f` and `h` in
the above formula. To disable these features, they can simply be set to 
zero. `keepSiteInfo` lets you choose whether the site count table should be 
returned. The parameters can be passed directly to `aggregateMatches`, or passed
to `findSeedMatch` when doing aggregation on-the-fly using the `agg.params` 
argument.

# Dealing with very large scans

## Multithreading

To deal with large amounts of sequences and/or seeds, both `findSeedMatches` 
and `aggregateMatches` support multithreading using the 
`r Biocpkg("BiocParallel")` package. This can be activated by passing 
`BP = MulticoreParam(ncores)`.

Depending on the system and the size of the scan (i.e. when including all 
non-canonical sites), mutlithreading can potentially take a large amount of 
memory. To avoid memory issues, the number of seeds processed simultaneously by
`findSeedMatches` can be restricted using the `n_seeds` parameter.
Alternatively, scan results can be saved to temporary files using the 
`useTmpFiles` argument (see `?findSeedMatches` for more detail).

Note that in addition to the multithreading specified in its arguments, 
`aggregateMatches` uses the `r CRANpkg("data.table")` package, which is often 
set to use multi-threading by default (see `?data.table::setDTthreads` for more
information). This can leave to CPU usage higher than specified through the 
`BP` argument of `aggregateMatches`.

## Dealing with large collections of predictions

Binding sites for all miRNAs on all transcripts, especially when including 
non-canonical sites, can easily amount to prohibitive amounts of memory. The 
companion [scanMiRApp](https://github.com/ETHZ-INS/scanMiRApp) package includes 
a class implementing fast indexed access to on-disk GenomicRanges and 
data.frames. The package additionally contains wrapper (e.g. for performing full
transcriptome scans) for common species and for detecting enriched miRNA-target 
pairs, as well as a shiny interface to `scanMiR`.

<br/><br/>

# Session info {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```