---
title: "Advanced usage: motif and sequence analysis"
shorttitle: "Advanced usage"
author:
- name: Benjamin Jean-Marie Tremblay
  affiliation: University of Waterloo, Waterloo, Canada
  email: b2tremblay@uwaterloo.ca
bibliography: universalmotif.bib
vignette: >
  %\VignetteIndexEntry{Advanced usage: motif and sequence analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::pdf_document
---

```{r setup, echo=FALSE}
knitr::opts_chunk$set(collapse=TRUE, comment = "#>")
suppressPackageStartupMessages(library(universalmotif))
suppressPackageStartupMessages(library(Biostrings))
suppressMessages(suppressPackageStartupMessages(library(MotifDb)))
suppressMessages(suppressPackageStartupMessages(library(ggtree)))
data(ArabidopsisPromoters)
data(ArabidopsisMotif)
```

# Introduction

This vignette details the `r Biocpkg("universalmotif")` implementation of higher-order motifs (`multifreq`), motif enrichment analyses, motif P-values, advanced usage related to motif comparison, drawing motif trees, and _de novo_ motif searches with MEME. For an introduction to sequence motifs, see the [introductory](IntroductionToSequenceMotifs.pdf) vignette. For a basic overview of available motif-related functions, see the [motif manipulation](MotifManipulation.pdf) vignette. For sequence-related utilities, see the [sequences](SequenceSearches.pdf) vignette.

# Higher-order motifs

Though PCM, PPM, PWM, and ICM type motifs are still widely used today, a few 'next generation' motif formats have been proposed. These wish to add another layer of information to motifs: positional interdependence. To illustrate this, consider the following sequences:

 # | Sequence
-- | --------
 1 | CAAAACC
 2 | CAAAACC
 3 | CAAAACC
 4 | CTTTTCC
 5 | CTTTTCC
 6 | CTTTTCC
: (\#tab:seqs2) Example sequences.

This becomes the following PPM:

Position |   1 |   2 |   3 |   4 |   5 |   6 |   7
-------- | --- | --- | --- | --- | --- | --- | ---
       A | 0.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0
       C | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0
       G | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
       T | 0.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0
: (\#tab:ppm2) Position Probability Matrix.

Based on the PPM representation, all three of CAAAACC, CTTTTCC, and CTATACC are equally likely. Though looking at the starting sequences, should CTATACC really be considered so? For transcription factor binding sites, some would say the answer is no. By incorporating this type of information into the motif, this can allow for increased accuracy in motif searching. A few implementations of this include: TFFM by @tffm, BaMM by @bamm, and KSM by @ksm.

The `r Biocpkg("universalmotif")` package implements its' own, rather simplified, version of this concept. Plainly, the standard PPM has been extended to include `k`-letter frequencies, with `k` being any number higher than 1. For example, the 2-letter version of the table \@ref(tab:ppm2) motif would be:

Position |   1 |   2 |   3 |   4 |   5 |   6 
-------- | --- | --- | --- | --- | --- | --- 
      AA | 0.0 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0
      AC | 0.0 | 0.0 | 0.0 | 0.0 | 0.5 | 0.0
      AG | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      AT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      CA | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      CC | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0
      CG | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      CT | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      GA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      GC | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      GG | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      GT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      TA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      TC | 0.0 | 0.0 | 0.0 | 0.0 | 0.5 | 0.0
      TG | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
      TT | 0.0 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0
: (\#tab:multi) 2-letter probability matrix.

This format shows the probability of each letter combined with the probability of the letter in the next position. The seventh column has been dropped, since it is not needed; the information in the sixth column is sufficient, and there is no eighth position to draw 2-letter probabilities from. Now, the probability of getting CTATACC is no longer equal to CTTTTCC and CAAAACC. This information is kept in the `multifreq` slot of `universalmotif` class motifs. To add this information, use the `add_multifreq` function.

```{r}
library(universalmotif)

motif <- create_motif("CWWWWCC", nsites = 6)
sequences <- DNAStringSet(rep(c("CAAAACC", "CTTTTCC"), 3))
motif.k2 <- add_multifreq(motif, sequences, add.k = 2)

## Alternatively:
# motif.k2 <- create_motif(sequences, add.multifreq = 2)

motif.k2
```

This information is most useful with functions such as `scan_sequences()` and `enrich_motifs()`. Though other tools in the `r Biocpkg("universalmotif")` can work with `multifreq` motifs (such as `motif_pvalue()`, `compare_motifs()`), keep in mind they are not as well supported as regular motifs (getting P-values from `multifreq` motifs is exponentially slower, and P-values from using `compare_motifs()` for `multifreq` motifs are not available by default). See the [sequences](SequenceSearches.pdf) vignette for using `scan_sequences()` with the `multifreq` slot.

# Enrichment analyses

The `r Biocpkg("universalmotif")` package offers the ability to search for enriched motif sites in a set of sequences via `enrich_motifs()`. There is little complexity to this, as it simply runs `scan_sequences()` twice; once on a set of target sequences, and once on a set of background sequences. After which the results between the two sequences are collated and run through enrichment tests. The background sequences can be given explicitly, or else `enrich_motifs()` will create background sequences on its own by using `shuffle_sequences()` on the target sequences.

Let us consider the following basic example:

```{r}
library(universalmotif)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)

enrich_motifs(ArabidopsisMotif, ArabidopsisPromoters, shuffle.k = 3,
              threshold = 0.001, verbose = 0, progress = FALSE, RC = TRUE)
```

Here we can see that the motif is significantly enriched in the target sequences. Looking for closely at the results, the motif was found 197 times in 41 of the 50 sequences; whereas it was found 37 times in 14 of the 50 background sequences. The `Pval.hits` was calculated by calling `fisher.test` from the `r Rpackage("stats")` package. Another type of test is available which looks for positional enrichment, i.e. whether one area in the target sequences is favoured.

```{r}
library(universalmotif)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)

enrich_motifs(ArabidopsisMotif, ArabidopsisPromoters, shuffle.k = 3,
              search.mode = "positional", verbose = 0, progress = FALSE,
              RC = TRUE)
```

In this case however no such bias exists in the target sequences. There are a number of tests available for positional enrichment; see `?enrich_motifs`.

One final point: always keep in mind the `threshold` parameter, as this will ultimately decide the number of hits found. (A bad threshold can lead to a false negative.) See the [sequences](SequenceSearches.pdf) vignette for a discussion about using P-values to determine the threshold of `scan_sequences()`. Motif P-values are discussed in more details in the next section of this document.

# Motif P-values

Motif P-values are not usually discussed outside of the bioinformatics literature, but are actually quite a challenging topic. For illustrate this, consider the following example motif:

```{r}
library(universalmotif)

m <- matrix(c(0.10,0.27,0.23,0.19,0.29,0.28,0.51,0.12,0.34,0.26,
              0.36,0.29,0.51,0.38,0.23,0.16,0.17,0.21,0.23,0.36,
              0.45,0.05,0.02,0.13,0.27,0.38,0.26,0.38,0.12,0.31,
              0.09,0.40,0.24,0.30,0.21,0.19,0.05,0.30,0.31,0.08),
            byrow = TRUE, nrow = 4)
motif <- create_motif(m, alphabet = "DNA", type = "PWM")
motif
```

Let us then use this motif with `scan_sequences()`:

```{r}
data(ArabidopsisPromoters)

res <- scan_sequences(motif, ArabidopsisPromoters, verbose = 0,
                      progress = FALSE, threshold = 0.8,
                      threshold.type = "logodds")
head(res)
```

Now let us imagine that we wish to rank these matches by P-value. First, we must calculate the match probabilities:

```{r}
## The second match was CTCTTTATTC, with a score of 3.896 (max possible = 6.403)

library(Biostrings)

bkg <- colMeans(oligonucleotideFrequency(ArabidopsisPromoters, 1,
                                         as.prob = TRUE))
bkg
```
Now, use these to calculate the probability of getting CTCTTTATTC.
```{r}
hit.prob <- bkg["A"]^1 * bkg["C"]^3 * bkg["G"]^0 * bkg["T"]^6
hit.prob <- unname(hit.prob)
hit.prob
```

Calculating the probability of a single match was easy, but then comes the challenging part: calculating the probability of all possible matches with a score higher than 3.896, and then summing these. This final sum then represents the probability of finding a match which scores at least 3.896. One way is to list all possible sequence combinations, then filtering based on score; however this "brute force" approach is unreasonable but for the smallest of motifs.

A few algorithms have been proposed to make this more efficient, but the method adopted by the `r Biocpkg("universalmotif")` package is that of @pvalues. The authors propose using a branch-and-bound^[https://en.wikipedia.org/wiki/Branch_and_bound] algorithm (with a few tricks) alongside a certain approximation. Briefly: motifs are first reorganized so that the highest scoring positions and letters are considered first in the branch-and-bound algorithm. Then, motifs past a certain width (in the original paper, 10) are split in sub-motifs. All possible combinations are found in these sub-motifs using the branch-and-bound algorithm, and P-values calculated for the sub-motifs. Finally, the P-values are combined.

The `motif_pvalue()` function modifies this process slightly by allowing the size of the sub-motifs to be specified via the `k` parameter; and additionally, whereas the original implementation can only calculate P-values for motifs with a maximum of 17 positions (and motifs can only be split in at most two), the `r Biocpkg("universalmotif")` implementation allows for any length of motif to be used (and motifs can be split any number of times). Changing `k` allows one to decide between speed and accuracy; smaller `k` leads to faster but worse approximations, and larger `k` leads to slower but better approximations. If `k` is equal to the width of the motif, then the calculation is _exact_.

Now, let us return to our original example:

```{r}
res <- res[1:6, ]
pvals <- motif_pvalue(motif, res$score, progress = FALSE, bkg.probs = bkg)
res2 <- data.frame(motif=res$motif,match=res$match,pval=pvals)[order(pvals), ]
knitr::kable(res2, digits = 22, row.names = FALSE, format = "markdown")
```

The default `k` in `motif_pvalue()` is 6. I have found this to be a good tradeoff between speed and P-value correctness.

To demonstrate the effect that `k` has on the output P-value, consider the following (and also note that for this motif `k = 10` represents an exact calculation):

```{r}
scores <- c(-6, -3, 0, 3, 6)
k <- c(2, 4, 6, 8, 10)
out <- data.frame(k = c(2, 4, 6, 8, 10),
                  score.minus6 = rep(0, 5),
                  score.minus3 = rep(0, 5),
                  score.0 = rep(0, 5),
                  score.3 = rep(0, 5),
                  score.6 = rep(0, 5))

for (i in seq_along(scores)) {
  for (j in seq_along(k)) {
    out[j, i + 1] <- motif_pvalue(motif, scores[i], k = k[j], progress = FALSE,
                                  bkg.probs = bkg)
  }
}

knitr::kable(out, format = "markdown", digits = 10)
```

For this particular motif, while the approximation worsens slightly as `k` decreases, it is still quite reasonable down to `k = 6`. Usually, you should only have to worry about `k` for longer motifs (such as those typically generated by MEME), where the number of sub-motifs increases.

The `r Biocpkg("universalmotif")` also allows for calculating motifs scores from P-values. Similarly to calculating P-values, exact scores can be calculated from small motifs, and approximate scores from big motifs using subsetting. Unlike the P-value calculation however, a uniform background is assumed. This is because approximate scores are calculated by assuming the distribution of possible scores follows a normal distribution. See the following code for a comparison of exact and approximate score calculations.

Starting from a set of P-values:

```{r}
bkg <- c(A=0.25, C=0.25, G=0.25, T=0.25)
pvals <- c(0.1, 0.01, 0.001, 0.0001, 0.00001)
scores <- motif_pvalue(motif, pvalue = pvals, progress = FALSE,
                       bkg.probs = bkg, k = 10)

scores.approx6 <- motif_pvalue(motif, pvalue = pvals, progress = FALSE,
                               bkg.probs = bkg, k = 6)
scores.approx8 <- motif_pvalue(motif, pvalue = pvals, progress = FALSE,
                               bkg.probs = bkg, k = 8)

pvals.exact <- motif_pvalue(motif, score = scores, progress = FALSE,
                            bkg.probs = bkg, k = 10)

pvals.approx6 <- motif_pvalue(motif, score = scores, progress = FALSE,
                              bkg.probs = bkg, k = 6)
pvals.approx8 <- motif_pvalue(motif, score = scores, progress = FALSE,
                              bkg.probs = bkg, k = 8)

res <- data.frame(pvalue = pvals, score = scores,
                  pvalue.exact = pvals.exact,
                  pvalue.k6 = pvals.approx6,
                  pvalue.k8 = pvals.approx8,
                  score.k6 = scores.approx6,
                  score.k8 = scores.approx8)
knitr::kable(res, format = "markdown", digits = 22)
```

Starting from a set of scores:

```{r}
bkg <- c(A=0.25, C=0.25, G=0.25, T=0.25)
scores <- -2:6
pvals <- motif_pvalue(motif, score = scores, progress = FALSE,
                      bkg.probs = bkg, k = 10)

scores.exact <- motif_pvalue(motif, pvalue = pvals, progress = FALSE,
                             bkg.probs = bkg, k = 10)

scores.approx6 <- motif_pvalue(motif, pvalue = pvals, progress = FALSE,
                               bkg.probs = bkg, k = 6)
scores.approx8 <- motif_pvalue(motif, pvalue = pvals, progress = FALSE,
                               bkg.probs = bkg, k = 8)

pvals.approx6 <- motif_pvalue(motif, score = scores, progress = FALSE,
                              bkg.probs = bkg, k = 6)
pvals.approx8 <- motif_pvalue(motif, score = scores, progress = FALSE,
                              bkg.probs = bkg, k = 8)

res <- data.frame(score = scores, pvalue = pvals,
                  pvalue.k6 = pvals.approx6,
                  pvalue.k8 = pvals.approx8,
                  score.exact = scores.exact,
                  score.k6 = scores.approx6,
                  score.k8 = scores.approx8)
knitr::kable(res, format = "markdown", digits = 22)
```
As you can see, results from exact calculations are not _quite_ exact. This is due to the fact that the distributions of P-values and scores are assumed to be continuous, when in reality they are discrete. As for results from approximate calculations: they are fairly close to exact values for larger P-values, but fall off for small P-values (P < 0.001).

# Advanced motif comparison

Here I will explore the effects of some of the options available in `compare_motifs()`. See the relevant section in the [basic motif manipulation](MotifManipulation.pdf) vignette for an introduction to using `compare_motifs()`. Let us begin by comparing the available methods:

```{r}
library(universalmotif)
library(MotifDb)

motifs <- convert_motifs(MotifDb)
motifs <- filter_motifs(motifs, altname = c("M0003_1.02", "M0004_1.02"))
summarise_motifs(motifs)
try_all <- function(motifs, ...) {
  scores <- numeric(8)
  methods <- c("MPCC", "PCC", "MEUCL", "EUCL", "MSW", "SW", "MKL", "KL")
  for (i in 1:8) {
    scores[i] <- compare_motifs(motifs, method = methods[i],
                                progress = FALSE, ...)[1, 2]
  }
  res <- data.frame(method = c("MPCC", "PCC", "MEUCL", "EUCL",
                                "MSW", "SW", "MKL", "KL"),
                     score = scores,
                     max = c(1, NA, sqrt(2), NA, 2, NA, NA, NA),
                     type = c("similarity", "similarity", "distance",
                              "distance", "similarity", "similarity",
                              "distance", "distance"))
  knitr::kable(res, format = "markdown")
}
try_all(motifs)
```

From this you can get a sense of how the different methods perform. See the function documentation at `?compare_motifs` for details on the method implementations. For now, let us explore how changing the other parameters affects the scores. First, we can control whether to allow the reverse complements to be compared as well:

```{r}
try_all(motifs, tryRC = FALSE)
```

In this case not allowing the reverse complement of the motifs did nothing; this is because the current motif orientations had better scores regardless. Next, whether to compare the motifs as PPM or ICM:

```{r}
try_all(motifs, use.type = "ICM")
```

Going from comparing PPM motifs to ICM motifs changes the scores quite a bit; in this case noticeable by increasing the distance between the two. Be careful about jumping to using `use.type = "ICM"` though; keep in mind that the conversion from PPM is ICM is not linear, so the comparison is inherently quite different in nature. Next, `normalise.scores`:

```{r}
try_all(motifs, normalise.scores = TRUE)
```

Again, using this option increases the distance between motifs. To explain this, let's visualize the motif alignment:

```{r}
view_motifs(motifs)
```

From this image you can see that there is a bit of an overhang. What `normalise.scores` does is punish this overhang by multiplying the final score by the ratio of aligned to un-aligned positions. To illustrate this further, we can restrict the alignment so there are as few as possible overhangs:

```{r}
view_motifs(motifs, min.overlap = 99)
try_all(motifs, min.overlap = 99)
try_all(motifs, min.overlap = 99, normalise.scores = TRUE)
```

Now, using `normalise.scores = TRUE` is less punishing. These options are important; turning them off results in the following:

```{r}
try_all(motifs, min.overlap = 1)
view_motifs(motifs, min.overlap = 1)
```

In trying to find the highest possible score, `compare_motifs()` ended up with this alignment, which most would agree is not a fair comparison of the two motifs. The final parameter I will discuss is `min.mean.ic`. When you look at the motifs being compared, they both have low information content regions. The `compare_motifs()` function allows you decide whether you want low information content positions to be scored (positions which don't pass the set threshold will not contribute to the score).

```{r}
motifs2 <- filter_motifs(MotifDb, altname = c("M0100_1.02", "M0104_1.02"))
view_motifs(motifs2, min.mean.ic = 0)
try_all(motifs2, min.mean.ic = 0)
view_motifs(motifs2, min.mean.ic = 0.7)
try_all(motifs2, min.mean.ic = 0.7)
```

In this case, changing the `min.mean.ic` did not alter the alignment, but had a great impact on the scores. Setting this too high can be quite punishing; I have found `min.mean.ic = 0.5` to be a good middle ground.

# Motif trees with ggtree

Additionally, this package introduces the `motif_tree()` function for generating basic tree-like diagrams for comparing motifs. This allows for a visual result from `compare_motifs()`. All options from `compare_motifs()` are available in `motif_tree()`. This function uses the `r Biocpkg("ggtree")` package and outputs a `ggplot` object (from the `r CRANpkg("ggplot2")` package), so altering the look of the trees can be done easily after `motif_tree()` has already been run.

```{r}
library(universalmotif)
library(MotifDb)

motifs <- convert_motifs(MotifDb)
motifs <- filter_motifs(motifs, organism = "Athaliana")[1:100]
tree <- motif_tree(motifs, layout = "daylight", linecol = "family",
                   progress = FALSE)

## Make some changes to the tree in regular ggplot2 fashion:
# tree <- tree + ...

tree
```

While `motif_tree()` works as a quick and convenient tree-building function, it can be inconvenient when more control is required over tree construction. For this purpose, the following code goes through how exactly `motif_tree()` generates trees.


```{r}
library(universalmotif)
library(MotifDb)
library(ggtree)
library(ggplot2)

motifs <- convert_motifs(MotifDb)
motifs <- filter_motifs(motifs, organism = "Athaliana")
motifs <- motifs[sample(seq_along(motifs), 25)]

## Step 1: compare motifs

comparisons <- compare_motifs(motifs, progress = FALSE)

## Step 2: create a "dist" object

# The default metric, MPCC, is a similarity metric
comparisons <- 1 - comparisons

comparisons <- as.dist(comparisons)

# We also want to extract names from the dist object to match annotations
labels <- attr(comparisons, "Labels")

## Step 3: get the comparisons ready for tree-building

# The R package "ape" provides the necessary "as.phylo" function
comparisons <- ape::as.phylo(hclust(comparisons))

## Step 4: incorporate annotation data to colour tree lines

family <- sapply(motifs, function(x) x["family"])
family.unique <- unique(family)

# We need to create a list with an entry for each family; within each entry
# are the names of the motifs belonging to that family
family.annotations <- list()
for (i in seq_along(family.unique)) {
  family.annotations <- c(family.annotations,
                          list(labels[family %in% family.unique[i]]))
}
names(family.annotations) <- family.unique

# Now add the annotation data:
comparisons <- ggtree::groupOTU(comparisons, family.annotations)

## Step 5: draw the tree

tree <- ggtree(comparisons, aes(colour = group), layout = "rectangular") +
          theme(legend.position = "bottom", legend.title = element_blank())

## Step 6: add additional annotations

# If we wish, we can additional annotations such as tip labelling and size

# Tip labels:
tree <- tree + geom_tiplab()

# Tip size:
tipsize <- data.frame(label = labels,
                      icscore = sapply(motifs, function(x) x["icscore"]))

tree <- tree %<+% tipsize + geom_tippoint(aes(size = icscore))

```

# Motif discovery with MEME

The `r Biocpkg("universalmotif")` package provides a simple wrapper to the powerful motif discovery tool MEME [@meme3]. To run an analysis with MEME, all that is required is a set of `XStringSet` class sequences (defined in the `r Biocpkg("Biostrings")` package), and `run_meme()` will take care of running the program and reading the output for use within R.

The first step is to check that R can find the MEME binary in your `$PATH` by running `run_meme()` without any parameters. If successful, you should see the default MEME help message in your console. If not, then you'll need to provide the complete path to the MEME binary. There are two options:

```{r,eval=FALSE}
library(universalmotif)

## 1. Once per session: via `options`

options(meme.bin = "/path/to/meme/bin/meme")

run_meme(...)

## 2. Once per run: via `run_meme`

run_meme(..., bin = "/path/to/meme/bin/meme")
```

Now we need to get some sequences to use with `run_meme()`. At this point we can read sequences from disk or extract them from one of the Bioconductor `BSgenome` packages.

```{r,eval=FALSE}
library(universalmotif)
data(ArabidopsisPromoters)

## 1. Read sequences from disk (in fasta format):

library(Biostrings)

# The following `read*` functions are available in Biostrings:
# DNA: readDNAStringSet
# DNA with quality scores: readQualityScaledDNAStringSet
# RNA: readRNAStringSet
# amino acid: readAAStringSet
# any: readBStringSet

sequences <- readDNAStringSet("/path/to/sequences.fasta")

run_meme(sequences, ...)

## 2. Extract from a `BSgenome` object:

library(GenomicFeatures)
library(TxDb.Athaliana.BioMart.plantsmart28)
library(BSgenome.Athaliana.TAIR.TAIR9)

# Let us retrieve the same promoter sequences from ArabidopsisPromoters:
gene.names <- names(ArabidopsisPromoters)

# First get the transcript coordinates from the relevant `TxDb` object:
transcripts <- transcriptsBy(TxDb.Athaliana.BioMart.plantsmart28,
                             by = "gene")[gene.names]

# There are multiple transcripts per gene, we only care for the first one
# in each:

transcripts <- lapply(transcripts, function(x) x[1])
transcripts <- unlist(GRangesList(transcripts))

# Then the actual sequences:

# Unfortunately this is a case where the chromosome names do not match
# between the two databases

seqlevels(TxDb.Athaliana.BioMart.plantsmart28)
#> [1] "1"  "2"  "3"  "4"  "5"  "Mt" "Pt"
seqlevels(BSgenome.Athaliana.TAIR.TAIR9)
#> [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"

# So we must first rename the chromosomes in `transcripts`:
seqlevels(transcripts) <- seqlevels(BSgenome.Athaliana.TAIR.TAIR9)

# Finally we can extract the sequences
promoters <- getPromoterSeq(transcripts,
                            BSgenome.Athaliana.TAIR.TAIR9,
                            upstream = 0, downstream = 1000)

run_meme(promoters, ...)
```

Once the sequences are ready, there are few important options to keep in mind. One is whether to conserve the output from MEME. The default is not to, but this can be changed by setting the relevant option.

```{r,eval=FALSE}
run_meme(sequences, output = "/path/to/desired/output/folder")
```

The second important option is the search function (`objfun`). Some search functions such as the default `classic` do not require a set of background sequences, whilst some do (such as `de`). If you choose one of the latter, then you can either let MEME create them for you (it will shuffle the target seqeunces) or you can provide them via the `control.sequences` parameter.

Finally, choose how you'd like the data imported into R. Once the MEME program exits, `run_meme()` will import the results into R with `read_meme()`; at this point you can decide if you want just the motifs themselves (`readsites = FALSE`) or if you'd like the original sequence sites as well (`readsites = TRUE`, the default).

There are a wealth of other MEME options available, such as the number of desired motifs (`nmotifs`), the width of desired motifs (`minw`, `maxw`), the search mode (`mod`), assigning sequence weights (`weights`), using a custom alphabet (`alph`), and many others. See the output from `run_meme()` for a brief description of the options, or visit the [online manual](http://meme-suite.org/doc/meme.html) for more details.

# Session info {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

# References {.unnumbered}