---
title: "The mpra User's Guide"
author: "Leslie Myint, Kasper Daniel Hansen"
date: "`r doc_date()`"
package: "`r pkg_ver('mpra')`"
bibliography: mpra.bib
abstract: >
  A comprehensive guide to using the mpra package for analyzing
  massively parallel reporter assays (MPRA).
vignette: >
  %\VignetteIndexEntry{mpra User's Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: 
  BiocStyle::html_document
---

# Introduction

The `r Biocpkg("mpra")` package provides tools for the analysis of data from massively parallel reporter assays (MPRA).  Specifically, it contains the functionality described in [@mpralm].  The primary analysis purpose is to enable differential analysis of activity measures, but the package can also be used to generate precision weights useful in regression analyses of activity scores on sequence features. The main workhorse of the `r Biocpkg("mpra")` package is the `mpralm()` function which draws on the previously proposed voom framework for RNA-seq analysis [@voom]. In this document, we will be looking at MPRA data from a study comparing episomal and lentiviral versions of MPRA [@Inoue:2017]. We will also look at MPRA data from a study comparing the regulatory activity of different alleles of thousands of SNPs [@Tewhey:2016].

## How to cite

If you are using this package, please cite [@mpralm]. If you are using the `mpralm()` function, it would be appropriate to also cite the voom framework [@voom].

# Dependencies

This document has the following dependencies

```{r dependencies, warning=FALSE, message=FALSE}
library(mpra)
```

# Creating an MPRASet object

In this package, MPRA data are contained in `MPRASet` objects. Because MPRA data do not have a common prescribed format, these objects must be created manually. In this section, we demonstrate how to do this.

`MPRASet` objects must contain DNA and RNA count information because this is the information used to quantify activity levels of the elements being assayed. DNA and RNA count information should be specified as $K \times S$ integer count matrices where $K$ is the total number of barcodes over all elements if barcode-level information is being supplied or the total number of putative regulatory elements (PREs) if element-level information is being supplied. $S$ is the number of samples (typically, the number of independent transfections).

`MPRASet` objects must also contain element identification information. This should be supplied as a character vector of length $K$, the number of rows in the DNA and RNA count matrices. These are any strings used to describe/identify the unique PREs being assayed.

Optionally, the barcode sequences and PRE sequences can be specified as length $K$ character vectors.

In the next sections we provide specific examples for how to specify this information for two common differential analysis settings: tissue and allele comparisons. Although we show simulated data, this information would typically be read from text files.

## Tissue comparison

In tissue comparison studies, the same set of PREs is assayed in two or more cell types. In the following example, the experiment looks at four PREs with three barcodes each. Two tissues (liver and kidney) are studied, and each tissue has four replicates (four independent transfections each).

RNA and DNA count matrices would look as below:

```{r}
E <- 4 # Number of elements
B <- 3 # Number of barcodes
s <- 4 # Samples per tissue
nt <- 2 # Number of tissues

set.seed(434)
rna <- matrix(rpois(E*B*s*nt, lambda = 30), nrow = E*B, ncol = s*nt)
dna <- matrix(rpois(E*B*s*nt, lambda = 30), nrow = E*B, ncol = s*nt)

rn <- as.character(outer(paste0("barcode_", seq_len(B), "_"), paste0("elem_", seq_len(E)), FUN = "paste0"))
cn <- c(paste0("liver_", seq_len(s)), paste0("kidney_", seq_len(s)))

rownames(rna) <- rn
rownames(dna) <- rn
colnames(rna) <- cn
colnames(dna) <- cn

rna
dna
```

PRE identification strings would look as below. When counts are provided at the barcode level, the `eid` character vector will have repeated elements.

```{r}
eid <- rep(paste0("elem_", seq_len(E)), each = B)
eid
```

We may also have PRE sequences as below. These sequences must be specified in a character vector of the same length as `eid` and the same number of rows as `rna` and `dna`.

```{r}
eseq <- replicate(E, paste(sample(c("A", "T", "C", "G"), 10, replace = TRUE), collapse = ""))
eseq <- rep(eseq, each = B)
eseq
```

The above pieces (`rna`, `dna`, `eid`, and `eseq`) can be supplied as arguments to the `MPRASet` constructor function as below. If `barcode` (barcode sequences) or `eseq` (PRE sequences) is not supplied, it must be specified as `NULL`.

```{r}
mpraset_example <- MPRASet(DNA = dna, RNA = rna, eid = eid, eseq = eseq, barcode = NULL)
mpraset_example
```

## Allele comparison

In allele comparison studies, PREs that exist with two or more alleles are assayed. All allelic-versions of the PREs are assayed in the same sample. Because activity comparisons between alleles is desired, these counts must be separated into different columns. In the following example, the experiment looks at four PREs with three barcodes each. There are two alleles per PRE, and there are four replicates (four independent transfections total).

Note that because the different alleles of a single PRE are linked to different barcodes, there is not a natural way to construct the RNA and DNA count matrices as above, where a particular barcode is in a row. Further, sometimes each PRE-allele combination is paired with varying numbers of barcodes. This is yet another reason that DNA and RNA count matrices should not look as above for allelic studies. In the example below, the count matrices shown might look as follows ***before*** they are ready to be made into an `MPRASet` object.

```{r}
E <- 2 # Number of elements
B <- 3 # Number of barcodes
s <- 4 # Total number of samples
nalleles <- 2 # Number of alleles

set.seed(434)
rna <- matrix(rpois(E*B*s*nalleles, lambda = 30), nrow = E*B*nalleles, ncol = s)
dna <- matrix(rpois(E*B*s*nalleles, lambda = 30), nrow = E*B*nalleles, ncol = s)

rn <- expand.grid(barcode = seq_len(B), allele = seq_len(nalleles), elem = seq_len(E))
rn <- paste0("barcode", rn$barcode, "_elem", rn$elem, "_allele", rn$allele)
cn <- paste0("sample", seq_len(s))

rownames(rna) <- rn
rownames(dna) <- rn
colnames(rna) <- cn
colnames(dna) <- cn

rna
dna
```

Most often with allelic studies, we will want to aggregate counts over barcodes to have summarized counts for each PRE-allele combination in each sample as below:

```{r}
agg_output <- lapply(seq_len(E), function(elem_id) {
    pattern1 <- paste0(paste0("elem", elem_id), "_allele1")
    bool_rna_allele1 <- grepl(pattern1, rownames(rna))
    pattern2 <- paste0(paste0("elem", elem_id), "_allele2")
    bool_rna_allele2 <- grepl(pattern2, rownames(rna))
    agg_rna <- c(
        colSums(rna[bool_rna_allele1,]),
        colSums(rna[bool_rna_allele2,])
    )
    names(agg_rna) <- paste0(rep(c("allele1", "allele2"), each = s), "_", names(agg_rna))
    bool_dna_allele1 <- grepl(pattern1, rownames(dna))
    bool_dna_allele2 <- grepl(pattern2, rownames(dna))
    agg_dna <- c(
        colSums(dna[bool_dna_allele1,]),
        colSums(dna[bool_dna_allele2,])
    )
    names(agg_dna) <- paste0(rep(c("allele1", "allele2"), each = s), "_", names(agg_dna))
    list(rna = agg_rna, dna = agg_dna)
})
agg_rna <- do.call(rbind, lapply(agg_output, "[[", "rna"))
agg_dna <- do.call(rbind, lapply(agg_output, "[[", "dna"))
eid <- paste0("elem", seq_len(E))
rownames(agg_rna) <- eid
rownames(agg_dna) <- eid
eseq <- replicate(E, paste(sample(c("A", "T", "C", "G"), 10, replace = TRUE), collapse = ""))
```

With the relevant information defined, we can use the `MPRASet` constructor function as in the first example:

```{r}
mpraset_example2 <- MPRASet(DNA = agg_dna, RNA = agg_rna, eid = eid, eseq = eseq, barcode = NULL)
mpraset_example2
```

# Analysis

## Tissue comparison

While the above section demonstrated how to create `MPRASet` objects, we will use preconstructed objects containing data from a comparison of episomal and lentiviral versions of MPRA [@Inoue:2017].

```{r}
data(mpraSetExample)
```

We create the design matrix with an indicator for the episomal (mutant integrase) samples and fit the precision-weighted linear model with `mpralm`. In MPRA experiments, activity measures are quantified as the log2 ratio of RNA counts over DNA counts. When there is barcode level information (as in this experiment), there are various ways to summarize information over barcodes to compute the final element- and sample-specific log ratios that are used for subsequent statistical modeling.

We have specified `aggregate = "mean"` to indicate that the element- and sample-specific log ratios will be computed by first computing the log ratio of RNA counts over DNA counts for each barcode, then taking the mean over barcodes in a particular element and sample. This is termed the "average estimator" in [@mpralm]. In contrast, specifying `aggregate = "sum"` would indicate the use of the "aggregate estimator", in which counts are first summed over barcodes to create total RNA and DNA counts, and the log ratio activity measure is the ratio of these total counts.

We have specifed `normalize = TRUE` to perform total count normalization on the RNA and DNA libraries. This scales all libraries to have a common size of 10 million reads.

Because this experiment looks at a set of PREs in two different cellular conditions, the different samples (columns of the `MPRASet` object) are independent. Thus we specify `model_type = "indep_groups"` to perform an unpaired analysis. In contrast, if we were performing an allele comparison, we would specify `model_type = "corr_groups"` to performed a paired analysis (indicating that different columns of the `MPRASet` object are linked).

Finally, we specify `plot = TRUE` to plot the relationship between log ratio variability versus element copy number.

```{r, fig=TRUE}
design <- data.frame(intcpt = 1, episomal = grepl("MT", colnames(mpraSetExample)))
mpralm_fit <- mpralm(object = mpraSetExample, design = design, aggregate = "mean", normalize = TRUE, model_type = "indep_groups", plot = TRUE)
```

The resulting fit object can be used with `topTable` from the `r Biocpkg("limma")` package.

```{r}
toptab <- topTable(mpralm_fit, coef = 2, number = Inf)
toptab6 <- head(toptab)
```

Because the element codes are rather long for this dataset, we do some tricks to print the top differential elements:

```{r printToptab10}
rownames(toptab6)
rownames(toptab6) <- NULL
toptab6
```

## Allelic comparison

We will also demonstrate an allelic comparison analysis using the data in [@Tewhey:2016]. In this study, the investigators compare reference and alternate allele versions of sequences containing SNPs. These sequences were believed to be eQTLs based on previous work.

We create a design matrix with an indicator for whether the counts come from the "B" allele (as opposed to the "A" allele). We also create an integer `block_vector` to indicate the actual sample that each column in the DNA and RNA count matrices comes from. In this case, columns 1 and 6 of the DNA and RNA count matrices come from sample 1 (the A and B alleles measured in transfection 1), columns 2 and 7 from sample 2, and so on. This information is needed to accurately model the within-sample correlation for a paired analysis.

```{r, fig=TRUE}
data(mpraSetAllelicExample)

design <- data.frame(intcpt = 1, alleleB = grepl("allele_B", colnames(mpraSetAllelicExample)))
block_vector <- rep(1:5, 2)
mpralm_allele_fit <- mpralm(object = mpraSetAllelicExample, design = design, aggregate = "none", normalize = TRUE, block = block_vector, model_type = "corr_groups", plot = TRUE)

toptab_allele <- topTable(mpralm_allele_fit, coef = 2, number = Inf)
head(toptab_allele)
```

# Session Info

```{r sessionInfo, results='asis', echo=FALSE}
sessionInfo()
```

# References