---
title: "SummarizedBenchmark"
author: "Patrick K. Kimes, Alejandro Reyes"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('SummarizedBenchmark')`"
abstract: >
  ""
output:
  BiocStyle::html_document:
    highlight: pygments
    toc: true
    fig_width: 5
vignette: >
  %\VignetteIndexEntry{Data objects and non-standard use.}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

# Data objects

## Preparing the data from Soneson et al.

The data is available for download from ArrayExpress. Expression data for each sample is provided in the RSEM output format. Corresponding information for the ground truth underlying the simulated data is also available, including transcript differential expression status.

First, we download and import the transcript-level TPM values using the `r BiocStyle::pkg_ver('tximport')` package.

```{r download-data}
d <- tempdir()
download.file(url = paste0("https://www.ebi.ac.uk/arrayexpress/files/",
                           "E-MTAB-4119/E-MTAB-4119.processed.3.zip"),
              destfile = file.path(d, "samples.zip"))
unzip(file.path(d, "samples.zip"), exdir = d)

fl <- list.files(d, pattern = "*_rsem.txt", full.names=TRUE)
names(fl) <- gsub("sample(.*)_rsem.txt", "\\1", basename(fl))
library(tximport)
library(readr)

txi <- tximport(fl, txIn = TRUE, txOut = TRUE,
                geneIdCol = "gene_id",
                txIdCol = "transcript_id",
                countsCol = "expected_count",
                lengthCol = "effective_length",
                abundanceCol = "TPM",
                countsFromAbundance = "scaledTPM",
                importer = function(x){ read_tsv(x) })

```

Next, we obtain and load the ground truth information that can be used for evaluating the results of the differential expression analysis.

```{r}
download.file(url = paste0("https://www.ebi.ac.uk/arrayexpress/files/",
                           "E-MTAB-4119/E-MTAB-4119.processed.2.zip"),
              destfile = file.path(d, "truth.zip"))
unzip(file.path(d, "truth.zip"), exdir = d)

truthdat <- read_tsv(file.path(d, "truth_transcript.txt"))
#save( txi, truthdat, file="../data/soneson2016.rda" )
```

# Non-Standard Use

## Manually Constructing a SummarizedBenchmark

So far, this vignette has shown the recommended use of *SummarizedBenchmark*, that enables users to perform benchmarks automatically keeping track of parameters and software versions. However, users can also construct *SummarizedBenchmark* objects from standard `S3` data objects. 

Using data from the `r BiocStyle::Biocpkg("iCOBRA")`package [@Soneson_2016], this part of the vignette demonstrates how to build *SummarizedBenchmark* objects from `S3` objects. The dataset contains differential expression results of three different methods (`r BiocStyle::Biocpkg("limma")`, `r BiocStyle::Biocpkg("edgeR")` and `r BiocStyle::Biocpkg("DESeq2")`) applied to a simulated RNA-seq dataset.

```{r cobraData, message=FALSE, warning=FALSE}
library(iCOBRA)
data(cobradata_example)
```

The process of building a *SummarizedBenchmark* object is similar to the one used to construct a *SummarizedExperiment* object. To build a *SummarizedBenchmark* object, three main objects are required (1) a list where each element corresponds to a *data.frame*, (2) a *DataFrame* with annotations of the methods and (3) when available, a *DataFrame* of ground truths. 

In the *SummarizedBenchmark* object, each output of the methods is considered a different `assay`. For example, using the differential expression dataset example, we can define two assays, q-values and estimated log fold changes. For each `assay`, we arrange the output of the different methods as a matrix where each column corresponds to a method and each row corresponds to each feature (in this case, genes). We will need a list in which each of it's element corresponds to an assay.

```{r arrangeLists}
assays <- list(
  qvalue=cobradata_example@padj,
  logFC=cobradata_example@score )
assays[["qvalue"]]$DESeq2 <- p.adjust(cobradata_example@pval$DESeq2, method="BH")
head( assays[["qvalue"]], 3)
head( assays[["logFC"]], 3)
```

Since these are simulated data, the ground truths for both assays are known. We can format these as a *DataFrame* where each column corresponds to an assay and each row corresponds to a feature.

```{r groundTruths}
library(S4Vectors)
groundTruth <- DataFrame( cobradata_example@truth[,c("status", "logFC")] )
colnames(groundTruth) <- names( assays )
groundTruth <- groundTruth[rownames(assays[[1]]),]
head( groundTruth )
```

Then, the method names are also reformatted as a *DataFrame* 

```{r buildColData}
colData <- DataFrame( method=colnames(assays[[1]]) )
colData
```

A *SummarizedBenchmark* is build using the following command

```{r buildSB}
library(SummarizedBenchmark)
sb <- SummarizedBenchmark(
  assays=assays, 
  colData=colData,
  groundTruth=groundTruth )
```