--- title: "_SummarizedExperiment_ for Coordinating Experimental Assays, Samples, and Regions of Interest" author: "Martin Morgan" date: "Revised: 18 August, 2015" output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{SummarizedExperiment} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r style, echo=FALSE, results='asis'} BiocStyle::markdown() ``` ```{r include = FALSE} # download current version of SE diagram download.file("https://docs.google.com/feeds/download/drawings/Export?id=1kiC8Qlo1mhSnLDqkGiRNPSo6GWn3C2duBszCFbJCB-g&exportFormat=svg", "SE.svg") ``` ![Summarized Experiment](SE.svg) # Motivating use case The `SummarizedExperiment` class is used to store rectangular matrices of experimental results, which are commonly produced by sequencing and microarray experiments. Each object stores observations of one or more samples, along with additional meta-data describing both the observations (features) and samples (phenotypes). A key aspect of the `SummarizedExperiment` it that the meta-data and assays are coordinated when subsetting. For instance if you want to exclude a given sample you can do for both the meta-data and assay in one operation, which ensures the meta-data and observed data will remain in sync. Improperly accounting for meta and observational data has resulted in a number of incorrect results and retractions (CITATIONS???), so this is a very desirable property. The `SummarizedExperiment` is in many ways similar to the historical `ExpressionSet`, the main distinction being that `SummarizedExperiment` is more flexible in it's row information, allowing both `GRange` based as well as those described by arbitrary `DataFrame`s. This makes it ideally suited to a variety of experiments, particularly sequencing based experiments such as RNA-Seq and ChIp-Seq. # Parts of a _SummarizedExperiment_ redo fig. 2 of http://www.nature.com/nmeth/journal/v12/n2/abs/nmeth.3252.html - Visual presentation of parts of SE - pseudo-code illustrating manipulation ## Assays The `airway` package contains an example dataset from an RNA-Seq experiment of read counts per gene for airway smooth muscles. This data is stored in a `SummarizedExperiment` object which contains 8 different experimental and assays 64,102 gene transcripts. ```{r} data(airway, package="airway") se <- airway se ``` To retrieve the experiment data from a `SummarizedExperiment` object one can use the `assays()` accessor. Each `SummarizedExperiment` object can have multiple assay datasets. Each of which can be accessed using the `$` operator. The `airway` dataset contains only one assay (`counts`). Here each row represents a gene transcript and each column one of the samples. ```{r assays, eval = FALSE} assays(se)$counts ``` ```{r assays_table, echo = FALSE} knitr::kable(assays(se)$counts[1:10,]) ``` ## 'Row' (regions-of-interest) data The `rowRanges()` accessor can be used to view the range information for each row in the dataset. This data is stored as a `GRangesList` object, where each object in the list corresponds to one gene transcript, and each `GRanges` object corresponds to one gene exon. ```{r rowRanges} rowRanges(se) ``` ### Genomic ranges or data frames? The `SummarizedExperiment` class can hold ranged information in the form of Genomic Range objects (`GRanges` and `GRangesList`), but it can also alternatively be a simple `DataFrame` object (TODO: clarify how to construct this). ```{r DataFrame} counts <- matrix(1:15, 5, 3, dimnames=list(LETTERS[1:5], LETTERS[1:3])) se0 <- SummarizedExperiment(assays=list(counts=counts)) mcols(se0) <- DataFrame(month=month.name[1:5], day=1:5) # Subset all January assays se0[mcols(se0)$month == "January", ] ``` ## 'Column' (sample) data Sample meta-data describing the samples can be accessed using `colData()`, and is a `DataFrame` that can store any number of descriptive columns for each sample row. ```{r colData} colData(se) ``` This sample metadata can be accessed using the `$` accessor for the `SummarizedExperiment` object, which makes it easy to subset the entire object by a given phenotype. ```{r columnSubset} # subset for only those samples treated with dexamethasone se[, se$dex == "trt"] ``` ## Experiment-wide metadata Meta-data describing the experimental methods and publication references can be accessed using `metadata()`. ```{r metadata} metadata(se) ``` Note that `metadata()` is just a simple list, so it is appropriate for _any_ experiment wide metadata the user wishes to save, such as storing model formulas. ```{r metadata-formula} metadata(se)$formula <- counts ~ dex + albut metadata(se) ``` # Putting it together Often, SummarizedExperiment objects are returned by functions written by other packages. However it is possible to create a `SummarizedExperiment` 'by hand'. ```{r construct} nrows <- 200 ncols <- 6 counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows) rowRanges <- GRanges(rep(c("chr1", "chr2"), c(50, 150)), IRanges(floor(runif(200, 1e5, 1e6)), width=100), strand=sample(c("+", "-"), 200, TRUE), feature_id=sprintf("ID%03d", 1:200)) colData <- DataFrame(Treatment=rep(c("ChIP", "Input"), 3), row.names=LETTERS[1:6]) rse <- SummarizedExperiment(assays=SimpleList(counts=counts), rowRanges=rowRanges, colData=colData) rse ``` # Common operations on `SummarizedExperiment` ## Subsetting ## - `[` Performs two dimensional subsetting, just like subsetting a matrix or data frame. ```{r 2d} # subset the first five transcripts and first three samples se[1:5, 1:3] ``` - `$` operates on `colData()` columns, for easy sample extraction. ```{r colDataExtraction} se[, se$cell == "N61311"] ``` ## Getters and setters ## - `rowRanges()` / (`mcols()`), `colData()`, `metadata()` ```{r getSet} counts <- matrix(1:15, 5, 3, dimnames=list(LETTERS[1:5], LETTERS[1:3])) se0 <- SummarizedExperiment(assays=list(counts=counts)) mcols(se0) <- DataFrame(month=month.name[1:5], day=1:5) # Subset all January assays se0[mcols(se0)$month == "January", ] ``` - `assay()` versus `assays()` There exist two accessor functions to extract the assay data from a `SummarizedExperiment` object. `assays()` operates on the entire list of assay data as a whole, while `assay()` operates on only one assay at a time. `assay(x, i)` is simply a convenience function which is equivalent to `assays(x)[[i]]`. ```{r assay_assays} assays(se) assays(se)[[1]][1:5, 1:5] # assay defaults to the first assay if no i is given assay(se)[1:5, 1:5] assay(se, 1)[1:5, 1:5] ``` ## Range-based operations ## - `subsetByOverlaps()` `SummarizedExperiment` objects support all of the `findOverlaps()` methods and associated functions. This includes `subsetByOverlaps()`, which makes it easy to subset a `SummarizedExperiment` object by an interval. ```{r overlap} # Subset for only rows which are in the interval 100,000 to 110,000 of # chromosome 1 roi <- GRanges(seqnames="1", ranges=100000:1100000) subsetByOverlaps(se, roi) ``` # Case study revisited -- _SummarizedExperiment_ in action TODO, although see airways above # Advanced: Extending _SummarizedExperiment_ TODO