--- title: "getDEE2: Programmatic access to the DEE2 RNA expression dataset" author: "Mark Ziemann & Antony Kaspi" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{getDEE2} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ## Background Digital Expression Explorer 2 (or DEE2 for short) is a repository of processed RNA-seq data in the form of counts. It was designed so that researchers could undertake re-analysis and meta-analysis of published RNA-seq studies quickly and easily. As of April 2020, over 1 million SRA runs have been processed. For further information about the resource, refer to the [journal article](https://doi.org/10.1093/gigascience/giz022) and [project homepage](http://dee2.io). This package provides an interface to access these expression data programmatically. ## Getting started ```{r, install, eval = FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("getDEE2") ``` ```{r, lib} library("getDEE2") ``` ## Searching for datasets of interest starting with accession numbers The first step is to download the list of accession numbers of available datasets with the `getDEE2Metadata` function, specifying a species name. options for species currently are: * athaliana * celegans * dmelanogaster * drerio * ecoli * hsapiens * mmusculus * rnorvegicus * scerevisiae If the species name is incorrect, an error will be thrown. ```{r, getmeta} mdat <- getDEE2Metadata("celegans") head(mdat) ``` If you have a SRA project accession number in mind already (eg: SRP009256) then we can see if the datasets are present. ```{r, filtermeta} mdat[which(mdat$SRP_accession %in% "SRP009256"),] ``` DEE2 data is centred around SRA run accessions numbers, these SRR_accessions can be obtained like this: ```{r, getSRRs} mdat1 <- mdat[which(mdat$SRP_accession %in% "SRP009256"),] SRRvec <- as.vector(mdat1$SRR_accession) SRRvec ``` ## Fetching DEE2 data using SRA run accession numbers The general syntax for obtaining DEE2 data is this: `getDEE2(species,SRRvec,metadata,outfile="NULL",counts="GeneCounts")` First, the function queries the metadata to make sure that the requested datasets are present. If metadata is not specified, then it will download a 'fresh' copy of the metadata. It then fetches the requested expression data and constructs a [SummarizedExperiment object](http://bioconductor.org/packages/SummarizedExperiment). The 'counts' parameter controls the type of counts provided: * `GeneCounts` STAR gene level counts (this is the default) * `TxCounts` Kallisto transcript level counts * `Tx2Gene` transcript counts aggregated (sum) to the gene level. If 'outfile' is defined, then files will be downloaded to the specified path. If it is not defined, then the files are downloaded to a temporary directory and deleted immediately after use. The SRR numbers need to exactly match those in SRA. Here is an example of using the SRR vector as defined above. ```{r, example1} suppressPackageStartupMessages(library("SummarizedExperiment")) x <- getDEE2("celegans",SRRvec,metadata=mdat,counts="GeneCounts") x # show sample level metadata colData(x)[1:7] # show the counts head(assays(x)$counts) ``` You can directly specify the SRR accessions in the command line, but be sure to type them correctly. In case SRR accessions are not present in the database, there will be a warning message. ```{r, testabsent} x <- getDEE2("celegans",c("SRR363798","SRR363799","SRR3581689","SRR3581692"), metadata=mdat,counts="GeneCounts") ``` In this case the accessions SRR3581689 and SRR3581692 are *A. thaliana* accessions and therefore not present in the *C. elegans* accession list. ## Downstream analysis DEE2 data are perfectly suitable for downstream analysis with [edgeR](https://bioconductor.org/packages/edgeR/), [DESeq2](https://bioconductor.org/packages/DESeq2/), and many other gene expression and pathway enrichment tools. For more information about working with SummarizedExperiment refer to the [rnaseqGene package](https://bioconductor.org/packages/rnaseqGene/) which describes a workflow for differential gene expression of SummarizedExperiment objects. ## Legacy function The function to obtain DEE2 in the legacy format is provided for completeness but is no longer recommended. It gives DEE2 data in the form of a list object with slots for gene counts, transcript counts, gene length, transcript length, quality control data, sample metadata summary, sample metadata (full) and any absent datasets. ```{r, legacy} x <- getDEE2("celegans",SRRvec,metadata=mdat,legacy=TRUE) names(x) head(x$GeneCounts) head(x$TxCounts) head(x$QcMx) head(x$GeneInfo) head(x$TxInfo) ``` ## Large project bundles The DEE2 webpage has processed many projects containing dozens to thousands of runs ([available here](http://dee2.io/huge)). These large project datasets are easiest to access with the "bundles" functionality described here. The three functions are: 1. `list_bundles` downloads a list of available bundles for a species 2. `query_bundles` checks whether a particular SRA project or GEO series accession number is available 3. `getDEE2_bundle` fetches the expression data for a particular accession and loads it as a SummarizedExperiment object In this first example, we search for a dataset with SRA project accession number SRP058781 and load the gene level counts. ```{r,bundle1} bundles <- list_bundles("athaliana") head(bundles) query_bundles(species="athaliana",query="SRP058781", col="SRP_accession",bundles=bundles) x <- getDEE2_bundle("athaliana", "SRP058781", col="SRP_accession",counts="GeneCounts") assays(x)$counts[1:6,1:4] ``` Similarly, it is possible to search with GEO series numbers, as in the next example. ```{r,bundle2} x <- getDEE2_bundle("drerio", "GSE106677", col="GSE_accession",counts="GeneCounts") assays(x)$counts[1:6,1:4] ``` ## Session Info ```{r,sessioninfo,message=FALSE} sessionInfo() ```