---
title: "getDEE2: Programmatic access to the DEE2 RNA expression dataset"
author: "Mark Ziemann & Antony Kaspi"
date: "`r Sys.Date()`"
output: 
    rmarkdown::html_vignette:
        toc: true
vignette: >
    %\VignetteIndexEntry{getDEE2}
    %\VignetteEngine{knitr::rmarkdown}
    \usepackage[utf8]{inputenc}
---

## Background

Digital Expression Explorer 2 (or DEE2 for short) is a repository of processed
RNA-seq data in the form of counts. It was designed so that researchers could
undertake re-analysis and meta-analysis of published RNA-seq studies quickly 
and easily. As of April 2020, over 1 million SRA runs have been processed.

For further information about the resource, refer to the
[journal article](https://doi.org/10.1093/gigascience/giz022) and
[project homepage](http://dee2.io).

This package provides an interface to access these expression data
programmatically. 

## Getting started

```{r, install, eval = FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("getDEE2")
```

```{r, lib}
library("getDEE2")
```

## Searching for datasets of interest starting with accession numbers

The first step is to download the list of accession numbers of available
datasets with the `getDEE2Metadata` function, specifying a species name.
options for species currently are:

* athaliana
* celegans
* dmelanogaster
* drerio
* ecoli
* hsapiens
* mmusculus
* rnorvegicus
* scerevisiae

If the species name is incorrect, an error will be thrown.

```{r, getmeta}
mdat <- getDEE2Metadata("celegans")
head(mdat)
```

If you have a SRA project accession number in mind already (eg: SRP009256)
then we can see if the datasets are present. 

```{r, filtermeta}
mdat[which(mdat$SRP_accession %in% "SRP009256"),]
```

DEE2 data is centred around SRA run accessions numbers, these SRR_accessions
can be obtained like this:

```{r, getSRRs}
mdat1 <- mdat[which(mdat$SRP_accession %in% "SRP009256"),]
SRRvec <- as.vector(mdat1$SRR_accession)
SRRvec
```

## Fetching DEE2 data using SRA run accession numbers

The general syntax for obtaining DEE2 data is this:

`getDEE2(species,SRRvec,metadata,outfile="NULL",counts="GeneCounts")`

First, the function queries the metadata to make sure that the requested
datasets are present. If metadata is not specified, then it will download
a 'fresh' copy of the metadata. It then fetches the requested expression
data and constructs a 
[SummarizedExperiment object](http://bioconductor.org/packages/SummarizedExperiment).
The 'counts' parameter controls the type of counts provided:

* `GeneCounts` STAR gene level counts (this is the default)

* `TxCounts` Kallisto transcript level counts

* `Tx2Gene` transcript counts aggregated (sum) to the gene level.

If 'outfile' is defined, then files will be downloaded to the specified path.
If it is not defined, then the files are downloaded to a temporary directory 
and deleted immediately after use.

The SRR numbers need to exactly match those in SRA.

Here is an example of using the SRR vector as defined above. 

```{r, example1}
suppressPackageStartupMessages(library("SummarizedExperiment"))
x <- getDEE2("celegans",SRRvec,metadata=mdat,counts="GeneCounts")
x
# show sample level metadata
colData(x)[1:7]
# show the counts
head(assays(x)$counts)
```

You can directly specify the SRR accessions in the command line, but be sure to
type them correctly. In case SRR accessions are not present in the database,
there will be a warning message.

```{r, testabsent}
x <- getDEE2("celegans",c("SRR363798","SRR363799","SRR3581689","SRR3581692"),
    metadata=mdat,counts="GeneCounts")
```

In this case the accessions SRR3581689 and SRR3581692 are *A. thaliana*
accessions and therefore not present in the *C. elegans* accession list. 

## Downstream analysis

DEE2 data are perfectly suitable for downstream analysis with 
[edgeR](https://bioconductor.org/packages/edgeR/), 
[DESeq2](https://bioconductor.org/packages/DESeq2/),
and many other gene expression and pathway enrichment tools.
For more information about working with SummarizedExperiment refer to the
[rnaseqGene package](https://bioconductor.org/packages/rnaseqGene/) which
describes a workflow for differential gene expression of SummarizedExperiment
objects.

## Legacy function

The function to obtain DEE2 in the legacy format is provided for completeness 
but is no longer recommended. It gives DEE2 data in the form of a list object
with slots for gene counts, transcript counts, gene length, transcript length,
quality control data, sample metadata summary, sample metadata (full) and any
absent datasets.

```{r, legacy}
x <- getDEE2("celegans",SRRvec,metadata=mdat,legacy=TRUE)
names(x)
head(x$GeneCounts)
head(x$TxCounts)
head(x$QcMx)
head(x$GeneInfo)
head(x$TxInfo)
```

## Large project bundles

The DEE2 webpage has processed many projects containing dozens to thousands of
runs ([available here](http://dee2.io/huge)). These large project datasets are
easiest to access with the "bundles" functionality described here. The three 
functions are:

1. `list_bundles` downloads a list of available bundles for a species

2. `query_bundles` checks whether a particular SRA project or GEO series
accession number is available

3. `getDEE2_bundle` fetches the expression data for a particular accession
and loads it as a SummarizedExperiment object

In this first example, we search for a dataset with SRA project 
accession number SRP058781 and load the gene level counts.

```{r,bundle1}
bundles <- list_bundles("athaliana")
head(bundles)
query_bundles(species="athaliana",query="SRP058781",
    col="SRP_accession",bundles=bundles)
x <- getDEE2_bundle("athaliana", "SRP058781",
    col="SRP_accession",counts="GeneCounts")
    assays(x)$counts[1:6,1:4]
```

Similarly, it is possible to search with GEO series numbers, as in the next
example.  

```{r,bundle2}
x <- getDEE2_bundle("drerio", "GSE106677",
    col="GSE_accession",counts="GeneCounts")
    assays(x)$counts[1:6,1:4]
```

## Session Info
```{r,sessioninfo,message=FALSE}
sessionInfo()
```