---
title: "Overview of the scRNAseq dataset collection"
author: 
- name: Davide Risso
  affiliation: Division of Biostatistics and Epidemiology, Weill Cornell Medicine
- name: Aaron Lun
  email: infinite.monkeys.with.keyboards@gmail.com
date: "Created: May 25, 2016; Compiled: `r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
package: scRNAseq
vignette: >
  %\VignetteIndexEntry{User's Guide}
  %\VignetteEngine{knitr::rmarkdown}
bibliography: "`r system.file('scripts', 'ref.bib', package='scRNAseq')`"
---

```{r style, echo=FALSE}
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
```

# Introduction

The `r Biocpkg("scRNAseq")` package provides convenient access to several publicly available data sets 
in the form of `SingleCellExperiment` objects.
The focus of this package is to capture datasets that are not easily read into R with a one-liner from, e.g., `read.csv()`.
Instead, we do the necessary data munging so that users only need to call a single function to obtain a well-formed `SingleCellExperiment`.
For example:

```{r}
library(scRNAseq)
fluidigm <- ReprocessedFluidigmData()
fluidigm
```

Readers are referred to the `r Biocpkg("SummarizedExperiment")` and `r Biocpkg("SingleCellExperiment")` documentation 
for further information on how to work with `SingleCellExperiment` objects.

# Available data sets

The `listDatasets()` function returns all available datasets in `r Biocpkg("scRNAseq")`,
along with some summary statistics and the necessary R command to load them.

```{r}
out <- listDatasets()
```

```{r, echo=FALSE}
out <- as.data.frame(out)
out$Taxonomy <- c(`10090`="Mouse", `9606`="Human", `8355`="Xenopus")[as.character(out$Taxonomy)]
out$Call <- sprintf("`%s`", out$Call)
knitr::kable(out)
```

If the original dataset was not provided with Ensembl annotation, we can map the identifiers with `ensembl=TRUE`.
Any genes without a corresponding Ensembl identifier is discarded from the dataset.

```{r}
sce <- ZeiselBrainData(ensembl=TRUE)
head(rownames(sce))
```

Functions also have a `location=TRUE` argument that loads in the gene coordinates.

```{r}
sce <- ZeiselBrainData(ensembl=TRUE, location=TRUE)
head(rowRanges(sce))
```

# Adding new data sets

Please contact us if you have a data set that you would like to see added to this package.
The only requirement is that your data set has publicly available expression values (ideally counts) and sample annotation.
The more difficult/custom the format, the better, 
as its inclusion in this package will provide more value for other users in the R/Bioconductor community.

If you have already written code that processes your desired data set in a `SingleCellExperiment`-like form,
we would welcome a pull request [here](https://github.com/LTLA/scRNAseq).
The process can be expedited by ensuring that you have the following files:

- `inst/scripts/make-X-Y-data.Rmd`, a Rmarkdown report that creates all components of a `SingleCellExperiment`.
`X` should be the last name of the first author of the relevant study while `Y` should be the name of the biological system.
- `inst/scripts/make-X-Y-metadata.R`, an R script that creates a metadata CSV file at `inst/extdata/metadata-X-Y.csv`.
Metadata files should follow the format described in the `r Biocpkg("ExperimentHub")` documentation.
- `R/XYData.R`, an R source file that defines a function `XYData()` to download the components from ExperimentHub
and creates a `SingleCellExperiment` object.

Potential contributors are recommended to examine some of the existing scripts in the package to pick up the coding conventions.
Remember, we're more likely to accept a contribution if it's indistinguishable from something we might have written ourselves!

# References