---
title: "HDCytoData data package"
author: 
  - name: Lukas M. Weber
    affiliation: 
      - &id1 "Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland"
      - &id2 "SIB Swiss Institute of Bioinformatics, Zurich, Switzerland"
  - name: Charlotte Soneson
    affiliation: 
      - &id3 "Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland"
      - &id4 "SIB Swiss Institute of Bioinformatics, Basel, Switzerland"
package: HDCytoData
output: 
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{HDCytoData data package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Overview

The `HDCytoData` data package contains a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) datasets, formatted into `SummarizedExperiment` and `flowSet` Bioconductor object formats. The data objects are hosted on the Bioconductor ExperimentHub web resource.

The objects contain the cell-level expression values, as well as row and column metadata, including sample IDs, group IDs, true cell population labels or cluster labels (where available), channel names, protein marker names, and protein marker classes (cell type or cell state).

These datasets have been used for benchmarking purposes in our previous work and publications, e.g. to benchmark clustering algorithms or methods for differential analysis. They are provided here in the `SummarizedExperiment` and `flowSet` formats to make them easier to access.


# Datasets

The package contains the following datasets, which can be grouped into datasets useful for benchmarking either (i) clustering algorithms or (ii) methods for differential analysis.

- Clustering:
    - Levine_32dim
    - Levine_13dim
    - Samusik_01
    - Samusik_all
    - Nilsson_rare
    - Mosmann_rare

- Differential analysis:
    - Krieg_Anti_PD_1
    - Bodenmiller_BCR_XL

Additional details on each dataset are included in the help files for the datasets. For each dataset, this includes a description of the dataset (biological context, number of samples, number of cells, number of manually gated cell populations, number and classes of protein markers, etc.), as well as an explanation of the object structures, and references and raw data sources.

The help files can be accessed by the dataset names, e.g. `?Bodenmiller_BCR_XL`.


# How to load data

This section shows how to load the datasets, using one of the datasets (`Bodenmiller_BCR_XL`) as an example.

The datasets can be loaded either with named functions referring directly to the object names, or by using the `ExperimentHub` interface. Both methods are demonstrated below.

See the help files (e.g. `?Bodenmiller_BCR_XL`) for details about the structure of the `SummarizedExperiment` or `flowSet` objects.

Load the datasets using named functions:

```{r}
suppressPackageStartupMessages(library(HDCytoData))

# Load 'SummarizedExperiment' object using named function
Bodenmiller_BCR_XL_SE()

# Load 'flowSet' object using named function
Bodenmiller_BCR_XL_flowSet()
```


Alternatively, load the datasets using the `ExperimentHub` interface:

```{r}
# Create an ExperimentHub instance
ehub <- ExperimentHub()

# Query ExperimentHub instance to find datasets
query(ehub, "HDCytoData")

# Load 'SummarizedExperiment' object using index of dataset
ehub[["EH2254"]]

# Load 'flowSet' object using index of dataset
ehub[["EH2255"]]
```


# Using the data

Once the datasets have been loaded from ExperimentHub, they can be used as normal within an R session. For example, using the `SummarizedExperiment` form of the dataset loaded above:

```{r}
# Load dataset in 'SummarizedExperiment' format
d_SE <- Bodenmiller_BCR_XL_SE()

# Inspect the object
d_SE
assay(d_SE)[1:6, 1:6]
rowData(d_SE)
colData(d_SE)
```


# Transformation of raw data

Note that flow and mass cytometry data should be transformed prior to performing any downstream analyses, such as clustering. Standard transforms include the `asinh` with `cofactor` parameter equal to 5 for mass cytometry (CyTOF) data, or 150 for flow cytometry data (see Bendall et al. 2011, Supplementary Figure S2).


# Exploring the data

Interactive visualizations to explore the datasets can be generated from the `SummarizedExperiment` objects using the [iSEE](http://bioconductor.org/packages/iSEE) ("Interactive SummarizedExperiment Explorer") package, available from Bioconductor (Soneson, Lun, Marini, and Rue-Albrecht, 2018), which provides a Shiny-based graphical user interface to explore single-cell datasets stored in the `SummarizedExperiment` format. For more details, see the `iSEE` package vignettes.