---
title: "Accessing Human Cell Atlas Data"
author: "Maya Reed McDaniel"
date: "March 4th, 2021"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Accessing Human Cell Atlas Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE
)
```
# Motivation & Introduction

The purpose of this package is to make it easy to query the [Human
Cell Atlas Data Portal](https://www.humancellatlas.org/data-portal/)
via their data browser
[API](https://data.humancellatlas.org/apis/api-documentation/data-browser-api).
Visit the [Human Cell Atlas](https://data.humancellatlas.org/) for
more information on the project.

## Installation and getting started

Evaluate the following code chunk to install packages required for
this vignette.

```{r install, eval = FALSE}
## install from Bioconductor if you haven't already
pkgs <- c("httr", "dplyr", "LoomExperiment", "hca")
pkgs_needed <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(pkgs_needed)
```

Load the packages into your _R_ session.

```{r setup, message = FALSE}
library(httr)
library(dplyr)
library(LoomExperiment)
library(hca)
```

# Example: Discover and download a 'loom' file

To illustrate use of this package, consider the task of downloading a
'loom' file summarizing single-cell gene expression observed in an HCA
research project. This could be accomplished by visiting the HCA data
portal (at https://data.humancellatlas.org/explore) in a web browser
and selecting projects interactively, but it is valuable to accomplish
the same goal in a reproducible, flexible, programmatic way.  We will
(1) discover projects available in the HCA Data Coordinating Center
that have loom files; and (2) retrieve the file from the HCA and
import the data into _R_ as a 'LoomExperiment' object. For
illustration, we focus on the 'Single cell transcriptome analysis of
human pancreas reveals transcriptional signatures of aging and somatic
mutation patterns' project.

## Discover projects with loom files

Use `projects()` to retrieve all projects in the HCA's default catalog.
```{r}
projects()
```

Use `filters()` to restrict the projects to just those that contain at
least one 'loom' file.

```{r}
project_filter <- filters(fileFormat = list(is = "loom"))
project_tibble <- projects(project_filter)
project_tibble
```

Use standard _R_ commands to further filter projects to the one we are
interested in, with title starting with "Single...".
Extract the unique `projectId` for this project.

```{r}
project_tibble %>%
    filter(startsWith(projectTitle, "Single")) %>%
    t()

projectIds <-
    project_tibble %>%
    filter(startsWith(projectTitle, "Single")) %>%
    dplyr::pull(projectId)

projectId <- projectIds[1]
```

## Discover and download the loom file of interest

`files()` retrieves (the first 1000) files from the Human Cell Atlas
data portal. Construct a filter to restrict the files to loom files
from the project we are interested in.

```{r}
file_filter <- filters(
    projectId = list(is = projectId),
    fileFormat = list(is = "loom")
)

# only the two smallest files
file_tibble <- files(file_filter, size = 2, sort = "fileSize", order = "asc")

file_tibble
```

`files_download()` will download one or more files (one for each row)
in `file_tibble`. The download is more complicated than simply
following the `url` column of `file_tibble`, so it is not possible to
simply copy the url into a browser. We'll download the file and then
immediately import it into _R_.

```{r}
file_locations <- file_tibble %>% files_download()

LoomExperiment::import(unname(file_locations[1]),
                       type ="SingleCellLoomExperiment")
```

Note that `files_download()` uses [BiocFileCache][https://bioconductor.org/packages/BiocFileCache],
so individual files are only downloaded once.

# Example: Illustrating access to `h5ad` files

This example walks through the process of file discovery and retrieval
in a little more detail, using `h5ad` files created by the Python
AnnData analysis software and available for some experiments in the
default catalog.

## Projects facets and terms

The first challenge is to understand what file formats are available
from the HCA. Obtain a tibble describing the 'facets' of the data, the
number of terms used in each facet, and the number of distinct values
used to describe projects.

```{r}
projects_facets()
```

Note the `fileFormat` facet, and repeat `projects_facets()` to
discover detail about available file formats

```{r}
projects_facets("fileFormat")
```

Note that there are 8 uses of the `h5ad` file format. Use this as a
filter to discover relevant projects.

```{r}
filters <- filters(fileFormat = list(is = "h5ad"))
projects(filters)
```

## Projects columns

The default tibble produced by `projects()` contains only some of the
information available; the information is much richer.

### `projects()` as an _R_ `list`

Instead of retrieving the result of `projects()` as a tibble, retrieve
it as a 'list-of-lists'

```{r}
projects_list <- projects(as = "list")
```

This is a complicated structure. We will use `lengths()`, `names()`,
and standard _R_ list selection operations to navigate this a bit. At
the top level there are three elements.

```{r}
lengths(projects_list)
```

`hits` represents each project as a list, e.g,.

```{r}
lengths(projects_list$hits[[1]])
```

shows that there are 10 different ways in which the first project is
described. Each component is itself a list-of-lists, e.g.,

```{r}
lengths(projects_list$hits[[1]]$projects[[1]])
projects_list$hits[[1]]$projects[[1]]$projectTitle
```

One can use standard _R_ commands to navigate this data structure, and
to, e.g., extract the `projectTitle` of each project.

### `projects()` as an `lol`

Use `as = "lol"` to create a more convenient way to select, filter and
extract elements from the list-of-lists by `projects()`.

```{r}
lol <- projects(as = "lol")
lol
```

Use `lol_select()` to restrict the `lol` to particular paths, and
`lol_filter()` to filter results to paths that are leafs, or with
specific numbers of entries.

```{r}
lol_select(lol, "hits[*].projects[*]")
lol_select(lol, "hits[*].projects[*]") |>
    lol_filter(n == 44, is_leaf)
```

`lol_pull()` extracts a path from the `lol` as a vector; `lol_lpull()`
extracts paths as lists.

```{r}
titles <- lol_pull(lol, "hits[*].projects[*].projectTitle")
length(titles)
head(titles, 2)
```

### Creating `projects()` tibbles with specific columns

The path or its abbreviation can be used to specify the columns of
the tibble to be returned by the `projects()` query.

Here we retrieve additional details of donor count and total cells by
adding appropriate path abbreviations to a named character
vector. Names on the character vector can be used to rename the path
more concisely, but the paths must uniquely identify elements in the
list-of-lists.

```{r}
columns <- c(
    projectId = "hits[*].entryId",
    projectTitle = "hits[*].projects[*].projectTitle",
    genusSpecies = "hits[*].donorOrganisms[*].genusSpecies[*]",
    donorCount = "hits[*].donorOrganisms[*].donorCount",
    cellSuspensions.organ = "hits[*].cellSuspensions[*].organ[*]",
    totalCells = "hits[*].cellSuspensions[*].totalCells"
)
projects <- projects(filters, columns = columns)
projects
```

Note that the `cellSuspensions.organ` and `totalCells` columns have more than
one entry per project.

```{r}
projects |>
   select(projectId, cellSuspensions.organ, totalCells)
```

In this case, the mapping between `cellSuspensions.organ` and `totalCells`
is clear, but in general more refined navigation of the `lol` structure may be
necessary.

```{r}
projects |>
    select(projectId, cellSuspensions.organ, totalCells) |>
    filter(lengths(totalCells) > 0) |>
    tidyr::unnest(c("cellSuspensions.organ", "totalCells"))
```

Select the following entry, augment the filter, and query available files

```{r}
projects %>%
    filter(startsWith(projectTitle, "Reconstruct")) %>%
    t()
```

This approach can be used to customize the tibbles returned by the
other main functions in the package, `files()`, `samples()`, and
`bundles()`.

## File download

The relevant file can be selected and downloaded using the technique
in the first example.

```{r}
filters <- filters(
    projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"),
    fileFormat = list(is = "h5ad")
)
files <-
    files(filters) %>%
    head(1)            # only first file, for demonstration
files %>% t()
```

```{r, eval = FALSE}
file_path <- files_download(files)
```

`"h5ad"` files can be read as SingleCellExperiment objects using the
[zellkonverter][] package.

```{r, eval = FALSE}
## don't want large amount of data read from disk
sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
sce
```

[zellkonverter]: https://bioconductor.org/packages/zellkonverter

# Example: A multiple file download

```{r}
project_filter <- filters(fileFormat = list(is = "csv"))
project_tibble <- projects(project_filter)

project_tibble %>%
    filter(
        startsWith(
            projectTitle,
            "Reconstructing the human first trimester"
        )
    )

projectId <-
    project_tibble %>%
    filter(
        startsWith(
            projectTitle,
            "Reconstructing the human first trimester"
        )
    ) %>%
    pull(projectId)

file_filter <- filters(
    projectId = list(is = projectId),
    fileFormat = list(is = "csv")
)

## first 4 files will be returned
file_tibble <- files(file_filter, size = 4)

file_tibble %>%
    files_download()
```

# Example: Exploring the pagination feature

The `files()`, `bundles()`, and `samples()` can all return many 1000's
of results. It is necessary to 'page' through these to see all of
them. We illustrate pagination with `projects()`, retrieving only 30 projects.

Pagination works for the default `tibble` output

```{r}
page_1_tbl <- projects(size = 30)
page_1_tbl

page_2_tbl <- page_1_tbl %>% hca_next()
page_2_tbl

## should be identical to page_1_tbl
page_2_tbl %>% hca_prev()
```

Pagination also works for the `lol` objects

```{r}
page_1_lol <- projects(size = 5, as = "lol")
page_1_lol %>%
    lol_pull("hits[*].projects[*].projectTitle")

page_2_lol <-
    page_1_lol %>%
    hca_next()
page_2_lol %>%
    lol_pull("hits[*].projects[*].projectTitle")

## should be identical to page_1_lol
page_2_lol %>%
    hca_prev() %>%
    lol_pull("hits[*].projects[*].projectTitle")
```

# Example: Obtaining other data entities
Much like `projects()` and `files()`, `samples()` and `bundles()` allow you to
provide a `filter` object and additional criteria to retrieve data in the
form of samples and bundles respectively

```{r}
heart_filters <- filters(organ = list(is = "heart"))
heart_samples <- samples(filters = heart_filters, size = 4)
heart_samples

heart_bundles <- bundles(filters = heart_filters, size = 4)
heart_bundles
```

# Example: Obtaining summaries of project catalogs
HCA experiments are organized into catalogs, each of which can be summarized
with the `summary()` function

```{r}
heart_filters <- filters(organ = list(is = "heart"))
summary(filters = heart_filters, type = "fileTypeSummaries")
first_catalog <- catalogs()[1]
summary(type = "overview", catalog = first_catalog)
```

# Example: Obtaining details on individual projects, files, samples, and bundles
Each project, file, sample, and bundles has its own unique ID by which, in
conjunction with its catalog, can be to uniquely identify them.
```{r}
heart_filters <- filters(organ = list(is = "heart"))
heart_projects <- projects(filters = heart_filters, size = 4)
heart_projects

projectId <-
    heart_projects %>%
    filter(
        startsWith(
            projectTitle,
            "Cells of the adult human"
        )
    ) %>%
    dplyr::pull(projectId)

projects_detail(uuid = projectId)
```


# Session info

```{r sessionInfo}
sessionInfo()
```