--- title: "Accessing Human Cell Atlas Data" author: "Maya Reed McDaniel" date: "March 4th, 2021" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Accessing Human Cell Atlas Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE ) ``` # Motivation & Introduction The purpose of this package is to make it easy to query the [Human Cell Atlas Data Portal](https://www.humancellatlas.org/data-portal/) via their data browser [API](https://data.humancellatlas.org/apis/api-documentation/data-browser-api). Visit the [Human Cell Atlas](https://data.humancellatlas.org/) for more information on the project. ## Installation and getting started Evaluate the following code chunk to install packages required for this vignette. ```{r install, eval = FALSE} ## install from Bioconductor if you haven't already pkgs <- c("httr", "dplyr", "LoomExperiment", "hca") pkgs_needed <- pkgs[!pkgs %in% rownames(installed.packages())] BiocManager::install(pkgs_needed) ``` Load the packages into your _R_ session. ```{r setup, message = FALSE} library(httr) library(dplyr) library(LoomExperiment) library(hca) ``` # Example: Discover and download a 'loom' file To illustrate use of this package, consider the task of downloading a 'loom' file summarizing single-cell gene expression observed in an HCA research project. This could be accomplished by visiting the HCA data portal (at https://data.humancellatlas.org/explore) in a web browser and selecting projects interactively, but it is valuable to accomplish the same goal in a reproducible, flexible, programmatic way. We will (1) discover projects available in the HCA Data Coordinating Center that have loom files; and (2) retrieve the file from the HCA and import the data into _R_ as a 'LoomExperiment' object. For illustration, we focus on the 'Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns' project. ## Discover projects with loom files Use `projects()` to retrieve all projects in the HCA's default catalog. ```{r} projects() ``` Use `filters()` to restrict the projects to just those that contain at least one 'loom' file. ```{r} project_filter <- filters(fileFormat = list(is = "loom")) project_tibble <- projects(project_filter) project_tibble ``` Use standard _R_ commands to further filter projects to the one we are interested in, with title starting with "Single...". Extract the unique `projectId` for this project. ```{r} project_tibble %>% filter(startsWith(projectTitle, "Single")) %>% t() projectIds <- project_tibble %>% filter(startsWith(projectTitle, "Single")) %>% dplyr::pull(projectId) projectId <- projectIds[1] ``` ## Discover and download the loom file of interest `files()` retrieves (the first 1000) files from the Human Cell Atlas data portal. Construct a filter to restrict the files to loom files from the project we are interested in. ```{r} file_filter <- filters( projectId = list(is = projectId), fileFormat = list(is = "loom") ) # only the two smallest files file_tibble <- files(file_filter, size = 2, sort = "fileSize", order = "asc") file_tibble ``` `files_download()` will download one or more files (one for each row) in `file_tibble`. The download is more complicated than simply following the `url` column of `file_tibble`, so it is not possible to simply copy the url into a browser. We'll download the file and then immediately import it into _R_. ```{r} file_locations <- file_tibble %>% files_download() LoomExperiment::import(unname(file_locations[1]), type ="SingleCellLoomExperiment") ``` Note that `files_download()` uses [BiocFileCache][https://bioconductor.org/packages/BiocFileCache], so individual files are only downloaded once. # Example: Illustrating access to `h5ad` files This example walks through the process of file discovery and retrieval in a little more detail, using `h5ad` files created by the Python AnnData analysis software and available for some experiments in the default catalog. ## Projects facets and terms The first challenge is to understand what file formats are available from the HCA. Obtain a tibble describing the 'facets' of the data, the number of terms used in each facet, and the number of distinct values used to describe projects. ```{r} projects_facets() ``` Note the `fileFormat` facet, and repeat `projects_facets()` to discover detail about available file formats ```{r} projects_facets("fileFormat") ``` Note that there are 8 uses of the `h5ad` file format. Use this as a filter to discover relevant projects. ```{r} filters <- filters(fileFormat = list(is = "h5ad")) projects(filters) ``` ## Projects columns The default tibble produced by `projects()` contains only some of the information available; the information is much richer. ### `projects()` as an _R_ `list` Instead of retrieving the result of `projects()` as a tibble, retrieve it as a 'list-of-lists' ```{r} projects_list <- projects(as = "list") ``` This is a complicated structure. We will use `lengths()`, `names()`, and standard _R_ list selection operations to navigate this a bit. At the top level there are three elements. ```{r} lengths(projects_list) ``` `hits` represents each project as a list, e.g,. ```{r} lengths(projects_list$hits[[1]]) ``` shows that there are 10 different ways in which the first project is described. Each component is itself a list-of-lists, e.g., ```{r} lengths(projects_list$hits[[1]]$projects[[1]]) projects_list$hits[[1]]$projects[[1]]$projectTitle ``` One can use standard _R_ commands to navigate this data structure, and to, e.g., extract the `projectTitle` of each project. ### `projects()` as an `lol` Use `as = "lol"` to create a more convenient way to select, filter and extract elements from the list-of-lists by `projects()`. ```{r} lol <- projects(as = "lol") lol ``` Use `lol_select()` to restrict the `lol` to particular paths, and `lol_filter()` to filter results to paths that are leafs, or with specific numbers of entries. ```{r} lol_select(lol, "hits[*].projects[*]") lol_select(lol, "hits[*].projects[*]") |> lol_filter(n == 44, is_leaf) ``` `lol_pull()` extracts a path from the `lol` as a vector; `lol_lpull()` extracts paths as lists. ```{r} titles <- lol_pull(lol, "hits[*].projects[*].projectTitle") length(titles) head(titles, 2) ``` ### Creating `projects()` tibbles with specific columns The path or its abbreviation can be used to specify the columns of the tibble to be returned by the `projects()` query. Here we retrieve additional details of donor count and total cells by adding appropriate path abbreviations to a named character vector. Names on the character vector can be used to rename the path more concisely, but the paths must uniquely identify elements in the list-of-lists. ```{r} columns <- c( projectId = "hits[*].entryId", projectTitle = "hits[*].projects[*].projectTitle", genusSpecies = "hits[*].donorOrganisms[*].genusSpecies[*]", donorCount = "hits[*].donorOrganisms[*].donorCount", cellSuspensions.organ = "hits[*].cellSuspensions[*].organ[*]", totalCells = "hits[*].cellSuspensions[*].totalCells" ) projects <- projects(filters, columns = columns) projects ``` Note that the `cellSuspensions.organ` and `totalCells` columns have more than one entry per project. ```{r} projects |> select(projectId, cellSuspensions.organ, totalCells) ``` In this case, the mapping between `cellSuspensions.organ` and `totalCells` is clear, but in general more refined navigation of the `lol` structure may be necessary. ```{r} projects |> select(projectId, cellSuspensions.organ, totalCells) |> filter(lengths(totalCells) > 0) |> tidyr::unnest(c("cellSuspensions.organ", "totalCells")) ``` Select the following entry, augment the filter, and query available files ```{r} projects %>% filter(startsWith(projectTitle, "Reconstruct")) %>% t() ``` This approach can be used to customize the tibbles returned by the other main functions in the package, `files()`, `samples()`, and `bundles()`. ## File download The relevant file can be selected and downloaded using the technique in the first example. ```{r} filters <- filters( projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"), fileFormat = list(is = "h5ad") ) files <- files(filters) %>% head(1) # only first file, for demonstration files %>% t() ``` ```{r, eval = FALSE} file_path <- files_download(files) ``` `"h5ad"` files can be read as SingleCellExperiment objects using the [zellkonverter][] package. ```{r, eval = FALSE} ## don't want large amount of data read from disk sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE) sce ``` [zellkonverter]: https://bioconductor.org/packages/zellkonverter # Example: A multiple file download ```{r} project_filter <- filters(fileFormat = list(is = "csv")) project_tibble <- projects(project_filter) project_tibble %>% filter( startsWith( projectTitle, "Reconstructing the human first trimester" ) ) projectId <- project_tibble %>% filter( startsWith( projectTitle, "Reconstructing the human first trimester" ) ) %>% pull(projectId) file_filter <- filters( projectId = list(is = projectId), fileFormat = list(is = "csv") ) ## first 4 files will be returned file_tibble <- files(file_filter, size = 4) file_tibble %>% files_download() ``` # Example: Exploring the pagination feature The `files()`, `bundles()`, and `samples()` can all return many 1000's of results. It is necessary to 'page' through these to see all of them. We illustrate pagination with `projects()`, retrieving only 30 projects. Pagination works for the default `tibble` output ```{r} page_1_tbl <- projects(size = 30) page_1_tbl page_2_tbl <- page_1_tbl %>% hca_next() page_2_tbl ## should be identical to page_1_tbl page_2_tbl %>% hca_prev() ``` Pagination also works for the `lol` objects ```{r} page_1_lol <- projects(size = 5, as = "lol") page_1_lol %>% lol_pull("hits[*].projects[*].projectTitle") page_2_lol <- page_1_lol %>% hca_next() page_2_lol %>% lol_pull("hits[*].projects[*].projectTitle") ## should be identical to page_1_lol page_2_lol %>% hca_prev() %>% lol_pull("hits[*].projects[*].projectTitle") ``` # Example: Obtaining other data entities Much like `projects()` and `files()`, `samples()` and `bundles()` allow you to provide a `filter` object and additional criteria to retrieve data in the form of samples and bundles respectively ```{r} heart_filters <- filters(organ = list(is = "heart")) heart_samples <- samples(filters = heart_filters, size = 4) heart_samples heart_bundles <- bundles(filters = heart_filters, size = 4) heart_bundles ``` # Example: Obtaining summaries of project catalogs HCA experiments are organized into catalogs, each of which can be summarized with the `summary()` function ```{r} heart_filters <- filters(organ = list(is = "heart")) summary(filters = heart_filters, type = "fileTypeSummaries") first_catalog <- catalogs()[1] summary(type = "overview", catalog = first_catalog) ``` # Example: Obtaining details on individual projects, files, samples, and bundles Each project, file, sample, and bundles has its own unique ID by which, in conjunction with its catalog, can be to uniquely identify them. ```{r} heart_filters <- filters(organ = list(is = "heart")) heart_projects <- projects(filters = heart_filters, size = 4) heart_projects projectId <- heart_projects %>% filter( startsWith( projectTitle, "Cells of the adult human" ) ) %>% dplyr::pull(projectId) projects_detail(uuid = projectId) ``` # Session info ```{r sessionInfo} sessionInfo() ```