---
title: "CuratedAtlasQueryR"
output: "html_vignette"
params:
demo_metadata: true
vignette: >
%\VignetteIndexEntry{CuratedAtlasQueryR}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
knit: >
(function(x, ...){
proj_root <- rprojroot::find_package_root_file() |> normalizePath()
rmarkdown::render(
x,
output_format = "html_document",
params = list(
demo_metadata = TRUE
)
)
rmarkdown::render(
x,
output_file = "README.md",
output_format = "github_document",
output_dir = proj_root,
knit_root_dir = proj_root,
params = list(
demo_metadata = FALSE
)
)
})
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
root_dir <- knitr::opts_knit$get("root.dir")
if (!is.null(root_dir)){
# This hack fixes the relative image paths.
# See https://github.com/rstudio/rmarkdown/issues/2473
knitr::opts_knit$set(
output.dir = root_dir
)
}
proj_root <- rprojroot::find_package_root_file() |> normalizePath()
# Utility function for figures to force them to have the correct path
find_figure <- function(names){
rprojroot::find_package_root_file() |>
file.path("man", "figures", names)
}
METADATA_URL = if (params$demo_metadata)
CuratedAtlasQueryR::SAMPLE_DATABASE_URL else
CuratedAtlasQueryR::DATABASE_URL
```
[![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
`CuratedAtlasQuery` is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria.
Harmonised data is stored in the ARDC Nectar Research Cloud, and most
`CuratedAtlasQuery` functions interact with Nectar via web requests, so a
network connection is required for most functionality.
```{r, echo=FALSE, out.height = c("139px"), out.width = "120x" }
find_figure("logo.png") |> knitr::include_graphics()
```
```{r, echo=FALSE, out.height = c("58px"), out.width = c("155x", "129px", "202px", "219px", "180px")}
c(
"svcf_logo.jpeg",
"czi_logo.png",
"bioconductor_logo.jpg",
"vca_logo.png",
"nectar_logo.png"
) |>
find_figure() |>
knitr::include_graphics()
```
# Query interface
## Installation
```{r, eval=FALSE}
devtools::install_github("stemangiola/CuratedAtlasQueryR")
```
## Load the package
```{r, message=FALSE, warning=FALSE}
library(CuratedAtlasQueryR)
```
## Load and explore the metadata
### Load the metadata
```{r}
# Note: in real applications you should use the default value of remote_url
metadata <- get_metadata(remote_url = METADATA_URL)
metadata
```
The `metadata` variable can then be re-used for all subsequent queries.
### Explore the tissue
```{r}
metadata |>
dplyr::distinct(tissue, file_id)
```
## Download single-cell RNA sequencing counts
### Query raw counts
```{r}
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment()
single_cell_counts
```
### Query counts scaled per million
This is helpful if just few genes are of interest, as they can be compared across samples.
```{r}
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm")
single_cell_counts
```
### Extract only a subset of genes
```{r}
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm", features = "PUM1")
single_cell_counts
```
### Extract the counts as a Seurat object
This convert the H5 SingleCellExperiment to Seurat so it might take long time and occupy a lot of memory depending on how many cells you are requesting.
```{r}
single_cell_counts_seurat =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_seurat()
single_cell_counts_seurat
```
## Save your `SingleCellExperiment`
The returned `SingleCellExperiment` can be saved with two modalities, as `.rds` or as `HDF5`.
### Saving as RDS (fast saving, slow reading)
Saving as `.rds` has the advantage of being fast, andd the `.rds` file occupies very little disk space as it only stores the links to the files in your cache.
However it has the disadvantage that for big `SingleCellExperiment` objects, which merge a lot of HDF5 from your `get_single_cell_experiment`, the display and manipulation is going to be slow.
In addition, an `.rds` saved in this way is not portable: you will not be able
to share it with other users.
```{r}
single_cell_counts |> saveRDS("single_cell_counts.rds")
```
### Saving as HDF5 (slow saving, fast reading)
Saving as `.hdf5` executes any computation on the `SingleCellExperiment` and writes it to disk as a monolithic `HDF5`.
Once this is done, operations on the `SingleCellExperiment` will be comparatively very fast.
The resulting `.hdf5` file will also be totally portable and sharable.
However this `.hdf5` has the disadvantage of being larger than the corresponding `.rds` as it includes a copy of the count information, and the saving process is going to be slow for large objects.
```{r}
single_cell_counts |> HDF5Array::saveHDF5SummarizedExperiment("single_cell_counts", replace = TRUE)
```
## Visualise gene transcription
We can gather all CD14 monocytes cells and plot the distribution of HLA-A across all tissues
```{r, results='hide'}
suppressPackageStartupMessages({
library(ggplot2)
})
# Plots with styling
counts <- metadata |>
# Filter and subset
dplyr::filter(cell_type_harmonised == "cd14 mono") |>
dplyr::filter(file_id_db != "c5a05f23f9784a3be3bfa651198a48eb") |>
# Get counts per million for HCA-A gene
get_single_cell_experiment(assays = "cpm", features = "HLA-A") |>
suppressMessages() |>
# Add feature to table
tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |>
# Rank x axis
tibble::as_tibble()
# Plot by disease
counts |>
dplyr::with_groups(disease, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |>
# Plot
ggplot(aes(forcats::fct_reorder(disease, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
geom_jitter(shape=".") +
# Style
guides(color="none") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) +
xlab("Disease") +
ggtitle("HLA-A in CD14 monocytes by disease")
```
```{r echo=FALSE}
find_figure("HLA_A_disease_plot.png") |> knitr::include_graphics()
```
```{r, results='hide'}
# Plot by tissue
counts |>
dplyr::with_groups(tissue_harmonised, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |>
# Plot
ggplot(aes(forcats::fct_reorder(tissue_harmonised, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
geom_jitter(shape=".") +
# Style
guides(color="none") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) +
xlab("Tissue") +
ggtitle("HLA-A in CD14 monocytes by tissue") +
theme(legend.position = "none")
```
```{r echo=FALSE}
find_figure("HLA_A_tissue_plot.png") |> knitr::include_graphics()
```
```{r}
metadata |>
# Filter and subset
dplyr::filter(cell_type_harmonised=="nk") |>
# Get counts per million for HCA-A gene
get_single_cell_experiment(assays = "cpm", features = "HLA-A") |>
suppressMessages() |>
# Plot
tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |>
ggplot(aes(tissue_harmonised, `HLA.A`, color = file_id)) +
theme_bw() +
theme(
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
legend.position = "none"
) +
geom_jitter(shape=".") +
xlab("Tissue") +
ggtitle("HLA-A in nk cells by tissue")
```
## Obtain Unharmonised Metadata
Various metadata fields are *not* common between datasets, so it does not
make sense for these to live in the main metadata table. However, we can
obtain it using the `get_unharmonised_metadata()` function. This function
returns a data frame with one row per dataset, including the `unharmonised`
column which contains unharmnised metadata as a nested data frame.
```{r}
harmonised <- metadata |> dplyr::filter(tissue == "kidney blood vessel")
unharmonised <- get_unharmonised_metadata(harmonised)
unharmonised
```
Notice that the columns differ between each dataset's data frame:
```{r}
dplyr::pull(unharmonised) |> head(2)
```
# Cell metadata
Dataset-specific columns (definitions available at cellxgene.cziscience.com)
`cell_count`, `collection_id`, `created_at.x`, `created_at.y`, `dataset_deployments`, `dataset_id`, `file_id`, `filename`, `filetype`, `is_primary_data.y`, `is_valid`, `linked_genesets`, `mean_genes_per_cell`, `name`, `published`, `published_at`, `revised_at`, `revision`, `s3_uri`, `schema_version`, `tombstone`, `updated_at.x`, `updated_at.y`, `user_submitted`, `x_normalization`
Sample-specific columns (definitions available at cellxgene.cziscience.com)
`sample_`, `sample_name`, `age_days`, `assay`, `assay_ontology_term_id`, `development_stage`, `development_stage_ontology_term_id`, `ethnicity`, `ethnicity_ontology_term_id`, `experiment___`, `organism`, `organism_ontology_term_id`, `sample_placeholder`, `sex`, `sex_ontology_term_id`, `tissue`, `tissue_harmonised`, `tissue_ontology_term_id`, `disease`, `disease_ontology_term_id`, `is_primary_data.x`
Cell-specific columns (definitions available at cellxgene.cziscience.com)
`cell_`, `cell_type`, `cell_type_ontology_term_idm`, `cell_type_harmonised`, `confidence_class`, `cell_annotation_azimuth_l2`, `cell_annotation_blueprint_singler`
Through harmonisation and curation we introduced custom column, not present in the original CELLxGENE metadata
- `tissue_harmonised`: a coarser tissue name for better filtering
- `age_days`: the number of days corresponding to the age
- `cell_type_harmonised`: the consensus call identity (for immune cells) using the original and three novel annotations using Seurat Azimuth and SingleR
- `confidence_class`: an ordinal class of how confident `cell_type_harmonised` is. 1 is complete consensus, 2 is 3 out of four and so on.
- `cell_annotation_azimuth_l2`: Azimuth cell annotation
- `cell_annotation_blueprint_singler`: SingleR cell annotation using Blueprint reference
- `cell_annotation_blueprint_monaco`: SingleR cell annotation using Monaco reference
- `sample_id_db`: Sample subdivision for internal use
- `file_id_db`: File subdivision for internal use
- `sample_`: Sample ID
- `.sample_name`: How samples were defined
# RNA abundance
The `raw` assay includes RNA abundance in the positive real scale (not transformed with non-linear functions, e.g. log sqrt). Originally CELLxGENE include a mix of scales and transformations specified in the `x_normalization` column.
The `cpm` assay includes counts per million.
# Session Info
```{r}
sessionInfo()
```