--- title: "CuratedAtlasQueryR" output: "html_vignette" params: demo_metadata: true vignette: > %\VignetteIndexEntry{CuratedAtlasQueryR} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} knit: > (function(x, ...){ proj_root <- rprojroot::find_package_root_file() |> normalizePath() rmarkdown::render( x, output_format = "html_document", params = list( demo_metadata = TRUE ) ) rmarkdown::render( x, output_file = "README.md", output_format = "github_document", output_dir = proj_root, knit_root_dir = proj_root, params = list( demo_metadata = FALSE ) ) }) --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) root_dir <- knitr::opts_knit$get("root.dir") if (!is.null(root_dir)){ # This hack fixes the relative image paths. # See https://github.com/rstudio/rmarkdown/issues/2473 knitr::opts_knit$set( output.dir = root_dir ) } proj_root <- rprojroot::find_package_root_file() |> normalizePath() # Utility function for figures to force them to have the correct path find_figure <- function(names){ rprojroot::find_package_root_file() |> file.path("man", "figures", names) } METADATA_URL = if (params$demo_metadata) CuratedAtlasQueryR::SAMPLE_DATABASE_URL else CuratedAtlasQueryR::DATABASE_URL ``` [![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing) `CuratedAtlasQuery` is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria. Harmonised data is stored in the ARDC Nectar Research Cloud, and most `CuratedAtlasQuery` functions interact with Nectar via web requests, so a network connection is required for most functionality. ```{r, echo=FALSE, out.height = c("139px"), out.width = "120x" } find_figure("logo.png") |> knitr::include_graphics() ``` ```{r, echo=FALSE, out.height = c("58px"), out.width = c("155x", "129px", "202px", "219px", "180px")} c( "svcf_logo.jpeg", "czi_logo.png", "bioconductor_logo.jpg", "vca_logo.png", "nectar_logo.png" ) |> find_figure() |> knitr::include_graphics() ``` # Query interface ## Installation ```{r, eval=FALSE} devtools::install_github("stemangiola/CuratedAtlasQueryR") ``` ## Load the package ```{r, message=FALSE, warning=FALSE} library(CuratedAtlasQueryR) ``` ## Load and explore the metadata ### Load the metadata ```{r} # Note: in real applications you should use the default value of remote_url metadata <- get_metadata(remote_url = METADATA_URL) metadata ``` The `metadata` variable can then be re-used for all subsequent queries. ### Explore the tissue ```{r} metadata |> dplyr::distinct(tissue, file_id) ``` ## Download single-cell RNA sequencing counts ### Query raw counts ```{r} single_cell_counts = metadata |> dplyr::filter( ethnicity == "African" & stringr::str_like(assay, "%10x%") & tissue == "lung parenchyma" & stringr::str_like(cell_type, "%CD4%") ) |> get_single_cell_experiment() single_cell_counts ``` ### Query counts scaled per million This is helpful if just few genes are of interest, as they can be compared across samples. ```{r} single_cell_counts = metadata |> dplyr::filter( ethnicity == "African" & stringr::str_like(assay, "%10x%") & tissue == "lung parenchyma" & stringr::str_like(cell_type, "%CD4%") ) |> get_single_cell_experiment(assays = "cpm") single_cell_counts ``` ### Extract only a subset of genes ```{r} single_cell_counts = metadata |> dplyr::filter( ethnicity == "African" & stringr::str_like(assay, "%10x%") & tissue == "lung parenchyma" & stringr::str_like(cell_type, "%CD4%") ) |> get_single_cell_experiment(assays = "cpm", features = "PUM1") single_cell_counts ``` ### Extract the counts as a Seurat object This convert the H5 SingleCellExperiment to Seurat so it might take long time and occupy a lot of memory depending on how many cells you are requesting. ```{r} single_cell_counts_seurat = metadata |> dplyr::filter( ethnicity == "African" & stringr::str_like(assay, "%10x%") & tissue == "lung parenchyma" & stringr::str_like(cell_type, "%CD4%") ) |> get_seurat() single_cell_counts_seurat ``` ## Save your `SingleCellExperiment` The returned `SingleCellExperiment` can be saved with two modalities, as `.rds` or as `HDF5`. ### Saving as RDS (fast saving, slow reading) Saving as `.rds` has the advantage of being fast, andd the `.rds` file occupies very little disk space as it only stores the links to the files in your cache. However it has the disadvantage that for big `SingleCellExperiment` objects, which merge a lot of HDF5 from your `get_single_cell_experiment`, the display and manipulation is going to be slow. In addition, an `.rds` saved in this way is not portable: you will not be able to share it with other users. ```{r} single_cell_counts |> saveRDS("single_cell_counts.rds") ``` ### Saving as HDF5 (slow saving, fast reading) Saving as `.hdf5` executes any computation on the `SingleCellExperiment` and writes it to disk as a monolithic `HDF5`. Once this is done, operations on the `SingleCellExperiment` will be comparatively very fast. The resulting `.hdf5` file will also be totally portable and sharable. However this `.hdf5` has the disadvantage of being larger than the corresponding `.rds` as it includes a copy of the count information, and the saving process is going to be slow for large objects. ```{r} single_cell_counts |> HDF5Array::saveHDF5SummarizedExperiment("single_cell_counts", replace = TRUE) ``` ## Visualise gene transcription We can gather all CD14 monocytes cells and plot the distribution of HLA-A across all tissues ```{r, results='hide'} suppressPackageStartupMessages({ library(ggplot2) }) # Plots with styling counts <- metadata |> # Filter and subset dplyr::filter(cell_type_harmonised == "cd14 mono") |> dplyr::filter(file_id_db != "c5a05f23f9784a3be3bfa651198a48eb") |> # Get counts per million for HCA-A gene get_single_cell_experiment(assays = "cpm", features = "HLA-A") |> suppressMessages() |> # Add feature to table tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |> # Rank x axis tibble::as_tibble() # Plot by disease counts |> dplyr::with_groups(disease, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |> # Plot ggplot(aes(forcats::fct_reorder(disease, median_count,.desc = TRUE), `HLA.A`,color = file_id)) + geom_jitter(shape=".") + # Style guides(color="none") + scale_y_log10() + theme_bw() + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) + xlab("Disease") + ggtitle("HLA-A in CD14 monocytes by disease") ``` ```{r echo=FALSE} find_figure("HLA_A_disease_plot.png") |> knitr::include_graphics() ``` ```{r, results='hide'} # Plot by tissue counts |> dplyr::with_groups(tissue_harmonised, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |> # Plot ggplot(aes(forcats::fct_reorder(tissue_harmonised, median_count,.desc = TRUE), `HLA.A`,color = file_id)) + geom_jitter(shape=".") + # Style guides(color="none") + scale_y_log10() + theme_bw() + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) + xlab("Tissue") + ggtitle("HLA-A in CD14 monocytes by tissue") + theme(legend.position = "none") ``` ```{r echo=FALSE} find_figure("HLA_A_tissue_plot.png") |> knitr::include_graphics() ``` ```{r} metadata |> # Filter and subset dplyr::filter(cell_type_harmonised=="nk") |> # Get counts per million for HCA-A gene get_single_cell_experiment(assays = "cpm", features = "HLA-A") |> suppressMessages() |> # Plot tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |> ggplot(aes(tissue_harmonised, `HLA.A`, color = file_id)) + theme_bw() + theme( axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1), legend.position = "none" ) + geom_jitter(shape=".") + xlab("Tissue") + ggtitle("HLA-A in nk cells by tissue") ``` ## Obtain Unharmonised Metadata Various metadata fields are *not* common between datasets, so it does not make sense for these to live in the main metadata table. However, we can obtain it using the `get_unharmonised_metadata()` function. This function returns a data frame with one row per dataset, including the `unharmonised` column which contains unharmnised metadata as a nested data frame. ```{r} harmonised <- metadata |> dplyr::filter(tissue == "kidney blood vessel") unharmonised <- get_unharmonised_metadata(harmonised) unharmonised ``` Notice that the columns differ between each dataset's data frame: ```{r} dplyr::pull(unharmonised) |> head(2) ``` # Cell metadata Dataset-specific columns (definitions available at cellxgene.cziscience.com) `cell_count`, `collection_id`, `created_at.x`, `created_at.y`, `dataset_deployments`, `dataset_id`, `file_id`, `filename`, `filetype`, `is_primary_data.y`, `is_valid`, `linked_genesets`, `mean_genes_per_cell`, `name`, `published`, `published_at`, `revised_at`, `revision`, `s3_uri`, `schema_version`, `tombstone`, `updated_at.x`, `updated_at.y`, `user_submitted`, `x_normalization` Sample-specific columns (definitions available at cellxgene.cziscience.com) `sample_`, `sample_name`, `age_days`, `assay`, `assay_ontology_term_id`, `development_stage`, `development_stage_ontology_term_id`, `ethnicity`, `ethnicity_ontology_term_id`, `experiment___`, `organism`, `organism_ontology_term_id`, `sample_placeholder`, `sex`, `sex_ontology_term_id`, `tissue`, `tissue_harmonised`, `tissue_ontology_term_id`, `disease`, `disease_ontology_term_id`, `is_primary_data.x` Cell-specific columns (definitions available at cellxgene.cziscience.com) `cell_`, `cell_type`, `cell_type_ontology_term_idm`, `cell_type_harmonised`, `confidence_class`, `cell_annotation_azimuth_l2`, `cell_annotation_blueprint_singler` Through harmonisation and curation we introduced custom column, not present in the original CELLxGENE metadata - `tissue_harmonised`: a coarser tissue name for better filtering - `age_days`: the number of days corresponding to the age - `cell_type_harmonised`: the consensus call identity (for immune cells) using the original and three novel annotations using Seurat Azimuth and SingleR - `confidence_class`: an ordinal class of how confident `cell_type_harmonised` is. 1 is complete consensus, 2 is 3 out of four and so on. - `cell_annotation_azimuth_l2`: Azimuth cell annotation - `cell_annotation_blueprint_singler`: SingleR cell annotation using Blueprint reference - `cell_annotation_blueprint_monaco`: SingleR cell annotation using Monaco reference - `sample_id_db`: Sample subdivision for internal use - `file_id_db`: File subdivision for internal use - `sample_`: Sample ID - `.sample_name`: How samples were defined # RNA abundance The `raw` assay includes RNA abundance in the positive real scale (not transformed with non-linear functions, e.g. log sqrt). Originally CELLxGENE include a mix of scales and transformations specified in the `x_normalization` column. The `cpm` assay includes counts per million. # Session Info ```{r} sessionInfo() ```