--- title: "tidyFlowCore" author: - name: Timothy Keyes affiliation: - Stanford University School of Medicine email: tkeyes@stanford.edu output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('tidyFlowCore')`" vignette: > %\VignetteIndexEntry{tidyFlowCore} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track vignette build time startTime <- Sys.time() ## Bib setup library(RefManageR) ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], tidyverse = citation("dplyr")[1], HDCytoData = citation("HDCytoData")[1], tidyFlowCore = citation("tidyFlowCore")[1] ) ``` # Basics ## Installing `tidyFlowCore` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg("tidyFlowCore")` is an `R` package available via [Bioconductor](http://bioconductor.org), an open-source repository for R packages related to biostatistics and biocomputing. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/), after which you can install `r Biocpkg("tidyFlowCore")` by using the following commands in your `R` session: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("tidyFlowCore") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Preliminaries `r Biocpkg("tidyFlowCore")` adopts the so-called "tidy" functional programming paradigm developed by Wickham et al. in the `tidyverse` ecosystem of R packages `r Citep(bib[["tidyverse"]])`. For information about the `tidyverse` ecosytem broadly, feel free to reference the (free) [R for Data Science](https://r4ds.hadley.nz/) book, the [tidyverse website](https://www.tidyverse.org/), or [this paper](https://www.biorxiv.org/content/10.1101/2023.09.10.557072v2) describing the larger `tidyomics` project. `r Biocpkg("tidyFlowCore")` integrates the `flowCore` Bioconductor package's data analysis capabilities with those of the `tidyverse`. If you're relatively unfamiliar with the Bioconductor project, you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU). ## Asking for help Learning to use `R` and `Bioconductor` can be challenging, so it's important to know where to get help. The main place to ask questions about `tidyFlowCore` is the [Bioconductor support site](https://support.bioconductor.org/). Use the `tidyFlowCore` tag there and look at [previous posts](https://support.bioconductor.org/tag/tidyFlowCore/). You can also ask questions on GitHub or Twitter. But remember, if you're asking for help, follow the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). Make sure to include a simple example that reproduces your issue (a ["reprex"](https://reprex.tidyverse.org/articles/learn-reprex.html)) and your session info to help developers understand and solve your problem. ## Citing `tidyFlowCore` If you use `r Biocpkg("tidyFlowCore")` for your research, please use the following citation. ```{r "citation"} citation("tidyFlowCore") ``` # `tidyFlowCore` quick start `tidyFlowCore` allows you to treat `flowCore` data structures like tidy `data.frame`s or `tibbles`. It does so by implementing dplyr, tidyr, and ggplot2 verbs that can be deployed directly on the `flowFrame` and `flowSet` S4 classes. In this section, we give a brief example of how `tidyFlowCore` can enable a data analysis pipeline to use all the useful functions of the `flowCore` package and many of the functions of the `dplyr`, `tidyr`, and `ggplot2` packages. ## Load required packages ```{r, warning = FALSE} library(tidyFlowCore) library(flowCore) ``` ## Read data For our example here, we download some publicly available mass cytometry (CyTOF) data downloadable through the `r Citep(bib[["HDCytoData"]])` package. These data are made available as a `flowCore::flowSet` S4 object, `Bioconductor`'s standard data structure for cytometry data. ```{r example, eval = requireNamespace('tidyFlowCore')} # read data from the HDCytoData package bcr_flowset <- HDCytoData::Bodenmiller_BCR_XL_flowSet() ``` To read more about this dataset, run the following command: ```{r, eval = FALSE} ?HDCytoData::Bodenmiller_BCR_XL_flowSet ``` ## Data transformation The `flowCore` package natively supports multiple types of data preprocessing and transformations for cytometry data through the use of its `tranform` class. For example, if we want to apply the standard arcsinh transformation often used for CyTOF data to our current dataset, we could use the following code: ```{r} asinh_transformation <- flowCore::arcsinhTransform(a = 0, b = 1/5, c = 0) transformation_list <- flowCore::transformList( colnames(bcr_flowset), asinh_transformation ) transformed_bcr_flowset <- flowCore::transform(bcr_flowset, transformation_list) ``` Alternatively, we can also use the `tidyverse`'s functional programming paradigm to perform the same transformation. For this, we use the `mutate`-`across` framework via `tidyFlowCore`: ```{r} transformed_bcr_flowset <- bcr_flowset |> dplyr::mutate(across(-ends_with("_id"), \(.x) asinh(.x / 5))) ``` ## Cell type counting Suppose we're interested in counting the number of cells that belong to each cell type (encoded in the `population_id` column of `bcr_flowset`) in our dataset. Using standard `flowCore` functions, we could perform this calculation in a few steps: ```{r} # extract all expression matrices from our flowSet combined_matrix <- flowCore::fsApply(bcr_flowset, exprs) # take out the concatenated population_id column combined_population_id <- combined_matrix[, 'population_id'] # perform the calculation table(combined_population_id) ``` `tidyFlowCore` allows us to perform the same operation simply using the `dplyr` package's `count` function, with output provided in the convenient form of a `tibble`: ```{r} bcr_flowset |> dplyr::count(population_id) ``` `tidyFlowCore` also makes it easy to perform the counting after grouping by other variables in our metadata. For example, we can see that `bcr_flowset` contains data from multiple .FCS files, each of which is associated with a file name. ```{r} flowCore::pData(object = bcr_flowset) ``` `tidyFlowCore` makes it easy to perform grouped tidy operations, like counting, using information in our `flowSet`'s metadata: ```{r} bcr_flowset |> # use the .tidyFlowCore_identifier pronoun to access the name of # each experiment in the flowSet dplyr::count(.tidyFlowCore_identifier, population_id) ``` ## Nesting and unnesting `flowFrame` and `flowSet` data objects have a clear relationship with one another in the `flowCore` application programming interface (API). Essentially, `flowSet`s are nested `flowFrame`s - or, in other words, `flowSet`s are made up of multiple `flowFrame`s! `tidyFlowCore` provides a useful API for converting between `flowSet` and `flowFrame` data structures at various degrees of nesting using the `group`/`nest` and `ungroup`/`unnest` verbs. Note that in the `dplyr` and `tidyr` APIs, `group`/`nest` and `ungroup`/`unnest` are **not** synonyms (grouped `data.frames` are different from nested `data.frames`). However, because of how `flowFrame`s and `flowSet`s are structured, `tidyFlowCore`'s `group`/`nest` and `ungroup`/`unnest` functions have identical behavior, respectively. ```{r} # unnesting a flowSet results in a flowFrame with an additional column, # 'tidyFlowCore_name` that identifies cells based on which experiment in the # original flowSet they come from bcr_flowset |> dplyr::ungroup() ``` ```{r} # flowSets can be unnested and re-nested for various analyses bcr_flowset |> dplyr::ungroup() |> # group_by cell type dplyr::group_by(population_id) |> # calculate the mean HLA-DR expression of each cell population dplyr::summarize(mean_hladr_expression = mean(`HLA-DR(Yb174)Dd`)) |> dplyr::select(population_id, mean_hladr_expression) ``` ## Plotting `tidyFlowCore` also provides a direct interface between `ggplot2` and `flowFrame` or `flowSet` data objects. For example... ```{r} # cell population names, from the HDCytoData documentation population_names <- c( "B-cells IgM-", "B-cells IgM+", "CD4 T-cells", "CD8 T-cells", "DC", "monocytes", "NK cells", "surface-" ) # calculate mean CD20 expression across all cells mean_cd20_expression <- bcr_flowset |> dplyr::ungroup() |> dplyr::summarize(mean_expression = mean(asinh(`CD20(Sm147)Dd` / 5))) |> dplyr::pull(mean_expression) # calculate mean CD4 expression across all cells mean_cd4_expression <- bcr_flowset |> dplyr::ungroup() |> dplyr::summarize(mean_expression = mean(asinh(`CD4(Nd145)Dd` / 5))) |> dplyr::pull(mean_expression) bcr_flowset |> # preprocess all columns that represent protein measurements dplyr::mutate(dplyr::across(-ends_with("_id"), \(.x) asinh(.x / 5))) |> # plot a CD4 vs. CD45 scatterplot ggplot2::ggplot(ggplot2::aes(x = `CD20(Sm147)Dd`, y = `CD4(Nd145)Dd`)) + # add some reference lines ggplot2::geom_hline( yintercept = mean_cd4_expression, color = "red", linetype = "dashed" ) + ggplot2::geom_vline( xintercept = mean_cd20_expression, color = "red", linetype = "dashed" ) + ggplot2::geom_point(size = 0.1, alpha = 0.1) + # facet by cell population ggplot2::facet_wrap( facets = ggplot2::vars(population_id), labeller = ggplot2::as_labeller( \(population_id) population_names[as.numeric(population_id)] ) ) + # axis labels ggplot2::labs( x = "CD20 expression (arcsinh)", y = "CD4 expression (arcsinh)" ) ``` Using some standard functions from the `ggplot2` library, we can create a scatterplot of CD4 vs. CD20 expression in the different cell populations included in the `bcr_flowset` `flowSet`. We can see, unsurprisingly, that both B-cell populations are highest for CD20 expression, whereas CD4+ T-helper cells are highest for CD4 expression. # Reproducibility The `r Biocpkg("tidyFlowCore")` package `r Citep(bib[["tidyFlowCore"]])` was made possible thanks to the following: * R `r Citep(bib[["R"]])` * `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` * `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` * `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])` * `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` * `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])` * `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])` * `r CRANpkg("tidyverse")` `r Citep(bib[["tidyverse"]])` This package was developed using `r BiocStyle::Biocpkg("biocthis")`. Code for creating the vignette ```{r createVignette, eval=FALSE} ## Create the vignette library("rmarkdown") system.time(render("tidyFlowCore.Rmd", "BiocStyle::html_document")) ## Extract the R code library("knitr") knit("tidyFlowCore.Rmd", tangle = TRUE) ``` Date the vignette was generated. ```{r reproduce1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproduce2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes. Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`. ```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```