--- title: "Working with aggregate functions" author: - name: Giulia Pais affiliation: | San Raffaele Telethon Institute for Gene Therapy - SR-Tiget, Via Olgettina 60, 20132 Milano - Italia email: giuliapais1@gmail.com, calabria.andrea@hsr.it output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('ISAnalytics')`" vignette: > %\VignetteIndexEntry{Working with aggregate functions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("knitcitations") ## Load knitcitations with a clean bibliography cleanbib() cite_options(hyperlink = "to.doc", citation_format = "text", style = "html") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitcitations = citation("knitcitations")[1], knitr = citation("knitr")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], ISAnalytics = citation("ISAnalytics")[1] ) write.bibtex(bib, file = "aggregate_function_usage.bib") ``` # Introduction In this vignette we're going to explain in detail how to use functions of the aggregate family, namely: 1. `aggregate_metadata` 2. `aggregate_values_by_key` ## How to install ISAnalytics To install the package run the following code: ```{r installBioc, eval=FALSE} ## For release version if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("ISAnalytics") ## For devel version if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } # The following initializes usage of Bioc devel BiocManager::install(version = "devel") BiocManager::install("ISAnalytics") ``` To install from GitHub: ```{r installGitHub, eval=FALSE} # For release version if (!require(devtools)) { install.packages("devtools") } devtools::install_github("calabrialab/ISAnalytics", ref = "RELEASE_3_12", dependencies = TRUE, build_vignettes = TRUE ) ## Safer option for vignette building issue devtools::install_github("calabrialab/ISAnalytics", ref = "RELEASE_3_12" ) # For devel version if (!require(devtools)) { install.packages("devtools") } devtools::install_github("calabrialab/ISAnalytics", ref = "master", dependencies = TRUE, build_vignettes = TRUE ) ## Safer option for vignette building issue devtools::install_github("calabrialab/ISAnalytics", ref = "master" ) ``` ```{r} library(ISAnalytics) ``` ## Setting options `ISAnalytics` has a verbose option that allows some functions to print additional information to the console while they're executing. To disable this feature do: ```{r OptVerbose, eval=FALSE} # DISABLE options("ISAnalytics.verbose" = FALSE) # ENABLE options("ISAnalytics.verbose" = TRUE) ``` Some functions also produce report in a user-friendly HTML format, to set this feature: ```{r OptWidg, eval=FALSE} # DISABLE HTML REPORTS options("ISAnalytics.widgets" = FALSE) # ENABLE HTML REPORTS options("ISAnalytics.widgets" = TRUE) ``` # Aggregating metadata We refer to information contained in the association file as "metadata": sometimes it's useful to obtain collective information based on a certain group of variables we're interested in. The function `aggregate_metadata` does just that: according to the grouping variables, meaning the names of the columns in the association file to perform a `group_by` operation with, creates a summary. You can fully customize the summary by providing a "function table" which tells the function which operation should be applied to which column and what name to give to the output column. A default is already supplied: ```{r echo=TRUE} knitr::kable(default_meta_agg()) ``` You can either provide purrr-style lambdas (as given in the example above), or simply specify the name of the function and additional parameters as a list in a separated column. If you choose to provide your own table you should maintain the column names for the function to work properly. ## Typical workflow Import the association file via `import_assocition_file`. If you need more information on import function please view the vignette "How to use import functions". ```{r} withr::with_options(list(ISAnalytics.widgets = FALSE), { path_AF <- system.file("extdata", "ex_association_file.tsv", package = "ISAnalytics" ) root_correct <- system.file("extdata", "fs.zip", package = "ISAnalytics" ) root_correct <- unzip_file_system(root_correct, "fs") association_file <- import_association_file(path_AF, root_correct) }) ``` Perform aggregation: ```{r} aggregated_meta <- aggregate_metadata( association_file, grouping_keys = c( "SubjectID", "CellMarker", "Tissue", "TimePoint" ) ) ``` ```{r echo=FALSE} knitr::kable(aggregated_meta) ``` # Aggregation of values by key `ISAnalytics` contains useful functions to aggregate the values contained in your imported matrices based on a key, aka a single column or a combination of columns contained in the association file that are related to the samples. ## Typical workflow Import your association file (see previous section) and then import your matrices: ```{r} withr::with_options(list(ISAnalytics.widgets = FALSE), { matrices <- import_parallel_Vispa2Matrices_auto( association_file = association_file, root = NULL, quantification_type = c("fragmentEstimate", "seqCount"), matrix_type = "annotated", workers = 2, patterns = NULL, matching_opt = "ANY", multi_quant_matrix = FALSE ) }) ``` The function `aggregate_values_by_key` can perform the aggregation both on the list of matrices and a single matrix. ```{r} # Takes the whole list and produces a list in output aggregated_matrices <- aggregate_values_by_key(matrices, association_file) # Takes a single matrix and produces a single matrix as output aggregated_matrices_single <- aggregate_values_by_key( matrices$seqCount, association_file ) ``` ```{r echo=FALSE} knitr::kable(head(aggregated_matrices_single)) ``` ### Changing parameters to obtain different results The function has several different parameters that have default values that can be changed according to user preference. 1. **Changing the `key` value** You can change the value of the parameter key as you see fit. This parameter should contain one or multiple columns of the association file that you want to include in the grouping when performing the aggregation. The default value is set to `c("SubjectID", "CellMarker", "Tissue", "TimePoint")` (same default key as the `aggregate_metadata` function). ```{r} agg1 <- aggregate_values_by_key( x = matrices$seqCount, association_file = association_file, key = c("SubjectID", "ProjectID") ) ``` ```{r echo=FALSE} knitr::kable(head(agg1)) ``` 2. **Changing the `lambda` value** The `lambda` parameter indicates the function(s) to be applied to the values for aggregation. `lambda` must be a named list of either functions or purrr-style lambdas: if you would like to specify additional parameters to the function the second option is recommended. The only important note on functions is that they should perform some kind of aggregation on numeric values: this means in practical terms they need to accept a vector of numeric/integer values as input and produce a SINGLE value as output. Valid options for this purpose might be: `sum`, `mean`, `median`, `min`, `max` and so on. ```{r} agg2 <- aggregate_values_by_key( x = matrices$seqCount, association_file = association_file, key = "SubjectID", lambda = list(mean = ~ mean(.x, na.rm = TRUE)) ) ``` ```{r echo=FALSE} knitr::kable(head(agg2)) ``` Note that, when specifying purrr-style lambdas (formulas), the first parameter needs to be set to `.x`, other parameters can be set as usual. You can also use in `lambda` functions that produce data frames or lists. In this case all variables from the produced data frame will be included in the final data frame. For example: ```{r} agg3 <- aggregate_values_by_key( x = matrices$seqCount, association_file = association_file, key = "SubjectID", lambda = list(describe = psych::describe) ) agg3 ``` 3. **Changing the `value_cols` value** The `value_cols` parameter tells the function on which numeric columns of x the functions should be applied. NOte that every function contained in `lambda` will be applied to every column in `value_cols`: resulting columns will be named as "original name_function applied". ```{r} ## Obtaining multi-quantification matrix comp <- comparison_matrix(matrices) agg4 <- aggregate_values_by_key( x = comp, association_file = association_file, key = "SubjectID", lambda = list(sum = sum, mean = mean), value_cols = c("seqCount", "fragmentEstimate") ) ``` ```{r echo=FALSE} knitr::kable(head(agg4)) ``` 4. **Changing the `group` value** The `group` parameter should contain all other variables to include in the grouping besides `key`. By default this contains `c("chr", "integration_locus", "strand", "GeneName", "GeneStrand")`. You can change this grouping as you see fit, if you don't want to add any other variable to the key, just set it to NULL. ```{r} agg5 <- aggregate_values_by_key( x = matrices$seqCount, association_file = association_file, key = "SubjectID", lambda = list(sum = sum, mean = mean), group = c(mandatory_IS_vars()) ) ``` ```{r echo=FALSE} knitr::kable(head(agg5)) ``` # Reproducibility The `r Biocpkg("ISAnalytics")` package `r citep(bib[["ISAnalytics"]])` was made possible thanks to: * R `r citep(bib[["R"]])` * `r Biocpkg("BiocStyle")` `r citep(bib[["BiocStyle"]])` * `r CRANpkg("knitcitations")` `r citep(bib[["knitcitations"]])` * `r CRANpkg("knitr")` `r citep(bib[["knitr"]])` * `r CRANpkg("rmarkdown")` `r citep(bib[["rmarkdown"]])` * `r CRANpkg("sessioninfo")` `r citep(bib[["sessioninfo"]])` * `r CRANpkg("testthat")` `r citep(bib[["testthat"]])` This package was developed using `r BiocStyle::Githubpkg("lcolladotor/biocthis")`. `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg("BiocStyle")` `r citep(bib[["BiocStyle"]])` with `r CRANpkg("knitr")` `r citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r citep(bib[["rmarkdown"]])` running behind the scenes. Citations made with `r CRANpkg("knitcitations")` `r citep(bib[["knitcitations"]])`. ```{r results = "asis", echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography bibliography() ```