--- title: "Sharing analyses with ISAnalytics" author: - name: Giulia Pais affiliation: | San Raffaele Telethon Institute for Gene Therapy - SR-Tiget, Via Olgettina 60, 20132 Milano - Italia email: giuliapais1@gmail.com, calabria.andrea@hsr.it output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('ISAnalytics')`" vignette: > %\VignetteIndexEntry{sharing_analyses} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], ISAnalytics = citation("ISAnalytics")[1], eulerr = citation("eulerr")[1] ) ``` # Introduction In this vignette we explain in more detail how to perform sharing analyses with `r Biocpkg("ISAnalytics")` and its dedicated sharing functions. ```{r echo=FALSE} inst_chunk_path <- system.file("rmd", "install_and_options.Rmd", package = "ISAnalytics") ``` ```{r child=inst_chunk_path} ``` # Shared integration sites An integration site is always characterized by a triple of values: `(chr, integration_locus, strand)`, hence these attributes are always present in integration matrices. ```{r} library(ISAnalytics) data("integration_matrices") data("association_file") ``` ```{r echo=FALSE} print(head(integration_matrices)) ``` We can aggregate our data in different ways according to our needs (to know more about this topic take a look at the vignette `vignette("aggregate_function_usage", package = "ISAnalytics")`), obtaining therefore different groups. Each group has an associated set of integration sites. ```{r} ## Aggregation by standard key agg <- aggregate_values_by_key(integration_matrices, association_file, value_cols = c("seqCount", "fragmentEstimate")) agg <- agg %>% dplyr::filter(TimePoint %in% c("0030", "0060")) ``` ```{r echo=FALSE} print(agg, width = Inf) ``` An integration site is *shared* between two or more groups if the same triple is observed in all the groups considered. # Automated sharing counts ISAnalytics provides the function `is_sharing()` for computing automated sharing counts. The function has several arguments that can be tuned according to user needs. ## SCENARIO 1: single input data frame and single grouping key ```{r} sharing_1 <- is_sharing(agg, group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), n_comp = 2, is_count = TRUE, relative_is_sharing = TRUE, minimal = TRUE, include_self_comp = FALSE, keep_genomic_coord = TRUE) sharing_1 ``` In this configuration we set: * A single input data frame: `agg` * A single grouping key by setting the argument `grouping_key`. In this specific case, our groups will be identified by a unique combination of `SubjectID`, `CellMarker`, `Tissue` and `TimePoint` * `n_comp` represents the number of comparisons to compute: 2 means we're interested in knowing the sharing for PAIRS of distinct groups * We want to keep the counts of distinct integration sites for each group by setting `is_count` to `TRUE` * `relative_is_sharing` if set to `TRUE` adds sharing expressed as a percentage, more precisely it adds a column `on_g1` that is calculated as the absolute number of shared integrations divided by the cardinality of the first group, `on_g2` is analogous but is computed on the cardinality of the second group and finally `on_union` is computed on the cardinality of the union of the two groups. * By setting the argument `minimal` to `TRUE` we tell the function to avoid redundant comparisons: in this way only **combinations** and not permutations are included in the output table * `include_self_comp` adds rows in the table that are labelled with the same group: these rows always have a 100% sharing with all other groups. There are few scenarios where this is useful, but for now we set it to `FALSE` since we don't need it * `keep_genomic_coord` allows us to keep the genomic coordinates of the shared integration sites as a separate table ### Changing the number of comparisons ```{r} sharing_1_a <- is_sharing(agg, group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), n_comp = 3, is_count = TRUE, relative_is_sharing = TRUE, minimal = TRUE, include_self_comp = FALSE, keep_genomic_coord = TRUE) sharing_1_a ``` Changing the `n_comp` to 3 means that we want to calculate the sharing between 3 different groups. Note that the `shared` column contains the counts of integrations that are shared by **ALL** groups, which is equivalent to a set intersection. Beware of the fact that the more comparisons are requested the more time the computation requires. ### A case when it is useful to set `minimal = FALSE` Setting `minimal = FALSE` produces all possible permutations of the groups and the corresponding values. In combination with `include_self_comp = TRUE`, this is useful when we want to know the sharing between pairs of groups and plot results as a heatmap. ```{r} sharing_1_b <- is_sharing(agg, group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), n_comp = 2, is_count = TRUE, relative_is_sharing = TRUE, minimal = FALSE, include_self_comp = TRUE) sharing_1_b heatmaps <- sharing_heatmap(sharing_1_b) ``` The function `sharing_heatmap()` automatically plots sharing between 2 groups. There are several arguments to this function that allow us to obtain heatmaps for the absolute sharing values or the relative (percentage) values. ```{r} heatmaps$absolute heatmaps$on_g1 heatmaps$on_union ``` Beware of the fact that calculating all permutations takes longer than calculating only distinct combinations, therefore if you don't need your results plotted or you have more than 2 groups to compare you should stick with `minimal = TRUE` and `include_self_comp = FALSE`. ## SCENARIO 2: single input data frame and multiple grouping keys In the first scenario, groups were homogeneous, that is, they were grouped all with the same key. In this other scenario we want to start with data contained in just one data frame but we want to compare sets of integrations that are grouped differently. To do this we give as input a **list of keys** through the argument `group_keys`. ```{r} sharing_2 <- is_sharing(agg, group_keys = list( g1 = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), g2 = c("SubjectID", "CellMarker"), g3 = c("CellMarker", "Tissue") )) sharing_2 ``` There are a few things to keep in mind in this case: * The arguments `group_key` (notice the absence of plural), `n_comp` and `include_self_comp` are ignored: the number of comparisons is automatically detected by counting the provided keys and a self comparison doesn't make sense since group labels are different * If you provide a list of identical keys or just one key you fall back to scenario 1 ## SCENARIO 3: multiple input data frame and single grouping key Providing multiple input data frames and the same grouping key is an effective way to reduce the number of comparisons performed. Let's make an example: suppose we're interested in comparing groups labelled by a unique combination of `SubjectID`, `CellMarker`, `Tissue` and `TimePoint`, but this time we want the first group to contain only integrations relative to `PT001_MNC_BM_0030` and the second group to contain only integrations relative to `PT001_MNC_BM_0060`. We're going to filter the original data frame in order to obtain only relevant data in 2 separated tables and then proceed by calling the function. ```{r} first_sample <- agg %>% dplyr::filter(SubjectID == "PT001", CellMarker == "MNC", Tissue == "BM", TimePoint == "0030") second_sample <- agg %>% dplyr::filter(SubjectID == "PT001", CellMarker == "MNC", Tissue == "BM", TimePoint == "0060") sharing_3 <- is_sharing(first_sample, second_sample, group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), is_count = TRUE, relative_is_sharing = TRUE, minimal = TRUE) sharing_3 ``` Once again the arguments `n_comp` and `include_self_comp` are ignored: the number of comparisons is equal to the number of data frames in input. ## SCENARIO 4: multiple input data frame and multiple grouping keys Finally, the most general scenario is when we have multiple data frames with multiple keys. In this case the number of data frames must be equal to the number of provided keys and grouping keys are applied in order ( data frame 1 is grouped with key 1, data frame 2 is grouped with key 2, and so on). ```{r} df1 <- agg %>% dplyr::filter(TimePoint == "0030") df2 <- agg %>% dplyr::filter(TimePoint == "0060") df3 <- agg %>% dplyr::filter(Tissue == "BM") keys <- list(g1 = c("SubjectID", "CellMarker", "Tissue"), g2 = c("SubjectID", "Tissue"), g3 = c("SubjectID", "CellMarker", "Tissue")) sharing_4 <- is_sharing(df1, df2, df3, group_keys = keys) sharing_4 ``` Notice that in this example the keys for g1 and g3 are the same, meaning the labels of the groups are actually the same, however you should remember that the groups contain a **different set of integration sites** since they come from different data frames. # Plotting sharing results When we have more than 2 comparisons it is convenient to plot them as venn or euler diagrams. ISAnalytics has a fast way to do that through the functions `is_sharing()` and `sharing_venn()`. ```{r} sharing_5 <- is_sharing(agg, group_keys = list( g1 = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), g2 = c("SubjectID", "CellMarker"), g3 = c("CellMarker", "Tissue") ), table_for_venn = TRUE) sharing_5 ``` The argument `table_for_venn = TRUE` will add a new column `truth_tbl_venn` that contains corresponding truth tables for each row. ```{r} sharing_plots1 <- sharing_venn(sharing_5, row_range = 1, euler = TRUE) sharing_plots2 <- sharing_venn(sharing_5, row_range = 1, euler = FALSE) ``` Say that we're interested in plotting just the first row of our sharing data frame. Then we can call the function `sharing_venn` and specify in the `row_range` argument the index 1. Note that this function requires the package `r CRANpkg("eulerr")` to work. The argument `euler` indicates if the function should produce euler or venn diagrams instead. Once obtained the lists of euler/venn objects we can plot them by simply calling the function `plot()`: ```{r} plot(sharing_plots1[[1]]) plot(sharing_plots2[[1]]) ``` There are several options that can be set, for this please refer to [eulerr docs](https://jolars.github.io/eulerr/reference/plot.euler.html). # Reproducibility `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes. Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`. ```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```