--- title: "Introduction to TreeSummarizedExperiment" author: - name: Ruizhu HUANG affiliation: - Institute of Molecular Life Sciences, University of Zurich. - SIB Swiss Institute of Bioinformatics. - name: Charlotte Soneson affiliation: - Institute of Molecular Life Sciences, University of Zurich. - SIB Swiss Institute of Bioinformatics. - name: Mark Robinson affiliation: - Institute of Molecular Life Sciences, University of Zurich. - SIB Swiss Institute of Bioinformatics. package: TreeSummarizedExperiment output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Tree Aggregation} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: TreeSE_vignette.bib --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) ``` # Introduction The `TreeSummarizedExperiment` class is an extension of the `SingleCellExperiment` class [@Aaron2019]. It's used to store rectangular data of experimental results as in a `SingleCellExperiment`, and also supports the storage of a hierarchical structure and its link information to the rectangular data. # TreeSummarizedExperiment {#tse-class} ## Anatomy of TreeSummarizedExperiment ```{r strTSE, echo=FALSE, fig.cap= "The structure of the TreeSummarizedExperiment class."} knitr::include_graphics("tse.png") ``` Compared with the `SingleCellExperiment` class, `TreeSummarizedExperiment` has four more slots. * `rowTree`: the hierarchical structure on the rows of the `assays` tables. * `rowLinks`: the link between rows of the `assays` tables and the `rowTree`. * `colTree`: the hierarchical structure on the columns of the `assays` tables. * `colLinks`: the link information between columns of `assays` tables and the `colTree`. The `rowTree` and `colTree` could be empty (`NULL`) if no trees are available. Correspondingly, the `rowLinks` and `colLinks` would be `NULL. All the other slots in `TreeSummarizedExperiment` are inherited from `SingleCellExperiment`. The slots `rowTree` and `colTree` only accept the tree data as the `phylo` class. If the tree is available in other formats, one would need to convert it to `phylo` with other R packages. For example, the package `r Biocpkg("treeio")` provides 12 functions to import different tree formats and output `phylo` object in the slot `phylo`. ## Toy data ```{r} suppressPackageStartupMessages({ library(TreeSummarizedExperiment) library(S4Vectors) library(ggtree) library(ape)}) ``` We generate a **assay_data** with observations of 5 entities collected from 4 samples. ```{r} # assays data assay_data <- rbind(rep(0, 4), matrix(1:16, nrow = 4)) colnames(assay_data) <- paste(rep(LETTERS[1:2], each = 2), rep(1:2, 2), sep = "_") rownames(assay_data) <- paste("entity", seq_len(5), sep = "") assay_data ``` The descriptions of the `r nrow(assay_data)` entities and `r ncol(assay_data)` samples are given in the **row_data** and **col_data**, respectively. ```{r} # row data row_data <- DataFrame(var1 = sample(letters[1:2], 5, replace = TRUE), var2 = sample(c(TRUE, FALSE), 5, replace = TRUE), row.names = rownames(assay_data)) row_data # column data col_data <- DataFrame(gg = c(1, 2, 3, 3), group = rep(LETTERS[1:2], each = 2), row.names = colnames(assay_data)) col_data ``` The hierarchical structure of the `r nrow(assay_data)` entities is denoted as **row_tree**. The hierarchical structure of the `r ncol(assay_data)` samples is denoted as **col_tree**. We create them by using the function `rtree` from the package `r CRANpkg("ape")`. ```{r} # Toy tree 1 set.seed(1) row_tree <- rtree(5) class(row_tree) # Toy tree 2 set.seed(4) col_tree <- rtree(4) col_tree$tip.label <- colnames(assay_data) col_tree$node.label <- c("All", "GroupA", "GroupB") ``` The created trees are `phylo` objects. The `phylo` object is actually a list with at least four elements: `edge`, `tip.label`, `edge.length`, and `Nnode`. ```{r} class(row_tree) str(row_tree) ``` The package `r Biocpkg("ggtree")` [@Yu2017] has been used to visualize the tree. The node labels and node numbers are in blue and orange texts, respectively. The **row_tree** has no labels for internal nodes. ```{r rtree, fig.cap="\\label{rtree} The structure of the row tree"} # Visualize the row tree ggtree(row_tree, size = 2) + geom_text2(aes(label = node), color = "darkblue", hjust = -0.5, vjust = 0.7, size = 6) + geom_text2(aes(label = label), color = "darkorange", hjust = -0.1, vjust = -0.7, size = 6) ``` The **col_tree** has labels for internal nodes. ```{r ctree, fig.cap="\\label{ctree} The structure of the column tree"} # Visualize the column tree ggtree(col_tree, size = 2) + geom_text2(aes(label = node), color = "darkblue", hjust = -0.5, vjust = 0.7, size = 6) + geom_text2(aes(label = label), color = "darkorange", hjust = -0.1, vjust = -0.7, size = 6) ``` ## The construction of `TreeSummarizedExperiment` The `TreeSummarizedExperiment` class is used to store the toy data: **assay_data**, **row_data**, **col_data**, **col_tree** and **row_tree**, To correctly store data, the link information between the rows (or columns) of **assay_data** and the nodes of the **row_tree** (or **col_tree**) is requried to provide via a charactor vector `rowNodeLab` (or `colNodeLab`). Those columns or rows that don't match with any node of the tree structure are removed with warnings. The link data between the `assays` tables and the tree data is automatically generated in the construction. Below shows an example to construct `TreeSummarizedExperiment` without the column tree. ```{r} # provide the node labels in rowNodeLab node_lab <- row_tree$tip.label row_tse <- TreeSummarizedExperiment(assays = list(assay_data), rowData = row_data, colData = col_data, rowTree = row_tree, rowNodeLab = node_lab) ``` When printing out **row_tse**, we see a similar message as `SingleCellExperiment` with four additional lines about `rowLinks`, `rowTree`, `colLinks` and `colTree`. Here, **row_tse** stores a row tree (`phylo` object), and the `rowLinks` has `r nrow(row_tse)` rows that is exactly the same as the number of rows in the `assays` tables. More details about the link data could be found in Section \@ref(linkData). ```{r} row_tse ``` If the row tree and the column tree are both available, the `TreeSummarizedExperiment` could be constructed similarly as below. Here, the column names of the `assays` table match with the node labels used in the column tree. So, we could omit the step of providing `colNodeLab`. ```{r} all(colnames(assay_data) %in% c(col_tree$tip.label, col_tree$node.label)) both_tse <- TreeSummarizedExperiment(assays = list(assay_data), rowData = row_data, colData = col_data, rowTree = row_tree, rowNodeLab = node_lab, colTree = col_tree) ``` Compared to **row_tse**, **both_tse** includes also a column tree. The column link data (`colLinks`) with `r ncol(both_tse)` rows is automatically generated. The number of rows in the link data is decided by the column dimension of the `assays` tables. ```{r} both_tse ``` ## The accessor functions ### Assays, rowData, colData, and metadata For slots inherited from the `SingleCellExperiment` class, the accessors are exactly the same as shown in `r Biocpkg("SingleCellExperiment")`. ```{r} # to get the first table in the assays (count <- assays(both_tse)[[1]]) ``` ```{r} # to get row data rowData(both_tse) ``` ```{r} # to get column data colData(both_tse) ``` ```{r} # to get metadata: it's empty here metadata(both_tse) ``` ### rowLinks, colLinks {#linkData} The row link and column link could be accessed via `rowLinks` and `colLinks`, respectively. The output would be a `LinkDataFrame` object. The `LinkDataFrame` class is extended from the `DataFrame` class with the restriction that it has at least four columns: **nodeLab**, **nodeLab\_alias**, **nodeNum**, and **isLeaf**. More details about the `DataFrame` class could be found in the `r Biocpkg("S4Vectors")` package. * nodeLab: the labels of nodes on the tree * nodeLab\_alias: the alias labels of nodes on the tree * nodeNum: the numbers of nodes on the tree * isLeaf: it's to indicate whether the node is a leaf node When a `phylo` tree is available in the `rowTree`, we could see a `LinkDataFrame` object in the `rowLinks`. The number of rows of `rowLinks` data matches with the number of rows of `assays` tables. ```{r} (rLink <- rowLinks(both_tse)) ``` ```{r} class(rLink) showClass("LinkDataFrame") ``` ```{r} nrow(rLink) == nrow(both_tse) ``` Similarly, the number of rows of `colLinks` data matches with the number of columns of `assays` table. ```{r} (cLink <- colLinks(both_tse)) nrow(cLink) == ncol(both_tse) ``` If the tree is not available, the corresponding link data is `NULL`. ```{r} colTree(row_tse) colLinks(row_tse) ``` The link data is automatically generated when constructing the `TreeSummarizedExperiment` object. We highly recommend users not to modify it manually; otherwise the link might be broken. For R packages developers, we show in the Section \@ref(modifyLink) about how to update the link. ## The subseting function We could use `[` to subset the `TreeSummarizedExperiment`. To keep track of the original data, the `rowTree` and `colTree` stay the same in the subsetting. ```{r} sub_tse <- both_tse[1:2, 1] sub_tse ``` The annotation data on the row and column dimension is changed accordingly. ```{r} # The first four columns are from rowLinks data and the others from rowData cbind(rowLinks(sub_tse), rowData(sub_tse)) ``` ```{r} # The first four columns are from colLinks data and the others from colData cbind(colLinks(sub_tse), colData(sub_tse)) ``` # Aggregation The aggregation is allowed on the row and the column dimension. ## The column dimension {#aggCol} Here, we show the aggregation on the column dimension. The `TreeSummarizedExperiment` object is assigned to the argument `x`. The desired aggregation level is given in `colLevel`. The level could be specified via the node label (the orange texts in Figure \@ref(fig:ctree)) or the node number (the blue texts in Figure \@ref(fig:ctree)). We could further decide how to aggregate via the argument `FUN`. ```{r} # use node labels to specify colLevel aggCol <- aggValue(x = both_tse, colLevel = c("GroupA", "GroupB"), FUN = sum) # or use node numbers to specify colLevel aggCol <- aggValue(x = both_tse, colLevel = c(6, 7), FUN = sum) ``` ```{r} assays(aggCol)[[1]] ``` The `rowData` doesn't change, but the `colData` adjusts with the change of the \code{assays} table. For example, the column **group** has the `A` value for `GroupA` because the descendant nodes of `GroupA` all have the value `A`; the column **gg** has the `NA` value for `GroupA` because the descendant nodes of `GroupA` have different values, (1 and 2). ```{r} # before aggregation colData(both_tse) # after aggregation colData(aggCol) ``` The `colLinks` is updated to link the new rows of `assays` tables and the column tree. ```{r} # the link data is updated colLinks(aggCol) ``` From the Figure \@ref(fig:rtree), we could see that the nodes 6 and 7 are labelled with `GroupA` and `GroupB`, respectively. This agrees with the column link data. ## The row dimension {#aggRow} It's similar to the aggregation on the row dimension, except that the level should be specified via `rowLevel`. ```{r} agg_row <- aggValue(x = both_tse, rowLevel = 7:9, FUN = sum) ``` Now, the output `assays` table has 3 rows. ```{r} assays(agg_row)[[1]] ``` We could see which row corresponds to which nodes via the `rowLinks` data. ```{r} rowLinks(agg_row) ``` The Figure \@ref(fig:rtree) shows that the nodes 7, 8 and 9 have no labels. Therefore, the `nodeLab` column in `LinkData` of the row data has missing value. They are all internal nodes and hence the column `isLeaf` has only `FALSE` value. ## Both dimensions The aggregation on both row and column dimensions could be performed in one step using the same function specified via `FUN`. If different functions are required for different dimension, it's suggested to do it in two steps as described in Section \@ref(aggRow) and Section \@ref(aggCol) because the order of aggregation might matter. ```{r} agg_both <- aggValue(x = both_tse, colLevel = c(6, 7), rowLevel = 7:9, FUN = sum) ``` As expected, we obtain a table with 3 rows (`rowLevel = 7:9`) and 2 columns (`colLevel = c(6, 7)`). ```{r} assays(agg_both)[[1]] ``` # Aggregation on the taxonomic table In some case, the information of the hierarchical structure is available as a `data.frame` instead of the `phylo` object mentioned above. To do the work listed above, we could convert the `data.frame` to the `phylo` class. The function `toTree` outputs the hierarchical information into a `phylo` object. If the data set is large, we suggest to allow `cache = TRUE` to speed up the aggregation step. ```{r} # The toy taxonomic table taxa <- data.frame(Kindom = rep("A", 5), Phylum = c("B1", rep("B2", 4)), Class = c("C1", "C2", "C3", "C3", NA), OTU = c("D1", "D2", "D3", "D4", NA)) # convert to a phylo tree taxa_tree <- toTree(data = taxa, cache = FALSE) ggtree(taxa_tree)+ geom_text2(aes(label = node), color = "darkblue", hjust = -0.5, vjust = 0.7, size = 6) + geom_text2(aes(label = label), color = "darkorange", hjust = -0.1, vjust = -0.7, size = 6) + geom_point2() ``` ```{r} # construct a TreeSummarizedExperiment object taxa_tse <- TreeSummarizedExperiment(assays = list(assay_data), rowData = row_data, rowTree = taxa_tree, rowNodeLab = taxa_tree$tip.label) ``` Here is about how to aggregate to the phylum level. ```{r} # specify the level taxa_lab <- c(taxa_tree$tip.label, taxa_tree$node.label) ii <- startsWith(taxa_lab, "Phylum:") (l1 <- taxa_lab[ii]) # aggregate agg_taxa <- aggValue(x = taxa_tse, rowLevel = l1, FUN = sum) ``` ```{r} assays(agg_taxa)[[1]] ``` ```{r} rowData(agg_taxa) ``` The aggregation could be on any freely combined level. ```{r} # specify the level l2 <- c("Class:C3", "Phylum:B1") # aggregate agg_any <- aggValue(x = taxa_tse, rowLevel = l2, FUN = sum) ``` ```{r} assays(agg_any)[[1]] ``` ```{r} rowData(agg_any) ``` # Additional ## More about functions working on the `phylo` object. Here, we show some functions as examples to manipulate or to extract information from the `phylo` object. More functions could be found in other packages, such as `r CRANpkg("ape")` [@ape2018], `r CRANpkg("tidytree")`. These functions might be useful when R package developers want to create their own functions to work on the `TreeSummarizedExperiment` class. Below shows the node label (black texts) and node number (blue texts) of each node on an example tree. ```{r} ggtree(tinyTree, branch.length = "none") + geom_text2(aes(label = label), hjust = -0.3) + geom_text2(aes(label = node), vjust = -0.8, hjust = -0.3, color = 'blue') ``` ### print out nodes of the tree We could specify to print out all nodes (`type = "all"`), the leaves (`type = "leaf"`) or the internal nodes (`type = "internal"`). ```{r} printNode(tree = tinyTree, type = "all") ``` ### Count the number of nodes ```{r} # The number of leaves countLeaf(tree = tinyTree) # The number of nodes (leaf nodes and internal nodes) countNode(tree = tinyTree) ``` ### Translation between the node label and the node number The translation between the labels and the numbers of nodes could be achieved by the function `transNode`. ```{r} transNode(tree = tinyTree, node = c(12, 1, 4)) ``` ```{r} transNode(tree = tinyTree, node = c("t4", "Node_18")) ``` ### find the descendants To get descendants that are on the leaf level, we could set the argument `only.leaf = TRUE`. ```{r} # only the leaf nodes findOS(tree = tinyTree, node = 17, only.leaf = TRUE) ``` The argument `only.leaf = FALSE` is set to get all descendants ```{r} # all descendant nodes findOS(tree = tinyTree, node = 17, only.leaf = FALSE) ``` ### find the sibling node The input `node` could be either the node label or the node number. ```{r} # node = 5, node = "t4" are the same node findSibling(tree = tinyTree, node = 5) findSibling(tree = tinyTree, node = "t4") ``` ### find the share node This would find the first node that joined by the specified nodes (`node`) in the path to the root. ```{r} shareNode(tree = tinyTree, node = c(5, 6)) ``` ### identify a leaf node ```{r} isLeaf(tree = tinyTree, node = 5) isLeaf(tree = tinyTree, node = 17) ``` ### calculate the distance between two nodes The distance between any two nodes on the tree could be calculated by `distNode`. ```{r} distNode(tree = tinyTree, node = c(1, 5)) ``` ### prune tree via leaves We could specify the leaf nodes `rmLeaf` to remove parts of a tree. If `mergeSingle = TRUE`, the internal node that is connected to the removed leaf nodes is removed too; otherwise, it is kept. ```{r} NT1 <- pruneTree(tree = tinyTree, rmLeaf = c(4, 5), mergeSingle = TRUE) ggtree(NT1, branch.length = "none") + geom_text2(aes(label = label), color = "darkorange", hjust = -0.1, vjust = -0.7) + geom_point2() NT2 <- pruneTree(tree = tinyTree, rmLeaf = c(4, 5), mergeSingle = FALSE) ggtree(NT2, branch.length = "none") + geom_text2(aes(label = label), color = "darkorange", hjust = -0.1, vjust = -0.7) + geom_point2() ``` ### convert a `phylo` object to a matrix Each row gives a path that connects a leaf and the root. ```{r} matTree(tree = tinyTree) ``` ## Customize functions for the `TreeSummarizedExperiment` class {#modifyLink} We show examples about how to create functions for the `TreeSummarizedExperiment`. R package developers could customize their functions based on the functions provided above on the `phylo` object or develop their own ones. Here, a function `rmRows` is created to remove entities (on rows) that have zero in all samples (on columns) in the first `assays` table. ```{r} # dat: a TreeSummarizedExperiment rmRows <- function(dat) { # calculate the total counts of each row count <- assays(dat)[[1]] tot <- apply(count, 1, sum) # find the row with zero in all columns ind <- which(tot == 0) # remove those rows out <- dat[-ind, ] return(out) } (rte <- rmRows(dat = both_tse)) rowLinks(rte) ``` The function `rmRows` doesn't update the tree data. To update the tree, we could do it as below with the help of `ape::drop.tip`. ```{r} updateRowTree <- function(tse, dropLeaf) { ## -------------- new tree: drop leaves ---------- oldTree <- rowTree(tse) newTree <- ape::drop.tip(phy = oldTree, tip = dropLeaf) ## -------------- update the row link ---------- # track the tree track <- trackNode(oldTree) track <- ape::drop.tip(phy = track, tip = dropLeaf) # row links rowL <- rowLinks(tse) rowL <- DataFrame(rowL) # update the row links: # 1. use the alias label to track and updates the nodeNum # 2. the nodeLab should be updated based on the new tree using the new # nodeNum # 3. lastly, update the nodeLab_alias rowL$nodeNum <- transNode(tree = track, node = rowL$nodeLab_alias, message = FALSE) rowL$nodeLab <- transNode(tree = newTree, node = rowL$nodeNum, use.alias = FALSE, message = FALSE) rowL$nodeLab_alias <- transNode(tree = newTree, node = rowL$nodeNum, use.alias = TRUE, message = FALSE) rowL$isLeaf <- isLeaf(tree = newTree, node = rowL$nodeNum) rowNL <- as(rowL, "LinkDataFrame") ## update the row tree and links newDat <- BiocGenerics:::replaceSlots(tse, rowLinks = rowNL, rowTree = list(phylo = newTree)) return(newDat) } ``` Now the row tree has four leaves. ```{r} # find the mismatch between the rows of the 'assays' table and the leaves of the # tree row_tree <- rowTree(rte) row_link <- rowLinks(rte) leaf_tree <- printNode(tree = row_tree,type = "leaf")$nodeNum leaf_data <- row_link$nodeNum[row_link$isLeaf] leaf_rm <- setdiff(leaf_tree, leaf_data) ntse <- updateRowTree(tse = rte, dropLeaf = leaf_rm) ``` ```{r} ntse rowLinks(ntse) ``` # Session Info ```{r} sessionInfo() ``` # Reference