The TreeSummarizedExperiment class is an extension of
the SingleCellExperiment class (Lun
and Risso 2020). It’s used to store rectangular data of
experimental results as in a SingleCellExperiment, and also
supports the storage of a hierarchical structure and its link
information to the rectangular data.
Compared
to the
SingleCellExperiment objects,
TreeSummarizedExperiment has five additional slots:
rowTree: the hierarchical structure on the rows of the
assays.rowLinks: the link information between rows of the
assays and the rowTree.colTree: the hierarchical structure on the columns of
the assays.colLinks: the link information between columns of the
assays and the colTree.referenceSeq (optional): the reference sequence data
per feature (row).The rowTree and\(/\)or
colTree can be left empty (NULL) if no trees
are available; in this case, the rowLinks and\(/\)or colLinks are also set to
NULL. All other TreeSummarizedExperiment slots
are inherited from SingleCellExperiment.
The rowTree and colTree slots require the
tree to be an object of the phylo class. If a tree is
available in an alternative format, it can often be converted to a
phylo object using dedicated R packages (e.g.,
r Biocpkg("treeio") (Wang et al.
2019)).
The referenceSeq slot is optional. It accepts the
reference sequence data of features either as DNAStringSet
or DNAStringSetList. More details are in @ref(refSeq).
Functions in the TreeSummarizedExperiment
package fall in two main categories: operations on the
TreeSummarizedExperiment object or operations on the tree
(phylo) objects. The former includes constructors and
accessors, and the latter serves as “pieces” to be assembled as
accessors or functions that manipulate the
TreeSummarizedExperiment object. Given that more than 200 R
packages make use of the phylo class, there are many
resources (e.g., ape) for users to
manipulate the small “pieces” in addition to those provided in TreeSummarizedExperiment.
We generate a toy dataset that has observations of 6 entities
collected from 4 samples as an example to show how to construct a
TreeSummarizedExperiment object.
library(TreeSummarizedExperiment)
# assays data (typically, representing observed data from an experiment)
assay_data <- rbind(rep(0, 4), matrix(1:20, nrow = 5))
colnames(assay_data) <- paste0("sample", 1:4)
rownames(assay_data) <- paste("entity", seq_len(6), sep = "")
assay_data## sample1 sample2 sample3 sample4
## entity1 0 0 0 0
## entity2 1 6 11 16
## entity3 2 7 12 17
## entity4 3 8 13 18
## entity5 4 9 14 19
## entity6 5 10 15 20
The information of entities and samples are given in the row_data and col_data, respectively.
# row data (feature annotations)
row_data <- data.frame(Kingdom = "A",
Phylum = rep(c("B1", "B2"), c(2, 4)),
Class = rep(c("C1", "C2", "C3"), each = 2),
OTU = paste0("D", 1:6),
row.names = rownames(assay_data),
stringsAsFactors = FALSE)
row_data## Kingdom Phylum Class OTU
## entity1 A B1 C1 D1
## entity2 A B1 C1 D2
## entity3 A B2 C2 D3
## entity4 A B2 C2 D4
## entity5 A B2 C3 D5
## entity6 A B2 C3 D6
# column data (sample annotations)
col_data <- data.frame(gg = c(1, 2, 3, 3),
group = rep(LETTERS[1:2], each = 2),
row.names = colnames(assay_data),
stringsAsFactors = FALSE)
col_data## gg group
## sample1 1 A
## sample2 2 A
## sample3 3 B
## sample4 3 B
The hierarchical structure of the 6 entities and
r ncol(assay_data) samples are denoted as
row_tree and col_tree, respectively.
The two trees are phylo objects randomly created with
rtree from the package ape. Note that
the row tree has 5 rather than 6 leaves; this is used later to show that
multiple rows in the assays are allowed to map to a single
node in the tree.
library(ape)
# The first toy tree
set.seed(12)
row_tree <- rtree(5)
# The second toy tree
set.seed(12)
col_tree <- rtree(4)
# change node labels
col_tree$tip.label <- colnames(assay_data)
col_tree$node.label <- c("All", "GroupA", "GroupB")We visualize the tree using the package ggtree (Yu et al. 2017). Here, the internal nodes of the row_tree have no labels as shown in Figure @ref(fig:plot-rtree).
library(ggtree)
library(ggplot2)
# Visualize the row tree
ggtree(row_tree, size = 2, branch.length = "none") +
geom_text2(aes(label = node), color = "darkblue",
hjust = -0.5, vjust = 0.7, size = 5.5) +
geom_text2(aes(label = label), color = "darkorange",
hjust = -0.1, vjust = -0.7, size = 5.5) The structure of the row tree. The node labels and the node numbers are in orange and blue text, respectively.
The col_tree has labels for internal nodes.
# Visualize the column tree
ggtree(col_tree, size = 2, branch.length = "none") +
geom_text2(aes(label = node), color = "darkblue",
hjust = -0.5, vjust = 0.7, size = 5.5) +
geom_text2(aes(label = label), color = "darkorange",
hjust = -0.1, vjust = -0.7, size = 5.5)+
ylim(c(0.8, 4.5)) +
xlim(c(0, 2.2))The structure of the column tree. The node labels and the node numbers are in orange and blue text, respectively.
TreeSummarizedExperimentThe TreeSummarizedExperiment class is used to store the
toy data generated in the previous section: assay_data,
row_data, col_data,
col_tree and row_tree. To correctly
store data, the link information between the rows (or columns) of
assay_data and the nodes of the
row_tree (or col_tree) can be provided
via a character vector rowNodeLab (or
colNodeLab), with length equal to the number of rows (or
columns) of the assays; otherwise the row (or column) names
are used. Those columns or rows with labels that are not present among
the node labels of the tree are removed with warnings. The link data
between the assays tables and the tree data is
automatically generated in the construction.
The row and column trees can be included simultaneously in the
construction. Here, the column names of assay_data can
be found in the node labels of the column tree, which enables the link
to be created between the column dimension of
assay_data and the column tree
col_tree. If the row names of
assay_data are not in the node labels of
row_tree, we would need to provide their corresponding
node labels (row_lab) to rowNodeLab in the
construction of the object. It is allowed to have multiple rows or/and
columns mapped to a node, for example, the same leaf label is used for
the first two rows in row_lab.
# all column names could be found in the node labels of the column tree
all(colnames(assay_data) %in% c(col_tree$tip.label, col_tree$node.label))## [1] TRUE
# provide the node labels in rowNodeLab
tip_lab <- row_tree$tip.label
row_lab <- tip_lab[c(1, 1:5)]
row_lab## [1] "t3" "t3" "t2" "t1" "t5" "t4"
both_tse <- TreeSummarizedExperiment(assays = list(Count = assay_data),
rowData = row_data,
colData = col_data,
rowTree = row_tree,
rowNodeLab = row_lab,
colTree = col_tree)## class: TreeSummarizedExperiment
## dim: 6 4
## metadata(0):
## assays(1): Count
## rownames(6): entity1 entity2 ... entity5 entity6
## rowData names(4): Kingdom Phylum Class OTU
## colnames(4): sample1 sample2 sample3 sample4
## colData names(2): gg group
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (6 rows)
## rowTree: 1 phylo tree(s) (5 leaves)
## colLinks: a LinkDataFrame (4 rows)
## colTree: 1 phylo tree(s) (4 leaves)
When printing out both_tse, we see a similar message
as SingleCellExperiment with four additional lines for
rowLinks, rowTree, colLinks and
colTree.
For slots inherited from the SingleCellExperiment class,
the accessors are exactly the same as shown in SingleCellExperiment.
These accessors are both setters and getters.
## sample1 sample2 sample3 sample4
## entity1 0 0 0 0
## entity2 1 6 11 16
## entity3 2 7 12 17
## entity4 3 8 13 18
## entity5 4 9 14 19
## entity6 5 10 15 20
## DataFrame with 6 rows and 4 columns
## Kingdom Phylum Class OTU
## <character> <character> <character> <character>
## entity1 A B1 C1 D1
## entity2 A B1 C1 D2
## entity3 A B2 C2 D3
## entity4 A B2 C2 D4
## entity5 A B2 C3 D5
## entity6 A B2 C3 D6
## DataFrame with 4 rows and 2 columns
## gg group
## <numeric> <character>
## sample1 1 A
## sample2 2 A
## sample3 3 B
## sample4 3 B
## list()
For new slots, we provide rowTree (and
colTree) to retrieve the row (column) trees, and
rowLinks (and colLinks) to retrieve the link
information between assays and nodes of the row (column)
tree. If the tree is not available, the corresponding link data is
NULL.
##
## Phylogenetic tree with 5 tips and 4 internal nodes.
##
## Tip labels:
## t3, t2, t1, t5, t4
##
## Rooted; includes branch length(s).
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## sample1, sample2, sample3, sample4
## Node labels:
## All, GroupA, GroupB
##
## Rooted; includes branch length(s).
rowTree and colTree can work not only as
getters but also as setters. The replacement requires that the row/col
names of the TSE object can be matched to node labels
of the new row/col tree; otherwise changeTree should be
used with rowNodeLab or colNodeLab available
@ref(change-tree).
new_tse <- both_tse
# a new tree
new_tree <- rtree(nrow(new_tse))
new_tree$tip.label <- rownames(new_tse)
# the original row tree is replaced with the new tree
rowTree(new_tse) <- new_tree
identical(rowTree(new_tse), rowTree(both_tse))## [1] FALSE
## [1] TRUE
rowLinks and colLinks only work as
getters.
## LinkDataFrame with 6 rows and 5 columns
## nodeLab nodeLab_alias nodeNum isLeaf whichTree
## <character> <character> <integer> <logical> <character>
## entity1 t3 alias_1 1 TRUE phylo
## entity2 t3 alias_1 1 TRUE phylo
## entity3 t2 alias_2 2 TRUE phylo
## entity4 t1 alias_3 3 TRUE phylo
## entity5 t5 alias_4 4 TRUE phylo
## entity6 t4 alias_5 5 TRUE phylo
## LinkDataFrame with 4 rows and 5 columns
## nodeLab nodeLab_alias nodeNum isLeaf whichTree
## <character> <character> <integer> <logical> <character>
## sample1 sample1 alias_1 1 TRUE phylo
## sample2 sample2 alias_2 2 TRUE phylo
## sample3 sample3 alias_3 3 TRUE phylo
## sample4 sample4 alias_4 4 TRUE phylo
The link data objects are of the LinkDataFrame class,
which extends the DataFrame class with the restriction that
it has at least four columns:
nodeLab: the labels of nodes on the treenodeLab_alias: the alias labels of nodes on the
treenodeNum: the numbers of nodes on the treeisLeaf: whether the node is a leaf nodewhichTree: the name of the tree that is mapped to (See
vignette 2).More details about the DataFrame class could be found in
the S4Vectors
R/Bioconductor package.
## [1] "LinkDataFrame"
## attr(,"package")
## [1] "TreeSummarizedExperiment"
## Class "LinkDataFrame" [package "TreeSummarizedExperiment"]
##
## Slots:
##
## Name: rownames nrows elementType elementMetadata
## Class: character_OR_NULL integer character DataFrame_OR_NULL
##
## Name: metadata listData
## Class: list list
##
## Extends:
## Class "DFrame", directly
## Class "LinkDataFrame_Or_NULL", directly
## Class "DataFrame", by class "DFrame", distance 2
## Class "SimpleList", by class "DFrame", distance 2
## Class "RectangularData", by class "DFrame", distance 3
## Class "List", by class "DFrame", distance 3
## Class "DataFrame_OR_NULL", by class "DFrame", distance 3
## Class "Vector", by class "DFrame", distance 4
## Class "list_OR_List", by class "DFrame", distance 4
## Class "Annotated", by class "DFrame", distance 5
## Class "vector_OR_Vector", by class "DFrame", distance 5
The link data is automatically generated when constructing the
TreeSummarizedExperiment object. We highly recommend users
not to modify it manually; otherwise the link might be broken. For R
packages developers, we show in the Section @ref(modifyLink) about how
to update the link.
In addition to tree data, reference sequence data can be stored per
feature in a TreeSummarizedExperiment object.
The data must match the number of rows in the object and can either
be added to the object upon creation or later on with the accessor
function referenceSeq.
## DNAStringSet object of length 6:
## width seq names
## [1] 4 AGCT entity1
## [2] 4 AGCT entity2
## [3] 4 AGCT entity3
## [4] 4 AGCT entity4
## [5] 4 AGCT entity5
## [6] 4 AGCT entity6
Both DNAStringSet or DNAStringSetList can
be used, so that a single or multiple sequences can be stored per
feature.
Now, one new line in the message is shown for the
referenceSeq slot.
## class: TreeSummarizedExperiment
## dim: 6 4
## metadata(0):
## assays(1): Count
## rownames(6): entity1 entity2 ... entity5 entity6
## rowData names(4): Kingdom Phylum Class OTU
## colnames(4): sample1 sample2 sample3 sample4
## colData names(2): gg group
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (6 rows)
## rowTree: 1 phylo tree(s) (5 leaves)
## colLinks: a LinkDataFrame (4 rows)
## colTree: 1 phylo tree(s) (4 leaves)
## referenceSeq: a DNAStringSet (6 sequences)
A TreeSummarizedExperiment object can be subset in two
different ways: [ to subset by rows or columns, and
subsetByNode to subset by nodes of a tree. As the numeric
ID of a node changes with the cut of a phylo tree, to keep
track of the original data, we do not update the tree structure in the
subsetting.
## class: TreeSummarizedExperiment
## dim: 2 1
## metadata(0):
## assays(1): Count
## rownames(2): entity1 entity2
## rowData names(4): Kingdom Phylum Class OTU
## colnames(1): sample1
## colData names(2): gg group
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (2 rows)
## rowTree: 1 phylo tree(s) (5 leaves)
## colLinks: a LinkDataFrame (1 rows)
## colTree: 1 phylo tree(s) (4 leaves)
## referenceSeq: a DNAStringSet (2 sequences)
## [1] TRUE
rowData, rowLinks, colData,
and colLinks are updated accordingly.
## DataFrame with 2 rows and 4 columns
## Kingdom Phylum Class OTU
## <character> <character> <character> <character>
## entity1 A B1 C1 D1
## entity2 A B1 C1 D2
## LinkDataFrame with 2 rows and 5 columns
## nodeLab nodeLab_alias nodeNum isLeaf whichTree
## <character> <character> <integer> <logical> <character>
## entity1 t3 alias_1 1 TRUE phylo
## entity2 t3 alias_1 1 TRUE phylo
# The first four columns are from colLinks data and the others from colData
cbind(colLinks(sub_tse), colData(sub_tse))## DataFrame with 1 row and 7 columns
## nodeLab nodeLab_alias nodeNum isLeaf whichTree gg
## <character> <character> <integer> <logical> <character> <numeric>
## sample1 sample1 alias_1 1 TRUE phylo 1
## group
## <character>
## sample1 A
To subset by nodes, we specify the node by its node label or node
number. Here, entity1 and entity2 are both mapped to
the same node t3, so both of them are retained.
## LinkDataFrame with 2 rows and 5 columns
## nodeLab nodeLab_alias nodeNum isLeaf whichTree
## <character> <character> <integer> <logical> <character>
## entity1 t3 alias_1 1 TRUE phylo
## entity2 t3 alias_1 1 TRUE phylo
Subsetting simultaneously in both dimensions is also allowed.
node_tse <- subsetByNode(x = both_tse, rowNode = "t3",
colNode = c("sample1", "sample2"))
assays(node_tse)[[1]]## sample1 sample2
## entity1 0 0
## entity2 1 6
The current tree can be replaced by a new one using
changeTree. If the hierarchical information is available as
a data.frame with each column representing a taxonomic
level (e.g., row_data), we provide toTree to
convert it into a phylo object.
## DataFrame with 6 rows and 4 columns
## Kingdom Phylum Class OTU
## <character> <character> <character> <character>
## entity1 A B1 C1 D1
## entity2 A B1 C1 D2
## entity3 A B2 C2 D3
## entity4 A B2 C2 D4
## entity5 A B2 C3 D5
## entity6 A B2 C3 D6
# convert it to a phylo tree
taxa_tree <- toTree(data = taxa)
# Viz the new tree
ggtree(taxa_tree)+
geom_text2(aes(label = node), color = "darkblue",
hjust = -0.5, vjust = 0.7, size = 5.5) +
geom_text2(aes(label = label), color = "darkorange",
hjust = -0.1, vjust = -0.7, size = 5.5) +
geom_point2()The structure of the taxonomic tree that is generated from the taxonomic table.
A mapping to match nodes of the two trees is required if nodes are labeled differently.
## class: TreeSummarizedExperiment
## dim: 6 4
## metadata(0):
## assays(1): Count
## rownames(6): entity1 entity2 ... entity5 entity6
## rowData names(4): Kingdom Phylum Class OTU
## colnames(4): sample1 sample2 sample3 sample4
## colData names(2): gg group
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (6 rows)
## rowTree: 1 phylo tree(s) (6 leaves)
## colLinks: a LinkDataFrame (4 rows)
## colTree: 1 phylo tree(s) (4 leaves)
## referenceSeq: a DNAStringSet (6 sequences)
## LinkDataFrame with 6 rows and 5 columns
## nodeLab nodeLab_alias nodeNum isLeaf whichTree
## <character> <character> <integer> <logical> <character>
## entity1 D1 alias_1 1 TRUE phylo
## entity2 D2 alias_2 2 TRUE phylo
## entity3 D3 alias_3 3 TRUE phylo
## entity4 D4 alias_4 4 TRUE phylo
## entity5 D5 alias_5 5 TRUE phylo
## entity6 D6 alias_6 6 TRUE phylo
Since it may be of interest to report or analyze observed data on
multiple resolutions based on the provided tree, the
TreeSummarizedExperiment package offers functionionality to
flexibly aggregate data to different levels of a tree.
Here, we show the aggregation along the column dimension. The desired
aggregation level is given in the colLevel argument, which
can be specified via the node label (orange text in Figure
@ref(fig:plot-ctree)) or the node number (blue text in Figure
@ref(fig:plot-ctree)). We could further specify how to aggregate via the
argument colFun.
# use node labels to specify colLevel
agg_col <- aggTSE(x = taxa_tse,
colLevel = c("GroupA", "GroupB"),
colFun = sum)
# or use node numbers to specify colLevel
agg_col <- aggTSE(x = taxa_tse, colLevel = c(6, 7), colFun = sum)## alias_6 alias_7
## entity1 0 0
## entity2 7 27
## entity3 9 29
## entity4 11 31
## entity5 13 33
## entity6 15 35
The rowData does not change, but the
colData adjusts with the change of the assays
table. For example, the column group has the
A value for GroupA because the descendant
nodes of GroupA all have the value A; the
column gg has the NA value for
GroupA because the descendant nodes of GroupA
have different values, (1 and 2).
## DataFrame with 4 rows and 2 columns
## gg group
## <numeric> <character>
## sample1 1 A
## sample2 2 A
## sample3 3 B
## sample4 3 B
## DataFrame with 2 rows and 2 columns
## gg group
## <numeric> <character>
## alias_6 NA A
## alias_7 3 B
We can decide which columns of colData to keep in the
final output using colDataCols. In situation with big data,
this can speed up the aggregation by dropping data that is not relevant.
More arguments for aggTSE (e.g., whichAssay,
BPPARAM) are available to customize the aggregation or to
provide parallel computation (?aggTSE).
agg_col <- aggTSE(x = taxa_tse, colLevel = c(6, 7),
colFun = sum, colDataCols = "group")
colData(agg_col)## DataFrame with 2 rows and 1 column
## group
## <character>
## alias_6 A
## alias_7 B
The colLinks is updated to link the new rows of
assays tables and the column tree.
## LinkDataFrame with 2 rows and 5 columns
## nodeLab nodeLab_alias nodeNum isLeaf whichTree
## <character> <character> <integer> <logical> <character>
## alias_6 GroupA alias_6 6 FALSE phylo
## alias_7 GroupB alias_7 7 FALSE phylo
From Figure @ref(fig:plot-ctree), nodes 6 and 7 are labeled with
GroupA and GroupB, respectively, which agrees
with the column link data.
Similarly, we could aggregate the data to the phylum level by
providing the names of the internal nodes that represent the phylum
level (see taxa_one below).
# the phylum level
taxa <- c(taxa_tree$tip.label, taxa_tree$node.label)
(taxa_one <- taxa[startsWith(taxa, "Phylum:")])## [1] "Phylum:B1" "Phylum:B2"
# aggregation
agg_taxa <- aggTSE(x = taxa_tse, rowLevel = taxa_one, rowFun = sum)
assays(agg_taxa)[[1]]## sample1 sample2 sample3 sample4
## alias_8 1 6 11 16
## alias_10 14 34 54 74
The user is nonetheless free to choose nodes from different taxonomic ranks. Note that not all rows in the original table are included in one of the aggregated rows. Similarly, it is possible for a row to contribute to multiple aggregated rows
# A mixed level
taxa_mix <- c("Class:C3", "Phylum:B1")
agg_any <- aggTSE(x = taxa_tse, rowLevel = taxa_mix, rowFun = sum)
rowData(agg_any)## DataFrame with 2 rows and 4 columns
## Kingdom Phylum Class OTU
## <character> <character> <character> <logical>
## alias_12 A B2 C3 NA
## alias_8 A B1 C1 NA
The aggregation on both dimensions could be performed in one step
using aggTSE. The aggregate functions for the row and the
column are specificed via rowFun and colFun,
respectively. The aggregation order is determined using
rowFirst. Here, we set rowFirst = FALSE to
firstly aggregate on the column dimension, and then on the row
dimension.
agg_both <- aggTSE(x = both_tse, colLevel = c(6, 7),
rowLevel = 7:9, rowFun = sum,
colFun = mean, rowFirst = FALSE)As expected, we obtain a table with 3 rows
(rowLevel = 7:9) and 2 columns
(colLevel = c(6, 7)).
## alias_6 alias_7
## alias_7 8.0 28.0
## alias_8 19.5 49.5
## alias_9 12.0 32.0
phylo object.Next, we highlight some functions to manipulate and/or to extract
information from the phylo object. Further operations can
be found in other packages, such as ape (Paradis and Schliep 2019), tidytree(Yu 2020). These functions are useful when users
want to customize functions for the
TreeSummarizedExperiment class.
To show these functions, we use the tree shown in Figure @ref(fig:plot-exTree).
data("tinyTree")
ggtree(tinyTree, branch.length = "none") +
geom_text2(aes(label = label), hjust = -0.1, size = 5.5) +
geom_text2(aes(label = node), vjust = -0.8,
hjust = -0.2, color = 'orange', size = 5.5) An example tree with node labels and numbers in black and orange texts, respectively.
The translation between the node labels and node numbers can be
achieved by the function convertNode.
## [1] "Node_12" "t2" "t9"
## t4 Node_18
## 5 18
To get descendants that are at the leaf level, we could set the
argument only.leaf = TRUE for the function
findDescendant.
## $Node_17
## [1] 4 5 6
When only.leaf = FALSE, all descendants are
returned.
## $Node_17
## [1] 4 5 6 18
We list some functions that might also be useful in Table @ref(tab:phyloFun). More are available in the package, and we encourage users to contribute their functions that might be helpful for others.
| Functions | Goal |
|---|---|
| printNode | print out the information of nodes |
| countNode | count the number of nodes |
| distNode | give the distance between a pair of nodes |
| matTree | list paths of a tree |
| findAncestor | find ancestor nodes |
| findChild | find child nodes |
| findSibling | find sibling nodes |
| shareNode | find the first node shared in the paths of nodes to the root |
| unionLeaf | find the union of descendant leaves |
| trackNode | track nodes by adding alias labels to a phylo object |
| isLeaf | test whether a node is a leaf node |
TreeSummarizedExperiment
classHere, we show an example on how to write custom functions for
TreeSummarizedExperiment objects. To extract data of
specific leaves, we created a function subsetByLeaf by
combining functions working on the phylo class (e.g.,
convertNode, keep.tip, trackNode,
isLeaf) with the accessor function
subsetByNode. Here, convertNode,
trackNode and isLeaf are available in
TreeSummarizedExperiment, and keep.tip is from
the r CRANpkg("ape") package. Since the node number of a
node is changed after pruning a tree with keep.tip,
trackNode is provided to track the node and further update
the link between the data and the new tree.
# tse: a TreeSummarizedExperiment object
# rowLeaf: specific leaves
subsetByLeaf <- function(tse, rowLeaf) {
# if rowLeaf is provided as node labels, convert them to node numbers
if (is.character(rowLeaf)) {
rowLeaf <- convertNode(tree = rowTree(tse), node = rowLeaf)
}
# subset data by leaves
sse <- subsetByNode(tse, rowNode = rowLeaf)
# update the row tree
## -------------- new tree: drop leaves ----------
oldTree <- rowTree(sse)
newTree <- ape::keep.tip(phy = oldTree, tip = rowLeaf)
## -------------- update the row link ----------
# track the tree
track <- trackNode(oldTree)
track <- ape::keep.tip(phy = track, tip = rowLeaf)
# row links
rowL <- rowLinks(sse)
rowL <- DataFrame(rowL)
# update the row links:
# 1. use the alias label to track and updates the nodeNum
# 2. the nodeLab should be updated based on the new tree using the new
# nodeNum
# 3. lastly, update the nodeLab_alias
rowL$nodeNum <- convertNode(tree = track, node = rowL$nodeLab_alias,
message = FALSE)
rowL$nodeLab <- convertNode(tree = newTree, node = rowL$nodeNum,
use.alias = FALSE, message = FALSE)
rowL$nodeLab_alias <- convertNode(tree = newTree, node = rowL$nodeNum,
use.alias = TRUE, message = FALSE)
rowL$isLeaf <- isLeaf(tree = newTree, node = rowL$nodeNum)
rowNL <- new("LinkDataFrame", rowL)
## update the row tree and links
BiocGenerics:::replaceSlots(sse,
rowLinks = rowNL,
rowTree = list(phylo = newTree))
}The row tree is updated after the subsetting. It now has only two
leaves, t2 and t3.
## class: TreeSummarizedExperiment
## dim: 3 4
## metadata(0):
## assays(1): Count
## rownames(3): entity1 entity2 entity3
## rowData names(4): Kingdom Phylum Class OTU
## colnames(4): sample1 sample2 sample3 sample4
## colData names(2): gg group
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (3 rows)
## rowTree: 1 phylo tree(s) (2 leaves)
## colLinks: a LinkDataFrame (4 rows)
## colTree: 1 phylo tree(s) (4 leaves)
## referenceSeq: a DNAStringSet (3 sequences)
## LinkDataFrame with 3 rows and 5 columns
## nodeLab nodeLab_alias nodeNum isLeaf whichTree
## <character> <character> <integer> <logical> <character>
## entity1 t3 alias_1 1 TRUE phylo
## entity2 t3 alias_1 1 TRUE phylo
## entity3 t2 alias_2 2 TRUE phylo
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_4.0.1 ggtree_4.0.4
## [3] ape_5.8-1 TreeSummarizedExperiment_2.18.0
## [5] Biostrings_2.78.0 XVector_0.50.0
## [7] SingleCellExperiment_1.32.0 SummarizedExperiment_1.40.0
## [9] Biobase_2.70.0 GenomicRanges_1.62.1
## [11] Seqinfo_1.0.0 IRanges_2.44.0
## [13] S4Vectors_0.48.0 BiocGenerics_0.56.0
## [15] generics_0.1.4 MatrixGenerics_1.22.0
## [17] matrixStats_1.5.0 BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0
## [4] htmlwidgets_1.6.4 lattice_0.22-7 vctrs_0.7.1
## [7] tools_4.5.2 yulab.utils_0.2.4 parallel_4.5.2
## [10] tibble_3.3.1 pkgconfig_2.0.3 Matrix_1.7-4
## [13] ggplotify_0.1.3 RColorBrewer_1.1-3 S7_0.2.1
## [16] lifecycle_1.0.5 farver_2.1.2 compiler_4.5.2
## [19] treeio_1.34.0 codetools_0.2-20 fontLiberation_0.1.0
## [22] fontquiver_0.2.1 ggfun_0.2.0 htmltools_0.5.9
## [25] sys_3.4.3 buildtools_1.0.0 sass_0.4.10
## [28] yaml_2.3.12 lazyeval_0.2.2 pillar_1.11.1
## [31] crayon_1.5.3 jquerylib_0.1.4 tidyr_1.3.2
## [34] BiocParallel_1.44.0 DelayedArray_0.36.0 cachem_1.1.0
## [37] abind_1.4-8 fontBitstreamVera_0.1.1 nlme_3.1-168
## [40] tidyselect_1.2.1 aplot_0.2.9 digest_0.6.39
## [43] dplyr_1.1.4 purrr_1.2.1 labeling_0.4.3
## [46] maketools_1.3.2 fastmap_1.2.0 grid_4.5.2
## [49] cli_3.6.5 SparseArray_1.10.8 magrittr_2.0.4
## [52] patchwork_1.3.2 S4Arrays_1.10.1 withr_3.0.2
## [55] gdtools_0.4.4 scales_1.4.0 rappdirs_0.3.4
## [58] rmarkdown_2.30 evaluate_1.0.5 knitr_1.51
## [61] gridGraphics_0.5-1 rlang_1.1.7 ggiraph_0.9.3
## [64] Rcpp_1.1.1 glue_1.8.0 tidytree_0.4.7
## [67] BiocManager_1.30.27 jsonlite_2.0.0 R6_2.6.1
## [70] systemfonts_1.3.1 fs_1.6.6