---
title: "Introduction to TreeSummarizedExperiment"
author: 
- name: Ruizhu HUANG
  affiliation: 
  - Institute of Molecular Life Sciences, University of Zurich.
  - SIB Swiss Institute of Bioinformatics.
- name: Charlotte Soneson
  affiliation: 
  - Institute of Molecular Life Sciences, University of Zurich.
  - SIB Swiss Institute of Bioinformatics.
- name: Mark Robinson
  affiliation: 
  - Institute of Molecular Life Sciences, University of Zurich.
  - SIB Swiss Institute of Bioinformatics.
package: TreeSummarizedExperiment
output: 
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Tree Aggregation}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
bibliography: TreeSE_vignette.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```


# Introduction 

The `TreeSummarizedExperiment` class is an extension of the
`SingleCellExperiment` class [@Aaron2019]. It's used to store rectangular data
of experimental results as in a `SingleCellExperiment`, and also supports the
storage of a hierarchical structure and its link information to the rectangular
data.

# TreeSummarizedExperiment {#tse-class}

## Anatomy of TreeSummarizedExperiment

```{r strTSE, echo=FALSE, fig.cap= "The structure of the TreeSummarizedExperiment class."}
knitr::include_graphics("tse.png")
```
Compared with the `SingleCellExperiment` class, `TreeSummarizedExperiment` has
four more slots.

* `rowTree`: the hierarchical structure on the rows of the `assays` tables.
* `rowLinks`: the link between rows of the `assays` tables and the `rowTree`.
* `colTree`: the hierarchical structure on the columns of the `assays` tables.
* `colLinks`: the link information between columns of `assays` tables and the
`colTree`.

The `rowTree` and `colTree` could be empty (`NULL`) if no trees are available.
Correspondingly, the `rowLinks` and `colLinks` would be `NULL. All the other
slots in `TreeSummarizedExperiment` are inherited from `SingleCellExperiment`.

The slots `rowTree` and `colTree` only accept the tree data as the `phylo`
class. If the tree is available in other formats, one would need to convert it
to `phylo` with other R packages. For example, the package `r Biocpkg("treeio")`
provides 12 functions to import different tree formats and output `phylo` object
in the slot `phylo`.

## Toy data

```{r}
suppressPackageStartupMessages({
    library(TreeSummarizedExperiment)
    library(S4Vectors)
    library(ggtree)
    library(ape)})
```

We generate a **assay_data** with observations of 5 entities collected from 4
samples.
```{r}
# assays data
assay_data <- rbind(rep(0, 4), matrix(1:16, nrow = 4))
colnames(assay_data) <- paste(rep(LETTERS[1:2], each = 2), 
                            rep(1:2, 2), sep = "_")
rownames(assay_data) <- paste("entity", seq_len(5), sep = "")
assay_data
```

The descriptions of the `r nrow(assay_data)` entities and `r ncol(assay_data)`
samples are given in the **row_data** and **col_data**, respectively.
```{r}
# row data
row_data <- DataFrame(var1 = sample(letters[1:2], 5, replace = TRUE),
                    var2 = sample(c(TRUE, FALSE), 5, replace = TRUE),
                    row.names = rownames(assay_data))
row_data
# column data
col_data <- DataFrame(gg = c(1, 2, 3, 3),
                    group = rep(LETTERS[1:2], each = 2), 
                    row.names = colnames(assay_data))
col_data
```

The hierarchical structure of the `r nrow(assay_data)` entities is denoted as
**row_tree**. The hierarchical structure of the `r ncol(assay_data)` samples is
denoted as **col_tree**. We create them by using the function `rtree` from the
package `r CRANpkg("ape")`.
```{r}
# Toy tree 1
set.seed(1)
row_tree <- rtree(5)
class(row_tree)

# Toy tree 2
set.seed(4)
col_tree <- rtree(4)
col_tree$tip.label <- colnames(assay_data)
col_tree$node.label <- c("All", "GroupA", "GroupB")
```

The created trees are `phylo` objects. The `phylo` object is actually a list
with at least four elements: `edge`, `tip.label`, `edge.length`, and `Nnode`.
```{r}
class(row_tree)
str(row_tree)
```

The package `r Biocpkg("ggtree")` [@Yu2017] has been used to visualize the tree.
The node labels and node numbers are in blue and orange texts, respectively. The
**row_tree** has no labels for internal nodes.

```{r rtree, fig.cap="\\label{rtree} The structure of the row tree"}
# Visualize the row tree
ggtree(row_tree, size = 2) +
geom_text2(aes(label = node), color = "darkblue",
                hjust = -0.5, vjust = 0.7, size = 6) +
geom_text2(aes(label = label), color = "darkorange",
            hjust = -0.1, vjust = -0.7, size = 6)
```

The **col_tree** has labels for internal nodes.
```{r ctree, fig.cap="\\label{ctree} The structure of the column tree"}
# Visualize the column tree
ggtree(col_tree, size = 2) +
geom_text2(aes(label = node), color = "darkblue",
                hjust = -0.5, vjust = 0.7, size = 6) +
geom_text2(aes(label = label), color = "darkorange",
            hjust = -0.1, vjust = -0.7, size = 6)
```

## The construction of `TreeSummarizedExperiment`

The `TreeSummarizedExperiment` class is used to store the toy data:
**assay_data**, **row_data**, **col_data**, **col_tree** and **row_tree**, To
correctly store data, the link information between the rows (or columns) of
**assay_data** and the nodes of the **row_tree** (or **col_tree**) is requried
to provide via a charactor vector `rowNodeLab` (or `colNodeLab`). Those columns
or rows that don't match with any node of the tree structure are removed with
warnings. The link data between the `assays` tables and the tree data is
automatically generated in the construction.

Below shows an example to construct `TreeSummarizedExperiment` without the
column tree.
```{r}
# provide the node labels in rowNodeLab
node_lab <- row_tree$tip.label
row_tse <- TreeSummarizedExperiment(assays = list(assay_data),
                                rowData = row_data,
                                colData = col_data,
                                rowTree = row_tree,
                                rowNodeLab = node_lab)
```

When printing out **row_tse**, we see a similar message as
`SingleCellExperiment` with four additional lines about `rowLinks`, `rowTree`,
`colLinks` and `colTree`. Here, **row_tse** stores a row tree (`phylo` object),
and the `rowLinks` has `r nrow(row_tse)` rows that is exactly the same as the
number of rows in the `assays` tables. More details about the link data could be
found in Section \@ref(linkData).

```{r}
row_tse
```

If the row tree and the column tree are both available, the
`TreeSummarizedExperiment` could be constructed similarly as below. Here, the
column names of the `assays` table match with the node labels used in the column
tree. So, we could omit the step of providing `colNodeLab`.
```{r}
all(colnames(assay_data) %in% c(col_tree$tip.label, col_tree$node.label))

both_tse <- TreeSummarizedExperiment(assays = list(assay_data),
                                rowData = row_data,
                                colData = col_data,
                                rowTree = row_tree,
                                rowNodeLab = node_lab,
                                colTree = col_tree)
```

Compared to **row_tse**, **both_tse** includes also a column tree. The column
link data (`colLinks`) with `r ncol(both_tse)` rows is automatically generated.
The number of rows in the link data is decided by the column dimension of the
`assays` tables.
```{r}
both_tse
```


## The accessor functions
### Assays, rowData, colData, and metadata

For slots inherited from the `SingleCellExperiment` class, the accessors are
exactly the same as shown in `r Biocpkg("SingleCellExperiment")`.

```{r}
# to get the first table in the assays
(count <- assays(both_tse)[[1]])
```

```{r}
# to get row data
rowData(both_tse)
```

```{r}
# to get column data
colData(both_tse)
```

```{r}
# to get metadata: it's empty here
metadata(both_tse)
```

### rowLinks, colLinks {#linkData}

The row link and column link could be accessed via `rowLinks` and `colLinks`,
respectively. The output would be a `LinkDataFrame` object. The `LinkDataFrame`
class is extended from the `DataFrame` class with the restriction that it has at
least four columns: **nodeLab**, **nodeLab\_alias**, **nodeNum**, and
**isLeaf**. More details about the `DataFrame` class could be found in the 
`r Biocpkg("S4Vectors")` package.

* nodeLab: the labels of nodes on the tree
* nodeLab\_alias: the alias labels of nodes on the tree
* nodeNum: the numbers of nodes on the tree
* isLeaf: it's to indicate whether the node is a leaf node

When a `phylo` tree is available in the `rowTree`, we could see a
`LinkDataFrame` object in the `rowLinks`. The number of rows of `rowLinks` data
matches with the number of rows of `assays` tables.


```{r}
(rLink <- rowLinks(both_tse))
```

```{r}
class(rLink)
showClass("LinkDataFrame")
```

```{r}
nrow(rLink) == nrow(both_tse)
```

Similarly, the number of rows of `colLinks` data matches with the number of
columns of `assays` table.
```{r}
(cLink <- colLinks(both_tse))
nrow(cLink) == ncol(both_tse)
```

If the tree is not available, the corresponding link data is `NULL`.
```{r}
colTree(row_tse)
colLinks(row_tse)
```

The link data is automatically generated when constructing the
`TreeSummarizedExperiment` object. We highly recommend users not to modify it
manually; otherwise the link might be broken. For R packages developers, we show
in the Section \@ref(modifyLink) about how to update the link.


## The subseting function

We could use `[` to subset the `TreeSummarizedExperiment`. To keep track of the
original data, the `rowTree` and `colTree` stay the same in the subsetting.
```{r}
sub_tse <- both_tse[1:2, 1]
sub_tse
```

The annotation data on the row and column dimension is changed accordingly.
```{r}
# The first four columns are from rowLinks data and the others from rowData
cbind(rowLinks(sub_tse), rowData(sub_tse))
```

```{r}
# The first four columns are from colLinks data and the others from colData
cbind(colLinks(sub_tse), colData(sub_tse))
```


# Aggregation

The aggregation is allowed on the row and the column dimension. 

## The column dimension {#aggCol}

Here, we show the aggregation on the column dimension. The
`TreeSummarizedExperiment` object is assigned to the argument `x`. The desired
aggregation level is given in `colLevel`. The level could be specified via the
node label (the orange texts in Figure \@ref(fig:ctree)) or the node number (the
blue texts in Figure \@ref(fig:ctree)). We could further decide how to aggregate
via the argument `FUN`.

```{r}
# use node labels to specify colLevel
aggCol <- aggValue(x = both_tse, 
                   colLevel = c("GroupA", "GroupB"),
                   FUN = sum)
# or use node numbers to specify colLevel
aggCol <- aggValue(x = both_tse, colLevel = c(6, 7), FUN = sum)
```


```{r}
assays(aggCol)[[1]]
```

The `rowData` doesn't change, but the `colData` adjusts with the change of the
\code{assays} table. For example, the column **group** has the `A` value for
`GroupA` because the descendant nodes of `GroupA` all have the value `A`; the
column **gg** has the `NA` value for `GroupA` because the descendant nodes of
`GroupA` have different values, (1 and 2).
```{r}
# before aggregation
colData(both_tse)
# after aggregation
colData(aggCol)
```

The `colLinks` is updated to link the new rows of `assays` tables and the column
tree.
```{r}
# the link data is updated
colLinks(aggCol)
```
From the Figure \@ref(fig:rtree), we could see that the nodes 6 and 7 are
labelled with `GroupA` and `GroupB`, respectively. This agrees with the
column link data.

## The row dimension {#aggRow}

It's similar to the aggregation on the row dimension, except that the level
should be specified via `rowLevel`.

```{r}
agg_row <- aggValue(x = both_tse, rowLevel = 7:9, FUN = sum)
```

Now, the output `assays` table has 3 rows.

```{r}
assays(agg_row)[[1]]
```

We could see which row corresponds to which nodes via the `rowLinks` data.
```{r}
rowLinks(agg_row)
```
The Figure \@ref(fig:rtree) shows that the nodes 7, 8 and 9  have no labels.
Therefore, the `nodeLab` column in `LinkData` of the row data has missing value.
They are all internal nodes and hence the column `isLeaf` has only `FALSE`
value.

## Both dimensions

The aggregation on both row and column dimensions could be performed in one step
using the same function specified via `FUN`. If different functions are required
for different dimension, it's suggested to do it in two steps as described in
Section \@ref(aggRow) and Section \@ref(aggCol) because the order of aggregation
might matter.

```{r}
agg_both <- aggValue(x = both_tse, colLevel = c(6, 7), 
                    rowLevel = 7:9, FUN = sum)
```

As expected, we obtain a table with 3 rows (`rowLevel = 7:9`) and 2 columns
(`colLevel = c(6, 7)`). 
```{r}
assays(agg_both)[[1]]
```


# Aggregation on the taxonomic table

In some case, the information of the hierarchical structure is available as a
`data.frame` instead of the `phylo` object mentioned above. To do the work
listed above, we could convert the `data.frame` to the `phylo` class.

The function `toTree` outputs the hierarchical information into a `phylo`
object. If the data set is large, we suggest to allow `cache = TRUE` to speed up
the aggregation step.

```{r}
# The toy taxonomic table
taxa <- data.frame(Kindom = rep("A", 5),
                     Phylum = c("B1", rep("B2", 4)),
                     Class = c("C1", "C2", "C3", "C3", NA),
                     OTU = c("D1", "D2", "D3", "D4", NA))
# convert to a phylo tree
taxa_tree <- toTree(data = taxa, cache = FALSE)
ggtree(taxa_tree)+
geom_text2(aes(label = node), color = "darkblue",
                hjust = -0.5, vjust = 0.7, size = 6) +
geom_text2(aes(label = label), color = "darkorange",
            hjust = -0.1, vjust = -0.7, size = 6) +
    geom_point2()
```


```{r}
# construct a TreeSummarizedExperiment object
taxa_tse <- TreeSummarizedExperiment(assays = list(assay_data),
                                   rowData = row_data,
                                   rowTree = taxa_tree,
                                   rowNodeLab = taxa_tree$tip.label)
```

Here is about how to aggregate to the phylum level.
```{r}
# specify the level
taxa_lab <- c(taxa_tree$tip.label, taxa_tree$node.label)
ii <- startsWith(taxa_lab, "Phylum:") 
(l1 <- taxa_lab[ii])
# aggregate
agg_taxa <- aggValue(x = taxa_tse, rowLevel = l1, FUN = sum)
```


```{r}
assays(agg_taxa)[[1]]
```

```{r}
rowData(agg_taxa)
```


The aggregation could be on any freely combined level.
```{r}
# specify the level
l2 <- c("Class:C3", "Phylum:B1")
# aggregate
agg_any <- aggValue(x = taxa_tse, rowLevel = l2, FUN = sum)
```


```{r}
assays(agg_any)[[1]]
```

```{r}
rowData(agg_any)
```


# Additional

## More about functions working on the `phylo` object.

Here, we show some functions as examples to manipulate or to extract information
from the `phylo` object. More functions could be found in other packages, such
as `r CRANpkg("ape")` [@ape2018], `r CRANpkg("tidytree")`. These functions might
be useful when R package developers want to create their own functions to work
on the `TreeSummarizedExperiment` class.

Below shows the node label (black texts) and node number (blue texts) of each
node on an example tree.
```{r}
ggtree(tinyTree, branch.length = "none") +
    geom_text2(aes(label = label), hjust = -0.3) +
    geom_text2(aes(label = node), vjust = -0.8,
               hjust = -0.3, color = 'blue') 
```

### print out nodes of the tree

We could specify to print out all nodes (`type = "all"`), the leaves (`type =
"leaf"`) or the internal nodes (`type = "internal"`).

```{r}
printNode(tree = tinyTree, type = "all")
```

### Count the number of nodes

```{r}
# The number of leaves
countLeaf(tree = tinyTree)

# The number of nodes (leaf nodes and internal nodes)
countNode(tree = tinyTree)
```

### Translation between the node label and the node number

The translation between the labels and the numbers of nodes could be achieved by
the function `transNode`. 
```{r}
transNode(tree = tinyTree, node = c(12, 1, 4))
```

```{r}
transNode(tree = tinyTree, node = c("t4", "Node_18"))
```

### find the descendants 

To get descendants that are on the leaf level, we could set the argument
`only.leaf = TRUE`.
```{r}
# only the leaf nodes
findOS(tree = tinyTree, node = 17, only.leaf = TRUE)
```

The argument `only.leaf = FALSE` is set to get all descendants 
```{r}
# all descendant nodes
findOS(tree = tinyTree, node = 17, only.leaf = FALSE)
```

### find the sibling node

The input `node` could be either the node label or the node number. 
```{r}
# node = 5, node = "t4" are the same node
findSibling(tree = tinyTree, node = 5)
findSibling(tree = tinyTree, node = "t4")
```

### find the share node

This would find the first node that joined by the specified nodes (`node`) in
the path to the root.
```{r}
shareNode(tree = tinyTree, node = c(5, 6))
```


### identify a leaf node
```{r}
isLeaf(tree = tinyTree, node = 5)
isLeaf(tree = tinyTree, node = 17)
```

### calculate the distance between two nodes

The distance between any two nodes on the tree could be calculated by
`distNode`.
```{r}
distNode(tree = tinyTree, node = c(1, 5))
```

### prune tree via leaves

We could specify the leaf nodes `rmLeaf` to remove parts of a tree. If
`mergeSingle = TRUE`, the internal node that is connected to the removed leaf
nodes is removed too; otherwise, it is kept.

```{r}
NT1 <- pruneTree(tree = tinyTree, rmLeaf = c(4, 5),
                mergeSingle = TRUE)
ggtree(NT1, branch.length = "none") +
    geom_text2(aes(label = label), color = "darkorange",
               hjust = -0.1, vjust = -0.7) +
    geom_point2()

NT2 <- pruneTree(tree = tinyTree, rmLeaf = c(4, 5),
                mergeSingle = FALSE)
ggtree(NT2, branch.length = "none") +
    geom_text2(aes(label = label), color = "darkorange",
               hjust = -0.1, vjust = -0.7) +
    geom_point2()
```

### convert a `phylo` object to a matrix

Each row gives a path that connects a leaf and the root.
```{r}
matTree(tree = tinyTree)
```

## Customize functions for the `TreeSummarizedExperiment` class {#modifyLink}

We show examples about how to create functions for the
`TreeSummarizedExperiment`. R package developers could customize their functions
based on the functions provided above on the `phylo` object or develop their own
ones.

Here, a function `rmRows` is created to remove entities (on rows) that have zero
in all samples (on columns) in the first `assays` table.

```{r}
# dat: a TreeSummarizedExperiment
rmRows <- function(dat) {
    # calculate the total counts of each row
    count <- assays(dat)[[1]]
    tot <- apply(count, 1, sum)
    
    # find the row with zero in all columns
    ind <- which(tot == 0)
    
    # remove those rows
    out <- dat[-ind, ]
    return(out)
    
}
(rte <- rmRows(dat = both_tse))
rowLinks(rte)
```

The function `rmRows` doesn't update the tree data. To update the tree, we could
do it as below with the help of `ape::drop.tip`.

```{r}

updateRowTree <- function(tse, dropLeaf) {
    ## -------------- new tree: drop leaves ----------
    oldTree <- rowTree(tse)
    newTree <- ape::drop.tip(phy = oldTree, tip = dropLeaf)
    
    ## -------------- update the row link ----------
    # track the tree
    track <- trackNode(oldTree)
    track <- ape::drop.tip(phy = track, tip = dropLeaf)
    
    # row links
    rowL <- rowLinks(tse)
    rowL <- DataFrame(rowL)
    
    # update the row links: 
    #   1. use the alias label to track and updates the nodeNum
    #   2. the nodeLab should be updated based on the new tree using the new
    #      nodeNum
    #   3. lastly, update the nodeLab_alias
    rowL$nodeNum <- transNode(tree = track, node = rowL$nodeLab_alias,
                              message = FALSE)
    rowL$nodeLab <- transNode(tree = newTree, node = rowL$nodeNum, 
                              use.alias = FALSE, message = FALSE)
    rowL$nodeLab_alias <- transNode(tree = newTree, node = rowL$nodeNum, 
                                    use.alias = TRUE, message = FALSE)
    rowL$isLeaf <- isLeaf(tree = newTree, node = rowL$nodeNum)

    rowNL <- as(rowL, "LinkDataFrame")
    
    ## update the row tree and links
    newDat <- BiocGenerics:::replaceSlots(tse,
                                          rowLinks = rowNL,
                                          rowTree = list(phylo = newTree))
    return(newDat)
    
}


```

Now the row tree has four leaves.
```{r}
# find the mismatch between the rows of the 'assays' table and the leaves of the
# tree
row_tree <- rowTree(rte)
row_link <- rowLinks(rte)
leaf_tree <- printNode(tree = row_tree,type = "leaf")$nodeNum
leaf_data <- row_link$nodeNum[row_link$isLeaf]
leaf_rm <- setdiff(leaf_tree, leaf_data)
ntse <- updateRowTree(tse = rte, dropLeaf = leaf_rm)
```

```{r}
ntse
rowLinks(ntse)
```

# Session Info

```{r}
sessionInfo()
```

# Reference