---
title: "singIST Vignette"
author: "Aitor Moruno-Cuenca"
output: BiocStyle::html_document
vignette: > 
  %\VignetteIndexEntry{singIST Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Abstract of singIST as a Bioconductor library
Comparative transcriptomics aims to identify common and differing patterns
(organisms, experimental conditions, tissues, and so on) between a set of
conditions by using transcriptomics data The methodology to tackle the problem
will vary greatly depending on which patterns we aim to compare, the amount and
biological nature of them and which omics resolution. **singIST** is a
comparative transcriptomics method that aims to compare a disease model against
a human disease of interest. In drug discovery and pharmaceutical early
development choosing a disease model becomes a fundamental endavour. For this
reason, we present **singIST** as a method that compares the single-cell
transcriptomics profile of a disease model, either In Vivo or In Vitro models,
against a human disease that such model should mimick.

Other Bioconductor packages for comparative transcriptomics are available.
One of such is **HybridExpress** which aims to compare the bulk transcriptomics
(RNA-seq) profile for hybrids relative to their progenitor species. **CoSIA**
is a Bioconductor package that enables organism-to-organism comparative bulk
transcriptomics (RNAseq) across tissues. However, none of these packages are
direct comparators to **singIST* since they are not conceived for single-cell
comparative transcriptomics but for bulk transcriptomics instead.


# Introduction

This Vignette demonstrates how to use the singIST package. singIST quantifies
the direction and mangitude of single-cell transcriptomic changes between a
human reference condition and a disease model at pathway, cell type, and gene
resolution. Throughout this vignette we will work with the following data:

- "cytokine_counts.rda" Contains Human scRNAseq pseudobulk LogNormalized
count matrix. The pseudobulk is aggregated by sample and cell type and contains
all genes belonging to the pathway Cytokine-Cytokine receptor interaction [KEGG]
- "DC_counts.rda" Contains Human scRNAseq pseudobulk LogNormalized
count matrix. The pseudobulk is aggregated by sample and cell type and contains
all genes belonging to the pathway Dendritic Cells in regulating Th1/Th2
Development [BIOCARTA]
- "OXA.rda" 

```{r}
set.seed(003)
library(singIST)
```

# Inputs of singIST

The method is defined around three inputs:

- **Superpathways**: pathway gene sets structured by cell type. 
- **Human scRNA-seq reference (pseudobulk)**: transformed into cell-type
specific pseudobulk, restricted to the genes of the superpathway.
- **Disease model scRNA-seq**: a Seurat or SingleCellExperiment object
containing scRNA-seq readouts of the disease model to assess.

In the following subsections we explicitly construct these inputs. The rest of
the vignette uses these inputs to: (1) learn a human reference model using 
adaptive sparse multi-block PLS-DA, and then (2) evaluate how human-like the
disease model changes are, in the learned human reference space.

## Superpathways

singIST does not operate on a pathway gene set alone. Instead, it defines a
superpathway by three things:

\begin{itemize}
\item A curated **pathway** (e.g., an MSigDB pathway identifier).
\item A set of **cell types** of interest that will be modeled explicitly.
\item A **gene set per cell type**, where each cell type is associated with a 
vector of genes that might or might not be the same
\end{itemize}

In many practical cases, we use the same pathway gene set for every cell type. 
However, singIST also supports **cell-type specific gene sets**. Below we create
two example of superpathways used later in this vignette.

First, because singIST relies on MSigDB curated pathways, we load the MSigDB
gene set collection once and reuse it as necessary. We assume the reference
to compare against is human `org = "hs"` and we extract both ENTREZID
and HGNC gene symbols `id = c("SYM", "EZID")`.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
gse <- msigdb::getMsigdb(org = "hs", id = c("SYM", "EZID"),
                         version = msigdb::getMsigdbVersions()[1])
```

### Example 1: Superpathway with the same pathway gene set for each cell type

In this first example, we define a pathway using its MSigDB metadata, choose the
cell types we want to model, and then assign the same pathway gene set to each
cell type gene set. This is the simplest and most common setup for a 
superpathway.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Pathway definition (curated gene set identifier)
dc_pathway <- create_pathway(standard_name = "BIOCARTA_DC_PATHWAY",
                             dbsource = "BIOCARTA",
                             collection = "c2",
                             subcollection = "CP")
# Superpathway = pathway + modeled cell types + per cell type gene sets
dc_superpathway <- create_superpathway(pathway_info = dc_pathway,
                                       celltypes = c("Dendritic Cells",
                                                     "Langerhan Cells",
                                                     "T-cell"),
                                       list())
# Assign gene sets per cell type (retrieve gene set from MSigDB)
dc_superpathway <- setGeneSetsCelltype(dc_superpathway, gse = gse)
```

### Example 2: Superpathway with different gene sets per cell type (optional)

In some applications, you may want different cell types to use different gene
sets. singIST supports this by allowing the user to directly provide a list
of gene vectors, one per cell type. 

Here we demonstrate this capability by taking the curated pathway gene set and 
creating different subsets for each cell type. (This is purely illustrative;
in real analyses these subsets would be defined biologically rather than 
randomly sampled.)

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Pathway definition (curated gene set identifier)
cytokine_pathway <- create_pathway(
    standard_name = "KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION",
    dbsource = "KEGG",
    collection = "c2",
    subcollection = "CP"
)
# Superpathway = pathway + modeled cell types + per cell type gene sets
cytokine_superpathway <- create_superpathway(
    pathway_info = cytokine_pathway,
    celltypes = c("Dendritic Cells", "Keratinocytes", "Langerhan Cells",
                  "Melanocytes", "T-cell"),
    list()
)
# Pull the curated gene set (retrieve gene set from MSigDB)
gene_set <- pullGeneSet(cytokine_pathway, gse = gse)
# Provide a different gene set per cell type
cytokine_superpathway$gene_sets_celltype <- list(
    "Dendritic Cells" = sample(gene_set, 50),
    "Keratinocytes" = sample(gene_set, 100),
    "Langerhan Cells" = sample(gene_set, 10),
    "Melanocytes" = sample(gene_set, 60),
    "T-cell" = sample(gene_set, 160)
)
```

## Human scRNA-seq

The human reference data are used to train an adaptive sparse multi-block PLS
on a superpathway. In this subsection we start from the human scRNA-seq that
have already been aggregated into cell-type pseudobulk log-normalize expression.
For instance, the pseudobulk can be performed with 
`Seurat::AggregateExpression()`. Concretely the pseudobulk matrix:

- Each row corresponds to a pseudobulk profile for a cell type within a sample
(e.g., T-cell_AD1).
- Columns correspond to genes (typically HGNC symbols for human training).
- Values correspond to log-normalized pseudobulk expression.

In addition to the pseudobulk matrix, singIST requires sample level labels
defining a base class and a target class (e.g., HC vs AD). The target class
is the human experimental group that the disease model such mimic precisely, 
while the base class is the human experimental group that such differentiate 
from the target class.

The training procedure includes a tuning step, so the human input is not only 
the pseudobulk matrix and sample labels - it also includes the hyperparameter
search space and the cross-validation design the user wants. In practice,
we tune:

- Block-specific sparsity via a grid of quantile thresholds (one quantile
per modeled cell type). Intuitively, higher quantiles enforce stronger sparsity
(fewer genes retained) in that block.
- Model complexity via `number_PLS` (maximum number of latent components
to test).
- Validation scheme via `folds_CV` and `repetition_CV`.

We store the former settings in a `hyperparameters` list and then combine
all the information into the `superpathway_input` list, which is the main
training input consumed by the training procedure `fitOptimal()/
multiple_fitOptimal()`.

### Example 1: Dendritic Cells in regulating Th1/Th2 Development

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Load the pseudobulk matrix
file_dc <- system.file("extdata", "DC_counts.rda", package = "singIST")
obj_dc <- load(file_dc)
counts_dc <- get(obj_dc[1])
# Remove the cell types that we are not going to model
counts_dc <- counts_dc[!(sub("_.*$", "", rownames(counts_dc)) %in% 
                        c("MastC-other", "Keratinocytes", "Melanocytes")), ]
```

Next, we define the sample identifiers and class labels.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Load the pseudobulk matrix
sample_id_dc <- c("AD1", "AD13", "AD2", "AD3","AD4", "HC1-2", "HC3", "HC4",
                  "HC5")
sample_class_dc <- c(rep("AD", 5), rep("HC", 4))
base_class_dc <- "HC"
target_class_dc <- "AD"
```

Now we define the hyperparameters.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Initialize hyperparameter object 
# Choose the array of quantile combinations that will be tested in the Cross
# Validation. Ideally one should choose a wide range of quantiles to test,
# as a narrow selection might not find the globally optimal model. The more
# quantiles one chooses the more computation time it will take, for the sake
# of simplicity we choose a small amount of quantiles to test
quantile_comb_table_dc <- as.matrix(
    RcppAlgos::permuteGeneral(seq(0.1, 0.95, by = 0.3),
                                m = length(dc_superpathway$celltypes), 
                                TRUE))
# Store tuning + CV design
dc_hyperparameters <- create_hyperparameters(
    quantile_comb_table = quantile_comb_table_dc,
    outcome_type = "binary",
    number_PLS = as.integer(3),
    folds_CV = as.integer(1), #LOOCV
    repetition_CV = as.integer(1)
)
```

Finally, we combine all the information into a superpathway_input list.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
dc_superpathway_input <- create_superpathway_input(
    superpathway_info = dc_superpathway,
    hyperparameters_info = dc_hyperparameters,
    pseudobulk_lognorm = counts_dc,
    sample_id = sample_id_dc,
    sample_class = sample_class_dc,
    base_class = base_class_dc,
    target_class = target_class_dc
)
```


### Example 2: Cytokine-Cytokine receptor interaction superpathway

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Load the pseudobulk matrix
file_cytokine <- system.file("extdata", "cytokine_counts.rda", package = "singIST")
obj_cytokine <- load(file_cytokine)
counts_cytokine <- get(obj_cytokine[1])
# Remove the cell types that we are not going to model
counts_cytokine <- counts_cytokine[!(sub("_.*$", "", rownames(counts_cytokine)) 
                                     %in% 
                                 c("MastC-other")), ]
```


Next, we define the sample identifiers and class labels.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Load the pseudobulk matrix
sample_id_cytokine <- c("AD1", "AD13", "AD2", "AD3","AD4", "HC1-2", "HC3",
                        "HC4", "HC5")
sample_class_cytokine <- c(rep("AD", 5), rep("HC", 4))
base_class_cytokine <- "HC"
target_class_cytokine <- "AD"
```

Now we define the hyperparameters.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Initialize hyperparameter object 
quantile_comb_table <- as.matrix(
    RcppAlgos::permuteGeneral(seq(0.05, 0.95, by = 0.2),
                                m = length(cytokine_superpathway$celltypes), 
                                TRUE))
# Store tuning + CV design
cytokine_hyperparameters <- create_hyperparameters(
                                    quantile_comb_table = quantile_comb_table,
                                    outcome_type = "binary",
                                    number_PLS = as.integer(2),
                                    folds_CV = as.integer(5),
                                    repetition_CV = as.integer(5))
```

Finally, we combine all the information into a superpathway_input list.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
cytokine_superpathway_input <- create_superpathway_input(
    superpathway_info = cytokine_superpathway,
    hyperparameters_info = cytokine_hyperparameters,
    pseudobulk_lognorm = counts_cytokine,
    sample_id = sample_id_cytokine,
    sample_class = sample_class_cytokine,
    base_class = base_class_cytokine,
    target_class = target_class_cytokine)
```


## Disease model scRNA-seq

The disease model input is a single-cell object (either a `Seurat` or
`SingleCellExperiment`) containing scRNA-seq measurements from the model
system we want to assess. Unlike the human input (which is provided as 
pseudobulk matrices), the disease model is kept at single-cell resolution and is
summarized internally by singIST when computing recapitulation scores.

To use a disease model dataset, singIST needs:

- A class label indicating the base and target experimental groups
in the model (e.g. ETOH vs OXA).
- A cell-type cluster label used to map model cell populations to the human
cell types defined in the superpathway.
- A donor / sample identifier.

Finally, we must provide a cell type mapping from th ehuman superpathway cell 
types (names used in `celltypes = ...`) to the disease model clusters.
A human cell type can map to one or more model clusters. A human cell type
can map to one or more model clusters; it is also valid to provide an empty
mapping (meaning that block will not contribute for that model).

In the following example we load the oxazolone mouse dataset shipped in the
package, standarize metadata column names to those expected by singIST,
define the mapping, and construct a `mapping organism` list via
`create_mapping_organism()`.

### Example: Oxazolone mouse model of Atopic Dermatitis

```{r, echo = TRUE, message = FALSE, warning = FALSE}
file_oxa <- system.file("extdata", "OXA.rda", package = "singIST")
obj_oxa <- load(file_oxa)
oxazolone <- get(obj_oxa[1])
```
Next, we ensure the object contains the metadata fields required by singIST.
The package example dataset already contains the necessary information,
but we rename the relevant columns to the standarized names used downstream:
`class`, `celltype_cluster`, and `donor`.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# The example object included in singIST uses these column positions
colnames(slot(oxazolone, "meta.data"))[c(5,11,1)] <- c("class",
                                                       "celltype_cluster",
                                                       "donor")
# If the object is a SingleCellExperiment, metadata live in colData()
```

Now, we define the human to model cell type mapping. The names of the list
must match the human cell types used in our superpathways. Values correspond
to one or more model clusters present in `celltype_cluster`.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
celltype_mapping <- list(
                    "Dendritic Cells" = c("cDC2", "cDC1", "migratory DCs"),
                    "Keratinocytes" = c("Keratinocytes"),
                    "Langerhan Cells" = c("LC"), 
                    "Melanocytes" = c(),
                    "T-cell" = c("DETC", "T"))
```

Finally, we combine the model object and mapping into a
`mapping_organism` list. This list specifies the model organism we 
are working with, and which class labels correspond to the base and
target conditions.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
oxa_org <- create_mapping_organism(organism = "Mus musculus", 
                target_class = "OXA", base_class = "ETOH",
                celltype_mapping = list(
                    "Dendritic Cells" = c("cDC2", "cDC1", "migratory DCs"),
                    "Keratinocytes" = c("Keratinocytes"),
                    "Langerhan Cells" = c("LC"), 
                    "Melanocytes" = c(),
                    "T-cell" = c("DETC", "T")),
                counts = oxazolone)
```

# Fit Optimal asmbPLS-DA and compute validation metrics for the optimal model

At this stage we have constructed two key training inputs:
`dc_superpathway_input` and `cytokine_superpathway_input`. The
next step is to train an adaptive sparse multi-block PLS-DA model (asmbPLS-DA)
for each superpathway, while selecting an optimal configuration (sparsity
per cell type block and number of PLS components) under the cross-validation
design specified in the corresponding `hyperparameters` object.

singIST provides two entry points to do it:

- `fitOptimal()` fits one `superpathway_input` list.
- `multiple_fitOptimal()` is a convenience wrapper that fits multiple
superpathways in one call. 

If the CV design is a leave-one-out CV (LOOCV), the functions provide with a
parallelization option to make the computations faster via using 
`parallel = TRUE` and setting the desired paralellization scheme with 
`BiocParallel::register()`.

In addition to fitting the optimal model, singIST can compute model validation
metrics, including permutation based global significance test
(`npermut = 100`) and resampling permutation tests (`type`, e.g.,
jackknife or subsampling). The number of permutation for the tests can be chosen
as desired.

## Example: fitting cytokine-cytokine receptor interaction and dendritic cells
## in regulating 

If the human pseudobulk input contains missing values (e.g., a cell type
block absent for some samples), `fitOptimal() / multiple_fitOptimal()`
automatically imputes them during model fitting. Further details of the imputation
procedure are provided in the Appendix.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Register a parallel backend (SnowParam works on Windows/macOS/Linux);
# on Linux you may prefer MulticoreParam.
BiocParallel::register(BiocParallel::SnowParam(workers = 2,
                        exportglobals = FALSE, progressbar = FALSE),
                        default = TRUE) 
# Combine all human training inputs into a list
superpathways <- list(dc_superpathway_input,
                      cytokine_superpathway_input)
# Fit Optimal asmbPLS-DA models + compute validation metrics
# parallel: logical vector indicating whether the CV design should be parallel
# npermut: number of permutations used for permutation based tests
# type : validation strategy (jackknife or subsampling)
# nsubsampling: number of subsampling repeats (only used when type includes
# subsampling).
superpathways_fit <- multiple_fitOptimal(
    superpathways,
    parallel = c(TRUE, FALSE), 
    npermut = c(100),
    type = c("jackknife","subsampling"), 
    nsubsampling = c(NULL, 5)
    )
# Restore serial backend (good practice)
BiocParallel::register(BiocParallel::SerialParam(), default = TRUE) 
```


# Compute recapitulation metrics

After fitting the optimal human reference models (`superpathway_fit`),
the final step is to quantify how well a disease model recapitulates the
human target vs base signal for each superpathway. singIST does this by
applying the disease model changes to the human base class reference
and then evaluating how "target-like" the samples become. The resulting
recapitulation outputs can be summarized at pathway, cell type, and gene
resolution.

There are two functions to do this:

- `singISTrecapitulations()` computes recapitulations for one fitted
superpathway model.
- `multiple_singISTrecapitulations()` is a wrapper that computes 
recapitulations for multiple fitted models at once.

Below we compute recapitulations for both fitted superpathways against the
oxazolone mapping organism `oxa_org`, and then render the results
into a plotting friendly output

## Example: Evaluating Oxazolone mouse model of Atopic Dermatitis for Dendritic
## Cells and cytokine superpathways

By default, the function assumes the organism for which the asmbPLS-DA model
was trained is "hsapiens", it also assumes that the gene annotation for the
mapped organism is "external_gene_name". These parameters are by default
but other cases can also be considered by specifying the parameters
`from_to = "hsapiens"` and
`object_gene_identifiers = "external_gene_name"`.

```{r, echo = TRUE, message = FALSE, warning = FALSE}
# Compute singIST recapitulations for all fitted superpathways
recapitulations_multiple <- multiple_singISTrecapitulations(oxa_org,
                                                            superpathways_fit)
```

In many workflows you will want the outputs in a standarized format that is
convenient for downstream plotting and reporting. The helper function
`render_multiple_outputs()` reshapes the recapitulation objects
accordingly. It makes a list that is ready-to-plot. 

```{r, echo = TRUE, message = FALSE, warning = FALSE}
final_output <- render_multiple_outputs(objects= list(recapitulations_multiple))
```

# Appendix: Missing value handling in human pseudobulk matrices

During CV, singIST may encounter missing entries in the human pseudobulk
predictor (e.g., entire cell type blocks missing for some samples). Rather
than discarding those samples, singIST imputes missing values inside each CV
split in a way that preserves the multiblock structure and avoids information
leakage. This procedure is performed automatically by `fitOptimal()`
and `multiple_fitOptimal()` and the user does not provide any input on
this, this section is merely for informative purposes.

## Problem setup

Assume we have a multi-block dataset for which we aim to impute an asmbPLS-DA,
$D = \{\left(x_i, y_i\right) \}_{i=1}^n$ where $x_i \in \mathbb{R}^p, y_i \in \{0,1\}$.
We perform a K-fold CV, thus splitting the dataset into $K$ folds; $\mathcal{T}_k$
training indices, $\mathcal{V}_k$ validation indices, $k = 1, \dots, K$. 
Then we have a classic train-test dataset
$\left(X_{train}^{(k)}, Y_{train}^{(k)}  \right) \in \{(x_i, y_i): i \in \mathcal{T}_k \}$
and $\left(X_{val}^{(k)}, Y_{val}^{(k)}  \right) \in \{(x_i, y_i): i \in \mathcal{V}_k \}$

The problem lies in that $X$ matrices we might have missing values.

## Step 1: Impute the training fold using multiblock MFA.

We first impute $X_{train}^{(k)}$ using Multple Factor Analysis (MFA) imputation
(via `missMDA::imputeMFA()`), which explicitly accounts for the 
multiblock structure (blocks correspond to modeled cell types). This produces
an imputed training matrix $\widetilde{X}_{train}^{(k)}$.

## Step 2: Impute the validation fold by projection (no re-fitting).

To avoid data leakage, we do not refit the MFA model on the validation data.
Instead, we compute MFA loadings and means from the imputed training fold and
project each validation sample using its observed features only.

Let $\mu \in \mathbb{R}^p$ be the vector of variable means and
$P \in \mathbb{R}^{p \times ncp}$ the matrix of MFA loadings estimated from
$\widetilde{X}_{train}^{(k)}$. For each validation sample
$x_i \in X_{val}^{(k)}$ define the set of observed feature indices:

$$O_i = \{j: (x_i)_j \neq NA\}, \quad |O_i| \geq 1$$

We estimate the latent scores $t_i$ by solving a least-squares problem 
restricted to the observed features:

$$t_i = \arg \min_{u_i \in \mathbb{R}^{ncp}} ||x_{i, O_i}-\left(\mu_{O_i}+P_{O_i}u_i \right)||_2^2 = \left(P_{O_i}^TP_{O_i}\right)^{-1} P_{O_i}^T \left(x_{i,O_i}-\mu_{O_i} \right)$$

We then reconstruct the feature full vector by:

$$\widehat{x}_{ij} = \mu_{j} + P_{j\cdot} t_i$$

The imputation validation entries are defined as:

$$
\begin{align}
\widehat{x}_{ij}  = \begin{cases}
x_{ij}, & j \in O_i\\
\widetilde{x}_{ij}, & j \notin O_i
\end{cases} \quad i \in \mathcal{V}_k
\end{align}
$$

This preserves the same latent space used for training-fold imputation and
prevents leakage from validation into the imputation model. Imputation is
intended for **partial missingness** (e.g., one missing block while other
blocks are observed). If missingness is extensive (e.g., many samples missing
most blocks), model stability and interpretability may degrade; in that case
consider restricting the set of modeled cell types and/or increasing sample
size.

# Session Info

```{r}
utils::sessionInfo()
```