---
title: "Analysis of MCIA Decomposition"
author:
- Max Mattessich
- Joaquin Reyna
- Edel Aron
- Anna Konstorum
date: "Compiled: `r format(Sys.time(), '%B %d, %Y')`"
header-includes:
   - \usepackage{amsmath}
   - \usepackage{amsfonts}
output:
  BiocStyle::html_document:
    dev: 'jpeg'
    df_print: kable
    fig_retina: 1
    number_sections: FALSE
    toc_depth: 3
    toc_float: TRUE
vignette: >
  %\VignetteIndexEntry{Analysis of MCIA Decomposition}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

\newcommand{\mb}[1]{\mathbf{#1}}
\newcommand{\var}{\text{var}}
\newcommand{\cov}{\text{cov}}

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

## Motivation

Multiple co-inertia analysis (MCIA) is a member of the family of joint 
dimensionality reduction (jDR) methods that extend unsupervised dimension reduction 
techniques such as Principal Components Analysis (PCA) and Non-negative Matrix 
Factorization (NMF) to datasets with multiple data blocks 
(alternatively called views) ([Cantini, 2021](https://doi.org/10.1038/s41467-020-20430-7)).

Here, we present a new implementation in R of MCIA, `nipalsMCIA`, 
that uses an extension of Non-linear Iterative Partial Least Squares (NIPALS) 
to solve the MCIA optimization problem ([Hanafi, 2011](https://doi.org/10.1016/j.chemolab.2010.05.010)). 
This implementation has several features, including speed-up over approaches that
employ the Singular Value Decomposition (SVD), several options for pre-processing 
and deflation to customize algorithm performance, methodology to perform 
out-of-sample global embedding, and analysis and visualization capabilities to 
maximize result interpretation. 

While there exist additional implementations of MCIA (e.g. [mogsa](https://www.bioconductor.org/packages/release/bioc/html/mogsa.html), 
[omicade4](https://bioconductor.org/packages/release/bioc/html/omicade4.html)), 
ours is unique in providing a pipeline that incorporates pre-processing data 
options including those present in the original development of MCIA 
(including a theoretically grounded calculation of inertia, or total variance) 
with an iterative solver that shows speed-up for larger datasets, and is 
explicitly designed for simultaneous ease of use as a tool for multi-view data 
decomposition as well as a foundation for theoretical and computational 
development of MCIA and related methodology.  A manuscript detailing our 
implementation is forthcoming.

## Overview

In this vignette, we will cover the most important functions
within the `nipalsMCIA` package as well as downstream analyses that can help
interpret the MCIA decomposition using a cancer data set from
[Meng et al., 2016](https://doi.org/10.1093/bib/bbv108) that includes 21
subjects with three data blocks. The data blocks include mRNA levels
(12895 features), microRNA levels (537 features) and protein levels
(7016 features). 

The `nipals_multiblock` function performs MCIA using the NIPALS algorithm.  
`nipals_multiblock` outputs a decomposition that includes a low-dimensional 
embedding of the data in the form of global scores, and the contributions of the data 
blocks (block score weights) and features (global loadings) to these same global scores.  
`nipalsMCIA` provides several additional functions to visualize, analyze, 
and interpret these results.

The `nipals_multiblock` function accepts as input a [MultiAssayExperiment](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html)
(MAE) object. Such objects represent a modern classed-based approach to organizing
multi-omics data in which each assay can be stored as an individual experiment
alongside relevant metadata for samples and experiments.  If users
have a list of data blocks with matching sample names (and optional sample-level
metadata), we provide a simple conversion function (`simple_mae.R`) to generate an 
MAE object.  For more sophisticated MAE object construction,
please consult the MAE documentation.

In the context of the NCI-60 data set and this vignette, we will show you the
power of MCIA to find important relationships between mRNA, microRNA and
proteins. More specifically, we will show you how to interpret the global
factor scores in [Part 1: Interpreting Global Factor Scores] and global loadings
in [Part 2: Interpreting Global Loadings].

## Installation

```{r installation-github, eval = FALSE}
# install.packages("devtools")
devtools::install_github("Muunraker/nipalsMCIA", ref = "code-development",
                         force = TRUE, build_vignettes = TRUE) # devel version
```

```{r installation-bioconductor, eval = FALSE}
# after acceptance
# install.packages("BiocManager")
BiocManager::install("nipalsMCIA")
```

```{r load-packages, message = FALSE}
library(ComplexHeatmap)
library(dplyr)
library(fgsea)
library(ggplot2)
library(ggpubr)
library(nipalsMCIA)
library(stringr)

# NIPALS starts with a random vector
set.seed(42)
```

## Preview of the NCI-60 dataset

The NCI-60 data set has been included with the `nipalsMCIA` package and is
easily available as shown below:

```{r intro-load-data}
# load the dataset which uses the name data_blocks
data(NCI60)

# examine the contents
data_blocks$miRNA[1:3, 1:3]
data_blocks$mrna[1:3, 1:3]
data_blocks$prot[1:3, 1:3]
```

To convert data_blocks into an MAE object we provide the `simple_mae()`
function:
```{r convert-to-mae}
data_blocks_mae <- simple_mae(data_blocks, row_format = "sample")
```

# Running and reviewing the MCIA output

We can compute the MCIA decomposition for $r$ global factors.  For our example,
we take $r=10$.

```{r intro-mcia, warning = FALSE, message = FALSE}
set.seed(42)
mcia_results <- nipals_multiblock(data_blocks_mae,
                                  col_preproc_method = "colprofile",
                                  num_PCs = 10, tol = 1e-12, plots = "none")
```

The result is an NipalsResult object containing several outputs from the decomposition:

```{r intro-mcia-results}
slotNames(mcia_results)
```

We describe the first two in more detail below, and will discuss several others
in the remainder of the vignette.  For additional details on the decomposition, see
([Hanafi, 2011](https://doi.org/10.1016/j.chemolab.2010.05.010), 
[Mattesich, 2022](http://hdl.handle.net/10427/CZ30Q6773)).

## Brief overview of the Global Scores Matrix ($F$)

The `global_scores` matrix is represented by $\mb F$ with dimensions
$n \times r$, where $n$ is the number of samples and $r$ is the number of
factors chosen by using the `num_PCs = r` argument. Each column $j$ of this matrix
represents the global scores for factor $j$,

$$
\mb F = \begin{pmatrix}
    | & |&  & |\\
    \mb f^{(1)} &\mb f^{(2)} & \dots & \mb f^{(r)}\\
    | & |&  & |
  \end{pmatrix} \in \mathbb{R}^{n \times r}
$$

This matrix encodes a low-dimensional representation of the data set, with the
$i$-th row representing a set of $r$-dimensional coordinates for the $i$-th
sample.

## Brief overview of the Global Loadings Matrix ($A$)

The `global_loadings` matrix is represented by $\mb A$ that is $p \times r$,
where $p$ is the number of features across all omics and $r$ is as before.
Each column $j$ of this matrix represents the global loadings for factor $j$,
i.e.

$$
\mb A = \begin{pmatrix}
    | & |&  & |\\
    \mb a^{(1)} &\mb a^{(2)} & \dots & \mb a^{(r)}\\
    | & |&  & |
  \end{pmatrix} \in \mathbb{R}^{p \times r}
$$

This matrix encodes the contribution (`loading`) of each feature to the global 
score.

The remainder of this vignette will be broken down into two sections,
[Part 1: Interpreting Global Factor Scores] and [Part 2: Interpreting Global Loadings]
where we show how to interpret $\mb F$ and $\mb A$, respectively.

# Part 1: Interpreting Global Factor Scores

## nipals_multiblock() Generates Basic Visualizations

In the introduction we showed how to calculate the MCIA decomposition using
`nipals_multiblock()` but used the parameter `plots = "none"` to avoid the
default plotting behavior of this function. By default, this function will
generate two plots which help establish an initial intuition for the MCIA
decomposition. Here we will re-run `nipals_multiblock()` with the default `plots`
parameter (`all`):

```{r part-1-mcia, warning = FALSE, message = FALSE, fig.dim = c(7, 4)}
set.seed(42)
mcia_results <- nipals_multiblock(data_blocks_mae,
                                  col_preproc_method = "colprofile",
                                  num_PCs = 10, tol = 1e-12)
```

The first plot visualizes each sample using factor 1 and 2 as a lower
dimensional representation (factor plot).

- Each sample is represented by 4 points, a center point (solid block
dot) which represents the global factor score, a mRNA factor score (square), a
miRNA factor score (circle), and a protein factor score (triangle).
- The last three omic-specific block factor scores are connected to the global
factor score. If a block factor scores is plotted far from its
corresponding global factor score, then this is an indication that the block
does not agree with/contribute to the trend found by the global decomposition.
- As an example, we can take a look at the global factor score at $(-1.1, 0.4)$.
The block factor scores are all quite near which suggests all three omics are
contributing somewhat equally. This is in contrast to the global factor score at
$(-0.3, 0.9)$ where the mRNA factor score is close, but the mRNA and miRNA are
far from their respective global factor score.

The second plot is a scree plot of normalized singular values corresponding to 
the variance explained by each factor.

## Visualizing a Factor Plot with Only Global Factor Scores

For clustering, it is useful to look at global factor scores without block
factor scores. The `projection_plot()` function can be used to generate such a
plot using `projection = "global"`.

```{r part-1-visualize-clusters, fig.dim = c(5, 5)}
projection_plot(mcia_results, projection = "global", orders = c(1, 2))
```

In addition, scores can be colored by a meaningful label such as cancer type
which is highly relevant to NCI-60. To do so, the colData slot of the associated
MAE object must be loaded with sample-level metadata prior to invoking
`projection_plot()`. The sample metadata is composed of row names corresponding to
the primary sample names of the MAE object, and columns contain different
metadata (e.g. age, disease status, etc). For instance, each of the 21 samples
in the NCI-60 dataset represents a cell line with one of three cancer types:
CNS, Leukemia, or Melanoma. We have provided this metadata as part of the
`data(NCI60)` dataset and we next show how it can be included in the resulting MAE
object using the `colData` parameter in `simple_mae()`:

```{r part-1-metadata}
# preview of metadata
head(metadata_NCI60)

# loading of mae with metadata
data_blocks_mae <- simple_mae(data_blocks, row_format = "sample",
                              colData = metadata_NCI60)
```

We now rerun `nipals_multiblock()` using the updated MAE object, where the
`colData` is passed to the `metadata` slot of the `NipalsResult` instance,
```{r part-1-mcia-again, warning = FALSE, message = FALSE}
# adding metadata as part of the nipals_multiblock() function
set.seed(42)
mcia_results <- nipals_multiblock(data_blocks_mae,
                                  col_preproc_method = "colprofile",
                                  plots = "none",
                                  num_PCs = 10, tol = 1e-12)
```

The `color_col` argument of `projection_plot()` can then be used to determine
which column of `metadata` is used for coloring the individual data points, in
this case `cancerType`. `color_pal` is used to assign a color palette and
requires a vector of colors (i.e. c('blue', 'red', 'green')). To help create
this vector we also provide `get_metadata_colors()`, a helper function (used 
below) that can be used with a `scales::<function>` to return an appropriate
vector of colors. Note: colors are applied by lexicographically sorting the list
of unique metadata values then assigning the first color to the first value,
second with second and so on.

```{r part-1-visualize-clusters-color-col, fig.dim = c(5, 5)}
# meta_colors = c("blue", "grey", "yellow") can use color names
# meta_colors = c("#00204DFF", "#7C7B78FF", "#FFEA46FF") can use hex codes
meta_colors <- get_metadata_colors(mcia_results, color_col = 1,
                                   color_pal_params = list(option = "E"))

projection_plot(mcia_results, projection = "global", orders = c(1, 2),
                color_col = "cancerType", color_pal = meta_colors,
                legend_loc = "bottomleft")
```
Using this plot one can observe that global factor scores for factor 1 and 2 can
separate samples into their cancer types.

## Visualizing the Clustering of Samples by Factor Scores

A heatmap can be used to cluster samples based on global scores across all factors
using `global_scores_heatmap()`. The samples can be colored by the associated metadata 
using `color_cor` + `color_pal`as shown below.


```{r part-1-global-heatmap-colored, fig.dim = c(6, 4)}
global_scores_heatmap(mcia_results,
                      color_col = "cancerType", color_pal = meta_colors)
```
Like the projection plot, one can observe that the clustering can differentiate
between cancer type quite effectively.

# Part 2: Interpreting Global Loadings

In addition to the global scores matrix, MCIA also calculates a global loadings
matrix $A$ that is $p\times r$.  

## Pseudoeigenvalues Representing the Contribution of Each Omic to the Global Factor Score

The global loadings matrix is calculated using a weighted sum of the block loadings.
These weights can be interpreted as the contribution of each omic to the global
factor score (i.e. *pseudoeigenvalue*), and can be visualized using the function `block_weights_heatmap()`.

```{r part-2-block-heatmap, fig.dim = c(4.5, 3)}
block_weights_heatmap(mcia_results)
```

For example, we can observe that factors 7-9 have the strongest contribution
from the protein data.

## Visualize All Feature Loadings on Two Axes

MCIA returns feature loadings for each factor that can be used to understand
important contributions. To get a sense of which omics features show high
contribution to each factor, one can plot feature loadings for any two omics
using the `vis_load_plot` function. The features that take the most extreme
values will be the strongest contributors to that factor.

```{r part-2-loadings, fig.dim = c(6, 4)}
# colors_omics = c('red', 'blue', 'green')
# colors_omics <- get_colors(mcia_results, color_pal = colors_omics)
colors_omics <- get_colors(mcia_results)

vis_load_plot(mcia_results, axes = c(1, 4), colors_omics = colors_omics)
```
From this plot, we can deduce that individual miRNA features tend to
have the most extreme values in factors 1 and 4, thereby contributing most strongly
to the global scores for those factors.

## Scree Plot: Visualizing the Top Features per Factor

To dive into each factor users can generate a scree plot of top loading 
values using the following two steps/functions: 1) `ord_loadings()` which
calculates a list of ranked feature loadings, and 2) `vis_load_ord()` which 
takes as input ranked values from step (1) to generate a scree plot. Features
can be ranked based on ascending or descending values and/or absolute value. 
Users can also plot features from all omics together or focus on a specific
omic by using a corresponding block name (i.e. mrna). 

### Factor 1

Here, we return and visualize two ranked lists for Factor 1: using all omics,
or just mrna.

```{r part-2-factor-1, fig.dim = c(8, 3)}
# define the loadings
all_pos_1 <- ord_loadings(mcia_out = mcia_results, omic = "all",
                          absolute = FALSE, descending = TRUE, factor = 1)
mrna_pos_1 <- ord_loadings(mcia_out = mcia_results, omic = "mrna",
                           absolute = FALSE, descending = TRUE, factor = 1)

# visualization
all_pos_1_vis <- vis_load_ord(gl_f_ord = all_pos_1, omic_name = "all",
                              colors_omics = colors_omics)
mrna_pos_1_vis <- vis_load_ord(gl_f_ord = mrna_pos_1, omic_name = "mrna",
                               colors_omics = colors_omics)

ggpubr::ggarrange(all_pos_1_vis, mrna_pos_1_vis)
```

The first plot shows that that miRNA has the overal top positive feature loadings for
factor 1, whereas the second plot allows us to focus on only mRNA signals,
where genes like *GNS* and *WASL* come up as top contributors.

### Factor 2

Here, we return and visualize two ranked lists for Factor 2: using all omics,
or just protein, this time ranking features by magnitude (absolute value).

```{r part-2-factor-2, fig.dim = c(8, 3)}
# define the loadings
all_abs_2 <- ord_loadings(mcia_out = mcia_results, omic = "all",
                          absolute = TRUE, descending = TRUE, factor = 2)
prot_abs_2 <- ord_loadings(mcia_out = mcia_results, omic = "prot",
                           absolute = TRUE, descending = TRUE, factor = 2)

# visualization
all_abs_2_vis <- vis_load_ord(gl_f_ord = all_abs_2, omic_name = "all",
                              colors_omics = colors_omics)
prot_abs_2_vis <- vis_load_ord(gl_f_ord = prot_abs_2, omic_name = "prot",
                               colors_omics = colors_omics)

ggpubr::ggarrange(all_abs_2_vis, prot_abs_2_vis)
```

For factor 2, we again see that across all omics, miRNA is playing the strongest
role. One important difference (relative to factor 1) is made by ranking the
values according to the absolute value. Doing so allows us to identify 9 (out
of 15) features with a strong negative loading value. The right plot shows that
the negatively values protein features have the largest absolute values, indicating 
that they may play a distinct role from the miRNA in global scores for this factor. 

### Factor 4

We return and visualize a ranked list for Factor 4 with the top 60 features.

```{r part-2-factor-4, fig.dim = c(10, 4)}
# define the loadings
all_4 <- ord_loadings(mcia_results, omic = "all", absolute = FALSE,
                      descending = TRUE, factor = 4)

# visualization
all_4_plot <- vis_load_ord(gl_f_ord = all_4, colors_omics = colors_omics,
                           n_feat = 60)
all_4_plot
```
One can observe that while miRNA dominate the top features, proteins are also 
represented.

## Pathway Analysis for the Top Factors using Data from Gene-Centric Omics Blocks

The NCI-60 data set includes gene expression data, and its corresponding global
loading matrix is a gene by factor matrix. We can learn more about the pathways
over-represented in a given factor by running the global loadings for the factor 
through a gene set enrichment analysis (GSEA). Here, we look at factors 1 and 3.
We run `gsea_report()`, which reports on the p-value of the most significant 
pathway as well as the total number of significant pathways for each factor. 

### Gather Data and Generate the Report

```{r part-2-pathways, echo = TRUE, message = FALSE, warning = FALSE, results = 'hide'}
# extract mRNA global loadings
mrna_gfscores <- nmb_get_gl(mcia_results)
mrna_rows <- str_detect(row.names(mrna_gfscores), "_mrna")
mrna_gfscores <- mrna_gfscores[mrna_rows, ]

# rename rows to contain HUGO based gene symbols
row.names(mrna_gfscores) <- str_remove(rownames(mrna_gfscores), "_[0-9]*_.*")

# load pathway data
path.database <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.2/c2.cp.reactome.v6.2.symbols.gmt"
pathways <- fgsea::gmtPathways(gmt.file = path.database)

# generate the GSEA report
geneset_report <- gsea_report(metagenes = mrna_gfscores, path.database,
                              factors = seq(1, 3), pval.thr = 0.05, nproc = 8)
```

### Investigating the GSEA Summary Table

The report comes in the form of a list where the first element is a data frame
with summary level of the GSEA analysis per factor. Ideally, each factor is
capturing a select number of pathways with a high significance. From this
report (below) we can see that the most significant pathway is associated with
Factor 3 and that there is a large variation in the number of total (significant)
pathways ranging from 7 (Factor 8) to 143 (Factor 4).

```{r include = FALSE}
# # Apply scientific notation to min_pval
geneset_report[[1]]$min_pval <- sprintf("%.2e", geneset_report[[1]]$min_pval)
```

```{r part-2-genesets}
geneset_report[[1]]
```

As just mentioned, Factor 3 contains the most enriched gene set so we can
re-run GSEA for this factor in order to get a full list of enrichment scores
across all gene sets:

```{r part-2-gsea-for-factor-3, echo = TRUE, message = FALSE, warning = FALSE, results = 'hide'}
# re-running GSEA
factor3_paths <- fgseaMultilevel(pathways, stats = mrna_gfscores[, 3],
                                 nPermSimple = 10000, minSize = 15,
                                 nproc = 4, maxSize = 500)
```

```{r include=FALSE}
# Order by adjusted p value
factor3_paths <- factor3_paths[order(factor3_paths$padj), ]

# Apply scientific notation to padj
to_sci <- function(x) {
    return(sprintf("%.2e", x))
}
factor3_paths[,"padj"] <- sapply(factor3_paths[,"padj"], to_sci)

# Clean up pathway name
clean_pathway_name <- function(x) {
    replaced_value <- str_replace_all(x , "_", " ")
    if (nchar(replaced_value) > 49) {
      substring_value <- substr(replaced_value, 1, 50) 
      return(paste0(substring_value, '...'))
    }
    else {
      return(replaced_value)
    }
}
factor3_paths[, "pathway"] <- apply(X = factor3_paths[,"pathway"],
                                    MARGIN = 1, FUN = clean_pathway_name)
```

```{r part-2-gsea-for-factor-3-viz}
head(factor3_paths[, c("pathway", "padj")])
```

We observe that the most significant gene set is the REACTOME_CELL_CYCLE gene set. 
Below we can take a look at the list of genes from REACTOME_CELL_CYCLE that
likely contribute to factor 3. This analysis can be repeated as necessary to
make sense of other gene based factor loadings. 

```{r part-2-significant}
# extracting REACTOME_CELL_CYCLE, the most significant gene set
sig_path3 <- factor3_paths[min(factor3_paths$padj) == factor3_paths$padj, ][1, ]
sig_path3$leadingEdge[[1]][1:10]
```

# Session Info

<details>
  <summary>**Session Info**</summary>
  
```{r session-info}
sessionInfo()
```