---
title: " Vignette illustrating the use of MOFA on single-cell multi-omics data"
author: "Ricard Argelaguet and Britta Velten"
output:
  BiocStyle::html_document:
    toc: true
package: MOFA
vignette: >
  %\VignetteIndexEntry{MOFA: Application to a single-cell multi-omics data set}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette shows how MOFA can be used to disentangle the heterogeneity in single-cell DNA methylation and RNA expression (scMT) data.  

Briefly, the data set consists of 87 mouse embryonic stem cells (mESCs), comprising of 16 cells cultured in ‘2i’ media, which induces a naive pluripotency state, and 71 serum-grown cells, which commits cells to a primed pluripotency state poised for cellular differentiation (Angermueller, 2016). 


# Load data and create MOFA object
The data is stored as a `MultiAssayExperiment` object.
Notice that there are 4 views, one with normalized RNA expression assay and 3 views with DNA methylation information. Each DNA methylation view is a different genomic context (i.e. Enhancers, Promoters and CpG Islands) and each feature is an individual CpG site.
```{r, warning=FALSE, message=FALSE}
library(MultiAssayExperiment)
library(MOFA)
library(MOFAdata)
library(ggplot2)
```

```{r}
data("scMT_data")
scMT_data
```

First, we create the MOFA object:
```{r}
MOFAobject <- createMOFAobject(scMT_data)
MOFAobject
```

The function `plotDataOverview` can be used to obtain an overview of the structure of the data:
```{r}
plotDataOverview(MOFAobject, colors=c("#31A354","#377EB8","#377EB8","#377EB8"))
```

# Fit the MOFA model
The next step is to train the model. Internally, this is done via Python, so make sure you have the corresponding package installed (see installation instructions and read the FAQ if you have problems).

## Define options

### Define data options
Next, we define data options. The most important are:  

* **scaleViews**: logical indicating whether to scale views to have unit variance. As long as the scale of the different data sets is not too high, this is not required. Default is `FALSE`.

* **removeIncompleteSamples**: logical indicating whether to remove samples that are not profiled in all omics. The model can cope with missing assays, so this option is not required. Default is `FALSE`. 

```{r}
DataOptions <- getDefaultDataOptions()
DataOptions
```


### Define model options
Next, we define model options. The most important are:  

* **numFactors**: number of factors (default is 0.5 times the number of samples). By default, the model will only remove a factor if it explains exactly zero variance in the data. You can increase this threshold on minimum variance explained by setting `TrainOptions$dropFactorThreshold` to a value higher than zero.  

* **likelihoods**: likelihood for each view. Usually we recommend gaussian for continuous data, bernoulli for binary data and poisson for count data. By default, the model tries to guess it from the data.  

* **sparsity**: do you want to use sparsity? This makes the interpretation easier so it is recommended (Default is `TRUE`).  

```{r}
ModelOptions <- getDefaultModelOptions(MOFAobject)
ModelOptions
```

### Define training options
Next, we define training options. The most important are:  

* **maxiter**: maximum number of iterations. Ideally set it large enough and use the convergence criteria `TrainOptions$tolerance`.  

* **tolerance**: convergence threshold based on change in the evidence lower bound. For an exploratory run you can use a value between 1.0 and 0.1, but for a "final" model we recommend a value of 0.01.  

* **DropFactorThreshold**: hyperparameter to automatically learn the number of factors based on a minimum variance explained criteria. Factors explaining less than 'DropFactorThreshold' fraction of variation in all views will be removed. For example, a value of 0.01 means that factors that explain less than 1\% of variance in all views will be discarded. By default this it zero, meaning that all factors are kept unless they explain no variance at all. Here we use a threshold of 1%. 
```{r}
TrainOptions <- getDefaultTrainOptions()
TrainOptions$seed <- 2018

# Automatically drop factors that explain less than 1% of variance in all omics
TrainOptions$DropFactorThreshold <- 0.01

TrainOptions
```

## Prepare MOFA
`prepareMOFA` internally performs a set of sanity checks and fills the `DataOptions`, `TrainOptions` and `ModelOptions` slots of the `MOFAobject`
```{r}
MOFAobject <- prepareMOFA(
  MOFAobject, 
  DataOptions = DataOptions,
  ModelOptions = ModelOptions,
  TrainOptions = TrainOptions
)
```

## Run MOFA
Now we are ready to train the `MOFAobject`, which is done with the function `runMOFA`. This step can take some time (around 5 min with default parameters for a single trial). For illustration we provide an existing trained `MOFAobject`
```{r, eval=FALSE}
MOFAobject <- runMOFA(MOFAobject)
```

```{r}
filepath <- system.file("extdata", "scMT_model.hdf5", package = "MOFAdata")
MOFAobject <- loadModel(filepath, MOFAobject)
MOFAobject
```


# Analyse a trained MOFA model
After training, we can explore the results from MOFA. Here we provide a semi-automated pipeline to disentangle and characterize all the identified sources of variation (the factors).  

**Part 1: Disentangling the heterogeneity**  

Calculation of variance explained by each factor in each view. This is probably the most important plot that MOFA generates, as it summarises the entire heterogeneity of the dataset in a single figure.  

**Part 2: Characterization of individual factors**    

* Inspection of top features with highest loadings: the loading is a measure of feature importance, so features with high loading are the ones driving the heterogeneity captured by the factor.  
* Feature set enrichment analysis (where set annotations are present, e.g. gene sets for mRNA views).  
* Ordination of samples by factors to reveal clusters and/or gradients: this is similar to what is traditionally done with Principal Component Analysis or t-SNE.  

Other analyses, including imputation of missing values and prediction of clinical data are also available. See below for a short guide on how to impute data. Detailed vignettes will follow soon.

## Disentangling the heterogeneity, calculation of variance explained by each factor in each view
This is done by `calculateVarianceExplained` (to get the numerical values) and `plotVarianceExplained` (to get the plot).  
The resulting figure gives an overview of which factors are active in which view(s). If a factor is active in more than one view, this means that is capturing shared signal (co-variation) between features of different data modalities.  

In this data set MOFA identified 3 Factors with a minimum variance of 1%. While *Factor 1* is shared across all data modalities (12% variance explained in the RNA data and between 53% and 72% in the methylation data sets), Factors 2 and 3 are active primarily in the RNA data
```{r}
# Calculate the variance explained (R2) per factor in each view 
r2 <- calculateVarianceExplained(MOFAobject)
r2$R2Total

# Variance explained by each factor in each view
head(r2$R2PerFactor)

# Plot it
plotVarianceExplained(MOFAobject)
```


## Characterisation of individual factors

### Inspection of loadings for Factor 1
Plotting the RNA expression gene loadings for *Factor 1*, we can see that it is enriched for naive pluripotency marker genes such as Rex1/Zpf42, Tbx3, Fbxo15 and Essrb. Hence, based on previous studies (Mohammed et al, 2016) we can hypothesise that *Factor 1* captures a transition from a naive pluripotent state to a primed pluripotent states.
```{r}
# Plot all weights and highlight specific gene markers
plotWeights(
  object = MOFAobject, 
  view = "RNA expression", 
  factor = 1, 
  nfeatures = 0, 
  abs = FALSE, 
  manual = list(c("Zfp42","Esrrb","Morc1","Fbxo15",
                  "Jam2","Klf4","Tcl1","Tbx3","Tex19.1"))
)

# Plot top 10 genes
plotTopWeights(
  object = MOFAobject,
  view = "RNA expression", 
  factor = 1, 
  nfeatures = 10
)
```

Also, instead of looking at the "abstract" weights, it is useful to observe, in the original data, the heterogeneity captured by a Factor. This can be done using the `plotDataHeatmap` function.
```{r}
# Add metadata to the plot
factor1 <- sort(getFactors(MOFAobject,"LF1")[,1])
order_samples <- names(factor1)
df <- data.frame(
  row.names = order_samples,
  culture = getCovariates(MOFAobject, "culture")[order_samples],
  factor = factor1
)

plotDataHeatmap(
  object = MOFAobject, 
  view = "RNA expression", 
  factor = "LF1", 
  features = 20, 
  transpose = TRUE, 
  show_colnames=FALSE, show_rownames=TRUE, # pheatmap options
  cluster_cols = FALSE, annotation_col=df # pheatmap options
)
```

We can now connect these transcriptomic changes to coordinated changes on the DNA methylation. As *Factor 1* is active in all genomic contexts, it suggests that there is a massive genome-wide DNA methylation remodelling. This can confirmed by inspecting the weights in the DNA methylation views, as we do here for the CpG sites in enhancers: Notice that most features (CpG sites) have a negative weight, which suggests that their DNA methylation decrease in a manner that is inversely proportional to the direction of *Factor 1*.
```{r}
plotWeights(
  object = MOFAobject,
  view = "Met Enhancers", 
  factor = 1, 
  nfeatures = 0,
  abs = FALSE,
  scale = FALSE
)
```
Similar observations can be made for the CpG sites in other genomic contexts, here CpG Islands or Promotors, by plotting the weights of the respective views (i.e. "Met CpG Islands" or "Met Promoters").

As done before, let's observe the heterogeneity captured by *Factor 1* in the original data space. This clearly confirms that most of the CpG sites are getting methylated as cells progress in *Factor 1* from naive to primed pluripotent stem cells
```{r}
plotDataHeatmap(
  object = MOFAobject, 
  view = "Met Enhancers", 
  factor = 1, 
  features = 500,
  transpose = TRUE,
  cluster_rows=FALSE, cluster_cols=FALSE, # pheatmap options
  show_rownames=FALSE, show_colnames=FALSE, # pheatmap options
  annotation_col=df  # pheatmap options
)
```

### Inspection of loadings for Factor 2
A similar analysis for *Factor 2* reveals that it captures a second axis of differentiation from the primed pluripotency state to a differentiated state, with highest RNA loadings for known differentiation markers such as keratins and annexins.
```{r}
# Plot all weights and highlight specific gene markers
plotWeights(
  object = MOFAobject, 
  view="RNA expression", 
  factor = 2, 
  nfeatures = 0, 
  manual = list(c("Krt8","Cald1","Anxa5","Tagln",
                  "Ahnak","Dsp","Anxa3","Krt19")), 
  scale = FALSE,
  abs = FALSE
)

# Plot top 10 genes
plotTopWeights(
  object = MOFAobject, 
  view="RNA expression", 
  factor = 2, 
  nfeatures = 10
)
```

Interestingly, the $R^2$ plot suggests that this second axis of variability is not associated with DNA methylation. We can confirm this by plotting the weights (they are all zero) and the heatmaps (no coherent pattern) as we have done before:
```{r}
plotWeights(
  object = MOFAobject, 
  view = "Met Enhancers", 
  factor = 2, 
  nfeatures = 0, 
  abs = FALSE,
  scale = FALSE
)
```

```{r, message=FALSE}
factor2 <- sort(getFactors(MOFAobject,"LF2")[,1])
order_samples <- names(factor2)
df <- data.frame(
  row.names = order_samples,
  culture =  getCovariates(MOFAobject, "culture")[order_samples],
  factor = factor2
)
plotDataHeatmap(
  object = MOFAobject, 
  view = "Met Enhancers", 
  factor = 2, 
  features = 500,
  transpose = TRUE,
  cluster_rows=FALSE, cluster_cols=FALSE,  # pheatmap options
  show_rownames=FALSE, show_colnames=FALSE,  # pheatmap options
  annotation_col=df  # pheatmap options
)
```

## Ordination of samples by factors
Samples can be visualized along factors of interest using the `plotFactorScatter` and `plotFactorBeeswarm` functions.  

In this data set, the **combination of Factor 1 and Factor 2 captured the entire differentiation trajectory from naive pluripotent cells via primed pluripotent cells to differentiated cells**. This illustrates the importance of learning continuous latent factors rather than discrete sample assignments.
```{r, message=FALSE}
p <- plotFactorScatter(
  object = MOFAobject, 
  factors=1:2, 
  color_by = "culture")

p + scale_color_manual(values=c("lightsalmon","orangered3"))
```

Finally, *Factor 3* captured the cellular detection rate, a known technical covariate that represents the number of expressed genes.
```{r, message=FALSE}
plotFactorBeeswarm(
  object = MOFAobject,
  factors = 3, 
  color_by = "cellular_detection_rate"
)
```


## Customized analysis
For customized exploration of weights and factors, you can directly fetch the variables from the `MOFAobject` using 'get' functions: `getWeights`, `getFactors` and `getTrainData`. As an example, here we do a scatterplot between Factor 3 and the true Cellular Detection Rate values
```{r, message=FALSE}
cdr <- colMeans(getTrainData(MOFAobject)$`RNA expression`>0,na.rm=TRUE)
factor3 <- getFactors(MOFAobject,
                      factors=3)

foo <- data.frame(factor = as.numeric(factor3), cdr = cdr)
ggplot(foo, aes_string(x = "factor", y = "cdr")) + 
  geom_point() + xlab("Factor 3") +
  ylab("Cellular Detection Rate") +
  stat_smooth(method="lm") +
  theme_bw()
```


# Further functionalities 

## Imputation of missing observations
The factors can be used to impute missing data (see Methods of the publication). This is done using the `impute` function, which stores the imputed data in the `ImputedData` slot of the `MOFAobject`. It can be accessed via the `getImputedData` function:
```{r}
MOFAobject <- impute(MOFAobject)
nonImputedMethylation <- getTrainData(MOFAobject,
                                      view="Met CpG Islands")[[1]]
imputedMethylation <- getImputedData(MOFAobject,
                                     view="Met CpG Islands")[[1]]
```

```{r}
# non-imputed data
pheatmap::pheatmap(nonImputedMethylation[1:100,1:20],
         cluster_rows = FALSE, cluster_cols = FALSE,
         show_rownames = FALSE, show_colnames = FALSE)

# imputed data
pheatmap::pheatmap(imputedMethylation[1:100,1:20],
         cluster_rows = FALSE, cluster_cols = FALSE,
         show_rownames = FALSE, show_colnames = FALSE)
```

## Clustering of samples based on latent factors
Samples can be clustered according to their values on the latent factors using the `clusterSamples` function.
```{r, message=FALSE}
# kmeans clustering with K=3 using Factor 1
set.seed(1234)
clusters <- clusterSamples(MOFAobject, k=3, factors=1)

# Scatterplot colored by the predicted cluster labels,
# and shaped by the true culture conditions
plotFactorScatter(
  object = MOFAobject,
  factors = 1:2,
  color_by = clusters,
  shape_by = "culture"
)
```

# SessionInfo
```{r}
sessionInfo()
```