---
title: "Low Dimensional Projection of Cytometry Samples"
author:
    - name: Philippe Hauchamps
    - name: Laurent Gatto
package: CytoMDS
abstract: >
 This vignette describes the functionality implemented in the `CytoMDS`
 package. `CytoMDS` provides support for low dimensional projection of a set  
 of cytometry samples, using concepts such as Earth Mover's (EMD) distance,   
 and Multi Dimensional Scaling (MDS).  
 This vignette is distributed under a CC BY-SA 4.0 license.
output:
  BiocStyle::html_document:
    toc_float: true
bibliography: CytoMDS.bib
vignette: >
  %\VignetteIndexEntry{Low Dimensional Projection of Cytometry Samples}
  %\VignetteEngine{knitr::rmarkdown}
  %%\VignetteKeywords{FlowCytometry, QualityControl, DimensionReduction, multidimensionalScaling, Software, Visualization}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'hide'}
BiocStyle::markdown()
```
# Installation and loading dependencies

To install this package, start R and enter (un-commented):

```{r}
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# 
# BiocManager::install("CytoMDS")
```

We now load the packages needed in the current vignette:

```{r rlibs, results=FALSE}
suppressPackageStartupMessages(library(HDCytoData))
library(CytoMDS)
library(ggplot2)
library(patchwork)
```

# Introduction

The `CytoMDS` package implements low dimensional visualization of cytometry 
samples, in order to visually assess distances between them. This, in turn, 
can greatly help the user to identify quality issues like batch effects 
or outlier samples, and/or check the presence of potential sample clusters 
that might align with the experimental design.  

The `CytoMDS` algorithm combines, on the one hand, the concept of 
*Earth Mover's Distance (EMD)* [@Orlova2016-jm], a.k.a. *Wasserstein metric* 
and, on the other hand, the metric *Multi Dimensional Scaling (MDS)* algorithm 
for the low dimensional projection [@De_Leeuw2009-aw]. 

Besides projection itself, the package also provides some diagnostic tools 
for both checking the quality of the MDS projection, as well as interpreting 
the axes of the projection (see below sections).

# Illustrative dataset

The illustrative dataset that is used throughout this vignette is a mass
cytometry (*CyTOF*) dataset from [@Bodenmiller2012-re], and provided in the 
Bioconductor *HDCytoData* data package (@Weber2019-qp).   

This dataset consists of 16 paired samples (8 times 2) of peripheral blood 
cells from healthy individuals. Among each sample pair, one sample 
- the reference - was left un-stimulated, while the other sample was 
stimulated with B cell receptor / Fc receptor cross-linker (BCR-XL).  

This public dataset is known to contain a strong differential expression signal 
between the two conditions (stimulated vs un-stimulated) and as been used in 
recent work to benchmark differential analysis algorithms ([@Weber2019-pz]) 
and to design mass cytometry data analysis pipelines ([@Nowicka2017-dh]).

In the `CytoMDS`package, as in the current vignette, 
matrices of cytometry events intensities, corresponding to one sample, 
are stored as `flowCore::flowFrame` [@flowCore] objects. 
Samples of a particular cytometry dataset are then stored 
as a `flowCore::flowSet` object, which is a collection of flowFrame's, 
i.e. one flowFrame per sample. Therefore, we load the flowSet version 
of the *BodenMiller2012* dataset, obtained from the `HDCytoData` package.

```{r loadDataSet}
BCRXL_fs <- HDCytoData::Bodenmiller_BCR_XL_flowSet()
BCRXL_fs
```

In regular flowSet's, the experimental design information is typically  
stored in the `phenoData` slot, and this is also the way `CytoMDS` expects to 
get its input. However, `HDCytoData` has chosen to store the experimental 
design information in a slightly different way, hence the need to convert 
the data as follows:

```{r convertPhenoData}
phenoData <- flowCore::pData(BCRXL_fs)
additionalPhenoData <- 
    keyword(BCRXL_fs[[1]], "EXPERIMENT_INFO")$EXPERIMENT_INFO
phenoData <- cbind(phenoData, additionalPhenoData)

flowCore::pData(BCRXL_fs) <- phenoData
```

We also select channels/markers that are biologically relevant, i.e. both the 
cell type and cell state markers, and store them for further use. We discard 
the typical housekeeping markers that are founds in flowFrames like *time* and 
Cell_length, etc. In total, these mass cytometry samples contain intensities 
for 24 biologically relevant markers.

```{r selectMarkers}
markerInfo <- keyword(BCRXL_fs[[1]], "MARKER_INFO")$MARKER_INFO
chClass <- markerInfo$marker_class

table(chClass)

chLabels <- markerInfo$channel_name[chClass != "none"]
(chMarkers <- markerInfo$marker_name[chClass != "none"])
```

The first step consists in scale transforming the raw data. Indeed, 
distances between samples make more sense with scaled transformed signal, 
in which distributional differences are more readable and usable for 
downstream analysis.

Here, since we are dealing with mass cytometry samples, we use the classical 
`arcsinh()` transformation with 5 as *co-factor*, as described elsewhere
([@Nowicka2017-dh]). 

```{r scaleTransform}
trans <- arcsinhTransform(
    transformationId="ArcsinhTransform", 
    a = 0, 
    b = 1/5, 
    c = 0)

BCRXL_fs_trans <- transform(
    BCRXL_fs,
    transformList(chLabels, trans))
```


# Pairwise sample Earth Mover's Distances

## Calculating distances between samples

We can now calculate pairwise *Earth Mover's Distances (EMD)* 
between all samples of our dataset.   

This is done by calling the `pairwiseEMDDist()` method The simplest way to  
use this method is by providing directly a `flowCore::flowSet`, containing all
samples, as input parameter. Note that, for heavy datasets that include 
a lot of samples, this can create memory issues. To handle this case, `CytoMDS` 
provides other ways to call the `pairwiseEMDDist()` function 
(see 'Handling heavy datasets' section).

Using the `channels` argument, it is possible to restrict the EMD calculation 
to some of the channels. Here we simply pass as input the biologically 
relevant markers selected in the previous section.

```{r distCalc}
pwDist <- pairwiseEMDDist(
    BCRXL_fs_trans,
    channels = chMarkers,
    verbose = FALSE
)
pwDistMatrix <- as.matrix(pwDist)
```

The return value of the `pairwiseEMDDist` function is a `DistSum` object.
We can use the `as.matrix()` method to to convert this object into a matrix, 
here a symmetric square matrix, with as many rows (columns) as input samples 
(extract shown here below for the scale-transformed *Bodenmiller2012* dataset).

```{r distShow}
round(pwDistMatrix[1:10, 1:10], 2)
```

One relevant way to visualize this distance matrix is to draw the histogram 
of pairwise distances, as shown in the below plot. 

```{r distShowHist, fig.align='center', fig.wide = TRUE}
distVec <- pwDistMatrix[upper.tri(pwDist)]
distVecDF <- data.frame(dist = distVec)
pHist <- ggplot(distVecDF, mapping = aes(x=dist)) + 
    geom_histogram(fill = "darkgrey", col = "black", bins = 15) + 
    theme_bw() + ggtitle("EMD distances for Bodenmiller2012 dataset")
pHist
```

## Individual marker contribution in the distance matrix

Since in *CytoMDS*, the *EMD* is calculated as an approximation, summing over
the one-dimensional marginal marker unidimensional distributions, 
it is possible to obtain an individual contribution of each marker 
to the distance matrix. This can be done using the `distByFeature()` method: 

```{r distByFeatureTable}
DF <- distByFeature(pwDist)
DF[order(DF$percentage, decreasing = TRUE),]
```

These individual marker contributions can also be displayed visually, using 
the `ggplotDistFeatureImportance()` method: 

```{r distByFeaturePlot, fig.align='center', fig.wide = TRUE}
pBar <- ggplotDistFeatureImportance(pwDist)
pBar
```

# Metric Multidimensional scaling

## Calculating the MDS projection

Once the pairwise distance matrix has been calculated, computing the *Multi 
Dimensional Scaling (MDS)* projection is done 
by calling the `computeMetricMDS()` function. In its simplest form, only 
the distance `DistSum` object (or a distance matrix) needs to be passed 
to the function. In that case, the number of dimensions to use in the MDS 
is automatically set in order to reach a specific value 
for a projection quality indicator, i.e. the *target pseudo R square*, 
which in turn is set by default to 0.95 
(see Quality of projection - diagnostic tools section).  

Note that the *Smacof* algorithm [@De_Leeuw2009-aw], 
used to compute the MDS projection, is stochastic, so it is sensitive 
to the 'seed' used. Therefore, in cases where reproducible results from 
one run to another is required, it is advised to set the `seed` argument 
to a fixed value.

The returned value of the `computeMetricMDS()` function is an object of the 
*MDS* class. This object can  be queried to get e.g. the number of dimensions 
that was effectively used, or the obtained pseudo RSquare, as shown 
in the following code chunk:

```{r MDSCalc}
mdsObj <- computeMetricMDS(pwDist, seed = 0)
show(mdsObj)
```

## Plotting the MDS projection

Plotting the obtained MDS projection is done using `ggplotSampleMDS()`. 
If no `phenoData` parameter is passed, then, by default, 
numbers are used as labels, and samples are represented as black dots.

```{r plotMDS1_simplest, fig.align='center', fig.asp = 1, fig.wide = TRUE}
ggplotSampleMDS(mdsObj)
```

However, by providing a 'phenoData' dataframe to the `ggplotSampleMDS()` 
function, the corresponding variables can be used 
for highlighting sample points with different colours and/or shapes. 
Here below, the previous plot is enhanced with red and blue colours, 
distinguishing samples from the two conditions. Also, we have added 
more meaningful labels to each data point, corresponding to patient id's.  

On this plot, one can clearly see a clear separation between samples of the 
two different conditions. These 2 sample groups are different along the x 
axis, which corresponds to the first projection direction, explaining 46.69 % 
of the variability contained in the MDS projection with 4 dimensions 
(as indicated in the subtitle). This clear separation between 2 condition 
clusters highlights the strong biological signal differentiating these two 
groups of samples. Further in this vignette, we shall try to assign a user 
interpretation to this x axis direction (see 'Aid to interpreting projection 
axes' section).

```{r plotMDS_colours_shapes_1_2, fig.align='center', fig.asp = 0.9, fig.wide = TRUE}
p12 <- ggplotSampleMDS(
    mdsObj,
    pData = phenoData,
    projectionAxes = c(1,2),
    pDataForColour = "group_id",
    pDataForLabel = "patient_id"
)
p12
```

Given that 4 dimensions were used in the metric MDS algorithm, the user can 
visualize the MDS projection using any combination of two axes, 
for example axes 2 and 3, or 3 and 4, as below:

```{r plotMDS_colours_shapes_2_3_4, fig.align='center', fig.width = 9, fig.height = 12}
p23 <- ggplotSampleMDS(
    mdsObj,
    pData = phenoData,
    projectionAxes = c(2,3),
    pDataForColour = "group_id",
    pDataForLabel = "patient_id"
)

p34 <- ggplotSampleMDS(
    mdsObj,
    pData = phenoData,
    projectionAxes = c(3,4),
    pDataForColour = "group_id",
    pDataForLabel = "patient_id"
)
p23 / p34
```

These plots reveal two important findings: 
1. Dots of the 2 samples groups are well-mixed in both views, showing that 
the biological difference between the two sample groups is mainly concentrated 
along the first projection axis.
2. Dots corresponding to samples from the same patient id are most of the time 
very close to each other. We could conclude that the variability contained in 
axes 2, 3 and 4 mostly represent biological variation between different 
individuals, and not (or much less) the effect of stimulation. 

## Quality of projection - diagnostic tools

In order to be able to trust the projected distances obtained on the `CytoMDS` 
plots, a couple of projection quality indicators can be verified : 
- the *pseudo RSquare* indicator shows what percentage of the variability 
contained in the pairwise distance matrix is actually contained in the 
low dimensional MDS projection. It is analog to the statistical *RSquare* 
for a linear regression model, i.e. the closer to one the *pseudo RSquare* is, 
the better. Note that the *pseudo RSquare* refers to the variability contained 
in __ALL__ dimensions of the MDS projection, not only the two plotted axes.
- *nDim* is the number of dimensions of the projection that was needed 
to obtain the corresponding *pseudo RSquare* (here 4 dimensions).
- the percentage of variation that is captured along each axis (coordinates), 
is to be interpreted with respect to the total variability __that is 
captured by the MDS projection__, not the total variability.  

For example, in the previous section, the MDS projection using 4 dimensions is 
able to capture 98.15% (*pseudo RSquare*) of the initial variability contained 
in the calculated pairwise distance matrix. __Of these 98.15%__, 
46.69% is in turn captured by axis 1, 29.95% is captured by axis 2, 15.38% is 
captured by axis 3 and 7.99% is captured by axis 4.

Another useful projection quality diagnostic tool is the *Shepard diagram*. 
On this diagram, each dot represents a single distance between a sample pair, 
with as x coordinate the original (high dimensional) distance between 
the two samples, and as y coordinate the projected low dimensional distance, 
as obtained by the MDS projection algorithm.  

In the *Shepard* diagram, the ideal situation is when all dots are located 
on a straight line passing through the (0,0) and (1,1) points. 
In the below diagram, one can notice that all points are very near 
the ideal straight line, hence the distance projections can be trusted.

```{r plotMDSShepard, fig.align='center', fig.wide = TRUE, fig.asp = 1}
ggplotSampleMDSShepard(mdsObj)
```

## Additional options for the MDS projection

In this section, we describe a couple of additional options available 
to the user when calculating the MDS projection using `computeMetricMDS()`.

First, instead of letting the algorithm choose itself the number of dimensions, 
it is also possible to assign it explicitly, for example to 2, as in below:

```{r MDSCalc.nDim2, fig.align='center', fig.asp = 0.85, fig.wide = TRUE}
 mdsObj2 <- CytoMDS::computeMetricMDS(pwDist, seed = 0, nDim = 2)
 ggplotSampleMDS(mdsObj2,
                 pData = phenoData,
                 projectionAxes = c(1,2),
                 pDataForColour = "group_id",
                 pDataForLabel = "patient_id",
                 flipYAxis = TRUE)
```

Note that the obtained projection on 2 axes, although similar, is not exactly 
the same as the one obtained when visualizing the first two axis of the 
MDS projection obtained before, with 4 dimensions. Actually, this is a feature 
of the Metric MDS projection, although it might appear a bit counter-intuitive 
at first.  

Second, it is also possible to adjust the number of dimensions indirectly, 
by setting an explicit *pseudo Rsquare* target. In that case the algorithm will
increase the number of dimensions until reaching the required quality target, 
instead of the by default 0.95 target.  

The below example shows how to obtain a *pseudo R Square* of at least 0.99. 
As a result, the needed number of dimensions is now 6, instead of 4.

```{r MDSCalc.Rsq99, fig.align='center', fig.asp = 0.85, fig.wide = TRUE}
mdsObj3 <- CytoMDS::computeMetricMDS(pwDist, seed = 0, targetPseudoRSq = 0.99)
ggplotSampleMDS(mdsObj3,
                 pData = phenoData,
                 projectionAxes = c(1,2),
                 pDataForColour = "group_id",
                 pDataForLabel = "patient_id")
```

The corresponding Shepard diagram is obtained as below, showing that the dots 
are even more concentrated around the ideal straight line, than before:

```{r plotMDSShepard.Rsq99, fig.align='center', fig.wide = TRUE, fig.asp = 1}
ggplotSampleMDSShepard(mdsObj3)
```

## Aid to interpreting projection axes

### Bi-plots

With MDS projections, it is possible to (try to) associate some axis directions 
to specific sample characteristics. The idea is to calculate the 
correlation of well chosen sample statistics w.r.t. the axes of projection, so 
that these correlations can be represented on a correlation circle, which is 
overlaid on the projection plot. This plot set-up is called a 'bi-plot'.

In order produce a bi=plot, the user first needs to calculate chosen statistics 
of interest for which they want to assess the association with the axis 
directions. Typically, one chooses channel specific statistics, 
like e.g. the mean, the standard deviation, or any quantile that might be 
of interest. However, any statistics that can be calculated for each sample 
can be used (number of events,...)

Here below, we provide an example where the user overlays the median of 
the different channels, on a bi-plot with MDS projection axes 1 and 2.  

On the bi-plot, each arrow - here representing a channel median - is located 
at coordinates equal to its respective Pearson correlation w.r.t. the two axis. 

Here, one can identify that the x axis has a strong positive correlation with 
the median of markers 'CD4', 'CD33', 'CD20', 'pNFkB', 'pp38', 'pBtk' 
and 'pSlp76', and a strong negative correlation with the median of marker 
'HLA-DR'. The y axis has a strong negative correlation with the median 
of a number of markers: 'CD123', 'pStat1', 'pStat5', 'pSHP2', 'pAkt', 'pZap70' 
and 'pPlcg2'.

```{r plotMDS_medians}
medians <- channelSummaryStats(BCRXL_fs_trans, 
                               channels = chLabels, 
                               statFUNs = median)
```

```{r plotMDS_biplot, fig.align='center', fig.asp = 0.8, fig.wide = TRUE}
ggplotSampleMDS(mdsObj,
                pData = phenoData,
                pDataForColour = "group_id",
                displayPointLabels = FALSE,
                displayArrowLabels = TRUE,
                repelArrowLabels = TRUE,
                biplot = TRUE,
                extVariables = medians)
```

Note that, on the bi-plots, only the arrows of length greater or equal to 
a specific threshold (by default set at 0.8) are represented, in order to not 
overwhelm the plot with arrows, especially when the dataset uses a high 
dimensional panel.  

It is however possible to adjust this threshold by explicitly setting 
the `arrowThreshold` argument. For example, in the below plot, the threshold 
is set set to 0.6, showing, for example, that x axis is also strongly 
negatively correlated with the median of 'pS6' and 'CD7' markers.

```{r plotMDS_biplot_0.6, fig.align='center', fig.asp = 0.8, fig.wide = TRUE}
ggplotSampleMDS(mdsObj,
                pData = phenoData,
                pDataForColour = "group_id",
                displayPointLabels = FALSE,
                displayArrowLabels = TRUE,
                repelArrowLabels = TRUE,
                biplot = TRUE,
                extVariables = medians,
                arrowThreshold = 0.6)
```

In terms of biological interpretation, since the x axis is the direction along 
which there is a clear separation between stimulated and (reference) 
un-stimulated samples, the biplot suggests that channels for which 
the median appears as strongly correlated with the x axis, should also show 
visible distributional difference between the two sample groups.  

In order to check this, we can use the `ggplotMarginalDensities` method 
provided in the `CytoMDS` package, as follows: 

```{r plotMarginalDensities, fig.align='center', fig.asp = 1.2, fig.wide = TRUE}
ggplotMarginalDensities(
    BCRXL_fs_trans, 
    channels = chLabels,
    pDataForColour = "group_id",
    pDataForGroup = "sample_id")
```
And indeed, we can notice that the Reference samples (in red) tend to 
show higher intensity values for markers 'pNFkB', 'pp38', 'CD4', 'CD20', 
'CD33', 'pSlp76', 'pBtk', while BCR-XL stimulated samples (in blue) tend to 
show higher internsity valeus for markers 'pS6', 'HLA-DR' and 'CD7'. All these 
markers where identified as strongly (resp. positively and negatively) 
correlated with the x axis on the bi-plot.  

### Bi-plot wrapping

Instead of drawing one single bi-plot related to a specific type of statistics, 
for example channel medians as before, we can also try to associate the axes 
to different channel statistics at once. In the next plot, we draw bi-plots 
for channel medians, 25% and 75% quantiles, and standard deviations at once. 

The 'facets-alike' bi-plot, or *bi-plot wrapping*, is obtained thanks to the 
`ggplotSampleMDSWrapBiplots()` function, which internally calls 
`ggplotSampleMDS()` function several times, and arrange the obtained outputs 
on a single plot.   

```{r plotMDS_stats}
statFUNs = c("median" = stats::median,
             "Q25" = function(x, na.rm) {
                 stats::quantile(x, probs = 0.25)
             },
             "Q75" = function(x, na.rm) {
                 stats::quantile(x, probs = 0.75)
             },
             "standard deviation" = stats::sd)
chStats <- channelSummaryStats(BCRXL_fs_trans,
                               channels = chLabels, 
                               statFUNs = statFUNs)
```

```{r plotMDS_biplot_facetting, fig.align='center', fig.asp = 1, fig.wide = TRUE}
ggplotSampleMDSWrapBiplots(
    mdsObj,
    extVariableList = chStats,
    ncol = 2,
    pData = phenoData,
    pDataForColour = "group_id",
    displayPointLabels = FALSE,
    displayArrowLabels = TRUE,
    repelArrowLabels = TRUE,
    displayLegend = FALSE)
```

Sometimes, as is the case on the plots below, a high number of channel 
statistics are strongly correlated with the bi-plot axes, so that the plot 
is hardly readable, due to too many arrows displayed.  

In that case, it is advised to generate series of bi-plots, part of the 
channel statistics, in order to better identify the strongly correlated ones. 
One example is provided below, showing the correlation between the axes and the 
standard deviation of each channel:

```{r plotMDS_biplot_stddev, fig.align='center', fig.asp = 1.0, fig.wide = TRUE}
stdDevs <- list(
    "std dev of channels 1 to 6" = chStats[["standard deviation"]][,1:6],
    "std dev of channels 7 to 12" = chStats[["standard deviation"]][,7:12],
    "std dev of channels 13 to 18" = chStats[["standard deviation"]][,13:18],
    "std dev of channels 19 to 24" = chStats[["standard deviation"]][,19:24]
)

ggplotSampleMDSWrapBiplots(
    mdsObj,
    ncol = 2,
    extVariableList = stdDevs,
    pData = phenoData,
    pDataForColour = "group_id",
    displayPointLabels = FALSE,
    displayArrowLabels = TRUE,
    repelArrowLabels = TRUE)
```



# Handling large datasets

Computing Earth Mover's Distances between all sample pairs of large datasets 
(e.g. with hundreds of samples), is a heavy computational task.  

First, loading the whole data set as a `flowCore::flowSet()` in RAM at once, 
might not be possible due to its size. Second, calculating a matrix of 
pairwise distances, has a computational complexity of O(N2), which can lead to 
very long computation times for large datasets.  

Therefore, the `CytoMDS` package provides several mechanisms allowing to
mitigate these issues.

## Loading flow frames dynamically during distance matrix computation

In order to be able to handle datasets of greater size than the available 
computer RAM, the `pairwiseEMDDist()` function allows for an alternative input 
mode, where:
- the input samples are NOT provided directly via a `flowCore::flowSet`, or  
a list of expression matrices, but
- the user provides the nb of samples, and a user-written expression matrix 
loading function that will be called to dynamically load the *i*th sample 
- as an expression matrix - upon request, and optionally additional arguments.

Typically, the expression matrix loading function provided by the user shall 
describe how to read the *i*th sample from disk. 

## Using BiocParallel to parallelize distance matrix computation

Also, `CytoMDS` pairwise distance calculation supports parallelization of 
distance matrix computation, through the use of `BiocParallel` package.  

When parallelization is used, the calculation engine will automatically 
create worker tasks corresponding to the calculation of blocks 
of the distance matrix.  

In the below, we provide an example with the BodenMiller2012 dataset. 
Note that this example has illustrative purpose only, as in fact this dataset 
is small enough to fully reside in memory and to not require parallelization 
of distance calculation. 

In case where 'on-the-fly' expression matrices in-memory loading is required, 
we advise, as a preliminary step, to store the samples, previously scale 
transformed, on disk. Here we do it in a temporary directory. 

```{r temporary_store}
storageLocation <- suppressMessages(base::tempdir())

nSample <- length(BCRXL_fs_trans)
fileNames <- file.path(
    storageLocation,
    paste0("BodenMiller2012_TransformedSample", 
           sprintf("%02d.rds", seq_len(nSample))))

for (i in seq_len(nSample)) {
    saveRDS(BCRXL_fs_trans[[i]],
            file = fileNames[i])
}
```

Then, the `pairwiseEMDDist()` method can be called, now specifying a number of 
samples, a expression matrix loading function, and a `BiocParallel::SnowParam()` 
backbone for parallelization of the computations. 

```{r distance_calc_loading_otf_parallel}
bp <- BiocParallel::SnowParam(
    stop.on.error = FALSE,
    progressbar = TRUE)
pwDistLast <- suppressWarnings(pairwiseEMDDist(
    x = nSample,
    channels = chMarkers,
    loadExprMatrixFUN  = function(exprMatrixIndex, files, channels, markers){
        ff <- readRDS(file = files[exprMatrixIndex ])
        exprMat <- flowCore::exprs(ff)[, channels, drop = FALSE]
        colnames(exprMat) <- markers
        exprMat
    },
    loadExprMatrixFUNArgs = list(
        files = fileNames, 
        channels = chLabels,
        markers = chMarkers),
    BPPARAM = bp))
```
The obtained distances - as shown in the below histogram - are exactly 
the same as before.

```{r dist1ShowHistAgain, fig.align='center', fig.wide = TRUE}
pwDistLastMat <- as.matrix(pwDistLast)
distVecLast <- pwDistLastMat[upper.tri(pwDistLastMat)]
distVecDFLast <- data.frame(dist = distVecLast)
pHistLast <- ggplot(distVecDFLast, mapping = aes(x=dist)) +
    geom_histogram(fill = "darkgrey", col = "black", bins = 15) +
    theme_bw() + 
    ggtitle(
        "EMD distances for Bodenmiller2012 dataset", 
        subtitle =  "on the fly memory loading and parallel computation")
pHistLast
```


# Expression matrices as input instead of flowFrames
Standard use of `CytoMDS` involves providing cytometry sample data, contained 
in `flowCore` standard cytometry data structures (`flowSet` and `flowFrame`) 
as input to Earth Mover's distance calculation.  

However, `CytoMDS` can also accept more generic input data, as a list 
of expression matrices, stored in standard R matrices. For instance, this 
allows using `CytoMDS` visualizations for other type of data than cytometry, as 
long as each sample can be represented by a matrix with any number of rows 
(cells), and fixed columns/features (markers).

Here below is an illustrating toy example with some randomly simulated data.  

Let us first simulate 10 samples with 1,000 rows each and 10 features.
Samples are split into 2 groups: `conditionA` and `conditionB` with some
increased expression std deviation for 3 features in `conditionB`, 
compared to `conditionA`.

```{r simulExprMatrices}
nSample <- 10
conditions <- factor(c(rep("conditionA", 5), rep("conditionB", 5)))
nRow <- 1000
nFeat <- 10
nDiffFeat <- 3
diffFeats = c(2, 3, 9)
stdDevFactor = 1.5

exprMatrixList <- mapply(
    seq(nSample),
    conditions,
    FUN = function(i, condition) {
        exprMatrix <- matrix(rnorm(nRow*nFeat), nrow = nRow)
        if (condition == "conditionB") {
            exprMatrix[, diffFeats] <- exprMatrix[, diffFeats] * stdDevFactor
        }
        colnames(exprMatrix) <- paste0("Feat", seq(nFeat))
        exprMatrix
    },
    SIMPLIFY = FALSE
)

names(exprMatrixList) <- paste0("sample", seq(nSample))

```

We can now generate a pairwise sample distance matrix, based on the previously 
simulated matrix list, and show the corresponding histogram: 

```{r distExprMatrices}

pwDistExpr <-  pairwiseEMDDist(
    exprMatrixList
)
```

```{r distExprMatShowHist, fig.align='center', fig.wide = TRUE}
pwDistExprmat <- as.matrix(pwDistExpr)
distVecExpr <- pwDistExprmat[upper.tri(pwDistExprmat)]
distVecDFExpr <- data.frame(dist = distVecExpr)
pHistExpr <- ggplot(distVecDFExpr, mapping = aes(x=dist)) +
    geom_histogram(fill = "darkgrey", col = "black", bins = 15) +
    theme_bw() + 
    ggtitle(
        "EMD distances for simulated expression matrices")
pHistExpr
```

With the generated pairwise distance matrix, we can now proceed with the 
`CytoMDS` workflow, as in the previous examples:

```{r MDSCalcExprMat}
mdsObjExpr <- computeMetricMDS(pwDistExpr, seed = 0)
show(mdsObjExpr)
```

```{r plotMDSExprMat, fig.align='center', fig.asp = 0.9, fig.wide = TRUE}
phenoDataExpr <- data.frame(sampleId = seq(nSample), cond = conditions)
p12 <- ggplotSampleMDS(
    mdsObjExpr,
    pData = phenoDataExpr,
    projectionAxes = c(1,2),
    pDataForColour = "cond",
    pDataForLabel = "sampleId"
)
p12
```

```{r biplotsMDSExprMat, fig.align='center', fig.asp = 0.9, fig.wide = TRUE}
statFunctions <- list(mean = base::mean,
                      std_dev = stats::sd)
chStatsExpr <- channelSummaryStats(exprMatrixList,
                                   statFUNs  = statFunctions)
p <- CytoMDS::ggplotSampleMDSWrapBiplots(
    mdsObj = mdsObjExpr,
    extVariableList = chStatsExpr,
    pData = phenoDataExpr,
    projectionAxes = c(1,2),
    pDataForColour = "cond",
    displayPointLabels = FALSE,
    repelArrowLabels = TRUE,
    ncol = 2,
    arrowThreshold = 0.9
)
p

```


# Session information {-}

```{r sessioninfo, echo=FALSE}
sessionInfo()
```

# References {-}