--- title: "Low Dimensional Projection of Cytometry Samples" author: - name: Philippe Hauchamps - name: Laurent Gatto package: CytoMDS abstract: > This vignette describes the functionality implemented in the `CytoMDS` package. `CytoMDS` provides support for low dimensional projection of a set of cytometry samples, using concepts such as Earth Mover's (EMD) distance, and Multi Dimensional Scaling (MDS). This vignette is distributed under a CC BY-SA 4.0 license. output: BiocStyle::html_document: toc_float: true bibliography: CytoMDS.bib vignette: > %\VignetteIndexEntry{Low Dimensional Projection of Cytometry Samples} %\VignetteEngine{knitr::rmarkdown} %%\VignetteKeywords{FlowCytometry, QualityControl, DimensionReduction, multidimensionalScaling, Software, Visualization} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'hide'} BiocStyle::markdown() ``` # Installation and loading dependencies To install this package, start R and enter (un-commented): ```{r} # if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # # BiocManager::install("CytoMDS") ``` We now load the packages needed in the current vignette: ```{r rlibs, results=FALSE} suppressPackageStartupMessages(library(HDCytoData)) library(CytoMDS) library(ggplot2) library(patchwork) ``` # Introduction The `CytoMDS` package implements low dimensional visualization of cytometry samples, in order to visually assess distances between them. This, in turn, can greatly help the user to identify quality issues like batch effects or outlier samples, and/or check the presence of potential sample clusters that might align with the experimental design. The `CytoMDS` algorithm combines, on the one hand, the concept of *Earth Mover's Distance (EMD)* [@Orlova2016-jm], a.k.a. *Wasserstein metric* and, on the other hand, the metric *Multi Dimensional Scaling (MDS)* algorithm for the low dimensional projection [@De_Leeuw2009-aw]. Besides projection itself, the package also provides some diagnostic tools for both checking the quality of the MDS projection, as well as interpreting the axes of the projection (see below sections). # Illustrative dataset The illustrative dataset that is used throughout this vignette is a mass cytometry (*CyTOF*) dataset from [@Bodenmiller2012-re], and provided in the Bioconductor *HDCytoData* data package (@Weber2019-qp). This dataset consists of 16 paired samples (8 times 2) of peripheral blood cells from healthy individuals. Among each sample pair, one sample - the reference - was left un-stimulated, while the other sample was stimulated with B cell receptor / Fc receptor cross-linker (BCR-XL). This public dataset is known to contain a strong differential expression signal between the two conditions (stimulated vs un-stimulated) and as been used in recent work to benchmark differential analysis algorithms ([@Weber2019-pz]) and to design mass cytometry data analysis pipelines ([@Nowicka2017-dh]). In the `CytoMDS`package, as in the current vignette, matrices of cytometry events intensities, corresponding to one sample, are stored as `flowCore::flowFrame` [@flowCore] objects. Samples of a particular cytometry dataset are then stored as a `flowCore::flowSet` object, which is a collection of flowFrame's, i.e. one flowFrame per sample. Therefore, we load the flowSet version of the *BodenMiller2012* dataset, obtained from the `HDCytoData` package. ```{r loadDataSet} BCRXL_fs <- HDCytoData::Bodenmiller_BCR_XL_flowSet() BCRXL_fs ``` In regular flowSet's, the experimental design information is typically stored in the `phenoData` slot, and this is also the way `CytoMDS` expects to get its input. However, `HDCytoData` has chosen to store the experimental design information in a slightly different way, hence the need to convert the data as follows: ```{r convertPhenoData} phenoData <- flowCore::pData(BCRXL_fs) additionalPhenoData <- keyword(BCRXL_fs[[1]], "EXPERIMENT_INFO")$EXPERIMENT_INFO phenoData <- cbind(phenoData, additionalPhenoData) flowCore::pData(BCRXL_fs) <- phenoData ``` We also select channels/markers that are biologically relevant, i.e. both the cell type and cell state markers, and store them for further use. We discard the typical housekeeping markers that are founds in flowFrames like *time* and Cell_length, etc. In total, these mass cytometry samples contain intensities for 24 biologically relevant markers. ```{r selectMarkers} markerInfo <- keyword(BCRXL_fs[[1]], "MARKER_INFO")$MARKER_INFO chClass <- markerInfo$marker_class table(chClass) chLabels <- markerInfo$channel_name[chClass != "none"] (chMarkers <- markerInfo$marker_name[chClass != "none"]) ``` The first step consists in scale transforming the raw data. Indeed, distances between samples make more sense with scaled transformed signal, in which distributional differences are more readable and usable for downstream analysis. Here, since we are dealing with mass cytometry samples, we use the classical `arcsinh()` transformation with 5 as *co-factor*, as described elsewhere ([@Nowicka2017-dh]). ```{r scaleTransform} trans <- arcsinhTransform( transformationId="ArcsinhTransform", a = 0, b = 1/5, c = 0) BCRXL_fs_trans <- transform( BCRXL_fs, transformList(chLabels, trans)) ``` # Pairwise sample Earth Mover's Distances ## Calculating distances between samples We can now calculate pairwise *Earth Mover's Distances (EMD)* between all samples of our dataset. This is done by calling the `pairwiseEMDDist()` method The simplest way to use this method is by providing directly a `flowCore::flowSet`, containing all samples, as input parameter. Note that, for heavy datasets that include a lot of samples, this can create memory issues. To handle this case, `CytoMDS` provides other ways to call the `pairwiseEMDDist()` function (see 'Handling heavy datasets' section). Using the `channels` argument, it is possible to restrict the EMD calculation to some of the channels. Here we simply pass as input the biologically relevant markers selected in the previous section. ```{r distCalc} pwDist <- pairwiseEMDDist( BCRXL_fs_trans, channels = chMarkers, verbose = FALSE ) pwDistMatrix <- as.matrix(pwDist) ``` The return value of the `pairwiseEMDDist` function is a `DistSum` object. We can use the `as.matrix()` method to to convert this object into a matrix, here a symmetric square matrix, with as many rows (columns) as input samples (extract shown here below for the scale-transformed *Bodenmiller2012* dataset). ```{r distShow} round(pwDistMatrix[1:10, 1:10], 2) ``` One relevant way to visualize this distance matrix is to draw the histogram of pairwise distances, as shown in the below plot. ```{r distShowHist, fig.align='center', fig.wide = TRUE} distVec <- pwDistMatrix[upper.tri(pwDist)] distVecDF <- data.frame(dist = distVec) pHist <- ggplot(distVecDF, mapping = aes(x=dist)) + geom_histogram(fill = "darkgrey", col = "black", bins = 15) + theme_bw() + ggtitle("EMD distances for Bodenmiller2012 dataset") pHist ``` ## Individual marker contribution in the distance matrix Since in *CytoMDS*, the *EMD* is calculated as an approximation, summing over the one-dimensional marginal marker unidimensional distributions, it is possible to obtain an individual contribution of each marker to the distance matrix. This can be done using the `distByFeature()` method: ```{r distByFeatureTable} DF <- distByFeature(pwDist) DF[order(DF$percentage, decreasing = TRUE),] ``` These individual marker contributions can also be displayed visually, using the `ggplotDistFeatureImportance()` method: ```{r distByFeaturePlot, fig.align='center', fig.wide = TRUE} pBar <- ggplotDistFeatureImportance(pwDist) pBar ``` # Metric Multidimensional scaling ## Calculating the MDS projection Once the pairwise distance matrix has been calculated, computing the *Multi Dimensional Scaling (MDS)* projection is done by calling the `computeMetricMDS()` function. In its simplest form, only the distance `DistSum` object (or a distance matrix) needs to be passed to the function. In that case, the number of dimensions to use in the MDS is automatically set in order to reach a specific value for a projection quality indicator, i.e. the *target pseudo R square*, which in turn is set by default to 0.95 (see Quality of projection - diagnostic tools section). Note that the *Smacof* algorithm [@De_Leeuw2009-aw], used to compute the MDS projection, is stochastic, so it is sensitive to the 'seed' used. Therefore, in cases where reproducible results from one run to another is required, it is advised to set the `seed` argument to a fixed value. The returned value of the `computeMetricMDS()` function is an object of the *MDS* class. This object can be queried to get e.g. the number of dimensions that was effectively used, or the obtained pseudo RSquare, as shown in the following code chunk: ```{r MDSCalc} mdsObj <- computeMetricMDS(pwDist, seed = 0) show(mdsObj) ``` ## Plotting the MDS projection Plotting the obtained MDS projection is done using `ggplotSampleMDS()`. If no `phenoData` parameter is passed, then, by default, numbers are used as labels, and samples are represented as black dots. ```{r plotMDS1_simplest, fig.align='center', fig.asp = 1, fig.wide = TRUE} ggplotSampleMDS(mdsObj) ``` However, by providing a 'phenoData' dataframe to the `ggplotSampleMDS()` function, the corresponding variables can be used for highlighting sample points with different colours and/or shapes. Here below, the previous plot is enhanced with red and blue colours, distinguishing samples from the two conditions. Also, we have added more meaningful labels to each data point, corresponding to patient id's. On this plot, one can clearly see a clear separation between samples of the two different conditions. These 2 sample groups are different along the x axis, which corresponds to the first projection direction, explaining 46.69 % of the variability contained in the MDS projection with 4 dimensions (as indicated in the subtitle). This clear separation between 2 condition clusters highlights the strong biological signal differentiating these two groups of samples. Further in this vignette, we shall try to assign a user interpretation to this x axis direction (see 'Aid to interpreting projection axes' section). ```{r plotMDS_colours_shapes_1_2, fig.align='center', fig.asp = 0.9, fig.wide = TRUE} p12 <- ggplotSampleMDS( mdsObj, pData = phenoData, projectionAxes = c(1,2), pDataForColour = "group_id", pDataForLabel = "patient_id" ) p12 ``` Given that 4 dimensions were used in the metric MDS algorithm, the user can visualize the MDS projection using any combination of two axes, for example axes 2 and 3, or 3 and 4, as below: ```{r plotMDS_colours_shapes_2_3_4, fig.align='center', fig.width = 9, fig.height = 12} p23 <- ggplotSampleMDS( mdsObj, pData = phenoData, projectionAxes = c(2,3), pDataForColour = "group_id", pDataForLabel = "patient_id" ) p34 <- ggplotSampleMDS( mdsObj, pData = phenoData, projectionAxes = c(3,4), pDataForColour = "group_id", pDataForLabel = "patient_id" ) p23 / p34 ``` These plots reveal two important findings: 1. Dots of the 2 samples groups are well-mixed in both views, showing that the biological difference between the two sample groups is mainly concentrated along the first projection axis. 2. Dots corresponding to samples from the same patient id are most of the time very close to each other. We could conclude that the variability contained in axes 2, 3 and 4 mostly represent biological variation between different individuals, and not (or much less) the effect of stimulation. ## Quality of projection - diagnostic tools In order to be able to trust the projected distances obtained on the `CytoMDS` plots, a couple of projection quality indicators can be verified : - the *pseudo RSquare* indicator shows what percentage of the variability contained in the pairwise distance matrix is actually contained in the low dimensional MDS projection. It is analog to the statistical *RSquare* for a linear regression model, i.e. the closer to one the *pseudo RSquare* is, the better. Note that the *pseudo RSquare* refers to the variability contained in __ALL__ dimensions of the MDS projection, not only the two plotted axes. - *nDim* is the number of dimensions of the projection that was needed to obtain the corresponding *pseudo RSquare* (here 4 dimensions). - the percentage of variation that is captured along each axis (coordinates), is to be interpreted with respect to the total variability __that is captured by the MDS projection__, not the total variability. For example, in the previous section, the MDS projection using 4 dimensions is able to capture 98.15% (*pseudo RSquare*) of the initial variability contained in the calculated pairwise distance matrix. __Of these 98.15%__, 46.69% is in turn captured by axis 1, 29.95% is captured by axis 2, 15.38% is captured by axis 3 and 7.99% is captured by axis 4. Another useful projection quality diagnostic tool is the *Shepard diagram*. On this diagram, each dot represents a single distance between a sample pair, with as x coordinate the original (high dimensional) distance between the two samples, and as y coordinate the projected low dimensional distance, as obtained by the MDS projection algorithm. In the *Shepard* diagram, the ideal situation is when all dots are located on a straight line passing through the (0,0) and (1,1) points. In the below diagram, one can notice that all points are very near the ideal straight line, hence the distance projections can be trusted. ```{r plotMDSShepard, fig.align='center', fig.wide = TRUE, fig.asp = 1} ggplotSampleMDSShepard(mdsObj) ``` ## Additional options for the MDS projection In this section, we describe a couple of additional options available to the user when calculating the MDS projection using `computeMetricMDS()`. First, instead of letting the algorithm choose itself the number of dimensions, it is also possible to assign it explicitly, for example to 2, as in below: ```{r MDSCalc.nDim2, fig.align='center', fig.asp = 0.85, fig.wide = TRUE} mdsObj2 <- CytoMDS::computeMetricMDS(pwDist, seed = 0, nDim = 2) ggplotSampleMDS(mdsObj2, pData = phenoData, projectionAxes = c(1,2), pDataForColour = "group_id", pDataForLabel = "patient_id", flipYAxis = TRUE) ``` Note that the obtained projection on 2 axes, although similar, is not exactly the same as the one obtained when visualizing the first two axis of the MDS projection obtained before, with 4 dimensions. Actually, this is a feature of the Metric MDS projection, although it might appear a bit counter-intuitive at first. Second, it is also possible to adjust the number of dimensions indirectly, by setting an explicit *pseudo Rsquare* target. In that case the algorithm will increase the number of dimensions until reaching the required quality target, instead of the by default 0.95 target. The below example shows how to obtain a *pseudo R Square* of at least 0.99. As a result, the needed number of dimensions is now 6, instead of 4. ```{r MDSCalc.Rsq99, fig.align='center', fig.asp = 0.85, fig.wide = TRUE} mdsObj3 <- CytoMDS::computeMetricMDS(pwDist, seed = 0, targetPseudoRSq = 0.99) ggplotSampleMDS(mdsObj3, pData = phenoData, projectionAxes = c(1,2), pDataForColour = "group_id", pDataForLabel = "patient_id") ``` The corresponding Shepard diagram is obtained as below, showing that the dots are even more concentrated around the ideal straight line, than before: ```{r plotMDSShepard.Rsq99, fig.align='center', fig.wide = TRUE, fig.asp = 1} ggplotSampleMDSShepard(mdsObj3) ``` ## Aid to interpreting projection axes ### Bi-plots With MDS projections, it is possible to (try to) associate some axis directions to specific sample characteristics. The idea is to calculate the correlation of well chosen sample statistics w.r.t. the axes of projection, so that these correlations can be represented on a correlation circle, which is overlaid on the projection plot. This plot set-up is called a 'bi-plot'. In order produce a bi=plot, the user first needs to calculate chosen statistics of interest for which they want to assess the association with the axis directions. Typically, one chooses channel specific statistics, like e.g. the mean, the standard deviation, or any quantile that might be of interest. However, any statistics that can be calculated for each sample can be used (number of events,...) Here below, we provide an example where the user overlays the median of the different channels, on a bi-plot with MDS projection axes 1 and 2. On the bi-plot, each arrow - here representing a channel median - is located at coordinates equal to its respective Pearson correlation w.r.t. the two axis. Here, one can identify that the x axis has a strong positive correlation with the median of markers 'CD4', 'CD33', 'CD20', 'pNFkB', 'pp38', 'pBtk' and 'pSlp76', and a strong negative correlation with the median of marker 'HLA-DR'. The y axis has a strong negative correlation with the median of a number of markers: 'CD123', 'pStat1', 'pStat5', 'pSHP2', 'pAkt', 'pZap70' and 'pPlcg2'. ```{r plotMDS_medians} medians <- channelSummaryStats(BCRXL_fs_trans, channels = chLabels, statFUNs = median) ``` ```{r plotMDS_biplot, fig.align='center', fig.asp = 0.8, fig.wide = TRUE} ggplotSampleMDS(mdsObj, pData = phenoData, pDataForColour = "group_id", displayPointLabels = FALSE, displayArrowLabels = TRUE, repelArrowLabels = TRUE, biplot = TRUE, extVariables = medians) ``` Note that, on the bi-plots, only the arrows of length greater or equal to a specific threshold (by default set at 0.8) are represented, in order to not overwhelm the plot with arrows, especially when the dataset uses a high dimensional panel. It is however possible to adjust this threshold by explicitly setting the `arrowThreshold` argument. For example, in the below plot, the threshold is set set to 0.6, showing, for example, that x axis is also strongly negatively correlated with the median of 'pS6' and 'CD7' markers. ```{r plotMDS_biplot_0.6, fig.align='center', fig.asp = 0.8, fig.wide = TRUE} ggplotSampleMDS(mdsObj, pData = phenoData, pDataForColour = "group_id", displayPointLabels = FALSE, displayArrowLabels = TRUE, repelArrowLabels = TRUE, biplot = TRUE, extVariables = medians, arrowThreshold = 0.6) ``` In terms of biological interpretation, since the x axis is the direction along which there is a clear separation between stimulated and (reference) un-stimulated samples, the biplot suggests that channels for which the median appears as strongly correlated with the x axis, should also show visible distributional difference between the two sample groups. In order to check this, we can use the `ggplotMarginalDensities` method provided in the `CytoMDS` package, as follows: ```{r plotMarginalDensities, fig.align='center', fig.asp = 1.2, fig.wide = TRUE} ggplotMarginalDensities( BCRXL_fs_trans, channels = chLabels, pDataForColour = "group_id", pDataForGroup = "sample_id") ``` And indeed, we can notice that the Reference samples (in red) tend to show higher intensity values for markers 'pNFkB', 'pp38', 'CD4', 'CD20', 'CD33', 'pSlp76', 'pBtk', while BCR-XL stimulated samples (in blue) tend to show higher internsity valeus for markers 'pS6', 'HLA-DR' and 'CD7'. All these markers where identified as strongly (resp. positively and negatively) correlated with the x axis on the bi-plot. ### Bi-plot wrapping Instead of drawing one single bi-plot related to a specific type of statistics, for example channel medians as before, we can also try to associate the axes to different channel statistics at once. In the next plot, we draw bi-plots for channel medians, 25% and 75% quantiles, and standard deviations at once. The 'facets-alike' bi-plot, or *bi-plot wrapping*, is obtained thanks to the `ggplotSampleMDSWrapBiplots()` function, which internally calls `ggplotSampleMDS()` function several times, and arrange the obtained outputs on a single plot. ```{r plotMDS_stats} statFUNs = c("median" = stats::median, "Q25" = function(x, na.rm) { stats::quantile(x, probs = 0.25) }, "Q75" = function(x, na.rm) { stats::quantile(x, probs = 0.75) }, "standard deviation" = stats::sd) chStats <- channelSummaryStats(BCRXL_fs_trans, channels = chLabels, statFUNs = statFUNs) ``` ```{r plotMDS_biplot_facetting, fig.align='center', fig.asp = 1, fig.wide = TRUE} ggplotSampleMDSWrapBiplots( mdsObj, extVariableList = chStats, ncol = 2, pData = phenoData, pDataForColour = "group_id", displayPointLabels = FALSE, displayArrowLabels = TRUE, repelArrowLabels = TRUE, displayLegend = FALSE) ``` Sometimes, as is the case on the plots below, a high number of channel statistics are strongly correlated with the bi-plot axes, so that the plot is hardly readable, due to too many arrows displayed. In that case, it is advised to generate series of bi-plots, part of the channel statistics, in order to better identify the strongly correlated ones. One example is provided below, showing the correlation between the axes and the standard deviation of each channel: ```{r plotMDS_biplot_stddev, fig.align='center', fig.asp = 1.0, fig.wide = TRUE} stdDevs <- list( "std dev of channels 1 to 6" = chStats[["standard deviation"]][,1:6], "std dev of channels 7 to 12" = chStats[["standard deviation"]][,7:12], "std dev of channels 13 to 18" = chStats[["standard deviation"]][,13:18], "std dev of channels 19 to 24" = chStats[["standard deviation"]][,19:24] ) ggplotSampleMDSWrapBiplots( mdsObj, ncol = 2, extVariableList = stdDevs, pData = phenoData, pDataForColour = "group_id", displayPointLabels = FALSE, displayArrowLabels = TRUE, repelArrowLabels = TRUE) ``` # Handling large datasets Computing Earth Mover's Distances between all sample pairs of large datasets (e.g. with hundreds of samples), is a heavy computational task. First, loading the whole data set as a `flowCore::flowSet()` in RAM at once, might not be possible due to its size. Second, calculating a matrix of pairwise distances, has a computational complexity of O(N2), which can lead to very long computation times for large datasets. Therefore, the `CytoMDS` package provides several mechanisms allowing to mitigate these issues. ## Loading flow frames dynamically during distance matrix computation In order to be able to handle datasets of greater size than the available computer RAM, the `pairwiseEMDDist()` function allows for an alternative input mode, where: - the input samples are NOT provided directly via a `flowCore::flowSet`, or a list of expression matrices, but - the user provides the nb of samples, and a user-written expression matrix loading function that will be called to dynamically load the *i*th sample - as an expression matrix - upon request, and optionally additional arguments. Typically, the expression matrix loading function provided by the user shall describe how to read the *i*th sample from disk. ## Using BiocParallel to parallelize distance matrix computation Also, `CytoMDS` pairwise distance calculation supports parallelization of distance matrix computation, through the use of `BiocParallel` package. When parallelization is used, the calculation engine will automatically create worker tasks corresponding to the calculation of blocks of the distance matrix. In the below, we provide an example with the BodenMiller2012 dataset. Note that this example has illustrative purpose only, as in fact this dataset is small enough to fully reside in memory and to not require parallelization of distance calculation. In case where 'on-the-fly' expression matrices in-memory loading is required, we advise, as a preliminary step, to store the samples, previously scale transformed, on disk. Here we do it in a temporary directory. ```{r temporary_store} storageLocation <- suppressMessages(base::tempdir()) nSample <- length(BCRXL_fs_trans) fileNames <- file.path( storageLocation, paste0("BodenMiller2012_TransformedSample", sprintf("%02d.rds", seq_len(nSample)))) for (i in seq_len(nSample)) { saveRDS(BCRXL_fs_trans[[i]], file = fileNames[i]) } ``` Then, the `pairwiseEMDDist()` method can be called, now specifying a number of samples, a expression matrix loading function, and a `BiocParallel::SnowParam()` backbone for parallelization of the computations. ```{r distance_calc_loading_otf_parallel} bp <- BiocParallel::SnowParam( stop.on.error = FALSE, progressbar = TRUE) pwDistLast <- suppressWarnings(pairwiseEMDDist( x = nSample, channels = chMarkers, loadExprMatrixFUN = function(exprMatrixIndex, files, channels, markers){ ff <- readRDS(file = files[exprMatrixIndex ]) exprMat <- flowCore::exprs(ff)[, channels, drop = FALSE] colnames(exprMat) <- markers exprMat }, loadExprMatrixFUNArgs = list( files = fileNames, channels = chLabels, markers = chMarkers), BPPARAM = bp)) ``` The obtained distances - as shown in the below histogram - are exactly the same as before. ```{r dist1ShowHistAgain, fig.align='center', fig.wide = TRUE} pwDistLastMat <- as.matrix(pwDistLast) distVecLast <- pwDistLastMat[upper.tri(pwDistLastMat)] distVecDFLast <- data.frame(dist = distVecLast) pHistLast <- ggplot(distVecDFLast, mapping = aes(x=dist)) + geom_histogram(fill = "darkgrey", col = "black", bins = 15) + theme_bw() + ggtitle( "EMD distances for Bodenmiller2012 dataset", subtitle = "on the fly memory loading and parallel computation") pHistLast ``` # Expression matrices as input instead of flowFrames Standard use of `CytoMDS` involves providing cytometry sample data, contained in `flowCore` standard cytometry data structures (`flowSet` and `flowFrame`) as input to Earth Mover's distance calculation. However, `CytoMDS` can also accept more generic input data, as a list of expression matrices, stored in standard R matrices. For instance, this allows using `CytoMDS` visualizations for other type of data than cytometry, as long as each sample can be represented by a matrix with any number of rows (cells), and fixed columns/features (markers). Here below is an illustrating toy example with some randomly simulated data. Let us first simulate 10 samples with 1,000 rows each and 10 features. Samples are split into 2 groups: `conditionA` and `conditionB` with some increased expression std deviation for 3 features in `conditionB`, compared to `conditionA`. ```{r simulExprMatrices} nSample <- 10 conditions <- factor(c(rep("conditionA", 5), rep("conditionB", 5))) nRow <- 1000 nFeat <- 10 nDiffFeat <- 3 diffFeats = c(2, 3, 9) stdDevFactor = 1.5 exprMatrixList <- mapply( seq(nSample), conditions, FUN = function(i, condition) { exprMatrix <- matrix(rnorm(nRow*nFeat), nrow = nRow) if (condition == "conditionB") { exprMatrix[, diffFeats] <- exprMatrix[, diffFeats] * stdDevFactor } colnames(exprMatrix) <- paste0("Feat", seq(nFeat)) exprMatrix }, SIMPLIFY = FALSE ) names(exprMatrixList) <- paste0("sample", seq(nSample)) ``` We can now generate a pairwise sample distance matrix, based on the previously simulated matrix list, and show the corresponding histogram: ```{r distExprMatrices} pwDistExpr <- pairwiseEMDDist( exprMatrixList ) ``` ```{r distExprMatShowHist, fig.align='center', fig.wide = TRUE} pwDistExprmat <- as.matrix(pwDistExpr) distVecExpr <- pwDistExprmat[upper.tri(pwDistExprmat)] distVecDFExpr <- data.frame(dist = distVecExpr) pHistExpr <- ggplot(distVecDFExpr, mapping = aes(x=dist)) + geom_histogram(fill = "darkgrey", col = "black", bins = 15) + theme_bw() + ggtitle( "EMD distances for simulated expression matrices") pHistExpr ``` With the generated pairwise distance matrix, we can now proceed with the `CytoMDS` workflow, as in the previous examples: ```{r MDSCalcExprMat} mdsObjExpr <- computeMetricMDS(pwDistExpr, seed = 0) show(mdsObjExpr) ``` ```{r plotMDSExprMat, fig.align='center', fig.asp = 0.9, fig.wide = TRUE} phenoDataExpr <- data.frame(sampleId = seq(nSample), cond = conditions) p12 <- ggplotSampleMDS( mdsObjExpr, pData = phenoDataExpr, projectionAxes = c(1,2), pDataForColour = "cond", pDataForLabel = "sampleId" ) p12 ``` ```{r biplotsMDSExprMat, fig.align='center', fig.asp = 0.9, fig.wide = TRUE} statFunctions <- list(mean = base::mean, std_dev = stats::sd) chStatsExpr <- channelSummaryStats(exprMatrixList, statFUNs = statFunctions) p <- CytoMDS::ggplotSampleMDSWrapBiplots( mdsObj = mdsObjExpr, extVariableList = chStatsExpr, pData = phenoDataExpr, projectionAxes = c(1,2), pDataForColour = "cond", displayPointLabels = FALSE, repelArrowLabels = TRUE, ncol = 2, arrowThreshold = 0.9 ) p ``` # Session information {-} ```{r sessioninfo, echo=FALSE} sessionInfo() ``` # References {-}