%
%\VignetteIndexEntry{FlowSOM}
%\VignetteDepends{FlowSOM}
%\VignetteKeywords{}
%\VignettePackage{FlowSOM}
%

\documentclass[english]{article}
\usepackage[T1]{fontenc}
\usepackage[latin9]{inputenc}
\usepackage{babel}
<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@ 

\begin{document}
\SweaveOpts{concordance=TRUE}
\begin{center}
\bioctitle[FlowSOM]{Using self-organizing maps for visualization and 
interpretation of cytometry data}
\par\end{center}{\Large \par}

\begin{center}
Katrien Quintelier, Artuur Couckuyt, Annelies Emmaneel, 
Sofie Van Gassen, Yvan Saeys
\par\end{center}

\begin{center}
Ghent University
\par\end{center}

\begin{center}
{\footnotesize September, 2014\bigskip{}
\bigskip{}
}
\par\end{center}{\footnotesize \par}

\begin{center}
\textbf{Abstract\bigskip{}
}
\par\end{center}

This vignette describes a protocol for analyzing high-dimensional cytometry 
data using \Biocpkg{FlowSOM}, a clustering and visualization algorithm based 
on a self-organizing map (SOM). FlowSOM is used to distinguish cell populations
from cytometry data in an unsupervised way and can help to gain deeper insights
in fields such as immunology and oncology. 

\bigskip{}
\bigskip{}

\section{Installation}

<<>>==
# install.packages("BiocManager")
# BiocManager::install("FlowSOM")
@

\section{Preprocessing}
FlowSOM handles different inputs, such as a flowFrame, a flowSet or an array of 
filepaths. For this purpose we will make use of a flowFrame. This allows
easier preprocessing. We start with compensating the data and then we transform 
the data with the logicle function. For CyTOF data an arcsinh transformation is 
preferred which is also found in the \Biocpkg{flowCore} package. Besides 
compensation and transformation, we also recommend cleaning the data by removing
margin events and by using cleaning algorithms.

<<>>=
fileName <- system.file("extdata", "68983.fcs", package="FlowSOM")
ff <- flowCore::read.FCS(fileName)

# Compensation
comp <- flowCore::keyword(ff)[["SPILL"]]
ff <- flowWorkspace::compensate(ff, comp)

# Transformation
transformList <- flowCore::estimateLogicle(ff, channels = colnames(comp))
ff <- flowWorkspace::transform(ff, transformList)
@


\section{The easy way}
The easiest way to use this package is using the wrapper function 
\Rfunction{FlowSOM}. It has less options than using the separate functions, 
but in general it has enough power. It returns a list, of which the first
item is the FlowSOM object (as required as input by many functions in this 
package) and the second item is the result of the metaclustering.

<<>>=
set.seed(42)
library(FlowSOM)

fSOM <- FlowSOM(ff,
                # Input options:
                compensate = FALSE, 
                transform = FALSE,
                scale = FALSE,
                # SOM options:
                colsToUse = c(9, 12, 14:18), xdim = 7, ydim = 7,
                # Metaclustering options:
                nClus = 10)
@

<<fig=TRUE>>=
p <- PlotStars(fSOM, backgroundValues = fSOM$metaclustering)
print(p, newpage = FALSE)
@

A pdf file with most relevant figures can be generated using the FlowSOMmary 
function.

<<>>=
FlowSOMmary(fsom = fSOM,
            plotFile = "FlowSOMmary.pdf")
@

From the resulting object you can get the cluster and metacluster label for
every individual cell.
<<>>=
head(GetClusters(fSOM))
head(GetMetaclusters(fSOM))
@

\section{Reading the data}

The FlowSOM package has several input options.

The first possibility is to use an array of character strings,
specifying paths to files or directories. When given a path to a
directory, all files in the directory will be considered. This process
does not happen recursively. You can specify a pattern to use only a
selection of the files. The default pattern is \Rcode{".fcs"}, making
sure that only fcs-files are selected.  When you are already working
with your data in \R{}, it might be easier to use a \Rclass{flowFrame}
or \Rclass{flowSet} from the \Biocpkg{flowCore} package as input. This
is also supported.  If multiple paths or a \Rclass{flowSet} are
provided, all data will be concatenated. You should check and apply 
normalization if needed using other packages.

When reading the data, several preprocessing options are available. The data 
can be automatically compensated using a specified matrix, or using the 
\Rcode{\$SPILL} variable from the fcs-file (when \Rcode{compensate} is 
\Rcode{TRUE} but no value is given for \Rcode{spillover}). 
The data can be transformed for specified columns. If no columns are 
provided, all columns from the spillover matrix will be transformed.
Finally, the data can be scaled. By default, it will scale to a mean of zero 
and standard deviation of one. However, specific scaling parameters can be set 
(see the base \R{} \Rfunction{scale} function for more detail).

\medskip{}

\noindent 
<<>>=
set.seed(42)
library(flowCore)
library(FlowSOM)

fileName <- system.file("extdata", "68983.fcs", package = "FlowSOM")
fSOM <- ReadInput(fileName, compensate = TRUE, transform = TRUE, 
                    toTransform = c(8:18), scale = TRUE)

ff <- suppressWarnings(flowCore::read.FCS(fileName))
fSOM <- ReadInput(ff, compensate = TRUE, transform = TRUE,
                    toTransform = c(8:18), scale = TRUE)
@

\noindent \medskip{}

This function returns a FlowSOM object, which is actually a \Rclass{list} 
containing several parameters. The data is stored as a matrix in 
\Rcode{\$data}, and all parameter settings to read the data are also stored. 
The begin and end indices of the subsets from the different files can be 
found in \Rcode{\$metadata}.

<<>>=
str(fSOM, max.level = 2)
@

\section{Building the self-organizing map}

The next step in the algorithm is to build a self-organizing map. Several
parameters for the self-organizing map algorithm can be provided, such as the
dimensions of the grid, the learning rate, the number of times the dataset has
to be presented. However, the most important parameter to decide is on which
columns the self-organizing map should be trained. This should contain all the
parameters that are useful to identify cell types, and exclude parameters of
which you want to study the behavior on all cell types such as activation
markers.

The \Rfunction{BuildSOM} function expects a FlowSOM object as input, and will
return a FlowSOM object with all information about the self organizing map
added in the \Rcode{\$map} parameter of the FlowSOM object.
\medskip{}
<<>>=
fSOM <- BuildSOM(fSOM, colsToUse = c(9, 12, 14:18))
str(fSOM$map, max.level = 2)
@

\section{Building the minimal spanning tree}

The third step of FlowSOM is to build the minimal spanning tree. 
This will again return a FlowSOM object, with extra information contained in
the \Rcode{\$MST} parameter.

\medskip{}
<<>>=
fSOM <- BuildMST(fSOM)
str(fSOM$MST, max.level =  1)
@

Once this step is finished, the FlowSOM object can be used for visualization.
You can plot the nodes in several layouts ("MST": Minimal spanning tree
(default),"grid": SOM grid, "matrix": alternative layout. You can make use of 
\Rcode{PlotFlowSOM} or you can use the \Rcode{PlotStars} function. 
\medskip{}
<<fig=TRUE>>=
PlotStars(fSOM)
@

<<fig=TRUE>>=
PlotStars(fSOM, view = "grid")
@

<<fig=TRUE>>=
set.seed(1)
tsne <- Rtsne::Rtsne(fSOM$map$codes, perplexity = 6)
PlotStars(fSOM, view = tsne$Y, maxNodeSize = 2)
@

If you do not want the size to depend on the number of cells assigned to a 
node, you can use the parameter \Rcode{equalNodeSize = T} in every plotting
function.
<<fig=TRUE>>=
p <- PlotStars(fSOM, equalNodeSize = TRUE)
print(p, newpage = FALSE)
@


It might also be interesting to compare with a manual gating. The 
\Rcode{cellTypes} can be any factor which has a value for each individual
cell, so you can also map other clustering results.
<<>>=
wsp_file <- system.file("extdata", "gating.wsp", package = "FlowSOM")

# Specify the cell types of interest for assigning one label per cell
cell_types <- c("B cells",
                "gd T cells", "CD4 T cells", "CD8 T cells",
                "NK cells","NK T cells")

# Parse the FlowJo workspace
gatingResult <- GetFlowJoLabels(fileName, 
                                wsp_file,
                                cell_types = cell_types)
@

<<fig=TRUE>>=
PlotPies(fSOM, cellTypes = gatingResult$manual)
@

If you are interested in one specific marker, you can use the 
\Rcode{PlotMarker} function.

<<fig=TRUE>>=
p <- PlotMarker(fSOM, "PE-Cy5-A")
print(p, newpage = FALSE)
@
<<fig=TRUE>>=
p <- PlotMarker(fSOM, "CD19")
print(p, newpage = FALSE)
@

If you need to refer to the nodes, it might be useful to number them.
<<fig=TRUE>>=
PlotNumbers(fSOM)
@

You can use this number for a 2D scatter plot indicating the node values.
<<fig=TRUE>>=
plot <- Plot2DScatters(fSOM, 
                       channelpairs = list(c("PE-Texas Red-A", "Pacific Blue-A")),
                       clusters = list(c(81, 82, 91, 92, 93)),
                       plotFile = NULL)
print(plot[[1]])
@


\section{Meta-clustering the data}

The fourth step of the FlowSOM algorithm is to perform a meta-clustering of
the data. This can be the first step in further analysis of the data, and
often gives a good approximation of manual gating results.

If you have background knowledge about the number of cell types you are
looking for, it might be optimal to provide this number to the algorithm.

<<>>=
metaClustering <- as.character(metaClustering_consensus(fSOM$map$codes,k = 7))
@

<<fig=TRUE>>=
PlotPies(fSOM,
         cellTypes=gatingResult$manual,
         backgroundValues = metaClustering)
@

<<fig=TRUE>>=
PlotLabels(fSOM, labels = metaClustering)
@

You can also extract the meta-clustering for each cell individually
<<>>=
metaClustering_perCell <- GetMetaclusters(fSOM, metaClustering)
table(metaClustering_perCell)
@

\section{Detecting nodes with a specific pattern}
If you do not have a manual gating to map on the tree, it might be time-
consuming to interpret all the different nodes. Therefore, you can also
query the tree to indicate nodes similar to a specified pattern. This function
is still being optimized, so make sure to check the marker values to see if it
corresponds to your expectations.

<<>>=
# Look for CD8+ ab T cells
query <- c("PE-Cy7-A" = "high", #CD3
            "APC-Cy7-A" = "high", #TCRb
            "Pacific Blue-A" = "high") #CD8
query_res <- QueryStarPlot(fSOM, query, equalNodeSize = TRUE, plot = FALSE)

cellTypes <- factor(rep("Unlabeled", fSOM$map$nNodes), 
                    levels=c("Unlabeled", "CD8 T cells"))
cellTypes[query_res$selected] <- "CD8 T cells"
@

<<fig=TRUE>>=
p <- PlotStars(fSOM, backgroundValues=cellTypes, 
               backgroundColor=c("#FFFFFF00","#0000FF"))
print(p, newpage = FALSE)
@

\section{Comparing different groups}
It is possible to compare between groups with the FlowSOM package as well. The tree should be build on either a concatenation of all files, or a representative subset of all cell types. Then a list identifying which files belong to specific groups should be defined, and the differences will be computed. For a smaller number of samples, you can look at the fold change between the groups. For coloring, a threshold is used. A threshold of 0.50 means the difference should be at least 50\% of the max of both groups, which corresponds with a 2-fold change. The higher the threshold, the stricter, a threshold of 0 will color each node. For a larger number of samples you can also use a wilcox test. This will be selected when a value is provided for the p\_thresh parameter.

<<>>=
library(FlowSOM)
set.seed(1)

# Build FlowSom result
 fileName <- system.file("extdata", "68983.fcs", package = "FlowSOM")
 ff <- flowCore::read.FCS(fileName)
 ff <- flowCore::compensate(ff, ff@description$SPILL)
 ff <- flowCore::transform(ff,
           flowCore::transformList(colnames(ff@description$SPILL),
                                     flowCore::logicleTransform()))
 flowSOM.res <- FlowSOM(ff, scale = TRUE, colsToUse = c(9, 12, 14:18),
                          nClus = 10)
   
# Create new data
# To illustrate the output, we here generate new fcs files (with more 
# cells in metaclusters 1 and 9).
# In practice you would not generate any new file but use your different
# files from your different groups
 for(i in 1:5){
   flowCore::write.FCS(ff[sample(nrow(ff), 1000), ], 
                       file = paste0("ff_tmp", i, ".fcs"))  
 }
 for(i in 6:10){
    flowCore::write.FCS(ff[c(sample(nrow(ff),500),
                          sample(which(GetMetaclusters(flowSOM.res) == 1), 250),
                          sample(which(GetMetaclusters(flowSOM.res) == 9), 250)), ], 
                     file = paste0("ff_tmp", i, ".fcs"))
 }                  
 
# Get the count matrix
 percentages <- GetFeatures(fsom = flowSOM.res, 
                            files = paste0("ff_tmp",1:10,".fcs"), 
                            type = "percentages")
  
# Perform the statistics
groups <- list("Group 1" = paste0("ff_tmp", 1:5, ".fcs"), 
               "Group 2" = paste0("ff_tmp", 6:10, ".fcs"))
MC_stats <- GroupStats(percentages[["metacluster_percentages"]], groups)
C_stats <- GroupStats(percentages[["cluster_percentages"]], groups)

# Process the fold changes vector
fold_changes <- C_stats["fold changes", ]
fold_changes <- factor(ifelse(fold_changes < -3, "Underrepresented compared to Group 1",
                       ifelse(fold_changes > 3, "Overrepresented compared to Group 1",
                       "--")), levels = c("--", "Underrepresented compared to Group 1",
                       "Overrepresented compared to Group 1"))
fold_changes[is.na(fold_changes)] <- "--"
@

<<fig=TRUE>>=
# Show in figure
## Fold change
gr_1 <- PlotStars(flowSOM.res, title = "Group 1", 
                  nodeSizes = C_stats["medians Group 1", ], 
                  refNodeSize = max(C_stats[c("medians Group 1", "medians Group 2"),]),
                  backgroundValues = fold_changes,
                  backgroundColors = c("white", "red", "blue"), 
                  list_insteadof_ggarrange = TRUE)
gr_2 <- PlotStars(flowSOM.res, title = "Group 2", 
                  nodeSizes = C_stats["medians Group 2", ], 
                  refNodeSize = max(C_stats[c("medians Group 1", "medians Group 2"),]),
                  backgroundValues = fold_changes,
                  backgroundColors = c("white", "red", "blue"), 
                  list_insteadof_ggarrange = TRUE)
p <- ggpubr::ggarrange(plotlist = list(gr_1$tree, gr_2$tree, 
                                       gr_2$starLegend, gr_2$backgroundLegend), 
                       ncol = 2, nrow = 2,
                       heights = c(4, 1))
print(p, newpage = FALSE)
@

<<fig=TRUE>>=
## p values
p <- PlotVariable(flowSOM.res, 
                  title = "Fold change group 1 vs. group 2",
                  variable = C_stats["fold changes", ])

print(p, newpage = FALSE)
@

<<fig=TRUE>>=
## volcano plot
p <- ggplot2::ggplot(data.frame("-log10 p values" = c(C_stats[4, ], 
                                                      MC_stats[4, ]), 
                                "log10 fold changes" = c(C_stats[7, ], 
                                                         MC_stats[7, ]), 
                                "feature" = c(colnames(C_stats), colnames(MC_stats)),
                                "metacluster" = c(as.character(flowSOM.res$metaclustering),
                                                  levels(flowSOM.res$metaclustering)),
                                "type" = c(rep("C", ncol(C_stats)),
                                           rep("MC", ncol(MC_stats))),
                                check.names = FALSE), 
                     ggplot2::aes(x = `log10 fold changes`, 
                                  y = `-log10 p values`,
                                  size = type,
                                  col = metacluster)) +
  ggplot2::xlim(-3, 3) +
  ggplot2::ylim(0, 3) +
  ggplot2::geom_point() +
  ggplot2::theme_minimal() +
  ggplot2::geom_vline(xintercept = 0, col = "darkgrey") +
  ggplot2::geom_hline(yintercept = -log10(0.05), col = "darkgrey") +
  ggplot2::scale_size_manual(values =c("C" = 1, "MC" = 2)) 

print(p, newpage = FALSE)
@

\section{Summary}

In summary, the FlowSOM package provides some new ways to look at cytometry
data.
It can help to keep an overview of how all markers are behaving on different
cell types, and to reduce the probability of overlooking interesting things
that are present in the data.
\end{document}

<<>>=
sessionInfo()
@