%\VignetteIndexEntry{Mapping cell populations in flow cytometry data flowMap-FR}
%\VignetteDepends{flowMap}
%\VignetteEngine{knitr::knitr}

% To compile this document
% library('knitr'); rm(list=ls()); knit('flowMap.Rnw')


\documentclass[12pt]{article}

<<knitr, echo=FALSE, results="hide">>=
library("knitr")
opts_chunk$set(tidy=FALSE,dev="png",fig.show="hide",
               fig.width=10,fig.height=8,
               message=FALSE)
@ 


<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@


<<loadpackage, echo=FALSE>>=
# in order to print version number below
library("flowMap")
@

\author{Chiaowen Joyce Hsiao$^{1*}$, Yu Qian$^{2}$, Richard H. Scheuermann$^{2}$ \\[1em] \small{$^{1}$ Center for Bioinformatics and Computational Biology (CBCB) and} \\ \small{and Applied Mathematics and Scientific Computation,} \\ \small{University of Maryland, College Park, US;} \\ \small{$^{2}$ Department of Informatics, J. Craig Venter Institute (JCVI), San Diego, US} \\ \small{\texttt{$^*$joyce.hsiao1 (at) gmail.com}} }


\title{User's guide to flowMap}

\begin{document}

\maketitle

\begin{abstract}
  \Biocpkg{flowMap} is a stand-alone software for mapping cell populations in multiple flow cytometry (FCM) data. A core issue in cell population matching is the ability of the algorithm to accurately quantify similarity between cell populations differing in proportion, size, and levels of expression markers. Our algorithm implements the Friedman-Rafskty (FR) statistic to compare similarity between cell populations across multiple flow cytometry samples. Users can generate a matrix of FR statistic, which can then be used to determine cell population groups across samples. The method can be incorporated in any standard flow cytometry sample processing pipeline at any stage where comparison of cell populations is required. We demonstrated the ability of the FR statistic in scenaiors of biological and technical differences between flow cytometry samples in C. Hsiao, M. Liu, R. Stanton, M. McGee, Y. Qian, and R. H. Scheuermann: Mapping cell populations in flow cytometry data for cross-sample comparison using the Friedman-Rafsky Test (2014) \cite{flowMap}. 
\end{abstract}


<<options, results='hide', echo=FALSE>>=
options(digits=3, width=80, prompt=" ", continue=" ")
@

\newpage

\tableofcontents


\section{Introduction}
In this vignette, we show how to use \Biocpkg{flowMap} to compare cell populations cell populations across flow cytometry samples, and to visualize the results. 

\section{Data prepartion}
\Biocpkg{flowMap} input accepts any flow cytometry sample files with identified cell populations. Thus, users can use \Biocpkg{flowMap} at any step of a FCM workflow to quantify similarity of cell populations. In order to use flowMap as a downstream analysis tool to compare phenotypes, the input FCM sample files need to have been preprocessed for debris filtering, transformation, and marker expression alignment.\\ 

Each FCM sample input needs to be in the matrix form of numeric values. Rows correspond to the events (cells) and the columns correspond to marker expression measurements. The last column of the sample matrix is usually named as \Robject{id}, serving as cell population membership index containing numeric values.\\

\subsection{Example data} Here's a flow cytometry sample in \emph{txt} format. There are 9 cell populations identified in the sample from 20,000 events measured in 4 feature markers (CD14,CD23,CD3,CD19,id). \Robject{id} index the cell population membership of each event from 1, 2, 3 through 9.

<<dataPrepExample, echo=TRUE>>=
sam1 <- read.table(system.file("extdata/sample.txt",package="flowMap"),
                   header=T)
str(sam1)
table(sam1$id)
@


\section{FR statistic to quantify similarity between cell populations}

The Friedman-Rafsky statistic is based on the minimum spanning tree (MST) algorithm which computes the extent to which the two cell populations overlap in their shared feature space \cite{FriedmanRafsky}. Because the runtime of the MST algorithm is quadratic in the number of events, we devised a downsampling scheme to estimate the FR statistics in a single cell population pair comparison. First, for any cell population pair comparison, the events are combined to form \Rcode{pooled data}. Next, samples containing the same number of events (default: 200) are taken from the pooled data. Each event in the pooled data may be sampled more than once. Key idea is to maintain a constant ratio of events from the two cell populations across samples. In each of the random samples, a MST is identified, followed by the FR statistic computation. The estimated FR statistic for each cell population pair comparison is based on the median of the FR statistics across the random samples.\\

Below are examples of MST finding when comparing two FCM samples. \emph{Sample 1} and \emph{Sample 2} each contains 9 cell populations.\\

<<displayData>>==
sam1 <- read.table(system.file("extdata/sample.txt",package="flowMap"),
                   header=T)
sam2 <- read.table(system.file("extdata/sample.txt",package="flowMap"),
                   header=T)
table(sam1$id)
table(sam2$id)
@


\subsection{Visualize the minimum spanning tree} 

\subsubsection{Example 1}
When events with different cell population membership are distant from each other, or in other words, events with the same membership congregate, the FR statistic determines the two cell populations to be dissimilar from each other. CP1 from Sample 1 is compared with CP3 from Sample 2 in the MST below. 100 events is sampled from the pooled data combinig events from the two cell populations, with the ratio of the cell population membership kept the same as that in the pooled data before sampling.

<<makeplotSelfruns1,dev='pdf',echo=TRUE,warning=FALSE>>==
mat1 = sam1[sam1$id==1,]
mat2 = sam2[sam2$id==3,]

# combine events from the two cell populations to 
# make pooled data
mat = rbind(mat1,mat2)

# sample 100 events from the pooled data
sampleSize = 100

# among the 100 events, sample events from the two cell populations
# such that the ratio of the cell population membership is the same
# as that in the pooled data
nn1 = round(sampleSize*table(mat$id)[1]/nrow(mat))
nn2 = round(sampleSize*table(mat$id)[2]/nrow(mat))
submat = rbind(mat1[sample(nrow(mat1),nn1),],mat2[sample(nrow(mat2),nn2),])
colnames(submat)[5] = "sam"

# plot MST of the 100 events
g1 = makeFRMST(submat)
par(mar=c(0,0,0,0))
plot(g1$g,vertex.label.cex=0.01,
     layout=layout.fruchterman.reingold(g1$g))
@

\incfig[h]{figure/makeplotSelfruns1}{.5\textwidth}{Example 1: MST of sample events from two dissimilar cell populations}{Blue nodes belong to Sample 1 CP1, and red nodes belong to Sample 2 CP 3.}


\subsubsection{Example 2}
This is another example of two cell populations being dissimilar from each other. Events in the pooled data tend to congregate on the tree with events of the same cell population membership.

<<makeplotSelfruns2,dev='pdf',echo=TRUE,warning=FALSE>>==
mat1 = sam1[sam1$id==4,]; mat2 = sam2[sam2$id==5,]
mat = rbind(mat1,mat2)
sampleSize = 100
nn1 = round(sampleSize*table(mat$id)[1]/nrow(mat))
nn2 = round(sampleSize*table(mat$id)[2]/nrow(mat))
submat = rbind(mat1[sample(nrow(mat1),nn1),],mat2[sample(nrow(mat2),nn2),])
colnames(submat)[5] = "sam"
g1 = makeFRMST(submat)
par(mar=c(0,0,0,0))
plot(g1$g,vertex.label.cex=0.01,layout=layout.fruchterman.reingold(g1$g))
@

\incfig[h]{figure/makeplotSelfruns2}{.5\textwidth}{Example 2: MST of sample events from two dissimilar cell populations}{Blue nodes belong to sample 1 CP4, and red nodes belong to sample 2 CP5.}


\newpage

\subsubsection{Example 3}
When two cell populations share a similar feature space, events of different cell population membership tend to distribute evenly on the tree. Below is an example of Sample 1 CP6 compared with Sample 2 CP6.

<<makeplotSelfruns3,dev='pdf',echo=TRUE,warning=FALSE>>==
mat1 = sam1[sam1$id==6,]
mat2 = sam2[sam2$id==6,]
mat1$id=1; mat2$id=2
mat = rbind(mat1,mat2)
sampleSize = 100
nn1 = round(sampleSize*table(mat$id)[1]/nrow(mat))
nn2 = round(sampleSize*table(mat$id)[2]/nrow(mat))
submat = rbind(mat1[sample(nrow(mat1),nn1),],mat2[sample(nrow(mat2),nn2),])
colnames(submat)[5] = "sam"
g1 = makeFRMST(submat)
par(mar=c(0,0,0,0))
plot(g1$g,vertex.label.cex=0.01,layout=layout.fruchterman.reingold(g1$g))
@

\incfig[h]{figure/makeplotSelfruns3}{.5\textwidth}{Example 3: MST of sample events from two similar cell populations}{Blue nodes belong to sample 1 CP6, and red nodes belong to sample 2 CP6.}


\section{Mapping cell populations across FCM samples}

\Biocpkg{flowMap} directly computes the similarity betweeen cell populations across FCM samples and provides results in a table format. As a proof-of-concept, we compared a FCM sample against itself.

\subsection{Import data}
The example data contains 9 identified cell populations.
<<displayData2>>==
sam1 <- read.table(system.file("extdata/sample.txt"
            ,package="flowMap"),header=T)
sam2 <- read.table(system.file("extdata/sample.txt"
            ,package="flowMap"),header=T)
table(sam1$id)
table(sam2$id)
@


\subsection{Parameter setting in FR statistic computation}
In order to optimize runtime, we devised a downsampling scheme to estimate FR statistics. The central idea is to draw random samples from the pooled data combining events in a cell population comparison and to estimate the FR statistics of the pooled data from the random samples. Parameters required are: number of random samples (\Robject{ndraws}; Default: 200), size of each random sample (\Robject{sampleSize}; Default: 200), sampling method (\Robject{sampleMethod}; Default: proportional), and number of processing cores (\Robject{ncores}; Default; maximum number of cores available in the computing environment). The parallel computing function in \Biocpkg{flowMap} is built upon the \Biocpkg{doParallel} package.\\

The number of random samples and the size of each random samples determine the precision (variability of the FR statistics across random samples) and the accuracy (deviation of the sample FR statistics from the true FR statistic) of the estimated FR statistic when comparing any two cell populations. As illustrated in Hsiao et al., (2014), the ranks of cell population pairs remain the same when increasing the number of random samples to 500 or when increasing the size of the random sample to 500 events. Users are advised to use the default paramter setting when mapping cell populations when mapping cell populations.\\ 


\subsection{Compute FR statistics}
To compare any two FCM samples, users can use \Rfunction{getFRest} to obtain a matrix of estimated FR statistics comparing any two cell populations across samples. Rows and columns in the result matrix corresponds to the first and the second input sample in the \Rfunction{getFRest} function. For example, the (2,1) entry in the result matrix corresponds to the FR statistic comparing Sample 1 CP2 and Sample 2 CP1. The (4,3) entry in the result matrix corresponds to the FR statistic comparing Sample 1 CP4 and Sample 2 CP3.

<<compareSampleSelf,dev='pdf',warning=FALSE>>=
res1 = getFRest(sam1,sam2,sampleMethod="proportional",sampleSize=100,
                ndraws=100,estStat="median",ncores=NULL)
res1@ww
library(gplots)
par(mar=c(0,0,0,0))
heatmapCols <- colorRampPalette(c("red","yellow","white","blue"))(50)
heatmap.2(res1@ww,trace="none",col=heatmapCols,symm=FALSE,dendrogram="none",
          Rowv=FALSE,Colv=FALSE,xlab="Sample 2",ylab="Sample 1")
@

\incfig[h]{figure/compareSampleSelf}{.5\textwidth}{Heatmap comparing cell populations between Sample 1 and Sample 2}{The matched populations (in the diagonal) are quantified with significantly smaller FR statistic values than the mismatched populations (correspond to off-diagonal entries). The (\emph{i},\emph{j}) cell contains the FR statistic of comparing the Sample 1 \emph{i}-th cell population with Sample 2 \emph{j}-th cell population}


Users can also extract p-values of the FR statistics in the slot \Robject{pNorm}. Heatmap of p-values shows that the p-values of the matched cell population pairs (diagonal entries) are clearly separated from the p-values of the mismatched cell population pairs (off-diagonal entries). 

<<plotSelfpval,dev="pdf",echo=TRUE>>==
library(gplots)
par(mar=c(0,0,0,0))
heatmapCols <- colorRampPalette(c("red","yellow","white","blue"))(50)
heatmap.2(res1@pNorm,trace="none",col=heatmapCols,symm=FALSE,dendrogram="none",
          Rowv=FALSE,Colv=FALSE,xlab="Sample 2",ylab="Sample 1")
@
\incfig[h]{figure/plotSelfpval}{.5\textwidth}{Heatmap of the FR statistics p-values comparing Sample 1 with Sample 2}{The p-values of the matched populations (in the diagonal) are significantly larger than the mismatched cell populations.}


<<plotSelfpvalhist,dev="pdf",echo=TRUE>>=
hist(res1@pNorm,xlab="log10 p-value histogram",main="")
@

\incfig[h]{figure/plotSelfpvalhist}{.5\textwidth}{Histogram of the FR statistics p-values mapping cell populations between Sample 1 with Sample 2}{The bimodal distribution of the log10 p-values indicates a clear gap beteween p-values of matched cell population pairs versus p-values of mistmatched cell population pairs. The matched pairs have significantly larger p-values (\emph{p}>.4) than the mismatched pairs (\emph{p}<.01).}


\subsection{Generate a multi-sample similarity matrix for clustering}

\Rfunction(makeDistmat) generates a complete similarity matrix containing FR statistics that is useful for multiple sample comparisons. Using the above Sample 1 versus Sample 2 example, user can obtain a 18-by-18 matrix for hierarchical clustering.

<<plotMultipval,dev="pdf",echo=TRUE>>==
resMulti = makeDistmat(samples=list(sam1,sam2),sampleSize=100,ndraws=100)
require(gplots)
par(mar=c(0,0,0,0))
heatmapCols <- colorRampPalette(c("red","yellow","white","blue"))(50)
heatmap.2(resMulti$distmat,trace="none",col=heatmapCols,symm=FALSE,dendrogram="none",
          Rowv=FALSE,Colv=FALSE)
@
\incfig[h]{figure/plotMultipval}{.5\textwidth}{Heatmap of the Sample 1 versus Sample 2 comparison similarity matrix}{Row and column labels 1.x and 2.y denote Sample 1 cell populations and Sample 2 cell populations, respectively.}


\newpage

\section{SessionInfo}
The last part of this vignette calls for the function \Rfunction{sessionInfo}, which reports the computing environment in the session, including the R version number and all the packages used. Users should check their computing environment for consistency with this document as a first step in evaluating the errors/issues while using the \Biocpkg{flowMap}.

<<sessionInfo,results="asis",echo=FALSE, eval=TRUE>>=
toLatex(sessionInfo())
@

<<resetOptions, results='hide', echo=FALSE>>=
options(prompt="> ", continue="+ ")
@


<<closeSockets,echo=FALSE>>=
closeSockets <- function() {
    allCon <- showConnections()
    socketCon <- as.integer(rownames(allCon)[allCon[, "class"] == "sockconn"])
    sapply(socketCon, function(ii) close.connection(getConnection(ii)) )
}
closeSockets()
@

\bibliography{flowMap}

\end{document}