%\VignetteIndexEntry{CONFESS} %\VignettePackage{BiocStyle} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage[T1]{fontenc} <>= BiocStyle::latex() @ \title{CONFESS v1.0} \author{Diana HP Low\thanks{dlow at imcb.a-star.edu.sg}, Efthymios Motakis\thanks{efthymios.motakis at riken.jp} \\ Institute of Molecular and Cell Biology, (A*STAR), Singapore \\RIKEN Yokohama, Japan} \begin{document} \maketitle \tableofcontents{} \newpage \section{Preliminaries} \Biocpkg{CONFESS} is a customized cell detection and signal estimation model for images coming from the Fluidigm C1 system. Applied to the HeLa \Biocpkg{CONFESSdata} dataset, our method estimated the cell cycle phase of hundreds of samples from their fluorescence signals and enabled us to study the spatio-temporal dynamics of the HeLa cell cycle. \subsection{Loading the packages} To load the \Biocpkg{CONFESS} package: <>= library(CONFESS) @ \subsection{Data pre-processing} The sample set of 14 raw C01 images are available in the \Biocpkg{CONFESSdata} package on Bioconductor. Alternatively, the complete set of 378 images can be downloaded \href{http://single-cell.clst.riken.jp/bluk\_download/cell\_images/}{here}. They include images for each of the following sets: \begin{itemize} \item Bright Field (BF) images \href{http://single-cell.clst.riken.jp/bluk\_download/cell\_images/RawSC\_BF.zip}{RawSC\_BF.zip} \item Red and Green Channels (Ch) images \href{http://single-cell.clst.riken.jp/bluk\_download/cell\_images/RawSC\_red\_green.zip}{RawSC\_red\_green.zip} \end{itemize} \Biocpkg{CONFESS} can take as input raw BMP or JPEG image files, or text-converted files. C01 files can be coverted with an external program like \href{http://fiji.sc/Fiji}{ImageJ Fiji}. To do the conversion with Fiji, go to ImageJ (Fiji): Process --> Batch --> Convert, with the following options: "Output format"=Text Image. "Interpolation"=Bilinear and "scale factor"=1. Select the option "Read Images using Bio-Formats" and "convert". You should get all txt files. \section{Fluorescence estimation} \subsection{Reading in image/text files} The function \Rfunction{readFiles} reads the image/text filenames. If image data is used, all images should be in a single directory referenced by iDirectory. \Rcode{BFdirectory} and \Rcode{CHdirectory} would then reference the output directory for the Bright Field and Channel images. If the image files have been converted to text, \Rcode{iDirectory} can be left empty, and \Rcode{BFdirectory} and \Rcode{CHdirectory} will now point to the input text files folders. These data should be stored in two different folders. This function will also report (and discard) any inconsistencies in the files being read (eg. BF file present but missing Red/Green channel). \Rcode{separator} separates the image type (BF and channel characteristic types defined in \Rcode{image.type}) from the rest of the sample name ID (consisting of the run ID and the well ID). A typical example of a sample name that is separated by "\_" from the image type is the .C01 Bright Field image 1772-062-248\_A01\_BF.C01. String 1772-062-248\_A01 is the joined Run and Well ID (also separated by "\_"). At this function though \Rcode{separator} refers to the one separating 1772-062-248\_A01.C01 and BF.C01 strings. In this example, we read text-converted files available in the \Biocpkg{CONFESSdata} dataset. <>= data_path<-system.file("extdata",package="CONFESSdata") files<-readFiles(iDirectory=NULL, BFdirectory=paste(data_path,"/BF",sep=""), CHdirectory=paste(data_path,"/CH",sep=""), separator = "_",image.type = c("BF","Green","Red"), bits=2^16) @ \newpage \subsection{Image spot estimation} To estimate the spots we need to specify a set of parameters. \Rcode{correctionAlgorithm} should be \Rcode{FALSE} in this estimation stage. If the parameter subset is not defined, all files read in with the \Rfunction{readFiles} function will be analysed. \Rcode{foregroundCut} defines a series of cut-offs that separate the spot (a potential cell) from the background. The cut-offs are empirically picked. For this reason, it is often helpful to train the dataset by picking a subset of well-defined, single-spot images and check the algorithm"s performance using different values (e.g. the above vs seq(0.8,0.96,0.02)). In noisy data we have found that low cut-offs produce the best results. If the spot"s fluorescence signal is too weak to be detected or simply the cell is not present, \Biocpkg{CONFESS} will perform capture site recognition (BF modeling) to estimate the pixel coordinates of the spot that, here, it is assumed to have a rectangular shape. The size of the side of this pseudo-spot is defined by \Rcode{BFarea}. The signal is then quantified within this area. Note that only in BF modeling the spot is assumed to have a rectangular shape. Otherwise we do not make any assumptions on the shape and the size of the spot. <>= estimates <- spotEstimator(files=files,foregroundCut=seq(0.6,0.76,0.02), BFarea=7,correctionAlgorithm=FALSE,savePlot="screen") @ \begin{figure}[!htp] \begin{center} \includegraphics[width=6in]{spotEstimator.jpeg} \caption{Example spot from spotEstimator} \end{center} \end{figure} \newpage \subsection{Quality control (identification of outliers)} The next step uses visual and statistical inspection tools for the identification of possible outliers. The spot estimates (of the samples stored in \Robject{estimates}) enter in the function through \Rfunction{defineLocClusters}. The way of processing these data is controlled by out.method whose possible values are: \begin{enumerate} \item \Rcode{interactive.clustering} : estimates concentric circles that mark the outliers (e.g. all dots outside the circle with with a particular radius are outliers) \item \Rcode{interactive.manual} : enables the user to select the outliers manually by point-and-click on the plot. \end{enumerate} The function mainly produces run- and well- specific plots that enable the user to pick outlier locations. We have noticed that the Well IDs exhibit specific directionality (half of them are facing right and the rest left) that affects the position of the capture site. The output integrates the first-step estimates and the quality control. The code below shows the quality control process under \Rcode{interactive.manual}. \Rcode{interactive.clustering} requires at least 15 samples in each run- / well- category (it will exit with an error here). When applied on all data, it enables the user to select outliers by entering a pre-calculated radius in an auto-generated message similar to the one below. <>= clu <- defineLocClusters(LocData=estimates,out.method="interactive.manual") #"Hit Enter to move to the next image or A + Enter to Abort:" @ \begin{figure}[!htp] \begin{center} \includegraphics[width=6in]{defineLocClusters.jpeg} \caption{Example cluster from defineLocClusters} \end{center} \end{figure} Option \Rcode{interactive.manual} enables the user to select suspicious data manually by point-and-click on the plot. The algorithm will select \underline{the closest spot only once} of the location that has been selected by the point-and-click (thus in closely located spots one can click as many times as he wants around to make sure that everything is selected). In Windows the user clicks on any number of spots that could be potential outliers and completes the procedure by clicking on the "Stop" button appearing on the top/left of the console. In Linux/Mac the "Stop" button is replaced by right clicking anywhere on the image. In Rstudio the process stops with the "Esc" button. \textbf{Important Note}: Please follow the pipeline"s instructions on how to stop this process at any time you wish. Closing the plot window may cause a fatal error and abnormal exit. To stop the process simply press A and Enter (Abort) when prompted! \subsection{Re-estimation step (for outliers)} The selected samples now undergo BF modelling using \Rfunction{spotEstimator} with \Rcode{correctionAlgorithm = TRUE}. The potential outliers that had been originally estimated by BF modelling are not re-estimated. They are only kept in a separate slot for manual inspection by our graphical tools. \Rcode{QCdata} contains the output of the quality control step. \Rcode{median.correction=TRUE} instructs the function to shift all locations with outlier BF modeling estimates (more than \Rcode{cutoff=50} pixels away from the bulk of the estimates) to the median of the bulk estimates (denoted as "confidence" in the output table). <>= estimates.2 <- spotEstimator(files=files,subset=clu$Outlier.indices,foregroundCut=seq(0.6,0.76,0.02), correctionAlgorithm=TRUE,QCdata=clu,savePlot="screen") @ \begin{figure}[!htp] \begin{center} \includegraphics[width=5in]{spotEstimator2.jpeg} \caption{Example spot from running spotEstimator for outliers} \end{center} \end{figure} \subsection{Final estimation} The final output is given by \Rfunction{LocationMatrix}. Apart from listing the locations and the signal estimates, it quantifies the cell existence via a set of user defined filters. The possible filters are: \begin{itemize} \item Size: it keeps spots whose size is higher than a cut-off. E.g. \Rcode{filter.by = matrix(c("Size",50),ncol=2)} with the cut-off being 50 pixels. \item Estimation.Type: it keeps spots that have been estimated only by Fluorescence-based estimation (i.e. it excludes BF modeling based estimates). E.g. \Rcode{filter.by = matrix(c("Estimation.Type","Fluorescence-based"),ncol=2)}. \item Pvalue: it keeps spots whose Pvalue for the signal-to-noise ratio (being significant) IN AT LEAST ONE CHANNEL is higher than a cut-off. E.g. \Rcode{filter.by = matrix(c("Pvalue",0.01),ncol=2)} will keep all spots with sufficiently low P-values in at least one channel. \item FDR: keep the spots whose FDR adjusted Pvalue for the signal-to-noise ratio (being significant) IN AT LEAST ONE CHANNEL is higher than a cut-off. E.g. \Rcode{filter.by = matrix(c("FDR",0.01),ncol=2)}. \item StN: keep the spots whose signal-to-noise ratio IN AT LEAST ONE CHANNEL is higher than a cut-off, for example \Rcode{filter.by = matrix(c("StN",1),ncol=2)}. \item Pvalue/(image.type): keep the spots whose Pvalue for the signal-to-noise ratio (being significant) OF A PARTICULAR CHANNEL is higher than a cut-off. The channel ID replaces the "(image.type)" expression. The channel IDs are user defined in \texttt{readFiles()}. E.g. \Rcode{filter.by = matrix(c("Pvalue/Green",0.01),ncol=2)} will keep all spots with sufficiently low P-values at the Green channel. \item FDR/(image.type): keep the spots whose FDR adjusted Pvalue for the signal-to-noise ratio (being significant) OF A PARTICULAR CHANNEL is higher than a cut-off (see above). E.g.\Rcode{filter.by = matrix(c("FDR/Green",0.01),ncol=2)}. \item StN/(image.type): keep the spots whose signal-to-noise ratio OF A PARTICULAR CHANNEL is higher than a cut-off (see above). E.g. \Rcode{filter.by = matrix(c("StN/Green",1),ncol=2)}. \item Out.Index: keep the spots that are estimated with confidence (remove outliers). E.g.\Rcode{filter.by = matrix(c("Out.Index","confidence"),ncol=2)}. \item Other.Spots: keep the spots of high quality images, i.e. images where the algorithm does not find any unmatched spots (across channels) or the number of matched spots is lower than a cut-off. E.g.\Rcode{filter.by = matrix(c("Other.Spots",0),ncol=2)} will keep the images that contain only matched spots across channels. \end{itemize} A typical example of filtering with more than one filter is the following: <>= Results <- LocationMatrix(data=estimates.2, filter.by = matrix(c("FDR","Out.Index",0.005,"confidence"),ncol=2)) Results$Output[1:3,] @ This will keep all "confident" spots with FDR adjusted P-value \textless 0.005 (in any of the channels). The filters allow the "spot" to "cell" transition by quantifying certain properties that only cells should have. The output of \Rfunction{LocationMatrix} should always be manually curated via inspection of the \Rfunction{spotEstimator} plots. The output table summarizes the estimates of the image analysis step. Most of the columns are self-explained. "fore" corresponds to foreground (spot) and "back" to the background signal. "Other.Spots" indicates whether \Biocpkg{CONFESS} the location of all other spot signals found in a particular image. In the example of sample "1772-062-248\_A03" it notifies us that it estimated the location "X = 30, Y = 204" in the Green channel and "X = 262, Y = 368" in the Red Channel (the latter has been correctly kept as reliable in the X, Y columns). Finally "Cells" is the transition column that reports predicted cells (marked with 1s) via the various filters that we set. \section{Adjustment and estimation of cell cycle phases / pseudotime} \subsection{Fluorescence adjustment on estimations from Part 1} To perform signal adjustment we use the relevant data of the previous step. <>= step1 <- createFluo(from.file=system.file("extdata", "Results_of_image_analysis.txt", package = "CONFESS"),separator="_") @ \Rcode{separator} separates the run from the well ID and it is kept here because \Rfunction{createFluo} can be based on data from a file that is not necessarily obtained from the above process (but it should have the above format!). \subsection{Data inspection by batch (chip)} In \Robject{step1} we store the sample IDs, the raw fluorescence signals, the areas and the Run IDs of all reliable samples into a list. We can quickly check the distribution of the signals by run/batch using the following commands: <>= print(unique(step1$batch)) @ \subsubsection{If data consists of more than 1 chip run/batch} This step would perform signal adjustment by \Rcode{BGmethod="normexp"} for image background within a mixtures of regression model with maximum \Rcode{maxMix=3} mixture components (the optimal number is estimated) that is also correcting for the run (batch) effects. Plots will be generated to show how each channel has been normalized and adjusted. The \Rcode{seed} parameter serves only for reproducibility of the results. <>= step2<-Fluo_adjustment(data=step1,transformation="log",maxMix=3,prior.pi=0.1, flex.reps = 50, single.batch.analysis=5, savePlot="screen",seed=999) @ \begin{figure}[!hbtp] \begin{center} \includegraphics[width=5in]{fluo_adjustment.jpeg} \caption{Fluo\_adjustment} \end{center}\end{figure} \newpage Data will now be clustered and the group numbers will be labelled on the centroids. <>= step2.1<-getFluo(data=step2) step3 <- Fluo_inspection(data=step2.1,altFUN="kmeans",B.kmeans=5,savePlot="screen") @ \begin{figure}[!htpb] \begin{center} \includegraphics[width=5in]{fluo_inspection_GAP.jpeg} \caption{Fluo\_inspection GAP statistics} \end{center}\end{figure} \begin{figure}[!htpb] \begin{center} \includegraphics[width=5in]{fluo_inspection_clusters.jpeg} \caption{Fluo\_inspection clusters} \end{center}\end{figure} The user should select the starting cluster, i.e. the start of the progression path. In circular paths like this one (representing a cell cycle process) any cluster can serve as starting. In this example, we select the bottom/left one (\Rcode{path.start=3}) that is well-separated from the previous and looks like a starting point. \Rcode{pathEstimator} will estimate the progression path assumed to progressing in a clockwise and circular manner. <>= step3.1<-pathEstimator(step3,path.start=3,path.type=c("circular","clockwise")) step4<-Fluo_modeling(data=step3.1,init.path=step3.1$Path,VSmethod="DDHFmv", CPmethod="ECP",CPgroups=5,CPpvalue=0.01,CPmingroup=10) step5<-Fluo_ordering(data=step4,savePlot="screen") @ \begin{figure}[!htpb] \begin{center} \includegraphics[width=5in]{fluo_ordering.jpeg} \caption{Fluo\_ordering} \end{center} \end{figure} \Robject{step4} and \Robject{step5} contain the main analytical steps that perform both cell ordering and cluster estimation by fluorescence signal. The plots show the distinct clusters (top) and the channel differences by estimated pseudotime (bottom). The data have been transformed by \Rcode{VSmethod="DDHFmv"} and the change-point analysis was run by \Rcode{CPmethod="ECP"} that identifies distinct clusters at significance level \Rcode{CPpvalue=0.01}. The main output of \Rfunction{Fluo\_ordering} is the following table with the estimates of interest: <>= head(step5$Summary_results,3) @ \newpage \subsubsection{If data consists of a single chip run/batch} If the data come from a single run, \Rfunction{Fluo\_adjustment} is replaced by \Rfunction{getFluo\_byRun} that performs only image background correction. The rest of the steps are as before. Note that in the example code below, we run the analysis on a subset of the original data, i.e. only those coming from \Rcode{batch=1}. The data have been selected by \Rfunction{FluoSelection\_byRun} that can be used to select either specific batch(es) or specific data across all batches. An example of the latter is provided in the next section. <>= step1.1<-FluoSelection_byRun(data=step1,batch=1) step2<-getFluo_byRun(data=step1.1,BGmethod="normexp",savePlot="screen") step3<-Fluo_inspection(data=step2,fixClusters=0,altFUN="kmeans",k.max=15, savePlot="screen") step3.1 <- pathEstimator(step3,path.start=2,path.type=c("circular","clockwise")) step4 <- Fluo_modeling(data=step3.1,init.path=step3.1$Path,VSmethod="DDHFmv", CPmethod="ECP",CPpvalue=0.01) step5<-Fluo_ordering(data=step4,savePlot="screen") @ \subsection{Cross-validation to assess the stability of the estimates} \Rfunction{Fluo\_CV\_prep} takes the initial data (step1) and automatically runs \Robject{step2}-\Robject{step4} for one or more reference runs (\Rcode{single.batch.analysis}). The reference run defines which run we use to adjust the data at \Rfunction{Fluo\_adjustment()}). Finally, cross-validation is done with \Rfunction{Fluo\_CV\_modeling} as follows: \begin{enumerate} \item remove samples at random and generate "new clusters". \item put back the excluded samples and re-estimate them (based on the new clusters). \item Repeat this B \textgreater 2 times. \item summarize and evaluate how different your original data estimates are from the cross-validation estimates \end{enumerate} <>= step1 <- createFluo(from.file=system.file("extdata", "Results_of_image_analysis.txt", package = "CONFESS")) steps2_4<-Fluo_CV_prep(data=step1,init.path = rep("bottom/left",2), path.type=c("circular","clockwise"),maxMix=3, single.batch.analysis = 5,transformation = "log",prior.pi = 0.1, flex.reps=5,areacut=49,fixClusters=0,altFUN="kmeans", k.max=15,VSmethod="DDHFmv",CPmethod="ECP",CPgroups=5, B.kmeans=5,CPpvalue=0.01,CPmingroup=15,savePlot="OFF",seed=999) steps2_4cv.1<-Fluo_CV_modeling(data=steps2_4,B=10,batch=1:4,perc.cutoff=0.6,q=0.9, f=0.99,seed.it=TRUE,pseudotime.cutoff=20,savePlot="screen") @ \begin{figure}[!htpb] \begin{center} \includegraphics[width=5in]{fluo_CV_modeling.jpeg} \caption{Fluo\_CV\_modeling 1} \end{center} \end{figure} You can do the same thing by removing and re-estimating a whole batch. In \Rfunction{Fluo\_CV\_modeling} we must set B=1 because the whole batch is removed once and there is no random samples to be excluded iteratively as above. The command below will remove and re-estimate batch 1. <>= steps2_4cv.2<-Fluo_CV_modeling(data=steps2_4,B=1,batch=1:4,perc.cutoff=0.6,q=0.9, f=0.99,seed.it=TRUE,pseudotime.cutoff=20,savePlot="screen") @ \begin{figure}[!htpb] \begin{center} \includegraphics[width=5in]{fluo_CV_modeling_2.jpeg} \caption{Fluo\_CV\_modeling 2} \end{center} \end{figure} The output of \Rfunction{Fluo\_CV\_modeling} is a list with the results of the original estimates (\Robject{step5}) and the CV-based estimates. The latter are newly predicted pseudotimes and clusters by two methods: median / original and median / null. They both integrate the information of the CV and the originally estimated pseudotimes by building kmean clusters of the \Rcode{B} CV estimates for each sample i and defining pseudotime(i) = median(pseudotime(set1,i)) where set1 is a subset of the \Rcode{B} pseudotimes that exhibit some similarity. The similarity is assessed by k-means clustering. This subset should contain a large percentage of the \Rcode{B} data (higher than a user defined cut-off) and it"s median should be lower than the q-th (user defined) quantile of the average differences between the original and the CV-estimated pseudotimes across all samples. Under the former method, if the CV estimated pseudotimes do not satisfy the above requirements then the algorithm returns pseudotime(i) = median(pseudotime(set2,i)) where set2 is the cluster of \Rcode{B} pseudotimes that minimizes modulus(median (pseudotimes(set2,i)) - original.pseudotimes). Under the latter method, if the CV estimated pseudotimes do not satisfy the above requirements then the algorithm returns pseudotime(i) = NULL. The NULL data should be removed from the analysis. Any of the above slots can be used in \Rfunction{Fluo\_ordering} to obtain new / alternative results. \end{document}