%\VignetteIndexEntry{An introduction to OperaMate} %\VignetteKeywords{Preprocessing, CellBasedAssays} %\VignettePackage{OperaMate} \documentclass[10pt, oneside]{article} <>= if (requireNamespace("BiocStyle", quietly = TRUE)) { BiocStyle::latex() } @ \usepackage{hyperref} \newcommand{\thetitle}{OperaMate: data importing, processing and analysis for Opera High Content Screening System} \title{\textsf{\textbf{\thetitle}}} \author{Chenglin Liu\\[1em]Shanghai Jiaotong University,\\ Shanghai, China\\ \texttt{cliu@sjtu.edu.cn} \and Yixue Li\\[1em]Shanghai Jiaotong Univeristy,\\ Shanghai, China\\ \texttt{yxli@sibs.ac.cn}} \begin{document} \maketitle \tableofcontents \section{Introduction} The \Rpackage{OperaMate} is intended to analyze protein intensity data derived from the images from PerkinElmer's Opera High Content Screening System (\url{http://www.perkinelmer.com/pages/020/cellularimaging/products/opera.xhtml}) and intepreted by Columbus Image Data Storage and Analysis System (\url{http://www.perkinelmer.com/pages/020/cellularimaging/products/columbus.xhtml}). PerkinElmer's Opera High Content Screening System is the state-of-the-art confocal microplate imaging solution for high throughput screening. The system works with complex disease models and offers various solutions like protein expression, RNAi screening. Its compatible tool Columbus image analysis system exports intensity data by analyzing the Opera images, but lacks further processing and analysis to uncover the biological meaning underlying the image data. Hence, we develop an R \cite{Rpackage} package \Robject{OperaMate} especially to process the intensity data exported by Columbus system, which can fulfills the procedure of data importing, processing and analysis in an easy and efficient way. \section{Getting Started} \Rpackage{OperaMate} integrates the entire process of data importing, processing and analysis into one function, which is easy to operate for the users who are new to the R language. However, it is also very convinient to customize each step according to the pipeline built by this function, checking the immediate results of each step, and re-performing a specific step with different parameters to achieve the desire of the users. \subsection{Configration}\label{section:config} In order to process and analyze data by one function, the configuration file should be supplied. The templete can be obtained together with the package named param.txt. Users are required to fill the blanks after colons. Leave them as blank if NULL is expected. An example can be obtained by : <>= file <- file.path(system.file("Test", package = "OperaMate"), "demoData", "demoParam.txt") ## file.show(file) @ Required column names of \Robject{genemap}: \begin{itemize} \item Barcode: character, the barcode of the plates. \item Well: character, the well ID. \item GeneSymbol: character, the annotated gene names of the well. \end{itemize} An example can be obtained by : <>= file <- file.path(system.file("Test", package = "OperaMate"), "demoData", "genemap.csv") genemap <- read.csv(file) head(genemap, n = 5) @ \subsection{Running OperaMate} <>= library(OperaMate) configFile <- file.path(system.file("Test", package = "OperaMate"), "demoData", "demoParam.txt") operaReport <- operaMate(configFile, gDevice = "png") names(operaReport) @ See the mannual for more details about this function. \subsection{Description of output} All the reports and figures of the demo data can be find at <>= tempdir() @ \section{Class declaration} \subsection{The expData class}\label{section:expData} Each \Robject{expData} object stores data of one imaging analysis report of the Opera system generatd by Columbus$^{TM}$ Image Data Storage and Analysis System. The report includes different types of data of one plate, distinguished by the paramters. The format of the report is set in the analysis system. \Rpackage{OperaMate} supports two most popular formats by now: the matrix and the table. The class requires the following information: \begin{enumerate} \item \Robject{name}, name of the plate \item \Robject{path}, path of the importing file \item \Robject{rep.id}, replicate ID \item \Robject{exp.id}, barcode of the experiment \item \Robject{format}, report format \begin{enumerate} \item \Robject{Matrix}: matrix format in the Columbus analysis system \item \Robject{Tab}: table format in the Columbus analysis system \end{enumerate} \end{enumerate} An example of creating a new expData object is as follows: <>= onePlate <- expData(name = "DSIMGA02-s1", path = file.path( system.file("Test",package = "OperaMate"), "Matrix", "130504-s1-02.txt" ), rep.id = "s1", exp.id = "DSIMGA02", format = "Matrix") @ %def However, it is highly recommended to import all files using the function \Robject{loadAll}. Details will be referred in Section \ref{section:import}. \subsection{The cellData class} An object of \Robject{expData} class stores all data corresponding to one parameter of the reports. It organizes the data as a matrix with rows are well IDs and columns the plate IDs. This class requires to provide the name of the object, which must be one parameter of the report. Additional information includes: \begin{enumerate} \item \Robject{posctrwell}, the well IDs of the positive controls, e.g. B05 \item \Robject{negctrwell}, the well IDs of the negative controls, e.g. B05 \item \Robject{expwell}, the well IDs of the samples, e.g. C12 \item \Robject{norm.method}, one method from "MP", "PMed", "Z", "Ctr", "None" \item \Robject{QC.threshold}, the thresholds for quality control \end{enumerate} The \Robject{expData} objects provide different places for different levels of the data: the raw data in the \Robject{origin.data} slot; the normalized data in the \Robject{norm.data} slot and data after quality control in the \Robject{qc.data} slot. In addition, the general quality of each plate is stored in the \Robject{plate.quality} slot. An example of creating a new cellData object is as follows: <>= oneCell <- cellData(name = "Average Intensity of Nuclei", positive.ctr = c("H02", "J02", "L02"), negative.ctr = c("C23", "E23", "G23")) oneCell @ %def \section{Data importing and processing} \subsection{Raw data importing and re-organization}\label{section:import} \Rpackage{OperaMate} imports all reports of the Columbus analysis system to \Robject{expData} objects using the function \Robject{loadAll}, and then re-organizes them to several \Robject{cellData} objects corresponding to different parameters of the reports. The function \Robject{loadAll} requires all reports are of the same data format and are placed in the same location. The structure of the file name can be specified by an example, e.g. egFilename = list(eg.filename = "0205-s2-01.txt", rep.id = "s2", exp.id = "01", sep = "-", barcode = "DSIMGA01"). If the file formats can meet these requirements, assign \Robject{cellformat} as ''Matrix'' or ''Tab'', and pass the location of the files to \Robject{datapath}. Otherwise, you need to specify the information of each file by a data.frame variable. An example can be referred by \Rcode{data(platemap)}. Then, you can import the data and construct \Robject{cellData} objects as follows: <>= datapath <- file.path(system.file("Test", package = "OperaMate"), "Matrix") lstPlates <- loadAll(cellformat = "Matrix", datapath = datapath, egFilename <- list(eg.filename = "Tab.130504-s1-01.txt", rep.id = "s1", exp.id = "01", sep = "-", barcode = "DSIMGA01") ) oneCell <- cellData(name = "Average Intensity of Nuclei") oneCell <- cellLoad(oneCell, lstPlates, neglect.well = c("*02", "*23"), positive.ctr = c("H02", "J02", "L02"), negative.ctr = c("C23", "E23", "G23")) str(oneCell["origin.data"]) @ If the reports are of other formats, you can redefine the function \Robject{parseTemplete}. See the manual for more information. \subsection{Data normalization} Data normalization is to reducing the systematical technical variance among the raw data. Technical sources of variation are unavoidable during experiments, and different amounts of variance occur in different rounds of experiments, which are referred as ''batch effect''. As to the plate-based assays, different techincal variations are added to different plates, and different locations of well are suffered with different amounts of noise, especially the edge wells. The latter phenomenon is called ''edge effects''. Good normalization methods help to attenuate the batch effects. The \Rpackage{OperaMate} package provides several normalization methods including ''MP'', ''PMed'', ''Z"'' ''Ctr''. Ctr method divides data by the mean of the plate controls. This approach is often favored by biologists. However, as to the large sample screening, it is usually not as accurate as methods which take all samples into consideration. All of the other methods consider all samples, and use the majority of samples as a negative reference. The ''PMed'' method divides data by the median of their corresponding plates. ''MP'' employs the Tukey's median polish procedure, and divides data by the median of their corresponding plates and wells recursively. "Z" is the robust z-score method which firstly substracts data by the median of their plates, and then divides them by the median absolute deviation of the plates. The default normalization method is ''MP''. However, if no normalization methods are expected, you can assign the \Robject{norm.method} as ''None''. An example is as follows: <>= oneCell <- cellNorm(oneCell, norm.method = "MP") str(oneCell["norm.data"]) @ %def Contrasting colors are very useful to visualize the batch effects. The \Rpackage{OperaMate} package provides two ways, hierarchical clustering (heatmap), the boxplot of each plate data. Heatmap method performs hieratical clustering to data matrix, and a large region of distinguishing color indicates the corresponding data with different technical variations. Boxplots visualize the distribution of the data in each plate. Examples are shown as shown in Figure \ref{fig:normplot}. <>= cellViz(oneCell, data.type = c("raw", "norm"), plateID = 1:6, outpath = tempdir()) @ \begin{figure}[tp] \begin{center} \includegraphics{normplot.png} \caption{\label{fig:normplot}% Batch effect visualization. (a-1) Boxplot of raw data.; (a-2) boxplot of norm data; (b-1) heatmap of raw data; (b-2) heatmap of norm data.} \end{center} \end{figure} In addition, you can check a specific plate. <>= cellViz(oneCell, data.type = c("raw", "norm"), plateID = 1, outpath = tempdir()) @ \subsection{Quality control }\label{section:qc} Before using the data for biological analysis, you should verify that your data passes quality control checks. \Robject{OperaMate} provides several ways to do quality control. ''PlateCorrelation'' checks if the duplicated plates have similar distributions. Pearson correlations are performed between every pair of duplicated plates, and the low correlation means bad replicates (default: 0.8). ''Zfactor'' checks the z factors of the wells of each plate. The data from the bad plates/wells are replaced by the mean of their duplicates if replace.badPlateData is TRUE. The qualities of the plates are stored in the \Robject{plate.quality} slot. Next, it performs the quality control well by well. It detects abnormal standardard deviations (sds) of replicated wells considering the mean values. Boxplot shows the abnormal sds as the outliers (black points outside the box), as shown in Figure \ref{fig:qcplot}. The cell numbers are loaded by the following function: <>= cell.cellNum <- cellData(name = "Cells.Analyzed") cell.cellNum <- cellLoad(cell.cellNum, lstPlates, neglect.well = c("*02", "*23"), positive.ctr = c("H02", "J02", "L02"), negative.ctr = c("C23", "E23", "G23")) oneCell <- cellNumLoad(oneCell, cell.cellNum) @ Quality control is performed by the following function: <>= oneCell <- cellQC(oneCell, qcType = c("plateCorrelation", "wellSd", "cellNumber"), qc.threshold = c(correlation = 0.7), outpath = tempdir()) str(oneCell["qc.data"]) head(oneCell["plate.quality"]) @ \begin{figure}[tp] \begin{center} \includegraphics{qcplot.png} \caption{\label{fig:qcplot} (a) Pearson correlation of replicated plates; (b) standard deviations of replicated wells binned by mean values.} \end{center} \end{figure} \section{Hit identification and biological analysis}\label{section:sigDA} The ''hits'' are the samples which are meaningfully different from the negative controls. The compulsory method is the t-test between the samples and the controls. The hits are the samples those significantly differ from the negative controls based on t-test ( default: p-value less than 0.05). This method is combined with the following methods to reduce a high false positives: \begin{enumerate} \item \Robject{ksd}, $mean \pm k\:standard\,deviation$. The hits are the samples those surpass k (default: 3) standard deviation relative to the mean. This approach is often used with z-score normalization. \item \Robject{kmsd}, $median \pm k\:median\,absolute\,deviation$. The hits are the samples those surpass k (default: 3) median absolute deviation relative to median. It is more robust than \Robject{ksd} as to a nonnormal distribution. \item \Robject{stable}, stable distribution method. You can check the fitness to the stable distribution by the QQ plot. \end{enumerate} An example is as follows: <>= oneCell <- cellSig(oneCell, method = "stable", th = c(0.05, 0.05), outpath = tempdir()) names(oneCell["Sig"]) @ The fitness of stable distribution to the data can be visualized, as shown in Figure \ref{fig:sig}(a). In addition, the signicance of the hits can be exihibited by a volvano plot,as shown in Figure \ref{fig:sig}(b). <>= labels <- c("Axin1") names(labels) <- c("DSIMGA04:C07") cellSigPlot(oneCell, highlight.label = labels, outpath = tempdir()) @ \begin{figure}[tp] \begin{center} \includegraphics{sigplot.png} \caption{\label{fig:sig} (a) QQ plot between data and stable distribution; (b) volcano plot between the log2 intensity and the log10 p-value in the multiple t-tests. The red points are hits statistically and quantitatively significantly different from the negative controls. (c) The counts of the detected hits. (d) The functions of the detected hits.} \end{center} \end{figure} Moreover, the potential biological meanings can be interpreted with the help of gProfileR package. You are required to provide the well-gene specification table. The description of \Robject{genemap} is referred in Section \ref{section:config}. The p-values barplot can be visualized, as shown in Figure \ref{fig:sig}(d) <>= genemap <- read.csv(file.path(system.file("Test", package = "OperaMate"), "demoData", "genemap.csv"), stringsAsFactors = FALSE) chart <- cellSigAnalysis(oneCell, genemap, organism = "mmusculus") head(chart, n = 5) cellSigAnalysisPlot(chart, prefix = oneCell@name, outpath = tempdir()) @ \section{Summarization} At last, all processed data and the significant hits of all data type are summarized to a single report, which is easy to check using Microsoft Office Excel or other softwares. The numbers of hits are visualized by a histogram, as shown in Figure \ref{fig:sig}(c). <>= report <- generateReport(list(oneCell), genemap, verbose = FALSE, plot = FALSE) head(report, n = 5) @ \section{Acknowledgement} We thank Li Mao and his advisor Lin Li for providing the original screening data generated by the Columbus system. The example data in the package are synthesis data generated based on their providing data. \section{Session info} <>= sessionInfo() @ \bibliography{OperaMate-vignette} \end{document}