% \VignetteIndexEntry{biosvd} % \VignetteDepends{} % \VignetteKeywords{biosvd} % \VignettePackage{biosvd} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{Sweave} \usepackage{amssymb} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{authblk} \textwidth=6.5in \textheight=8.5in \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \title{The biosvd package for high-throughput data processing, outlier detection, noise removal and dynamic modeling} \author[1]{Anneleen Daemen\thanks{daemena@gene.com}} \author[1]{Matthew Brauer\thanks{matthejb@gene.com}} \affil[1]{Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, CA} \date{\today} \renewcommand\Authands{ and } \begin{document} \maketitle \tableofcontents \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Singular Value Decomposition (SVD) is a popular method, suited for longitudinal data processing and modeling. Simply stated, SVD decomposes data to a linear combination of main major modes of intensity. SVD for the analysis of genome-wide expression data was introduced by Alter and colleagues in 2000, with the major modes referred to as eigengenes and eigenarrays [Alter et al, 2000]. Sorting the expression data to these modes revealed clusters of genes or arrays with similar function or biologic phenotype. Here, we extend the application of SVD to any data type. This BioConductor R package allows for high-throughput data processing, outlier detection, noise removal and dynamic modeling, based on the framework of Singular Value Decomposition. It provides the user with summary graphs and an interactive html report. This gives a global picture of the dynamics of expression/intensity levels, in which individual features and assays are classified in groups of similar regulation and function or similar cellular state and biological phenotype. \subsection{Singular Value Decomposition (SVD)} Singular Value Decomposition is a linear transformation of a data set $E$ from the $M$ features x $N$ assays space to a reduced $L$ eigenfeatures x $L$ eigenassays space, with \(L=\min \{M,N\}\). In mathematical terms, this corresponds to \(E = U \Sigma V^T\). $U$ and $V^T$ define the $M$ features x $L$ eigenassays and the $L$ eigenfeatures x $N$ assays orthonormal basis sets. Each column in $U$ corresponds to a left singular vector, representing genome-wide expression, proteome-wide abundance or metabolome-wide intensity in the $k$-th eigenassay. Accordingly, each row in $V^T$ is called a right singular vector and represents the expression, abundance or intensity of the $k$-th eigenfeature across all assays. \vspace{3 mm} $\Sigma$ is a diagonal matrix with expression of each eigenfeature restricted to the corresponding eigenassay, reflecting the decoupling and decorrelation of the data. The eigenexpression levels along the diagonal indicate the relative significance of each $\{$eigenfeature, eigenassay$\}$-pair. The relative fraction of overall expression that the $k$-th eigenfeature and eigenassay capture is called {\it eigenexpression fraction} and is defined as \(f_l = \epsilon^2_l / \sum_{k=1}^L \epsilon_k^2\). Finally, data complexity is expressed as the Shannon entropy, defined as \(0 \leqslant \frac{-1}{log(L)} \sum_{k=1}^L f_k log(f_k) \leqslant 1\). An entropy of 0 corresponds to an ordered and redundant data set, with all expression captured by a single $\{$eigenfeature, eigenassay$\}$-pair. On the other hand, the entropy is 1 in case of a disordered and random data set with all $\{$eigenfeature, eigenassay$\}$-pairs equally expressed. We refer to \cite{Alter00} for a more detailed description of SVD. \subsection{biosvd Package Overview} \includegraphics[scale=0.8]{biosvd-package-overview.png} \vspace{3 mm} The biosvd package consists of 4 main functions. First, \tt compute \rm reduces the input data set from the feature x assay space to the reduced diagonalized eigenfeature x eigenassay space, with the eigenfeatures and eigenassays unique orthonormal superpositions of the features and assays, respectively. Results of SVD applied to the data can subsequently be inspected based on the graphs generated with \tt plot\rm. These graphs include a heatmap of the eigenfeature x assay matrix ($V^T$), a heatmap of the feature x eigenassay matrix ($U$), a bar plot with the eigenexpression fractions of all eigenfeatures, and the levels of the eigenfeatures across the assays, and polar plots of the assays and features, displaying the features/assays according to their correlation with two selected eigenfeatures/eigenassays. These graphs aid in deciding which eigenfeatures and eigenassays to filter out (i.e., eigenfeatures representing steady state, noise, or experimental artifacts) (\tt exclude\rm). Filtering out steady-state expression/intensity corresponds to centering the expression/intensity patterns at steady-state level (arithmetic mean of intensity $\sim$ 0). \vspace{3 mm} Secondly, the three functions \tt compute\rm, \tt plot \rm and \tt exclude \rm can be applied to the variance in the data, in order to filter out steady-scale variance. This corresponds to a normalization by the steady scale of expression/intensity variance (geometric mean of variance $\sim$ 1). \vspace{3 mm} Thirdly, after possible removal of steady state expression, steady-scale variance, noise and experimental artifacts, SVD is re-applied to the normalized data, followed by the generation of plots with \tt plot \rm, and a summary txt file with \tt report\rm, containing the list of features, their coordinates, radius and phase in the polar plot, and any additional feature data provided by the user. \subsection{Case Studies Overview} In this vignette, three case studies are provided. In the first 2 case studies, the expression pattern of genes throughout the cell cycle are studied in yeast and human, respectively. As use of this package is not restricted to expression data, the third case study focuses on cellular metabolites in bacteria and yeast after carbon and nitrogen starvation. \vspace{3 mm} The essential data must be provided as a feature x assay matrix, a data frame, ExpressionSet or an object from class eigensystem obtained from a former run. For the examples provided in this vignette, ExpressionSets were created containing the gene x sample expression data with gene annotation and sample information, or the metabolite x assay intensity data with metabolite annotation and assay information. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Case Study 1: Yeast Cell Cycle Expression, Alpha-factor Block.} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% As a first example, Spellman and colleagues created a comprehensive catalog of genes in {\it Saccharomyces cerevisiae} whose transcript levels vary periodically within the cell cycle \cite{Spellman98}. To this end, mRNA levels in samples from yeast cultures were synchronized in G1 phase with $\alpha$ factor arrest. After release of the $\alpha$ factor, cells were sampled every 7 minutes over a timespan of 140 minutes, during which the cells synchronously completed two cell cycles. The gene x sample expression data comprise the (un-logtransformed) ratio of gene expression to reference mRNA from an asynchronous yeast culture. For each sample, the cell cycle phase is known as determined by Spellman {\it et al}. For 800 cell cycle-regulated genes, the phase in which these genes reach their peak expression was determined by Spellman {\it et al} based on published timing of the expression of known cell cycle-regulated genes. \vspace{3 mm} We start the example by loading the data and all required libraries: <>= library(biosvd) data(YeastData_alpha) colnames(pData(YeastData))[match("Cell.cycle.stage", colnames(pData(YeastData)))] <- "Cellcycle.sample" colnames(fData(YeastData))[match("Cell.cycle.stage", colnames(fData(YeastData)))] <- "Cellcycle.gene" YeastData @ The input data set is first reduced from the gene x sample space to the reduced diagonalized eigenfeature x eigenassay space. <>= eigensystem <- compute(YeastData) @ The eigenfeatures and eigenassays can subsequently be inspected based on the graphs generated with \tt plot\rm. All plot-related settings can be specified with an object of class \tt EigensystemPlotParam \rm. The default settings for all plots are as follows: <>= params <- new("EigensystemPlotParam") params @ Up to 10 figures are displayed (\tt figure(params)=TRUE\rm) or saved as a pdf file (default \tt figure(params)=FALSE\rm), using the \tt plots(params) \rm argument as follows: \begin{itemize} \item {\bf fraction}: bar plot with the eigenexpression fractions of all eigenfeatures \item {\bf zoomedFraction}: bar plot with the eigenexpression fractions of the eigenfeatures after removal of the dominant eigenfeature(s) \item {\bf scree}: screeplot for the eigenexpression fractions \item {\bf eigenfeatureHeatmap}: heatmap of the eigenfeature x assay matrix with use of a given contrast factor \tt contrast(params)\rm \item {\bf eigenassayHeatmap}: heatmap of the eigenfeature x assay matrix with use of a given contrast factor \tt contrast(params)\rm \item {\bf sortedHeatmap}: heatmap of the feature x assay matrix with use of a given contrast factor \tt contrast(params)\rm, with features ordered according to their correlation with two eigenfeatures, specified with \tt whichPolarAxes(params)\rm \item {\bf lines}: expression/intensity levels of specified eigenfeatures across the assays (\tt whichEigenfeatures(params)\rm, by default 1-4) \item {\bf allLines}: expression/intensity levels of all eigenfeatures across the assays \item {\bf eigenfeaturePolar}: polar plot for the features according to their correlation with two eigenfeatures, specified with \tt whichPolarAxes(params)\rm \item {\bf eigenassayPolar}: polar plot for the assays according to their correlation with two eigenassays, specified with \tt whichPolarAxes(params)\rm \end{itemize} For this example, the bar plot with all eigenfeatures (\tt fraction\rm), the expression levels of all eigenfeatures across the samples (\tt allLines\rm), and the heatmap of the eigenfeature x assay matrix (\tt eigenfeatureHeatmap\rm) are generated, with {\it YeastData} as prefix for visualization purpose (\tt prefix(params)\rm). The bar plot shows that the first eigenfeature captures 89\% of the overall relative expression in the experiment. The entropy of the data set is therefore low (0.18 $\ll$ 1). The expression level of the first eigenfeature across the samples as displayed in the \tt allLines \rm graph shows a time-invariant relative expression during the cell cycle. The low entropy in combination with the steady-state expression captured in the first eigenfeature suggests that the underlying processes are manifested by weak perturbations of a steady state of expression. The second eigenfeature describes an initial transient increase in relative expression superimposed over time-invariant relative expression, and is therefore inferred to represent response to synchronization in the cell cycle. Inspecting the remaining eigenfeature patterns across the samples reveals that eigenfeature 8 and 10 to 18 all show rapidly varying relative expression during the cell cycle. They can therefore be considered as noise. The heatmap of the eigenfeatures by samples reveals the same information, with a clear constant expression for eigenfeature 1. <>= fractions(eigensystem)[[1]] plots(params) <- "fraction" figure(params) <- TRUE prefix(params) <- "YeastData" plot(eigensystem, params) @ <>= plots(params) <- "allLines" plot(eigensystem, params) @ <>= plots(params) <- "eigenfeatureHeatmap" plot(eigensystem, params) @ We now use \tt exclude \rm to filter out steady-state expression captured by eigenfeature 1, an experimental artifact captured by eigenfeature 2 (i.e. initial response to synchronization in the cell cycle), and noise captured by eigenfeatures 8 and 10 to 18. <>= eigensystem <- exclude(eigensystem,excludeEigenfeatures=c(1,2,8,10:18)) @ Subsequently, we apply the same strategy to the variance in the data. The first eigenfeature now captures 88\% of overall information, representing a time-invariant scale of expression variance, with an entropy of 0.23. This eigenfeature is therefore removed from the data set. Besides exclusion of the specified eigenfeatures, \tt exclude \rm regenerates the eigensystem for the normalized expression data after removal of steady-scale variance. <>= eigensystem <- compute(eigensystem, apply='variance') entropy(eigensystem) fractions(eigensystem)[[1]] plots(params) <- "lines" plot(eigensystem, params) eigensystem <- exclude(eigensystem, excludeEigenfeatures=1) @ Now that steady-state expression, steady-scale variance, experimental artifacts and noise have been removed, a summary txt file and key graphs are generated. The txt file contains for all genes the cartesian coordinates, radius and phase in the eigenfeature polar plot, and annotation information that was provided with the input ExpressionSet. For the coloring of the genes and samples in the polar plots and heatmap, the cell cycle phase information from the ExpressionSet \tt YeastData \rm is used. We specify \tt assayColorMap(params)\rm and \tt featureColorMap(params)\rm, using variable \tt Cellcycle.sample\rm and \tt Cellcycle.gene\rm in the respective \tt pData\rm and \tt fData\rm objects of the ExpressionSet \tt YeastData \rm. Given that we don't want to specify the coloring for the various cell cycle phases ourselves, we set these color maps to \tt NA\rm. \vspace{3 mm} The polar plots show the genes and samples in the subspace spanned by two selected eigenfeatures and eigenassays, respectively. Because the first and second eigenfeature capture together more than 45\% of the overall normalized expression, these eigenfeatures were used as subspace (default for \tt whichPolarAxes(params)\rm). These two \{eigenfeature, eigenassay\}-pairs are sufficient to approximate the expression data when genes and samples have most of their normalized expression in this subspace (i.e., $0.5 <$ radius $< 1$). <>= assayColorMap(params) <- list(Cellcycle.sample=NA) plots(params) <- "eigenassayPolar" plot(eigensystem, params) @ <>= featureColorMap(params) <- list(Cellcycle.gene=NA) plots(params) <- "eigenfeaturePolar" plot(eigensystem, params) @ <>= fractions(eigensystem)[c(1,2)] report(eigensystem, params) @ As can be seen on the assay polar plot, eigenassay 1 is positively associated with mainly samples from G1 phase, whilst this eigenassay is negatively associated with samples from the S, G2 and M phases. Eigenassay 2 is positively associated with G1 and S, whilst negatively associated with M and M/G1. The right upper quadrant is therefore dominated by samples from G1, the right lower quadrant by S and G2, and the left lower quadrant by M. Genes in each of these quadrants in the feature polar plot can subsequently be linked to and interpreted in function of each of these cell cycle phases. The genes are colored according to their expression correlation with known cell cycle-regulated genes (as determined by Spellman {\it et al}), with mainly G1 genes in the right upper quadrant, S and G2 genes in the right lower quadrant, and M genes in the left lower quadrant, confirming the findings obtained with the biosvd package. \vspace{3 mm} Data were also sorted according to the first and second eigenfeature, and displayed in a gene x sample heatmap using the \tt sortedHeatmap\rm option in \tt plots(params)\rm. This heatmap with genes sorted according to the selected eigenfeatures shows a traveling wave of expression throughout the cell cycle. This visualization further aids in interpreting how genes are regulated by or function in the cell cycle. <>= plots(params) <- "sortedHeatmap" plot(eigensystem, params) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Case Study 2: Human HeLa Cell Cycle Expression} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Secondly, a similar experiment was performed in the human HeLa cervical carcinoma cell line \cite{Whitfield02}. Cells were arrested at the beginning of S phase by using a double thymidine block. Upon release from the thymidine block, cells were sampled every 1-2 hours for 44 hours during which the cells completed three cell cycles. Similar as for the Yeast experiment, expression data comprise the un-logtransformed ratio of gene expression to reference mRNA from an asynchronous HeLa culture. Moreover, cell cycle phase is known for each sample. For >850 genes that were identified by Whitfield {\it et al} to be periodically expressed during the cell cycle, the phase was determined based on correlation with genes known to be expressed in each cell cycle phase (e.g. {\it cyclin E1} at the G1/S boundary, {\it RAD51} in S phase, and {\it TOP2A} in G2). \vspace{3 mm} Similar as for the first case study, we load the HeLa data and compute the eigensystem. In this case, the first eigenfeature captures more than 90\% of the relative expression, with an entropy of 0.20. As can be seen on the plot of the expression level of eigenfeatures across samples, this eigenfeature represents steady-state expression. Besides eigenfeature 1, we also decided to remove eigenfeatures 7, 10, 11 and 12, all showing rapidly varying expression during the cell cycle. <>= data(HeLaData_exp_DoubleThym_2) colnames(pData(HeLaData))[match("Cell.cycle.stage", colnames(pData(HeLaData)))] <- "Cellcycle.sample" colnames(fData(HeLaData))[match("Cell.cycle.stage", colnames(fData(HeLaData)))] <- "Cellcycle.gene" HeLaData eigensystem <- compute(HeLaData) fractions(eigensystem)[[1]] entropy(eigensystem) params <- new("EigensystemPlotParam") plots(params) <- "allLines" figure(params) <- TRUE plot(eigensystem, params) eigensystem <- exclude(eigensystem,excludeEigenfeature=c(1,7,10:12)) @ As a second step, we apply the same strategy to the variance in the data. A time-invariant scale of expression variance was captured by the first eigenfeature and therefore removed. Regarding the plots generated by \tt plot\rm, these are not shown but saved as pdf files because we set \tt figure(params) \rm to \tt FALSE\rm. <>= eigensystem <- compute(eigensystem, apply='variance') entropy(eigensystem) fractions(eigensystem)[[1]] plots(params) <- c("eigenfeatureHeatmap", "fraction", "lines") figure(params) <- FALSE prefix(params) <- "HeLaData" plot(eigensystem, params) eigensystem <- exclude(eigensystem, excludeEigenfeatures=1) @ As final step, we now generate the summary txt file, the eigenassay/eigenfeature polar plots, and the sorted gene x sample heatmap for the first and second eigenfeature. Similar as before, the cell cycle phase information for both the genes and samples from the ExpressionSet \tt HeLaData \rm is used for coloring of the polar plots. This time, we pre-defined the colors for the various cell cycle phases by specifying various arguments of the EigensystemPlotParam object. The gene x sample heatmap with genes sorted according to eigenfeatures 1 and 2 displays a traveling wave of expression throughout the cell cycle. <>= report(eigensystem, params) @ <>= cellcycle.col.map <- c("orange2", "darkgreen", "blue2", "red2") names(cellcycle.col.map) <- c("S", "G2", "M", "G1") assayColorMap(params) <- list(Cellcycle.sample=cellcycle.col.map) plots(params) <- "eigenassayPolar" figure(params) <- TRUE plot(eigensystem, params) @ <>= cellcycle.col.map <- c("orange2", "darkgreen", "blue3", "magenta3", "red3") names(cellcycle.col.map) <- c("S", "G2", "G2/M", "M/G1", "G1/S") featureColorMap(params) <- list(Cellcycle.gene=cellcycle.col.map) plots(params) <- "eigenfeaturePolar" plot(eigensystem, params) @ <>= plots(params) <- "sortedHeatmap" plot(eigensystem, params) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Case Study 3: Starvation Metabolomics} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To show that use of this R package is not restricted to expression data, this third example features metabolomics data. Brauer and colleagues studied metabolic response to starvation in two microbes, {\it Escherichia coli} and {\it Saccharomyces cerevisae}, to determine whether metabolome response to nutrient deprivation is similar across both organisms \cite{Brauer06}. Sixty-eight cellular metabolites were analyzed by LC-MS/MS in both bacteria and yeast, after nutrient starvation with carbon and nitrogen. Cells were sampled for 8 hours. The metabolomics data comprise the log-transformed relative metabolite concentration changes with respect to experiment initiation at time point 0 hours. \vspace{3 mm} This case study not only differs from the two previous case studies in data type, but also in heterogeneity with four experiments combined into this data set (bacteria - carbon, bacteria - nitrogen, yeast - carbon, and yeast - nitrogen). This increased heterogeneity in the data explains the much higher entropy compared to the two previous studies (0.51), with the first and second eigenfeature capturing 42\% and 29\% of the relative intensity, respectively. Contrary to the two previous case studies, eigenfeature 1 no longer captures steady-state expression but shows a decreasing trend over time for each of the experiments, regardless of organism (B for bacteria vs. Y for yeast) and nutrient (C for carbon vs. N for nitrogen). Because we are not interested in a generic starvation response, we decided to remove the first eigenfeature at the metabolite intensity level. Furthermore, eigenfeatures 11, 12, and 14 to 24 rapidly vary along the assays and can therefore be considered as noise. <>= data(StarvationData) StarvationData eigensystem <- compute(StarvationData) fractions(eigensystem)[c(1,2)] params <- new("EigensystemPlotParam") plots(params) <- c("fraction") figure(params) <- TRUE prefix(params) <- "StarvationData" plot(eigensystem, params) @ <>= plots(params) <- "lines" plot(eigensystem, params) @ <>= plots(params) <- "allLines" plot(eigensystem, params) eigensystem <- exclude(eigensystem,excludeEigenfeature=c(1,11,12,14:24)) @ After removal of noise and the organism- and nutrient-inspecific starvation response, we investigated the variance in the data. Contrary to the two previous case studies, no steady-scale variance was present in the data and therefore no eigenfeatures were removed at the variance level. <>= eigensystem <- compute(eigensystem, apply='variance') plots(params) <- "lines" plot(eigensystem, params) eigensystem <- exclude(eigensystem, excludeEigenfeatures=0) @ Finally, the summary txt file and sorted metabolite x assay heatmap is generated, using the default eigenfeatures 1 and 2 for the calculation of cartesian and polar coordinates. It is also possible to change those settings, and obtain results for eigenfeatures 3 and 4. For the metabolite x assay heatmap, species information was used for the coloring of the assays. This heatmap allows determining the organism- and nutrient-specific metabolites. <>= report(eigensystem, params) plots(params) <- "sortedHeatmap" assayColorMap(params) <- list(Species=NA) plot(eigensystem, params) whichPolarAxes(params) <- c(4,3) prefix(params) <- "StarvationData34" report(eigensystem, params) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session Info} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% These analyses were done using the following versions of R, the operating system, and add-on packages: <>= toLatex(sessionInfo(), locale=FALSE) @ \begin{thebibliography}{9} \bibitem{Alter00} Alter O, Brown PO, and Botstein D. \emph{Singular value decomposition for genome-wide expression data processing and modeling}. Proc Nat Acad Sci U.S.A. 97(18), 10101-10106 (2000). \bibitem{Spellman98} Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, and Futcher B. \emph{Comprehensive identification of cell cycle-regulated genes of the Yeast {\it Saccharomyces cerevisiae} by microarray hybridization}. Mol Biol Cell 9, 3273-3297 (1998). \bibitem{Whitfield02} Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, and Botstein D. \emph{Identification of genes periodically expressed in the human cell cycle and their expression in tumors}. Mol Biol Cell 13, 1977-2000 (2002). \bibitem{Brauer06} Brauer MJ, Yuan J, Bennett BD, Lu W, Kimball E, Botstein D, and Rabinowitz JD. \emph{Conservation of the metabolomic response to starvation across two divergent microbes}. Proc Nat Acad Sci U.S.A. 103(51), 19302-19307 (2006). \end{thebibliography} \end{document}