\documentclass[11pt]{article} \usepackage{amsmath} \usepackage{times} %\usepackage{hyperref} \usepackage[numbers]{natbib} \usepackage{graphicx} <>= BiocStyle::latex(bibstyle="unsrt") @ \title{Multiple Co-inertia Analysis of Multiple OMICS Data using \Rpackage{omicade4}} \author{Chen Meng and Amin Moghaddas Gholami} \begin{document} \SweaveOpts{concordance=TRUE} % \VignetteIndexEntry{Using omicade4} \setkeys{Gin}{width=0.6\textwidth} \maketitle \tableofcontents \setlength\parindent{0pt} % ================================================= Introduction ================================================= \section{Introduction} Modern "omics" technologies enable quantitative monitoring of the abundance of various biological molecules in a high-throughput manner, accumulating an unprecedented amount of quantitative information on a genomic scale. Systematic integration and comparison of multiple layers of information is required to provide deeper insights into biological systems. Multivariate approaches have been applied successfully in the analysis of high throughput "omics" data. Principal component analysis (PCA) has been shown to be useful in exploratory analysis of linear trends in biological data \cite{pca}. Culhane and colleagues employed a two table coupling method (co-inertia analysis, CIA) to examine covariant gene expression patterns between microarray datasets from two different platforms \cite{cia2003}. Although PCA is available in several R packages, the \Biocpkg{ade4} and \Biocpkg{made4} contain many additional multivariate statistical methods including methods for analysis of one data table, coupling of two data tables or multi-table analysis \cite{ade4, ade4ii}. These methods for integrating multiple datasets make these particular packages very attractive for analysis of multi-omics data. \Biocpkg{omicade4} is developed as an extension to \Biocpkg{ade4} and \Biocpkg{made4} to facilitate input and analysis of more than two omics datasets. \Biocpkg{omicade4} provides functions for multiple co-inertia analysis and for graphical representation, so that the general similarity of different datasets could be easily interpreted. The method could be applied when several set of variables (genes, transcripts, proteins) are measured on the same set of individuals (cell lines, patients). This vignette provides a case study on a toy NCI-60 dataset to show the usage of this package. In addition, the package provides methods for S3 class \Rclass{cia}, which encapsulates results from the co-inertia analysis by \Rfunction{cia} function from \Biocpkg{made4}. Therefore, functions from \Biocpkg{made4} and \Biocpkg{ade4} are also called in this vignette. For more information please refer to \cite{omicade4} and several recent reviews. % ================================================= Quick start ================================================= \section{Quick Start} The package includes example data from four different microarray platforms (i.e., Agilent, Affymetrix HGU 95, Affymetrix HGU 133 and Affymetrix HGU 133plus 2.0) on the NCI-60 cell lines. The package and datasets are loaded by the commands: <>= library(omicade4) data(NCI60_4arrays) @ \Robject{NCI60\_4arrays} is a list containing the NCI-60 microarray data with only few hundreds of genes randomly selected in each platform to keep the size of the Bioconductor package small. However, the full datasets are available in \cite{cellminer}. % ------------------------------------------ Data view ----------------------------------------- \subsection{Data Overview} MCIA links the individuals (samples in column) in different datasets and thus the columns will be linked between the multiple datasets. Thus we have to ensure that the order of samples (the columns) in all datasets is the same before performing MCIA. The number of variables (genes) does not need to be the same. We can check the dimension of each dataset in the list by <>= sapply(NCI60_4arrays, dim) @ And check whether samples are ordered correctly <>= all(apply((x <- sapply(NCI60_4arrays, colnames))[,-1], 2, function(y) identical(y, x[,1]))) @ Before performing the MCIA, we can use hierarchical clustering to have a general idea about similarity of cell lines, which can be done with the following command. We will compare the clustering result with MCIA. <>= layout(matrix(1:4, 1, 4)) par(mar=c(2, 1, 0.1, 6)) for (df in NCI60_4arrays) { d <- dist(t(df)) hcl <- hclust(d) dend <- as.dendrogram(hcl) plot(dend, horiz=TRUE) } @ \begin{figure}[h!] \begin{center} \includegraphics{omicade4-cluster_data_separately} \caption{The hierarchical clustering of NCI-60 cell lines} \label{clust} \end{center} \end{figure} % ------------------------------------------ MCIA ----------------------------------------- \subsection{Data Exploration with Multiple Co-inertia Analysis} The main function \Rfunction{mcia} can be used to perform multiple co-inertia analysis: <>= mcoin <- mcia(NCI60_4arrays, cia.nf=10) class(mcoin) @ It returns an object of class \Rclass{mcia}. There are several methods that could be applied on this class. To visualize the result, one can use \Rfunction{plot} directly. However, because there are nine cancer types, we want to distinguish the cell lines by their original cancer type. This can be done by defining a phenotype factor in \Rfunction{plot}. The following commands create a vector to indicate the cell line groups. <>= cancer_type <- colnames(NCI60_4arrays$agilent) cancer_type <- sapply(strsplit(cancer_type, split="\\."), function(x) x[1]) cancer_type @ Next, we plot the result for the first two principal components <>= plot(mcoin, axes=1:2, phenovec=cancer_type, sample.lab=FALSE, df.color=1:4) @ \begin{figure}[h!] \begin{center} \includegraphics{omicade4-plot_mcia_cancertype} \caption{The MCIA plot of NCI-60 data} \label{mcia12} \end{center} \end{figure} This command produces a 4-panel figure as shown in figure \ref{mcia12}. The top left panel is the sample space, where each cell line is projected. Shapes represent samples in different platforms. Same cell lines are linked by edges. The shorter the edge, the better the correlation of samples in different platforms. In our sample plot, a relatively high correlation of all microarray datasets is depicted by the short edges. Furthermore, in most cancer types except lung cancer and breast cancer, cell lines having the same origin are closely projected, which indicates high homogeneity of these cancer types. This agrees with the hierarchical clustering (figure \ref{clust}). \vspace{8 mm} The next interesting question is which genes are responsible for defining the coordinates of samples. The top right panel is the variable (gene) space, e.g., genes from different platforms, which are distinguished by colors and shapes, are projected on this space. In this panel, a gene that is particularly highly expressed in a certain cell line will be located on the direction of this cell line. The farther away towards the outer margin, the stronger the association is. Equally, genes projected on the opposite direction from the origin indicate that they are lost or down regulated in those cell lines. From this sense, since the melanoma cell lines are highly weighted on the positive side of the horizontal axis in the first panel, the corresponding melanoma highly expressed genes are on the same direction. The following command could be used to select melanoma associated genes according to the coordinate of genes in that space <>= melan_gene <- selectVar(mcoin, a1.lim=c(2, Inf), a2.lim=c(-Inf, Inf)) melan_gene @ The first column represents gene names, the subsequent columns indicate which genes are identified in which platforms, and the last column is a statistic of the total number of platforms identifying the corresponding gene in the selected region. \vspace{8 mm} The bottom left panel in figure \ref{mcia12} shows the eigenvalue for each eigenvector. The barplot represents the absolute eigenvalues. The dots linked by lines indicate the proportion of variance for the eigenvectors. Cyan bars indicate the eigenvectors kept in the analysis. In this case, we kept 10 eigenvectors, and the top three axes have a relative large eigenvalue according to the scree plot. Therefore, not only the top two axes, but also the third one could lead to some interesting findings. Different axes could be explored by changing the \Rcode{axes} argument in \Rfunction{plot}, such as: <>= plot(mcoin, axes=c(1, 3), phenovec=cancer_type, sample.lab=FALSE, df.color=1:4) plot(mcoin, axes=c(2, 3), phenovec=cancer_type, sample.lab=FALSE, df.color=1:4) @ Finally, the bottom right panel in figure \ref{mcia12} shows the pseudo-eigenvalues space of all datasets, which indicates how much variance of an eigenvalue is contributed by each dataset. In this example, the HGU 95 is highly weighted on the first axis. Therefore, this dataset contributes the most variance on this axis among four datasets. However, the HGU 133 plus 2.0 data highly contribute to the second axis. Note that we selected some melanoma related genes by limiting the first axis using \Rfunction{selectVar} function, where we identified four genes in Agilent and HGU 95 platforms comparing to only one gene in the HGU 133 plus 2.0 platform, which is in agreement with the result suggested by this plot. \vspace{8 mm} In addition, the function \Rfunction{plotVar} could be used to visualize the gene space, given a list of genes of interest. Let's get back to the melanoma genes again, we know that {\it S100B} and {\it S100A1} are detected in more than one dataset. Now, we want to know where these genes are projected on the gene space. This could be visualized by <>= geneStat <- plotVar(mcoin, var=c("S100B", "S100A1"), var.lab=TRUE) @ \begin{figure}[h!] \begin{center} \includegraphics{omicade4-plotvar_mcia} \caption{visualization of genes of interest in CIA} \label{genespace1} \end{center} \end{figure} @ The output plot is shown in figure \ref{genespace1}. \bibliography{omicade4} \end{document}