\documentclass[a4paper]{article} \usepackage{natbib} \usepackage[latin1]{inputenc} \usepackage{authblk} \usepackage{amsmath} \usepackage{afterpage} \usepackage{subfigure} \usepackage[figuresright]{rotating} \let\proglang=\textsf \let\pkg=\textsf \let\proglang=\textsf \let\pkg=\textsf % \VignetteIndexEntry{kimod A K-tables approach to integrate multiple Omics-Data in R} \title{\pkg{kimod} A K-tables approach to integrate multiple Omics-Data in \proglang{R}} \author[1]{M L Zingaretti} \author[2]{J A Demey Zambrano} \author[2]{J L Vicente Villard\'on} \author[3]{J R Demey} \affil[1]{IAPCBA-IAPCH, Universidad Nacional de Villa Mar\'ia} \affil[3]{Departamento de Estad\'istica, Universidad de Salamanca} \affil[3]{Fellow Prometeo Senescyt, Escuela Superior Polit\'ecnica del Litoral (ESPOL)} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \begin{abstract} \emph{kimod} is to do multivariate data analysis of k-tables, in particular it makes STATIS methodology, designed to handle multiple data tables that quantity sets of variables collected on the same observations. This package allows to work with mixed data, with the introduction of the following improvements: distance options (for numeric and/or categorical variables) for each of the tables, bootstrap resampling techniques on the residual matrix of STATIS- compromise, that enable perform confidence ellipses for the projection of observations, and regressions Biplot to project all variables on the compromise matrix. In this way, goodness of fit criteria are used for variables selection and building relationships between observations and variables. Moreover, this allows generating clustering of variables which are powerfully related to each other and consequently get the same information. Since the main purpose of the package is to use these techniques to omic data analysis, it includes an example data from four different microarray platforms of the NCI-60 cell lines. \end{abstract} \section{Introduction} In the last years, the data of microarrays has not only gained a great importance but also it is availability for the public has increase. The "omics" technologies allow quantitative knowledge of hundreds of biological data of complex nature and have enabled the opportunity of study simultaneously, based on multiple datasets, the expression levels of thousands of genes over the effects of certain treatments or diseases. However, the joint analysis of the different subspaces that generate these technologies and their relations is not simple. Several statistical methods have been developed to handle these problems and to calculate a consensus from data matrices. STATIS-ACT \citep{DesPlan},\citep{escoufier1976} is one of the families of methods that are concerned with analysis of data arising from several configurations and is a powerful technique to compare subspaces. The aim of this package is to combine STATIS, Biplot \citep{Gabriel1971}, \citep{Demey2008} and Cluster methodologies to study the relationships between genes expressions of multiple omics datasets measuring the same biological samples or the expression of the same genes over different experimental conditions. \section{STATIS methodology} The STATIS methodology is an family of exploratory technique of multivariate data analysis based on linear algebra and especially Euclidean vector spaces (ACT stands for Analyse Conjointe de Tableaux, STATIS stands for Structuration des Tableaux A Trois Indices de la Statistique). It has been devised for multiway data situations on the basic idea of computing Euclidean distances between configurations of points \citep{escoufier1973}. In studies of genetic diversity the STATIS is a technique that it allows determine contribution of each observation to the Euclidean distance between the subspaces defined by the molecular markers and morphological traits. Formally, the central idea of the technique is to compare configurations of the same observations obtained in different circumstances. Thus we need to introduce a measure of similarity between two configurations. This is equivalent to define a distance between the corresponding scalar product matrices. These matrices are: \begin{equation} \label{eq1} W_{k}=XX^T \end{equation} We can use the classic Euclidean norm \begin{equation} \|W_{1}-W_{2}\|^{2}=\sum_{k}\sum_{k^T}[(W_{1}-W_{2})_{kk^T} ]^{2}=Tr[(W_{1}-W_{2})^2] \end{equation} On some cases when the variables are not all continuous, the scalar product can not compute. DISTATIS approach, we compute $K$ distance matrices instead of Scalar Product (See \ref{eq1}) between observations, further we transform these matrices into cross-product matrices and then use the cross-product approach to STATIS (See \citep{Abdi2007}, \citep{Abdi2012}). In these works, the autors only proposed the euclidean metrics, however in this package, we extend this approach and incorporating different metrics, extending the use of STATIS-ACT to other types of variables. Three aspects are considered in the application of the method, the study of Interstructure, the boundary of the Compromise space and the Graphical representation of the trajectories. \subsection{STEPS of STATIS} \begin{enumerate} \item Interstructure: Define a distance between $W_k's$ configurations matrix and generate a matrix of scalar product $W_{kxk}$, later, use the spectral decompositon of $W$ to projection of all studies in a space of low dimension. \item Compromise: Define a matrix $W_{nxn}$ that $\sum_{k=1}^{n}\alpha_{k}W_{k}$ with the property that is the linear combination of the $W_k's$ the most related to each $W_k$. Finally, use the singular value decomposition for plotting all observations on consensus espace. \item Trajectories: These gives a idea of the importance and the direction of the change of position of all observations between the stages $k$ and $k'$. \end{enumerate} \subsection{Sampling Variability and Biplot Analysis} Following \citep{Demey2008a}, the results of any data analysis are not thorough if they do not offer information about the stability of the solution that show whether the structure detected by the analysis is not random. There are several ways to accomplish this purpose, including the introduction of small perturbations in the data, resampling techniques or applying permutations. As for other sorting techniques, the sensitivity study of solutions in methods K- tables has hardly received any attention. Therefore, as part of this work, intents to study the stability sample of the average projections of individuals/variables or individuals-variables on parent commitment of the various methods. Specifically, the use of bootstrap \citep{efron1993} is proposes for the building confidence regions on the projection of the individual on the compromise matrix ($W$). %This is acquiring from the projection of self- decomposition of said matrix and is called $P$. In order to acquire the sampling variability, $B$ configurations must be generated of matrix $W$, for an algorithm, which on the matrix of residuals is used as detailed below. The eigen-decomposition of $W$ matrix is: \begin{equation} \hat{W}=U_{q}D_{q}V_{q} \label{ec2} \end{equation} The objective is then to find a configuration $P$ in a lower dimensional Euclidean space. A lower dimensional approximation can be obtained projecting using the equation \ref{ec2} (usually $q =2$). ($W$), can be break down as $W=\hat{W}+ \epsilon$ , making $\epsilon$ a matrix of residual with the same properties as $W$ and $\hat{W}$ , it is the low range estimation ($q>= library("kimod") @ <>= data(NCI60Selec_ESet) @ Once we call the datasets, we ckeck your class using the class() comand: <>= class(NCI60Selec_ESet) @ Then, check the dimensions of datasets. <>= lapply(NCI60Selec_ESet,dim) @ Finally, we ckeck if all tables have the same observations: <>= Tissues<-c(rep("Breast",5),rep("CNS",6),rep("Colon",7), rep("Leukemia",6),rep("Melanoma",10),rep("Lung",9), rep("Ovarian",7),rep("Prostate",2),rep("Renal",8)) @ Next command returns an array with the rownames of all tables <>= Names<-sapply(NCI60Selec_ESet,rownames) @ And if the following command is TRUE, it means of all matrix have the same observations: <>= unique(apply(Names[,-1],2,function(y)identical(y,Names[,1]))) @ Once the preprocessing of the experiment data is completed, the STATIS method can be carried out using by calling DiStatis function of \textbf{kimod} package: <>= Z1<-DiStatis(NCI60Selec_ESet) @ <>= class(Z1) @ Z1 if an object of DiStatis S4-class, if is printing the main slots of Z1 are: distance. methods (that indicates the kind of distance (or scalar product) that is calculated in each study, Inertia of Vectorial Correlation, Euclidean image of studies, compromise matrix, $P$ matrix for projection all observations in consensus- space, representation quality of observations and trajectories(i.e, the rows of the initial tables are projected in the the compromise-strucucture). To obtain the euclidean image of studies, runs: <>= RVPlot(Z1) @ The figure \ref{figure2} shows the relative contributions of each of the tables to Components 1 and 2. Thus, we can see that Study 1 (correspondent to Agilent platform it has the lowest contribute to the compromise. \begin{figure}[h] \centering \begin{center} <>= RVPlot(Z1,barPlot=FALSE) @ \caption{Contribution of all tables to the compromise.} \label{figure2} \end{center} \end{figure} To obtain the projection of observations on compromise, runs: <>= Tissues<-c(rep("Breast",5),rep("CNS",6),rep("Colon",7), rep("Leukemia",6),rep("Melanoma",10),rep("Lung",9), rep("Ovarian",7),rep("Prostate",2),rep("Renal",8)) @ <>= Colours<-c(rep(colors()[657],5),rep(colors()[637],6), rep(colors()[537],7),rep(colors()[552],6),rep(colors()[57],10), rep(colors()[300],9),rep(colors()[461],7),rep(colors()[450],2), rep(colors()[432],8)) @ <>= CompPlot(Z1,xlabBar="",colObs=Colours,pch=15,las=1, cex=2,legend=FALSE,barPlot=FALSE,cex.main=0.6,cex.lab=0.6, cex.axis=0.6,las=1) legend("topleft",unique(Tissues),col=unique(Colours), bty="n",pch=16,cex=1) @ The figure \ref{figure3} shows the projection of s cell lines onto the first two principal components of Compromise-structure. Cell lines of leukemia, melanoma and colon are clearly distinguished from the others. However, a melanoma cell line has similar profiles to carcinomas (CNS, renal ovarian, lung). Furthermore, the breast cancer varies widely, grouping itself some samples with colon tissues and others with CNS. \begin{figure}[h] \centering \begin{center} <>= CompPlot(Z1,xlabBar="",colObs=Colours,pch=15,las=1, cex=2,legend=FALSE,barPlot=FALSE,cex.main=0.6,cex.lab=0.6, cex.axis=0.6,las=1) legend("topleft",unique(Tissues),col=unique(Colours), bty="n",pch=16,cex=1) @ \caption{Compromise Plot. Projection of all tumoral tissues in the consensus space.} \label{figure3} \end{center} \end{figure} The Sample Variability is obtained by using Bootstrap and BootPlot functions. Bootstrap receives as argument an object of DiStatis Class and BootPlot performs the Sample-Variaibility-Plot (see figure \ref{figure1}). The Slot "Comparision.Boot" show difference between observations using the Bonferroni Correction for all dimensions. <>= B<-Bootstrap(Z1) BootPlot(B,Points=FALSE,cex.lab=0.7,cex.axis=0.7, las=1,xlimi=c(-0.003,0.002),ylimi=c(-0.005,0.007) ,legend=FALSE,col=Colours) legend("topleft",unique(Tissues),col=unique(Colours), bty="n",pch=16,cex=1) Comparisions.Boot(B) @ On figure \ref{figure4} can be seen than then melanoma tissues have high internal variability. Moreover, from slot(B,"Comparisions.Boot"), we can see that Colon, Leukemia, $BR.MCF$ and $LC_NCI_H522$ tissues separates from others. \begin{figure}[h] \begin{center} <>= BootPlot(B,Points=FALSE,cex.lab=0.7,cex.axis=0.7, las=1,xlimi=c(-0.003,0.002),ylimi=c(-0.005,0.007) ,legend=FALSE,col=Colours) legend("topleft",unique(Tissues),col=unique(Colours), bty="n",pch=16,cex=1) @ \caption{Sample-Variability Plot} \label{figure4} \end{center} \end{figure} For performs gene selection, responsibles of the tissues projections, and explore gene expression profiles, we can use the SelectVar function, that receives an main argument of DiStatis class. This function allows to build the biplot for continuous response, using an external procedure to obtained the regresors in the linear model (see section 4). Furthermore, allows select genes using measures of goodness of fit of the Models Biplot: adjusted $R^2$, P-value with bonferroni correction, AIC or BIC. The percentage of selected variables is an user input (See figure \ref{figure5}). <>= M1<-SelectVar(Z1,Crit="R2-Adj",perc=0.95) layout(matrix(c(1,1,1,1,1,1,2,2),c(1,1,1,1,1,1,2,2),byrow=TRUE)) Biplot(M1,labelObs = FALSE,labelVars=FALSE, colObs=Colours,Type="SQRT",las=1,cex.axis=0.8, cex.lab=0.8,xlimi=c(-3,3),ylimi=c(-3,3)) plot(0,type='n',axes=FALSE,ann=FALSE) legend("topright",unique(Tissues),col=unique(Colours), bty="n",pch=15,cex=1) @ \begin{figure}[h] \begin{center} <>= layout(matrix(c(1,1,1,1,1,1,2,2),c(1,1,1,1,1,1,2,2),byrow=TRUE)) Biplot(M1,labelObs = FALSE,labelVars=FALSE, colObs=Colours,Type="SQRT",las=1,cex.axis=0.8, cex.lab=0.8,xlimi=c(-3,3),ylimi=c(-3,3)) plot(0,type='n',axes=FALSE,ann=FALSE) legend("topright",unique(Tissues),col=unique(Colours), bty="n",pch=15,cex=1) @ \caption{Biplot. Projection of gene-selected on Compromise} \label{figure5} \end{center} \end{figure} Besides,f Groups argument in this function is TRUE, the variables will be clustered using Euclidean distance and Ward algorithm (see figure \ref{figure6}). \begin{figure}[h] \begin{center} <>= layout(matrix(c(1,1,1,1,1,1,2,2),c(1,1,1,1,1,1,2,2),byrow=TRUE)) Biplot(M1,labelObs = FALSE,labelVars=FALSE, colObs=Colours,Type="SQRT",las=1,cex.axis=0.8, cex.lab=0.8,xlimi=c(-3,3),ylimi=c(-3,3),Groups=TRUE,NGroups=4) plot(0,type='n',axes=FALSE,ann=FALSE) legend("topright",unique(Tissues),col=unique(Colours), bty="n",pch=15,cex=1) @ \caption{Biplot. Projection of gene-selected on Compromise} \label{figure6} \end{center} \end{figure} Finally, to see relationships between gene clusters and tissues, may be used the GroupProj function, that receives an main argument of SelectVar class. This function use the cluster package \citep{cluster} which is automatically called in our package. <>= A1<-GroupProj(M1,method="ward",metric="euclidean",NGroups=4) head(SortList(A1)[[1]]) @ The list shows that genes of cluster 1 are over-expressed in melanoma and CSN tissues and under-expressed in colon and leukemia (black in figure \ref{figure6}). \\ The gene on cluster 2 are over-expressed in Breast, CSN, Lung, Renal, Ovarion and Colon and under-expressed in melanoma and leukemia (red in figure \ref{figure6}). \\ The cluster 3 is related to under-expression in colon and leukemia tissues and over-expression on CSN and melanoma, mainly (green in in figure \ref{figure6}) . \\ Finally, the cluster 4 is associated to high expression in Colon and leukemia tissues and breast: BR.MCF7 and BR.T47D. The list of all cluster gene is obtained: <>= A1<-GroupProj(M1,method="ward",metric="euclidean",NGroups=4) Groups(A1) @ \bibliographystyle{apalike} \bibliography{cites} \section*{Session Info} <>= sessionInfo() @ \end{document}