% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{MergeMaid primer} %\VignetteKeywords{MergeMaid, expression} %\VignetteDepends{MergeMaid} %\VignettePackage{MergeMaid} %documentclass[12pt, a4paper]{article} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \author{Xiaogang Zhong, Leslie Cope, Elizabeth Garrett-Mayer, Giovanni Parmigiani} \begin{document} \title{Description of MergeMaid} \maketitle \section{Introduction} MergeMaid is designed to facilitate multi-study analysis. The merging function generates objects that can efficiently support a variety of joint analyses. Visualization tools allow for exploration of the data without requiring normalization across platforms. We have updated the package by adding a quick approximate calculation of the integrative correlation. Version 2.1.6 of MergeMaid includes the following primary functions, with corresponding data classes \begin{center} \begin{tabular}{|lp{5in}|} \hline {\it mergeExprs} & Merge Datasets into an object of class {\bf mergeExpressionSet}.\\ {\it intCor} & Compute integrative correlation coefficients,\\ & returns an object of class {\bf mergeCor}.\\ {\it modelOutcome} & Fit various models to the data, \\ & models currently available include linear and logistic regression, and Cox hazards,\\ & returns an object of class {\bf mergeCoeff}. \end{tabular} \end{center} In addition, there are a number of functions for the manipulation, retrieval and visualization of data. These functions depend on the data class for which they are defined and will be discussed below. \begin{description} \item[The mergeExprs function and the mergeExpressionSet class] The primary data class in the MergeMaid package is the \verb+mergeExpressionSet+, based on the ExpressionSet class defined in Bioconductor. 'mergeExprs' returns an object of class \verb+mergeExpressionSet+, required for all analytic functions included in the package. A \verb+mergeExpressionSet+ object contains the following slots \begin{center} \begin{tabular}{|lp{5in}|} \hline {\it data} & a list of \verb+ExpressionSet+ objects, one per study\\ {\it geneStudy} & incidence matrix indicating which genes are measured in each study. \\ {\it notes} & \\\hline \end{tabular} \end{center} The standard way to build a \verb+mergeExpressionSet+ object is with the function \verb+mergeExprs+. This function accepts expression data in a variety of formats, including \verb+ExpressionSet+ objects, simple matrices of expression values and other \verb+mergeExpressionSet+s. Any combination of these is acceptable. Merging is based on user-supplied gene ids (e.g. Genbank, Unigene, or LocusLink ID's). These IDs should make up the rownames for each expression data matrix. Frequently an expression array will include multiple probesets for some genes, and these may be assigned the same geneid. This presents a special problem for the merging of data across platforms, becoming important when carrying out an analysis on the merged data, (e.g. regression or survival analysis) for which genes need to be unambiguously matched. In general, appropriate measures are left up to the user at ID assignment. To prevent potential problems, replicates within a dataset which still share the same ID are averaged during the merging process. There are a number of functions to access and manipulate the data in a \verb+mergeExpressionSet+. \begin{center} \begin{tabular}{|lp{5in}|} \hline {\it exprs} & returns the contents of the \verb+data+ slot\\ {\it geneStudy } & returns the contents of the \verb+data+ slot\\ {\it notes} & returns the contents of the \verb+data+ slot\\ {\it names} & returns study names\\ {\it geneNames} & returns the entire list of gene IDs\\ {\it phenoData } & returns a list containing the phenodata (if any) included for each study\\ {\it [} & returns a \verb+mergeExpressionSet+ object containing only the indicated studies\\ {\it intersection} & returns a single \verb+ExpressionSet+ containing all studies and all common genes\\ {\it notes$<$-} & replaces the contents of the \verb+data+ slot\\ {\it names$<$-} & replaces the study names\\ {\it geneNames$<$-} & replaces gene IDs. \\ {\it plot} &Draw scatterplots to compare integrative correlations for genes. \\ \hline \end{tabular} \end{center} The two main analytic functions in the package are defined for \verb+mergeExpressionSet+ objects as well, but are discussed in separate sections, as each has an associated class. \item[The intCor function and the mergeCor class] When working with data from different sources is important to identify those genes which are measured in similar ways in the various datasets, and can be used in joint analyses. MergeMaid includes a gene reproducibility index called the {\bf integrative correlation coefficient} and calculated using the function \verb+intCor+. Within each study, and for each pair of genes, we calculate the correlation coefficient of expression values across subjects. By examining whether, for a specific gene, these correlations agree across studies we can quantify the reproducibility of results without relying on direct comparison of expression across platforms. The integrative correlations provides a reproducibility score for each gene. This analysis is unsupervised in that consistency is measured without using information about sample phenotypes. The output from the \verb+intCor+ function is an object of class \verb+mergeCor+, containing integrative correlation coefficients for a single \verb+mergeExpressionSet+ object. Such an object contains the following slots \begin{center} \begin{tabular}{|lp{5in}|} \hline {\it pairwise.cors} & matrix containing the integrative correlation for each pair of studies. \\ {\it max.cors} & vector representing maximal canonical correlation (pairwise canonical correlations) for each pair of studies.\\\hline \end{tabular} \end{center} If $n$ is the number of studies then for $i < j \leq n$, the pairwise correlation of correlations for studies $i$ and $j$ is stored in column $(i-1)*(n-1)-(i-2)*(i-1)/2 + j-i$ of the pairwise.cors slot. The {\it total integrative correlation} for each gene is obtained by averaging the $n(n-1)/2$ pairwise integrative correlations. The methods available for this class are: \begin{center} \begin{tabular}{|lp{5in}|} \hline {\it pairwise.cors} &Accessor function for the pairwise.cors slot \\ {\it max.cors} &Accessor function for the maximal canonical correlation (pairwise canonical correlations) for each pair of studies. \\ {\it integrative.cors} &Accessor function, returns total integrative correlation for each gene. \\ \hline \end{tabular} \end{center} In addition, there is a function called \verb+intcorDens+, which plots a smooth density curve for the true distribution of integrative correlation coefficients as well as the null distribution density curve obtained by permuting expression values. These plots can be used to help identify a useful threshold of reproducibility. Since the permutation required the original expression data, this function is defined for mergeExpressionSet objects rather than for mergeCor objects, but in spirit belongs here. \item[The modelOutcome function and the mergeCoeff class] The function \verb+modelOutcome+ calculates gene/study specific coefficients for a variety of models. The output from the \verb+modelOutcome+ function is an object of class \verb+mergeCoeff+ Such an object contains the following slots \begin{center} \begin{tabular}{|lp{5in}|} \hline {\it coeffs} & a matrix of coefficients, rows=genes, columns=studies\\ {\it coeff.std} & matrix of standardized coefficients \\ {\it zscore} & matrix of zscores for the coefficients \\ \hline \end{tabular} \end{center} Only 3 models are implemented in the first version of MergeMaid: linear regression, logistic regression and cox hazard rate. Methods for this class include: \begin{center} \begin{tabular}{|lp{5in}|} \hline {\it coeff} &Accessor function for the coeff slot. \\ {\it coeffstd} &Accessor function for the coeff.std slot. \\ {\it zscore} &Accessor function for the zscore slot. \\ {\it plot} &Draw scatterplots to compare coefficients from different studies.\\ \hline \end{tabular} \end{center} The plot function is actually defined for the matrix class, rather than for the mergeCoeff class. The usual syntax is \verb+plot(coeff(mergeCoeff))+ so that the coefficients are selected. \end{description} \end{document}