% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Using Categories to Analyze Microarray Data} %\VignetteDepends{Biobase, genefilter, hgu95av2.db, RColorBrewer, KEGG.db} %\VignetteSuggests{Rgraphviz} %\VignetteKeywords{EDA, graphs, ontology, visualization} %\VignettePackage{Category} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \usepackage{comment} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newlength{\smallfigwidth} \setlength{\smallfigwidth}{6cm} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\MAT}[1]{{\bf #1}} \newcommand{\VEC}[1]{{\bf #1}} \newcommand{\Amat}{{\MAT{A}}} %%notationally this is going to break \newcommand{\Emat}{{\MAT{E}}} \newcommand{\Xmat}{{\MAT{X}}} \newcommand{\Xvec}{{\VEC{X}}} \newcommand{\xvec}{{\VEC{x}}} \newcommand{\Zvec}{{\VEC{Z}}} \newcommand{\zvec}{{\VEC{z}}} \newcommand{\calG}{\mbox{${\cal G}$}} \bibliographystyle{plainnat} \title{Using Categories to Model Genomic Data} \author{R. Gentleman} \begin{document} \maketitle \section*{Introduction} The analysis of genomic data is an important challenge in bioinformatics. The particular aspect of that problem that we will address is the analysis of gene expression data in conjunction with category data. Category data, are data that map from specific entities (genes in this case) to categories, or classes. In some cases the mapping will be a partition, so that each gene maps to one and only one category, but in most cases genes will map to multiple categories. One important aspect of category data is that the categories and the mappings from entities to categories are known \textit{a priori} and are not determined from the experimental data. There are very many biological examples of category data, and to some extent they may be approached using similar statistical methods. Examples include the mapping from genes to chromosomes (which is nearly a partition), the mappings from genes to pathways, the mappings from genes to GO \citep{GO} classifications, or the mappings from genes to protein complexes. The categories themselves may have complex relationships, as is the case with GO and protein complexes, but for now we concentrate on the mappings between genes and categories. To make some of the concepts concrete and to provide extensive, but related, examples we will make use of a microarray data set \citep{Chiaretti2004}. The data come from a clinical trial in acute lymphoblastic leukemia (ALL). We will focus our attention on the patients with B-cell derived ALL, and in particular on comparing the group with BCR/ABL (9;22 translocation) to those with no observed cytogenetic abnormalities. Category analysis is similar to the approach taken in \cite{Mootha2003}. However, category analysis can be viewed as an extension of that methodology that makes its application both simpler and richer. Among the important concepts is the notion that the search for sets of differentially expressed genes is not always the right approach for analyzing gene expression data. Given that a microarray typically measures the levels of coordinated gene expression averaged over a few thousands of cells it seems that a more holistic approach is sometimes warranted. We are often more interested in categories where the constituent genes show coordinated changes in expression over the experimental conditions than in sets of differentially expressed genes. The methods presented in this paper are one approach to taking a more global view. <>= library("Biobase") library("annotate") library("hgu95av2.db") library("KEGG.db") library("genefilter") library("Category") library("ALL") @ Let's consider the comparison of two different groups, or phenotypes and we assume that there are some number of DNA microarrays that have been obtained for each group. The actual method of comparing the expression levels in the two groups is, in some sense, irrelevant to the subsequent discussion and readers can easily substitute their own favorite methods. We refer readers to \cite{WileyEncy} or \cite{Smyth2004} for a general discussion of some of the issues involved in gene filtering. Here we will use a $t$-test. First we subset the ALL data to the two phenotypes that we would like to compare, those with BCR/ABL and those with no cytogenetic abnormalities, labeled NEG. <>= ## subset of interest: 37+42 samples data(ALL) esetA <- ALL[, intersect(grep("^B", as.character(ALL$BT)), which(as.character(ALL$mol) %in% c("NEG", "BCR/ABL")))] esetA@annotation = "hgu95av2" esetA$mol.biol = factor(esetA$mol.biol) esetASub = nsFilter(esetA, var.cutoff=0.5)$eset ##set up some colors BCRcols = ifelse(esetASub$mol == "BCR/ABL","goldenrod", "skyblue") library("RColorBrewer") cols = brewer.pal(10, "RdBu") @ \subsection*{Category Analysis} We have a set of data where there have been $G$ measurements on each of $n$ samples. We use $\Emat$ to denote the $G$ by $n$ data matrix. We consider the case where a univariate test statistic can be computed for each entity (gene in our case) and denote the resulting $G$ vector by $\xvec$. There is a given fixed set of categories, $\mathcal{C}$, and a set of entities (genes) $\mathcal{G}$ from which we can compute the incidence matrix $\Amat$, where $a[i,j] =1$ if entity $j$ is in category $i$. The question of interest is the identification of the elements of $\zvec = \Amat \xvec$ that are unusually large or small. \subsection*{Implementation} Consider the two-sample problem. Assume that there are $n$ microarrays available, and that they have collected data on $n$ samples under two conditions. Suppose that we have chosen to use a test statistics, $T$. There are several different methods of generating values of $\xvec_T$ under the null hypothesis that there is no difference between the two conditions. These include permuting the sample labels, carrying out a bootstrap simulation, or using any one of a number of other methods for generating a reference distribution. Once this reference distribution, $\{\xvec^b\}_{b=1}^B$ has been computed, it induces a distribution on $\zvec$, where $\zvec^b = \Amat \xvec^b$. Hence, for each $z_i$ we can compute marginal tests of whether that particular $z_i$ is extreme relative to the joint distribution. \subsubsection*{Parametric Assumptions} Suppose that $\Xvec$ is multivariate $N(\mu, \Sigma)$. The statistics are computed as $\Zvec = \Amat \Xvec$, and hence $\Zvec$ also follows a multivariate Normal distribution with mean $\Amat \mu$ and variance $\Amat \Sigma \Amat^\prime$. But if $\Sigma$ is unknown then it is not possible to carry out inference on $\Zvec$. For our situation $\Sigma$ is too large to easily be estimated. We note that if $\Xvec$ is made up of two sample $t$-statistics with $n$ reasonably large then the elements of $\Xvec$ are approximately $N(0,1)$ random variables. If the genes were independent $\Zvec$ divided by the square root of the row sums of $\Amat$ will itself be approximately multivariate Normal with mean zero and $\Sigma$ will be the $m$ by $m$ identity matrix. We use this approximation below. To carry this out we make use of the \Rfunction{rowttests} function in the \Rpackage{genefilter} package. <>= ttests = rowttests(esetASub, "mol.biol") ##find the probes that we are going to use fL = findLargest(featureNames(esetASub), abs(ttests$statistic), "hgu95av2") fL2 = probes2Path(fL, "hgu95av2") length(fL2) inBoth = fL %in% names(fL2) fL = fL[inBoth] @ In the computation to get \Robject{fL2} we first found all duplicate mappings to LocusLink identifiers and then selected the one with the largest $t$--statistic. Note that we passed the absolute value of the observed $t$--statistic in and so obtain the most extreme value. Then we remove all LocusLink identifiers that do not map to any known pathway and find that we have \Sexpr{length(fL2)} genes left. In the next code segment, we first reorder the genes by the \Robject{fl2} values and then compute $t$-tests, on a per row basis, using \Robject{mol.biol} variable in the phenotypic data to define the groups. <>= eS = esetASub[match(names(fL2), featureNames(esetASub)),] tobs = rowttests(eS, "mol.biol") @ Next we create the adjacency matrix that maps genes/probes to pathways. We will make the decision that we are not interested in pathways that have fewer than 5 members that are in the experimentally observed genes. <>= ##compute the Amat, and rearrange it to follow the order of genes Amat = t(PWAmat("hgu95av2")) AmER = Amat[,names(fL)] rs = rowSums(AmER) ##FIXME: you need to decide how big a pathway you care about... AmER2 = AmER[rs>5,] rs2 = rs[rs>5] nCats = length(rs2) @ There are \Sexpr{nCats} pathways (categories) that will be used for the analysis. <>= ##compute observed stats tA = AmER2 %*% tobs$statistic tA = tA/sqrt(rs2) names(tA) = row.names(AmER2) @ And now we can examine the resultant qq-plot, which is shown in Figure~\ref{fig:qq}. \begin{figure}[tp] \centering <>= qqnorm(tA) @ \includegraphics[width=1.25\smallfigwidth]{Category-qqplot} \caption{\label{fig:qq}% A qq-plot where each point represents a different category (in this case a pathway). There is one pathway with a remarkably low value. } \end{figure} In Figure~\ref{fig:qq} we see that there is one pathway for which the aggregate statistic seems to be unusually large, and negative. While there are a number of pathways that seem to have elevated levels of activity among those with BCR/ABL, the ribosome pathway stands out with a large and negative aggregate statistic. We can find that pathway and examine the data a bit more. Then in the next section we make use of the permutation approach to assess significance. <>= byTT = names(tA)[ tA< -7] @ \begin{figure}[tp] \centering <>= KEGGmnplot(byTT, eS, group=eS$mol, data="hgu95av2", main=paste(KEGGPATHID2NAME[[byTT]], paste("Overall:", round(tA[byTT], 3)), sep="\n")) @ \includegraphics[width=1.25\smallfigwidth]{Category-KEGGmnplot} \caption{\label{fig:KEGGmn} The mean plot for the \Sexpr{KEGGPATHID2NAME[[byTT]]} pathway. Each point represents a gene in the pathway and the $x$-value is determined by the mean expression in the BCR/ABL group while the $y$-value is determined by the mean in the NEG group.} \end{figure} Figure~\ref{fig:KEGGmn} suggests that the level of activity among genes involved in the Ribosome pathway seems larger in those labeled NEG than the samples from patients with BCR/ABL. Perhaps indicating that ribosomal activity is suppressed in those with BCR/ABL. We can further investigate the relationship between gene expression and the Ribosome pathway by examining heatmaps for this pathway. First, we examine the heatmap for those genes that were selected by our filtering approach, Figure~\ref{fig:hmRib1}, the colors in the top bar for Figure~\ref{fig:hmRib1} are gold for BCR/ABL and blue for NEG, while in Figure~\ref{fig:hmRib2} they are green for males and grey for females. <>= hmcol <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(256)) spcol <- ifelse(eS$mol.biol=="BCR/ABL", "goldenrod", "skyblue") @ \begin{figure}[tp] \centering <>= tmp1 = KEGG2heatmap(byTT, eS, data="hgu95av2", main=paste(KEGGPATHID2NAME[[byTT]], paste("Overall:", round(tA[byTT], 3)), sep="\n"), col=hmcol, ColSideColors=spcol) @ \includegraphics[width=1.25\smallfigwidth]{Category-Ribosomeheatmap} \caption{\label{fig:hmRib1} The heatmap for the category with the Ribosome pathway, using all gene.} \end{figure} Here the patterns of expression do not seem to corroborate the finding. Yes there are differences in the patient samples with respect to the expression of mRNA for ribosomal mRNAs, but the differences are not so striking. So what is going on? Well, one of the genes gives us a hint - if you examine the heatmap carefully you will notice that one probe seems to be expressed in one set of samples and not in another - and the probe is \verb+41214_at+, which corresponds to RPS4Y1, which is a sex-linked ribosomal protein. If we then redraw the heatmap but color the sidebar according to sex, (green for males and slate for females) we see that these probes almost exactly separate the males and females (and suggest that there may be some errors in the meta-data). \begin{figure}[tp] \centering <>= spcol2 <- ifelse(eS$sex=="M", "lightgreen", "slategrey") tmp2 = KEGG2heatmap(byTT, eS, data="hgu95av2", main=paste(KEGGPATHID2NAME[[byTT]], paste("Overall:", round(tA[byTT], 3)), sep="\n"), col=hmcol, ColSideColors=spcol2) @ \includegraphics[width=1.25\smallfigwidth]{Category-HMbysex} \caption{\label{fig:hmRib2}% A heatmap of the selected probes for mRNA in the Ribosome pathway. The color bar is green for males and grey for females.} \end{figure} So we have found something that appears to be real, at least biological, but which may well be uninteresting. Gender is slightly confounded with the relationship between BCR/ABL and NEG cytogenetics and hence we will sometimes find effects that may be more correctly attributed to sex. To better understand this effect it will be necessary to modify the testing procedure so that sex can be adjusted for. This is reasonably straightforward and involves fitting a per gene linear model, with both a BCR/ABL effect and a sex effect and then using the standardized coefficients for the BCR/ABL effect in the GSEA analysis. \subsection*{A permutation distribution} <>= NPERM = 500 set.seed(444) @ If you are uncomfortable with the Normal-theory argument given in the previous section then it is important to assess the significance of the observed test statistics with respect to a reference distribution. To that end, we consider permuting the sample labels (that is which of the two groups, \verb+BCR/ABL+ or \verb+NEG+, a patient belongs to). In the code below we consider \Sexpr{NPERM} permutations. <>= v1 = ttperm(exprs(eS), eS$mol.biol, B=NPERM) permDm <- matrix(0.0, nrow=length(v1$perms[[1]]$statistic), ncol=length(v1$perms)) for (j in 1:ncol(permDm)) { permDm[ , j] <- v1$perms[[j]]$statistic } permD = AmER2 %*% permDm ##no need to do this second step - if we don't do if for tobs permD2 = sweep(permD, 1, sqrt(rs2), "/") pvals = matrix(NA, nr=nCats, ncol=2) dimnames(pvals) = list(row.names(AmER2), c("Lower", "Upper")) for(i in 1:nCats) { pvals[i,1] = sum(permD2[i,] < tA[i])/NPERM pvals[i,2] = sum(permD2[i,] > tA[i])/NPERM } ord1 = order(pvals[,1]) lowC = (row.names(pvals)[ord1])[pvals[ord1,1]< 0.05] highC = row.names(pvals)[pvals[,2] < 0.05] getPathNames(lowC) @ <>= lnhC = length(highC) @ There are \Sexpr{lnhC} pathways that have $p$-values less than $0.05$ where the mean is higher in the BCR/ABL group. We print the <>= getPathNames(highC)[1:5] @ Notice that we have used quite a large $p$-value, although our adjustment should be for the \Sexpr{nCats} categories that we are testing, and so it will not be too dramatic. We can visualize the differences in group means, just as we did before. These are shown in Figures~\ref{fig:mn1} through \ref{fig:mn3}. \begin{figure}[tp] \centering <>= KEGGmnplot(highC[lnhC], eS, group=eS$mol, data="hgu95av2", main=paste(KEGGPATHID2NAME[[highC[lnhC]]], paste("Overall:", round(tA[highC[lnhC]], 3)), sep="\n"), pch=16, col="blue") @ \includegraphics[width=1.25\smallfigwidth]{Category-mnplot1} \caption{\label{fig:mn1}% The per category mean plot for pathways that are deemed to be differentially expressed using a permutation approach.} \end{figure} \begin{figure}[tp] \centering <>= KEGGmnplot(highC[lnhC-1], eS, group=eS$mol, data="hgu95av2", main=paste(KEGGPATHID2NAME[[highC[lnhC-1]]], paste("Overall:", round(tA[highC[lnhC-1]], 3)), sep="\n")) @ \includegraphics[width=1.25\smallfigwidth]{Category-mnplot2} \caption{\label{fig:mn2}% The per category mnplot for pathways that are deemed to be differentially expressed using a permutation approach.} \end{figure} \begin{figure}[tp] \centering <>= KEGGmnplot(highC[lnhC-2], eS, group=eS$mol, data="hgu95av2", main=paste(KEGGPATHID2NAME[[highC[lnhC-2]]], paste("Overall:", round(tA[highC[lnhC-2]], 3)), sep="\n")) @ \includegraphics[width=1.25\smallfigwidth]{Category-mnplot3} \caption{\label{fig:mn3}% The per category mean plot pathway that are deemed to be differentially expressed using a permutation approach.} \end{figure} Unfortunately, as we can see from the visualizations none of these category plots are especially compelling. If we return to the category \Sexpr{KEGGPATHID2NAME[[byTT]]} then we see that the permutation $p$-value is \Sexpr{pvals[byTT, 2]}. %%FIXME: this one is a heatmap so I needed to move it \begin{figure}[tp] \centering <>= tmp2 = KEGG2heatmap(highC[lnhC-2], eS, data="hgu95av2", main=paste(KEGGPATHID2NAME[[byTT]], paste("Overall:", round(tA[highC[lnhC-2]], 3)), sep="\n"), col=hmcol, ColSideColors=spcol) @ \includegraphics[width=1.25\smallfigwidth]{Category-HMperm} \caption{\label{fig:HMperm}% } \end{figure} \section*{Using other functions} In the examples above we used, perhaps the simplest form of per category statistic, the summation. Extending the model to deal with virtually any other per-category function is quite simple and the cade to do this is available as the \Rfunction{applyByCategory} function. <>= byCmeds = applyByCategory(tobs$statistic, AmER2, FUN=median) byCrankTest = applyByCategory(tobs$statistic, AmER2, FUN=wilcox.test) ranS = sapply(byCrankTest, function(x) x$statistic) ranpv = sapply(byCrankTest, function(x) x$p.value) @ \bibliography{cat} \end{document}