\documentclass[11pt]{article} %\copyrightyear{} \pubyear{} % \VignetteIndexEntry{Linear models in GSEA} %\VignettePackage{GSEAlm} \usepackage{fullpage} %\usepackage[authoryear,round]{natbib} \usepackage{subfigure} \setlength{\parindent}{0.2in} \setlength{\parskip}{0.1in} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\ID}[1]{{\texttt{#1}}} %\newcommand{\texttt{BCR/ABL}}[1]{\texttt{BCR/ABL }} % \newcommand{\negf}[1]{\texttt{NEG }} % \linespread{1.2} \usepackage{natbib} \bibliographystyle{natbib} \begin{document} %\firstpage{1} \title{GSEAlm Package} \author{Assaf P. Oron, Robert Gentleman (with contributions from S. Falcon and Z. Jiang) \\ aoron@fhcrc.org,rgentlm@fhcrc.org} % \address{Computational Biology, 1100 Fairview Ave. N. M2-B876, % P.O. Box 19024, Seattle, Washington 98109-1024} \maketitle \begin{abstract} The \texttt{GSEAlm} package contains tools for linear-model inference and diagnostics, tailored for Gene Set Enrichment Analysis (GSEA). We demonstrate the package's main features, using the acute lymphoblastic leukemia (ALL) dataset, comparing the \texttt{BCR/ABL} and \texttt{NEG} phenotypes via GSEA based on chromosome-band mapping of genes. If you want to follow the text while running the code, please use the code from the \texttt{GSEAlm.R} file in the same directory where this document is found, because there are some code excerpts which were hidden from view in the tutorial below. A more up-to-date and academically oriented exposition of the tools coded in this package is available at \citet{OronEtAl08}. \end{abstract} \section{Background and Dataset} \label{sec:background} Gene set enrichment analysis \citep{Subramanian2005} is an important new approach to the analysis of gene expression data, and it has already been extended and generalized in a number of ways \citep{Tian2005, Kim2005, Jiang2007}. %%% FIXME (Assaf): I added this semi-philosophical paragraph, thinking %%% it can help understand what we're up to. What do you %%% think? In the same vein I %%% suggest to re-order the discussion so that we begin with %%% diagnostics rather than with the potential for expanding the model. %% %% FIXME:RG - we do compute per gene test statistics, but not just %% t-tests, and others that use GSEA do not, so you might want to say %% one approach is, rather than characterize all of GSEA. Expression analysis in general and GSEA in particular can be viewed as a cascade of successive data reductions: \begin{enumerate} \item Biochemical hybridization information is reduced to a set of pixel images (one per sample); \item Images are pre-processed to produce a $G\times n$ matrix of normalized average expression estimates ($G$ genes, $n$ samples); \item This matrix is then filtered out to remove irrelevant genes or samples (non-specific filtering); \item Per-gene statistics are calculated; \item Finally, these statistics are used to calculate gene-set-level statistics, which help identify differentially expressed or otherwise interesting gene-sets. \end{enumerate} This process is necessary, in order to bring the amount of information generated by the microrray experiment down to a scientifically relevant and manageable level, while retaining core features. However, given the immense extent of data reduction, it makes both theoretical and practical sense to find ways to check the process. This is where the linear-model toolset comes in handy. A linear model assumes that the mean of the response variable has a linear relationship with the explanatory variable(s). In gene-expression terminology, the typical generic model could be written as: \begin{equation} y_{gi} = \beta_{g0} + \sum_{j=1}^pX_{ij}\beta_{gj} + \epsilon_{gi}, \label{eq:modelgeneric} \end{equation} where \begin{itemize} \item $y_{gi}$ is the gene expression value of gene $g$ in sample $i$; \item $p$ is the number of explanatory variables in the model; \item $X_{ij}$ is the value of the $j$-th variable for the $i$-th sample. For dichotomous variables such as phenotype, one typically sets $X$ to zero or one (e.g., \texttt{NEG} will be zero and \texttt{BCR/ABL} one); \item $\beta_{gj}$ is the true value of the effect of variable $j$ upon the expression of gene $g$; and \item $\epsilon_{gi}$ is a random error (``noise''), often assumed to follow a normal distribution with mean zero and variance $\sigma^{2}$. \end{itemize} The data and model are used to calculate a fitted value for each observation, denoted as $\hat{y}_{gi}$. The residuals, $e_{gi}\equiv y_{gi}-\hat{y}_{gi}$ can be naively seen as estimates for the random errors, but they play a far greater role as will be seen below. Note that applying linear models to gene expression in the way outlined here, involves fitting the same model form to all genes separately and simultaneously. <>= library("annotate") library("GOstats") library("graph") #require("Rgraphviz", quietly=TRUE) #library("Rgraphviz") library("hgu95av2.db") library("genefilter") library("ALL") library("lattice") library("RColorBrewer") HMcols = rev(brewer.pal(10,"RdBu")) cols = brewer.pal(10, "BrBG") @ We load the \Rpackage{GSEAlm} package. Then we load the data and perform some initial filtering. The example dataset is from a clinical trial in acute lymphoblastic leukemia (ALL). This dataset is available from Bioconductor, under the name \Rpackage{ALL}. <<>>= library("GSEAlm") @ %%FIXME: RG says a few words about what and why are needed. <>= data(ALL) ### Some filtering; ### see explanation below code bcellIdx <- grep("^B", as.character(ALL$BT)) bcrOrNegIdx <- which(as.character(ALL$mol.biol) %in% c("NEG", "BCR/ABL")) esetA <- ALL[ , intersect(bcellIdx, bcrOrNegIdx)] esetA$mol.biol = factor(esetA$mol.biol) # recode factor ### Non-specific filtering esetASub <- nsFilter(esetA,var.cutoff=0.6,var.func=sd)$eset @ The \Rpackage{ALL} dataset contains \Sexpr{dim(ALL)[1]} genes and \Sexpr{dim(ALL)[2]} samples. We are interested, among other things, in comparing the \texttt{BCR/ABL} and \texttt{NEG} phenotypes, appearing under the \texttt{mol.biol} variable (hereafter: ``the phenotype effect''). There are \Sexpr{dim(esetA)[2]} samples with one of the two phenotypes; we remove all the rest. Standard non-specific filtering of genes was also performed in order to remove most of the unexpressed genes. The common assumption is that only about $40\%$ of genes are expressed in typical tissues, and therefore we remove the $60\%$ with less-variable expression across samples, using standard deviation as the threshold criterion. The filtered dataset contains \Sexpr{ncol(esetASub)} samples and \Sexpr{nrow(esetASub)} unique genes. Next we identify gene-sets based on chromosomal location. Such an analysis is of clinical interest, as ALL has been associated with chromosomal irregularities. We mapped the chromosomal location of each gene in the filtered dataset, using tools and methods described in \citet{FalconGent2007} and available in the \Rpackage{Category} package. The resulting gene-set structure is hierarchical, and can be represented as a tree graph. More details follow the code excerpt. <>= minBandSize = 5 haveMAP = sapply(mget(featureNames(esetASub), hgu95av2MAP), function(x) !all(is.na(x))) workingEset = esetASub[haveMAP, ] entrezUniv = unlist(mget(featureNames(workingEset), hgu95av2ENTREZID)) ### Creating incidence matrix and keeping the graph structure AgraphChr=makeChrBandGraph("hgu95av2.db",univ=entrezUniv) AmatChr = makeChrBandInciMat(AgraphChr) AmatChr3 = AmatChr[rowSums(AmatChr)>=minBandSize,] # Re-ordering incidence matrix columns egIds = sapply(featureNames(workingEset), function(x) hgu95av2ENTREZID[[x]]) idx = match(egIds, colnames(AmatChr)) AmatChr3 = AmatChr3[, idx] colnames(AmatChr3)=featureNames(workingEset) # Updating our graph to include only the bands that actually # appear in the matrix (doing it a bit carefully though...) # AmatChr3 = AmatChr[!duplicated(AmatChr),] AgraphChr3 = subGraph(c("ORGANISM:Homo sapiens",rownames(AmatChr3)),AgraphChr) # AgraphChr3 = subGraph(rownames(AmatChr3),AgraphChr) @ There are \Sexpr{nrow(AmatChr3)} bands containing at least \Sexpr{minBandSize} genes, and there are \Sexpr{ncol(AmatChr3)} genes mapped to such chromosome bands. To utilize our \Rpackage{GSEAlm} toolset, we need a chromosome-band {\bf incidence matrix}: a matrix with as many rows as gene-sets, and as many columns as genes. Matrix elements are $1$ if the gene belongs to the gene-set, and $0$ otherwise. Rather than use the convenient single-step function \texttt{MAPAmat} to create this matrix, here we first create a hierarchical chromosome tree using \texttt{makeChrBandGraph}, and then generate the matrix from the tree using \texttt{makeChrBandInciMat} (all 3 functions are found in the \Rpackage{Category} package). This way we retain the tree in our working memory, which will prove useful down the road. Note that the tree's trunk begins with the species-identifying node \texttt{\Sexpr{nodes(AgraphChr3)[1]}}, meaning that complete chromosomes are represented as the first-level branches. \section{The $t$-Test as a Linear Model, and Basic Diagnostics} We arrive at step 4 of the data-reduction cascade outlined in the introduction: individual-gene statistics, which in our case are $t$-tests. A point often overlooked is that $t$-tests are identical to a linear model with a single dichotomous explanatory variable (hereafter: ``factor''). This means that even for $t$-tests we can make use of linear-model tools. To demonstrate that the two approaches yield equivalent effect estimates, Fig. \ref{fig:lm1} plots the $t$-statistics produced by the \texttt{rowttests} function from the \Rpackage{genefilter} package, vs. the analogous statistics produced by a one-factor phenotype model using the \texttt{lmPerGene} function from the \Rpackage{GSEAlm} package. The values are identical except for a sign flip: the two functions have opposing conventions for factor level ordering (\texttt{rowttests} calculates "first in alphabetical order minus second", while \texttt{lmPerGene} uses the opposite which is similar to the \texttt{lm} default). In our particular application, the former convention makes more sense since the absence of mutation (i.e., "\texttt{NEG}") is a more natural reference. We therefore reverse factor level ordering, via the standard \texttt{relevel} function. <>= lmPhen <- lmPerGene(workingEset,~mol.biol) ##fit the t-tests model tobsChr <- rowttests(workingEset,"mol.biol") ## fit it via the linear-model interface lmEsts = lmPhen$tstat[2,] plot (tobsChr$stat,lmEsts,main="The t-test as a Linear Model", xlab="T-test t-statistic",ylab="One-Factor Linear Model t-statistic") ### Re-leveling the factor workingEset$mol.biol<-relevel(workingEset$mol.biol,ref="NEG") lmPhen <- lmPerGene(workingEset,~mol.biol) lmEsts = lmPhen$tstat[2,] @ \begin{figure} \centering \includegraphics[height=0.4\textwidth,width=0.4\textwidth]{lmPhen} \caption{\label{fig:lm1}T-statistics for gene-level phenotype effect, from the \texttt{rowttests} and \texttt{lmPerGene} functions, plotted against each other.} \end{figure} Using the linear-model function instead of just a $t$-test, enables us to directly calculate residuals. We have as many residuals as raw data points -- in our case, \Sexpr{nrow(workingEset)} by \Sexpr{ncol(workingEset)}. For the residual analysis, it is better to use normalized residuals, so that different genes will have similar magnitude. There are several ways to achieve this, producing normalized, internally-Studentized or externally-Studentized residuals \citep{CookWeisberg82}. The \texttt{getResidPerGene} function allows for all these options (and also the raw un-normalized residuals), with externally-Studentized residuals being the default. Arranging Studentized residuals by sample helps identify individual samples that tend to exhibit high or low expression levels. Similarly, we can look for samples with noticeably more or less variable residuals. If a particular sample is systematically different from the rest, one may suspect imperfect normalization or other technical issues. Otherwise, this may indicate some real underlying phenomena. Fig. \ref{fig:resid1} summarizes all Studentized residuals by sample (one box per sample), arranged by phenotype, using the \texttt{resplot} function. Several samples catch the eye, especially in the \texttt{NEG} phenotype (left) where the residuals from samples \ID{68001}, \ID{28001} and \ID{16009} are predominantly negative. In the \texttt{BCR/ABL} phenotype (right), residuals from sample \ID{84004} appear to be systematically higher than all the rest. Note that these features are {\bf not} evident from boxplots of raw expression intensities.\footnote{To see this, try replacing variable \texttt{lmPhenRes} by \texttt{workingEset} in the 'resplot' command below; you'll also need to remove the constraint \texttt{lims=c(-5,5)}. } Note that \Rfunction{lmPerGene} also allows for an intercept-only model (by omitting the formula) -- so you can examine residuals for patterns even without assuming any phenotype effect. <>= lmPhenRes <- getResidPerGene(lmPhen) resplot(resmat=exprs(lmPhenRes),fac=workingEset$mol,cex.main=.7,cex.axis=.6, horiz=TRUE,lims=c(-5,5),xname="",col=5,cex=.3) @ \begin{figure} \centering \includegraphics[height=0.8\textheight,width=0.8\textwidth]{SimpleResBoxplot} \caption{\label{fig:resid1}Raw residuals from a linear model of gene expression on phenotype for all genes of the filtered \Rpackage{ALL} dataset, grouped by sample and arranged by phenotype (\texttt{NEG} on left, \texttt{BCR/ABL} on right).} \end{figure} \section{Gene-Set Residuals} %%FIXME: we also propose to use y ~ 1+ x, where x = 0/1 depending on %%membership in S, and that is what I use pretty much exclusively now, %%it might be good to think about if that approach is suitable here We proceed to the next GSEA step: producing summary statistics of the phenotype effect by chromosome band. There are several ways, all roughly equivalent, to achieve this \citep{Subramanian2005,Jiang2007,EfronTibshi08}. Here we chose the statistic of \citet{Jiang2007} (hereafter, ``the J-G statistic''), for having the simplest theoretical properties, and for other reasons that will become apparent shortly. The J-G statistic can be defined as \begin{equation} \tau_{S} = \sum_{g \in S} \left(t_{g}-t_{ref}\right) / \sqrt{|S|}, \label{eq:gstau} \end{equation} where $t_{g}$ is the phenotype-effect $t$-statistic for gene $g$, $t_{ref}$ is some estimate of the overall mean shift (e.g., the median of all $t$-statistics in the experiment) and $|S|$ is the size of gene set $S$. Since gene-gene correlations may induce a size effect on $\tau_S$ (Oron and Gentleman, forthcoming), and since the sample size is large enough, we recommend basing inference upon phenotype-label (``column'') permutation $p$-values rather than upon comparing $\tau_S$ to standard Normal or $t$ distributions. Function \Rfunction{gsealmPerm} provides a reasonably fast implementation of this test for all chromosome bands simultaneously. We skip this test for now, and focus on diagnostics instead. Similarly to the gene-set statistics, we can create gene-set aggregate residuals by sample, using a formula analogous to (\ref{eq:gstau}): \begin{equation} r_{Si} = \sum_{g \in S} r_{gi} / \sqrt{|S|}, \label{eq:gser} \end{equation} where $r_{gi}$ is the Studentized residual from sample $i$ and gene $g$. We will refer to the $r_{Si}$ as ``GS residuals''. Both GS main effect statistics and GS residuals can be generated from individual-gene information, using the \texttt{GSNormalize} function. Figure \ref{fig:hm1} displays a heatmap of the $r_{Si}$, with chromosome bands in rows and samples in columns. Red indicates positive values and blue negative values. Only the lowest possible hierarchical level in each chromosome band (subject to the constraint of at least \Sexpr{minBandSize} genes) was used, in order to avoid redundancies and overlap-related distortions. Since we retained the chromosome's tree structure, this is easily achieved using the \texttt{leaves} function from the \Rpackage{graph} package. Both rows and columns were simultaneously re-ordered to identify clusters with highly correlated residual patterns. Note that we prefer to measure similarity via correlation, rather than Euclidean distance which is the default for the \texttt{heatmap} function. This is because we are interested in finding groups of samples (or chromosome bands) that exhibit the same qualitative behavior with respect to the dataset's baseline. <>= ## now we are going to aggregate residuals over chromosome bands stdrAchr=GSNormalize(exprs(lmPhenRes),AmatChr3) #rAchr = AmatChr3 %*% exprs(lmPhenRes) #rAsqrSums = sqrt(rowSums(AmatChr3)) #stdrAchr = sweep(rAchr, 1, rAsqrSums, FUN="/") @ <>= # brColors <- ifelse(colnames(stdrAchr) %in% branch1,"brown", "grey") # molColors <- ifelse(workingEset$mol=="BCR/ABL","brown", "grey") kinetColors <- ifelse(workingEset$kinet=="hyperd.","brown", "grey") onecor=function(x) as.dist(1-cor(t(x))) # To get correlation-based heatmap ### In the heatmap we only use the lowest-level bands, or "leaves" of the graph ChrLeaves=leaves(AgraphChr3,"out") ### for safety ChrLeaves=ChrLeaves[ChrLeaves %in% rownames(AmatChr3)] LeafGenes=which(colSums(AmatChr3[ChrLeaves,])>0) bandHeatmap=heatmap(stdrAchr[ChrLeaves,],scale="row",col = HMcols, ColSideColors=kinetColors,keep.dendro=TRUE,distfun=onecor, labRow=FALSE,xlab="Sample",ylab="Chromosome band") @ \begin{figure} \centering \includegraphics[height=0.7\textheight,width=0.7\textwidth]{rAExChrHmap} \caption{\label{fig:hm1}GS residuals from the linear model of gene expression on phenotype for each chromosome band (row) and sample (column). Red colors indicate positive residuals, and blues indicate negative residuals. Both rows and columns are clustered according to a correlation-based similarity measure. The colored band above the map indicates the value of the \texttt{kinet} variable -- grey for diploid, red for hyperdiploid and white for unknown.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Fig. \ref{fig:hm1} exhibits some intriguing patterns. First, one of the samples identified above as having low residuals -- \ID{28001} -- catches the eye as a narrow predominantly-blue vertical strip, again suggesting a technical normalization issue with this sample. More interesting from a modeling perspective is the apparent block or checkerboard pattern of the heatmap. This pattern indicates a potential association between certain samples and the expression level of certain chromosomal locations, an association not explained by phenotype. A skeptic might ask whether perhaps the block pattern is only an artifact of the heatmap's graphical method, which allows for grouping and reordering of both rows and columns. For this purpose, Fig. \ref{fig:hm_fake} displays the heatmap of a similarly sized, randomly-generated dataset. Since from the residual plot on Fig. \ref{fig:resid1} we know that different samples are prone to some normalization-related offset, the data were generated by superimposing independent errors on sample-specific baselines that (randomly) differ between columns. This error structure generates a heatmap with strong vertical features. On the other hand, in the absence of any real correlation between gene-sets and expression, horizontal features, if any, are mild and localized -- and therefore there is no apparent block pattern. <>= colbase=rexp(79,rate=3)*sample(c(-1,1),size=79,replace=T) randres=sweep(matrix(rnorm(length(ChrLeaves)*79),ncol=79),2,colbase) heatmap(randres, scale="row",col = HMcols, ColSideColors=kinetColors,keep.dendro=TRUE,distfun=onecor, ,labRow=FALSE,xlab="Sample",ylab="Chromosome band") @ \begin{figure} \centering \includegraphics[height=0.7\textheight,width=0.7\textwidth]{FakeHmap} \caption{\label{fig:hm_fake} Randomly generated GS residuals with a different baseline for each sample (column). Sample baselines were generated from an exponential distribution, and individual-point noise from a normal distribution. Otherwise, layout and details are identical to those of Fig. \ref{fig:hm1}.} \end{figure} So it seems that there may be unaccounted-for factors associated with groups of chromosome bands. Among the \Sexpr{ncol(pData(workingEset))} descriptive variables available for each sample, we found the ``\texttt{kinet}'' variable to be strongly correlated with the observed pattern. This variable indicates whether the samples exhibit hyperdiploidy, i.e., having substantially more than $46$ chromosomes. Clearly, this property could be associated with increased expression of certain chromosomes -- and if there are specific hyperdiploidy patterns common in ALL, this could help explain the block pattern of Fig. \ref{fig:hm1}. The \texttt{kinet} variable is illustrated as a colored band at the top of the heatmap, with red indicating hyperdiploid samples, grey diploid samples, and white samples with unknown status. Even though hyperdiploid subjects are less than a quarter of the sample size, they form a strong majority in a cluster of samples characterized by predominantly negative residuals over most chromosome bands, and strongly positive residuals over the remaining bands. We conclude that adding ``\texttt{kinet}'' to the model may be an idea worth pursuing. It should be noted, though, that while such a model expansion might help account for most of the block pattern, it will not be of any use for minority members -- especially the $6$ diploid samples in ``sample cluster 1'', and the $8$ hyperdiploid samples not belonging to either cluster. Hence, some remnant (or a milder mirror-image) of the block pattern would probably still show after model expansion. Another factor of interest, which can be used as a benchmark because it certainly plays a role in the expression of some genes, is sex. We conclude that adding sex, too, as a covariate in the model may be of use (a more detailed discussion of Y-chromosome expression and sex assignment in this dataset is found in \citet{OronEtAl08}). %Genes on the Y chromosome can only be expressed among males. Y %chromosome bands occupy only two adjacent rows in Fig. \ref{fig:hm1}, %and therefore the sex-associated pattern is not highly visible. Yet, %GS residuals for the entire Y-chromosome does show a marked gender effect, %as expected (Fig. \ref{fig:Yres}). Interestingly, the pattern %is somewhat noisy, to the extent that one may suspect some gender %mis-assignment problems in the dataset documentation, especially regarding %several male samples whose Y-chromosome GS residuals are quite %negative. A closer inspection of this issue can be provided by looking %at all individual-gene Y chromosome residuals, and clustering them by sample %(Fig. \ref{fig:YresidsDetail}); this can be achieved via the \texttt{resplot} function. %The graphical inspection does not yield a clear-cut answer to %the question whether such errors had occurred. There are a few outlying %samples in each gender, but none of them show ``typical'' opposite-gender expression levels. %Moreover, there is also a high degree of overall variability. % %In any case, % % %<>= % %resY <- stdrAchr["Y",] %boxplot(resY ~ workingEset$sex,range=1,names=c("Female","Male"),ylab="GS Residual", % cex=0.8,cex.axis=1.5,boxwex=.3,cex.lab=1.5) %lines(c(-1,3),rep(0,2),col=2) %@ % %\begin{figure}[!h] \centering % \includegraphics[height=0.3\textwidth,width=0.3\textwidth]{YrABoxplot} %\caption{\label{fig:Yres} Boxplot of GS residuals for the Y-chromosome, obtained from a linear %model with only phenotype as predictor and grouped by sex.} %\end{figure} % %<>= %resplot(GSname="Y",resmat=exprs(lmPhenRes),gnames=c("Female","Male"),incidence=AmatChr3,fac=workingEset$sex,cex.axis=0.7,cex=0.4) %@ %\begin{figure} [!h] % \centering \includegraphics[height=0.7\textheight,width=0.8\textwidth]{YResDetailedBoxplot} %\caption{\label{fig:YresidsDetail}Raw residuals of Y-chromosome genes from % linear model of gene expression on phenotype for the \Rpackage{ALL} dataset, % grouped by sample and arranged by sex (females on top, males on bottom).} %\end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%% START OF 3-FACTOR LINEAR MODEL PART %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{\label{sec:lmWsex}GSEA and Diagnostics, Using a 3-Variable Model} The previous section's findings suggest a $3$-variable model with phenotype, hyperdiploidy and sex; we'd like to implement it using \texttt{lmPerGene}. There are $4$ samples with missing data for this model; we can either give them up and use the remaining $75$, or {\bf impute} the data by artificially filling the missing values, and thus retain all $79$ samples. Here we choose the former, mostly because it is very unclear whether to assign sample \ID{25006} as diploid or hyperdiploid. However, if one chooses to impute, the following code excerpts creates a mirror \Robject{expressionSet} object named \Robject{imputeEset}; substitute for \Robject{workingEset} in all subsequent code to see the difference (there should be some difference in residual patterns, but very little in bottom-line inference). <>= sampleIDs=rownames(pData(workingEset)) imputeEset=workingEset imputeEset$sex[is.na(workingEset$sex)] <- 'M' imputeEset$kinet[is.na(workingEset$kinet)] <- 'dyploid' ### This is the questionable sample; try once as diploid ### (by skipping the following line), and once as hyperdiploid imputeEset$kinet[which(sampleIDs=="25006")]<-'hyperd.' @ We continue with the original (un-imputed) dataset; \Rfunction{lmPerGene} as a default removes samples with missing observations, just like \Rfunction{lm}. We also generate residuals, GS $\tau$ statistics and GS residuals. <>= lmExpand <- lmPerGene(workingEset,~mol.biol+sex+kinet) lmExpandRes <- getResidPerGene(lmExpand,type="extStudent") lmExpandTees <- t(lmExpand$tstat[2:4,]) lmExpandBandTees<-GSNormalize(lmExpandTees,AmatChr3) GSresidExpand=GSNormalize(exprs(lmExpandRes),AmatChr3) @ The GS residual heatmap is shown on Fig. \ref{fig:hmExpanded}. The patterns are more predominantly vertical than in Fig. \ref{fig:hm1}, but not as exclusively vertical as in Fig. \ref{fig:hm_fake}. Most of the remnant patterns resembling Fig. \ref{fig:hm1}'s blocks seems to be related to minority members, as explained earlier. Hyperdiploidy (depicted on the top bar) does not appear to be associated with any sample cluster anymore.\footnote{Note that we have to be a bit careful in setting up the heatmap, since we have $4$ columns less than in the single-factor one; we take advantage of the fact that residuals are an \Robject{expressionSet} type object to redefine the color vector on the top bar.} <>= kinetColors2 <- ifelse(lmExpandRes$kinet=="hyperd.","brown", "grey") bandHeatmapExp=heatmap(GSresidExpand[ChrLeaves,], scale="row",col = HMcols, ColSideColors=kinetColors2,keep.dendro=TRUE,distfun=onecor, labRow=FALSE,xlab="Sample",ylab="Chromosome band") @ \begin{figure} \centering \includegraphics[height=0.7\textheight,width=0.7\textwidth]{lm3FacHeatmap} \caption{\label{fig:hmExpanded}GS residuals from the $3$-factor linear model. Layout and details are similar to those of Fig. \ref{fig:hm1}.} \end{figure} <>= nperm=125 flagp=0.01 @ We proceed to inference. As mentioned above, due to within-sample correlations the safe approach to inference is via label permutation tests. For a multiple regression model, we can infer about one factor of interest only by permuting its labels {\bf within each subgroup defined by a level combination of the adjusting factors}. For example, the phenotype effect will be evaluated by separately and simultaneously permuting the phenotype labels of hyperdiploid females, hyperdiploid males, diploid females and diploid males, without mixing labels between the four groups. Using this approach the adjusting factors must be categorical (the factor of interest itself can be continuous). Function \texttt{gsealmPerm} performs such a permutation analysis. The factors are given as the RHS of a standard \texttt{R} model formula, with the first factor being the one of interest. If inference is needed for other factors, the same function can be called with a different factor ordering in the formula. In this document, we run only \Sexpr{nperm} permutations to save compilation time. Below is a list of flagged chromosome bands, at the \Sexpr{flagp} level on either side, for the phenotype effect. Theoretically we approach the p-value as an alert flag, rather than an exact probability value (which it is not, due to both the hierarchical structure and multiple testing). In any case, to make the list less redundant we look only at the chromosome-tree ``leaves'' as before. The first list is of bands down-expressed on the \texttt{BCR/ABL} phenotype, followed by a list of those up-expressed on \texttt{BCR/ABL}. Note that we have set \texttt{removeShift=TRUE} (this is not an argument of \Rfunction{gsealmPerm} itself, but rather passed on to \Rfunction{GSNormalize}) -- meaning that differential expression is estimated vs. the baseline gene-level expression gap between the phenotypes (calculated by default as the median of gene-level $t$-statistics). <>= pvalsExpand=gsealmPerm(workingEset[LeafGenes,],~mol.biol+sex+kinet,AmatChr3[ChrLeaves,LeafGenes],nperm=nperm,removeShift=TRUE) pvalsExpand[pvalsExpand[,1]>= %pvalsSex.Exp=gsealmPerm(workingEset,~sex+kinet+mol.biol,AmatChr3,nperm=nperm) %pvalsSex.Exp[rownames(AmatChr3) %in% ChrLeaves & pvalsSex.Exp[,1]>= chrNames=c(as.character(1:22),"X") pvalsHyper=gsealmPerm(workingEset,~kinet+mol.biol+sex,AmatChr3[chrNames,],nperm=nperm) pvalsHyper[pvalsHyper[,2]>= GSChromeResid=GSresidExpand[chrNames,] hyperHmap=heatmap(GSChromeResid[pvalsHyper[,2]<0.1,lmExpandRes$kinet=="hyperd."], scale="row",col = HMcols,keep.dendro=TRUE,distfun=onecor,xlab="Sample ID",ylab="Chromosome") @ \begin{figure} \centering \includegraphics[height=0.3\textheight,width=0.45\textwidth]{hyperdiploidHeatmap} \caption{\label{fig:hmHyperd} Complete-chromosome GS residuals from the $3$-factor linear model, for chromosomes flagged as over-expressed under hyperdiploidy ($p<0.1$), and for the hyperdiploid subjects only.} \end{figure} % <>= % resplot(GSname="X",resmat=exprs(lmExpandRes),incidence=AmatChr3,fac=workingEset$kinet,gnames=c("Diploid","Hyperdiploid"),prefix="Chromosome",notch=T) % @ % \begin{figure} % \centering % \includegraphics[height=0.25\textheight,width=0.5\textwidth]{Xresplot} % \caption{\label{fig:Xresplot}Standardized % individual-gene residuals from the $3$-factor linear model for X % chromosome genes, displayed by sample and split into the diploid % (top) and hyperdiploid (bottom) groups.} % \end{figure} \subsubsection{Influence Analysis} Recall that some samples (notably \ID{28001}) were visually flagged as potentially problematic outliers. What should we do about them? If there is still access to earlier-stage data (physical samples or raw images), it is always a good idea to inspect questionable samples directly. In the absence of such access, one needs to answer the practical question: how strongly does a single outlying observation affect model inference? Since our model includes only dichotomous factors, fitting it boils down to calculating group averages -- for $2$ groups in the $1$-factor case and for $8$ groups in the $3$-factor case. The smaller the group, the more leverage each sample in the group exerts (sample leverage values are obtained via the \texttt{Leverage} function). A measure known as Cook's $D$ \citep{CookWeisberg82} incorporates a sample's leverage together with the magnitude of its residual, to calculate the square distance by which a single observation ``moves'' the model's parameter estimates.\footnote{The distance is measured in $p$-dimensional parameter space and normalized by the standard error.} The \texttt{CooksDPerGene} function performs this calculation for all samples and genes simultaneously. In the GSEA framework, we would probably like to aggregate individual-gene $D$ values within each chromosome band, in an analogous manner to $\tau_{Si}$ and $r_{Si}$. However, normalizing by square-root makes little sense since Cook's $D$ is positive, not pivotal around zero. Instead, we suggest using \begin{equation} D_{Si} = \sqrt{\sum_{g \in S} D_{gi} / |S|} \label{eq:gsD}, \end{equation} which is simply the root-mean $D$ value over the gene-set. This, too, can be accomplished via the \texttt{GSNormalize} function. $D_{Si}$ provides a measure of the typical amount by which the sample in question affects $t$-statistics for genes in the band. We are looking for samples whose overall $D_{Si}$ values, across different gene-sets, are substantially larger than the overall baseline. Since the $1$-factor model splits the dataset into two roughly equal-sized groups, the leverage of all samples is roughly the same, and we expect this model's $D_{Si}$ values to convey a similar message to Fig. \ref{fig:hm1}. For that model, even the strongest outlier samples do not appear to be substantially away from the pack (Fig. \ref{fig:Cook}, top). The story is different for the $3$-factor model (Fig. \ref{fig:Cook}, bottom). There is at least one highly influential sample (relative to the baseline): sample \ID{28024}. This sample has become influential for the expanded model, because it belongs to a female subject with hyperdiploidy -- one of the smallest among the $8$ groups (by contrast, sample \ID{28001} belongs to the largest group, and hence its diminished influence on the $3$-factor model). Additionally, \ID{28024}'s expression levels were quite variable to begin with: its Studentized residuals have the second-highest IQR (see Fig. \ref{fig:resid1}). Again, if there is access to raw data in some form, it is a good idea to inspect them and see why this sample is so highly variable. Otherwise, a sensitivity analysis (repeating the regression without \ID{28024} and comparing the results) may be in order -- although in our particular case, the sample's overall influence does not appear to warrant taking this step. <>= cookie1=CooksDPerGene(lmPhen) cookie3=CooksDPerGene(lmExpand) BandCooks1=GSNormalize(cookie1,AmatChr3,fun2=identity) BandCooks3=GSNormalize(cookie3,AmatChr3,fun2=identity) BandCooks1Base=BandCooks1[ChrLeaves,] BandCooks3Base=BandCooks3[ChrLeaves,] layout(1:2) boxplot(sqrt(BandCooks1Base)~col(BandCooks1Base),names=sampleIDs,pch='+', cex=.2,las=3,cex.axis=.4, main="Influence by sample and Chromosome Band: 1-Factor Model", xlab="Sample ID",ylab="Cook's D (Band Root-Mean)",boxwex=.5, cex.main=0.8,cex.lab=0.7,col=5) boxplot(sqrt(BandCooks3Base)~col(BandCooks3Base), names=sampleIDs[!is.na(workingEset$kinet)],pch='+', cex=.2,las=3,cex.axis=.4, main="Influence by Sample and Chromosome Band: 3-Factor Model", xlab="Sample ID",ylab="Cook's D (Band Root-Mean)",boxwex=.5, cex.main=0.8,cex.lab=0.7,col=5,ylim=c(0,0.4)) @ \begin{figure} \centering \includegraphics[height=0.7\textheight,width=0.9\textwidth]{CooksDplot} % \includegraphics{CooksDplot} \caption{\label{fig:Cook} Chromosome-band root-mean Cook's $D$ values, summarized by sample, for the 1-factor (top) and 3-factor (bottom) models.} \end{figure} The $D_{Si}$ measure can also be used to check whether there are any problem gene-sets, i.e., chromosome bands influenced by an unduly large number of outlying samples. In our case, there are no such bands.\footnote{Graph not shown here; it can be easily produced by constructing a boxplot by rows rather than columns in the code above -- you'll also need to set \texttt{names=rownames(AmatChr3[ChrLeaves,])}.} The \Rpackage{GSEAlm} package also offers two other commonly encountered diagnostics: \texttt{dffitsPerGene} and \texttt{dfbetasPerGene}, each using a different measure of how individual samples affect estimation. According to recent theory \citep{LaMotte1999,Jensen2001}, all three influence diagnostics convey probabilistically equivalent information from the strict perspective of outlier identification. However, these functions are still useful for visual problem identification. \newpage %\bibliographystyle{natbib} \bibliography{cat} \section*{Session Information} \scriptsize{ <>= toLatex(sessionInfo()) @ } \end{document}