% \VignetteIndexEntry{The Iterative Signature Algorithm for Gene Expression Data} \documentclass{article} \usepackage{ragged2e} \usepackage{url} \newcommand{\Rfunction}[1]{\texttt{#1()}} \newcommand{\Rpackage}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\texttt{#1}} \newcommand{\Rargument}[1]{\textsl{#1}} \newcommand{\filename}[1]{\texttt{#1}} \newcommand{\variable}[1]{\texttt{#1}} \SweaveOpts{cache=TRUE} \SweaveOpts{prefix.string=plot} \setlength{\parindent}{2em} \begin{document} \title{The Iterative Signature Algorithm for Gene Expression Data} \author{G\'abor Cs\'ardi} \maketitle \tableofcontents \RaggedRight <>= options(width=60) options(continue=" ") try(X11.options(type="xlib"), silent=TRUE) @ \section{Introduction} The Iterative Signature Algorithm (ISA)~\cite{sa,isa,isamod} is a biclustering method. The input of a biclustering method is a matrix and the output is a set of biclusters that fulfill some criteria. A bicluster is a block of the potentially reordered input matrix. Most commonly, biclustering algorithms are used on microarray expression data, to find gene sets that are coexpressed across a subset of the original samples. In the ISA papers the biclusters are called transription modules (TM), we will often refer them under this name in the following. This tutorial specifically deals with the modular analysis of gene expression data. Section~\ref{sec:isa} gives a short summary of how ISA works. If you need more information of the underlying math or want to apply it to other data, then please see the referenced papers, the vignette titled ``The Iterative Signature Algorithm'' in the \Rpackage{isa2} R package, or the ISA homepage at \url{http://www.unil.ch/cbg/homepage/software.html}. \section{Preparing the data} \subsection{Loading the data} First, we load the required packages and the data to analyze. ISA is implemented in the \Rpackage{eisa} and \Rpackage{isa2} packages, see Section~\ref{sec:isapackages} for a more elaborated summary about the two packages. It is enough to load the \Rpackage{eisa} package, \Rpackage{isa2} and other required packages are loaded automatically: <>= library(eisa) @ In this tutorial we will use the data in the \Rpackage{ALL} package. <>= library(ALL) library(hgu95av2.db) data(ALL) @ This is a data set from a clinical trial in acute lymphoblastic leukemia and it contains \Sexpr{ncol(ALL)} samples altogether. \section{Simple ISA runs} The simplest way to run ISA is to choose the two threshold parameters and then call the \Rfunction{ISA} function on the \Rclass{ExpressionSet} object. The threshold parameters tune the size of the modules, less stringent (i.e. smaller) values result bigger, less correlated modules. The optimal values depend on your data and some experimentation is needed to determine them. Since running ISA might take a couple of minutes and the results depend on the random number generator used, the ISA run is commented out from the next code block, and we just load a precomputed set of modules that is distributed with the \Rpackage{eisa} package. <>= thr.gene <- 2.7 thr.cond <- 1.4 set.seed(1) # to get the same results, always # modules <- ISA(ALL, thr.gene=thr.gene, thr.cond=thr.cond) data(ALLModulesSmall) modules <- ALLModulesSmall @ This first applies a non-specific filter to the data set and then runs ISA from \Sexpr{formals(eisa:::ISA)$no.seeds} % $ random seeds (the default). See Section~\ref{sec:detailed-isa} if the default parameters are not appropriate for you and need more control. \section{Inspect the result} The \Rfunction{ISA} function returns an \Rclass{ISAModules} object. By typing in its name we can get a brief summary of the results: <>= modules @ There are various other \Rclass{ISAModules} methods that help to access the modules themselves and the ISA parameters that were used for the run. Calling \Rfunction{length} on \variable{modules} returns the number of ISA modules in the set, \Rfunction{dim} gives the dimension of the input expression matrix: the number of features (after the filtering) and the number of samples: <>= length(modules) dim(modules) @ Functions \Rfunction{featureNames} and \Rfunction{sampleNames} return the names of the features and samples, just like the functions with the same name for an \Rclass{ExpressionSet}: <>= featureNames(modules)[1:5] sampleNames(modules)[1:5] @ The \Rfunction{getNoFeatures} function returns a numeric vector, the number of features (probesets in our case) in each module. Similarly, \Rfunction{getNoSamples} returns a numeric vector, the number of samples in each module. \Rfunction{pData} returns the phenotype data of the expression set as a data frame. The \Rfunction{getOrganism} function returns the scientific name of the organism under study, \Rfunction{annotation} the name of the chip. For the former the appropriate annotation package must be installed. <>= getNoFeatures(modules) getNoSamples(modules) colnames(pData(modules)) getOrganism(modules) annotation(modules) @ The double bracket indexing operator (`\verb+[[+') can be used to select some modules from the complete set, the result is another, smaller \Rclass{ISAModules} object. The following selects the first five modules. <>= modules[[1:5]] @ The single bracket indexing operator can be used to restrict an \Rclass{ISAModules} object to a subset of features and/or samples. E.g. selecting all features that map to a gene on chromosome~1 can be done with <>= chr <- get(paste(annotation(modules), sep="", "CHR")) chr1features <- sapply(mget(featureNames(modules), chr), function(x) "1" %in% x) modules[chr1features,] @ Similarly, selecting all B-cell samples can be performed with <>= modules[ ,grep("^B", pData(modules)$BT)] @ \Rfunction{getFeatureNames} lists the probesets (more precisely, the feature names coming from the \Rclass{ExpressionSet} object) in the modules. It returns a list, here we just print the first entry. <>= getFeatureNames(modules)[[1]] @ The \Rfunction{getSampleNames} function does the same for the samples. Again, the sample names are taken from the \Rclass{ExpressionSet} object that was passed to \Rfunction{ISA}: <>= getSampleNames(modules)[[1]] @ ISA biclustering is not binary, every feature (and similarly, every sample) has a score between -1 and 1; the further the score is from zero the stronger the association between the feature (or sample) and the module. If two features both have scores with the same sign, then they are correlated, if the sign of their scores are opposite, then they are anti-correlated. You can query the scores of the features with the \Rfunction{getFeatureScores} function, and similarly, the \Rfunction{getSampleScores} function queries the sample scores. You can supply the modules you want to query as an optional argument: <>= getFeatureScores(modules, 3) getSampleScores(modules, 3) @ You can also query the scores in a matrix form, that is probably better if you need many or all of them at the same time. The \Rfunction{getFeatureMatrix} and \Rfunction{getSampleMatrix} functions are defined for this. The probes/samples that are not included in a module will have a zero score by definition. <>= dim(getFeatureMatrix(modules)) dim(getSampleMatrix(modules)) @ Objects from the \Rclass{ISAModules} class store various information about the ISA run and the convergence of the seeds. Information associated with the individual seeds can be queried with the \Rfunction{seedData} function, it returns a data frame, with as many rows as the number of seeds and various seed-level information, e.g. the number of iterations required for the seed to converge. See the manual page of \Rfunction{ISA} for details. <>= seedData(modules) @ The \Rfunction{runData} function returns additional information about the ISA run, see the \Rfunction{ISA} manual page for details. <>= runData(modules) @ \section{Enrichment calculations} The \Rpackage{eisa} package provides some functions to perform enrichment tests for the gene sets corresponding to the ISA modules against various databases. These tests are usually simplified and less sophisticated versions than the ones in the \Rpackage{Category}, \Rpackage{GOstats} or \Rpackage{topGO} packages, but they are much faster and this is important if we need to perform them for many modules. \subsection{Gene Ontology} To perform enrichment analysis against the Gene Ontology database, all you have to do is to supply your \Rclass{ISAModules} object to the \Rfunction{ISAGO} function. <>= GO <- ISAGO(modules) @ The \Rfunction{ISAGO} function requires the annotation package of the chip, e.g. for the ALL data, the \Rpackage{hgu95av2.db} package is required. The \variable{GO} object is a list with three elements, these correspond to the GO ontologies, they are: biological function, cellular component and molecular function, in this order. <>= GO @ We can see the number of categories tested, this is different for each ontology, as they have different number of terms. The gene universe size is also different, because it contains only genes that have at least one annotation in the given category. For extracting the results themselves, the \Rfunction{summary} function can be used, this converts them to a simple data frame. A $p$-value limit can be supplied to \Rfunction{summary}. Note, that since \Rfunction{ISAGO} calculates enrichment for many gene sets (i.e. for all biclusters), \Rfunction{summary} returns a list of data frames, one for each bicluster. A table for the first module: <>= summary(GO$BP, p=0.001)[[1]][,-6] @ We omitted the sixth column of the result, because it is very wide and would look bad in this vignette. This column is called \variable{drive} and lists the Entrez IDs of the genes that are in the intersection of the bicluster and the GO category; or in other words, the genes that drive the enrichment. These genes can also be obtained with the \Rfunction{geneIdsByCategory} function. The following returns the genes in the first module and the third GO BP category. (The GO categories are ordered according to the enrichment $p$-values, just like in the output of \Rfunction{summary}.) <>= geneIdsByCategory(GO$BP)[[1]][[3]] @ You can use the \Rpackage{GO.db} package to obtain more information about the enriched GO categories. <>= sigCategories(GO$BP)[[1]] library(GO.db) mget(na.omit(sigCategories(GO$BP)[[1]][1:3]), GOTERM) @ In addition, the following functions are implemented to work on the objects returned by \Rfunction{ISAGO}: \Rfunction{htmlReport}, \Rfunction{pvalues}, \Rfunction{geneCounts}, \Rfunction{oddsRatios}, \Rfunction{expectedCounts}, \Rfunction{universeCounts}, \Rfunction{universeMappedCount}, \Rfunction{geneMappedCount}, \Rfunction{geneIdUniverse}. These functions do essentially the same as they counterparts for \Rclass{GOHyperGResult} objects, see the documentation of the \Rpackage{GOstats} package. The only difference is, that since here we are testing a list of gene sets (=biclusters), they calculate the results for all gene sets and usually return lists. \subsubsection{Multiple testing correction} By default, the \Rfunction{ISAGO} function performs multiple testing correction using the Holm method, this can be changed via the \texttt{correction} and \texttt{correction.method} arguments. See the manual page of the \Rfunction{ISAGO} function for details, and also the \Rfunction{p.adjust} function for the possible multiple testing correction schemes. %\subsection{Transcription regulation from the DBD database} %TODO \section{How ISA works}% \label{sec:isa} \subsection{ISA iteration} ISA works in an iterative way. For an $E (m\times n)$ input matrix it starts from a seed vector $r_0$, which is typically a sparse 0/1 vector of length $m$. The non-zero elements in the seed vector define a set of genes in $E$. Then the transposed of $E$, $E'$ is multiplied by $r_0$ and the result is thresholded. The thresholding is an important step of the ISA, without thresholding ISA would be equivalent to a (not too effective) numerical singular value decomposition (SVD) algorithm. Currently thresholding is done by calculating the mean and standard deviation of the vector and keeping only elements that are further than a given number of standard deviations from the mean. Using the ``direction'' parameter, one can keep values that are (a) significantly higher (``up''); (b) lower (``down'') than the mean; or (c) both (``updown''). The thresholded vector $c_0$ is the (sample) \emph{signature} of $r_0$. Then the (gene) signature of $c_0$ is calculated, $E$ is multiplied by $c_0$ and then thresholded to get $r_1$. This iteration is performed until it converges, i.e. $r_{i-1}$ and $r_i$ are \emph{close}, and $c_{i-1}$ and $c_i$ are also close. The convergence criteria, i.e. what \emph{close} means, is by default defined by high Pearson correlation. It is very possible that the ISA finds the same module more than once; two or more seeds might converge to the same module. The function \Rfunction{ISAUnique} eliminates every module from the result of \Rfunction{ISAIterate} that is very similar (in terms of Pearson correlation) to the one that was already found before. It might be also apparent from the description of ISA, that the biclusters are soft, i.e. they might have an overlap in their genes, samples, or both. It is also possible that some genes and/or samples of the input matrix are not found to be part of any ISA biclusters. Depending on the stringency parameters in the thresholding (i.e. how far the values should be from the mean), it might even happen that ISA does not find any biclusters. \subsection{Parameters} The two main parameters of ISA are the two thresholds (one for the genes and one for the samples). They basically define the stringency of the modules. If the gene threshold is high, then the modules will have very similar genes. If it is mild, then modules will be bigger, with less similar genes than in the first case. The same applies to the sample threshold and the samples of the modules. \subsection{Random seeding and smart seeding} By default (i.e. if the \Rfunction{ISA} function is used) the ISA is performed from random sparse starting seeds, generated by the \Rfunction{generate.seeds} function. This way the algorithm is completely unsupervised, but also stochastic: it might give different results for different runs. It is possible to use non-random seeds as well. If you have some knowledge about the data or are interested in a particular subset of genes/samples, then you can feed in your seeds into the \Rfunction{ISAIterate} function directly. In this case the algorithm is deterministic, for the same seed you will always get the same results. Using smart (i.e. non-random) seeds can be considered as a semi-supervised approach. We show an example of using smart seeds in Section~\ref{sec:detailed-isa}. \subsection{Normalization} Using in silico data we observed that ISA has the best performance if the input matrix is normalized (see \Rfunction{ISANormalize}). The normalization produces two matrices: $E_r$ and $E_c$. $E_r$ is calculated by transposing $E$ and centering and scaling its expression values for each sample (see the \Rfunction{scale} R function). $E_c$ is calculated by centering and scaling the genes of $E$. $E_r$ is used to calculate the sample signature of genes and $E_c$ is used to calculate the gene signature of the samples. It is possible to use another normalization, or not to use normalization at all; the user has to construct an \Rclass{ISAExpressionSet} object containing the three matrices corresponding to the raw data, the gene-wise normalized data and the sample-wise normalized data. This object can be passed to the \Rfunction{ISAIterate} function. The matrices are not required to be different, the user can supply the raw data matrix three times, if desired. \subsection{Gene and sample scores} In addition to finding biclusters in the input matrix, the ISA also assigns scores to the genes and samples, separately for each module. The scores are between minus one and one and they are by definition zero for the genes/samples that are not included in the module. For the non-zero entries, the further the score of a gene/samples is from zero, the stronger the association between the gene/sample and the module. If the signs of two genes/samples are the same, then they are correlated, if they have opposite signs, then they are anti-correlated. \section{Bicluster coherence and robustness measures} \subsection{Coherence} Madeira and Oliviera\cite{madeira04} define various coherence scores for biclusters, these measure how well the rows and or columns are correlated. It is possible to use these measures for ISA as well, after converting the output of ISA to a \Rclass{biclust} object. We use the \texttt{Bc} object that was created in Section~\ref{sec:biclust}. Here are the measures for the first bicluster: <>= library(biclust) Bc <- as(modules, "Biclust") constantVariance(exprs(ALL), Bc, number=1) additiveVariance(exprs(ALL), Bc, number=1) multiplicativeVariance(exprs(ALL), Bc, number=1) signVariance(exprs(ALL), Bc, number=1) @ You can use \Rfunction{sapply} to perform the calculation for many or all modules, e.g. for this data set `constant variance' and `additive variance' are not the same: <>= cv <- sapply(seq_len(Bc@Number), function(x) constantVariance(exprs(ALL), Bc, number=x)) av <- sapply(seq_len(Bc@Number), function(x) additiveVariance(exprs(ALL), Bc, number=x)) cor(av, cv) @ Please see the manual pages of these functions and the paper cited above for more details. \subsection{Robustness} The \Rpackage{eisa} package uses a measure that is related to coherence; it is called robustness. Robustness is a generalization of the singular value of a matrix. If there were no thresholding during the ISA iteration, then ISA would be equivalent to a numerical method for singular value decomposition and robustness would be the same the principal singular value of the input matrix. If the \Rfunction{ISA} function was used to find the transcription modules, then the robustness measure is used automatically to filter the results. This is done by first scrambling the input matrix and then running ISA on it. As ISA is an unsupervised algorithm it usually finds some (although less and smaller) modules even in such a scrambled data set. Then the robustness scores are calculated for the proper and the scrambled modules and only (proper) modules that have a higher score than the highest scrambled module are kept. The robustness scores are stored in the seed data during this process, so you can check them later: <>= seedData(modules)$rob @ \section{The \Rpackage{isa2} and \Rpackage{eisa} packages}% \label{sec:isapackages} ISA and its companion functions for visualization, functional enrichment calculation, etc. are distributed in two separate R packages, \Rpackage{isa2} and \Rpackage{eisa}. \Rpackage{isa2} contains the implementation of ISA itself, and \Rpackage{eisa} specifically deals with supplying expression data to \Rpackage{isa2} and visualizing the results. If you analyze gene expression data, then we suggest using the interface provided in the \Rpackage{eisa} package. For other data, use the \Rpackage{isa2} package directly. \section{Finer control over ISA parameters}% \label{sec:detailed-isa} The \Rfunction{ISA} function takes care of all steps performed in a modular study, and for each step it uses parameters, that work reasonably well. In some cases, however, one wants to access these steps individually, to use custom parameters instead of the defaults. In this section, we will still use the acute lymphoblastic leukemia gene expression data from the \Rpackage{ALL} package. \subsection{Non-specific filtering} The first step of the analysis typically involves non-specific filtering of the probesets. The aim is to eliminate the probesets that do not show variation across the samples, as they only contribute noise to the data. By default (i.e. if the \Rfunction{ISA} function is called) this is performed using the \Rpackage{genefilter} package, and the default filter is based on the inter-quantile ratio of the probesets' expression values, a robust measure of variance. If other filters are desired, then these can be implemented by using the functions of the \Rpackage{genefilter} package directly. Possible filtering techniques include using the AffyMetrix present/absent calls produced by the \Rfunction{mas5calls} function of the \Rpackage{affy} package, but this requires the raw data, so in this vignette we use a simple method based on variance and minimum expression value: only probesets that have a variance of at least \variable{varLimit} and that have at least \variable{kLimit} samples with expression values over \variable{ALimit} are kept. <>= library(genefilter) varLimit <- 0.5 kLimit <- 4 ALimit <- 5 flist <- filterfun(function(x) var(x)>varLimit, kOverA(kLimit,ALimit)) ALL.filt <- ALL[genefilter(ALL, flist), ] @ The original expression set had \Sexpr{nrow(ALL)} features, the filtered one has only \Sexpr{nrow(ALL.filt)}. \subsection{Entrez Id matching} In this step we match the probesets to Entrez identifiers and remove the ones that don't map to any Entrez gene. <>= ann <- annotation(ALL.filt) library(paste(ann, sep=".", "db"), character.only=TRUE) ENTREZ <- get( paste(ann, sep="", "ENTREZID") ) EntrezIds <- mget(featureNames(ALL.filt), ENTREZ) keep <- sapply(EntrezIds, function(x) length(x) >= 1 && !is.na(x)) ALL.filt.2 <- ALL.filt[keep,] @ To reduce ambiguity in the interpretation of the results, we might also want to keep only single probeset for each Entrez gene. The following code snipplet keeps the probeset with the highest variance. <>= vari <- apply(exprs(ALL.filt.2), 1, var) larg <- findLargest(featureNames(ALL.filt.2), vari, data=annotation(ALL.filt.2)) ALL.filt.3 <- ALL.filt.2[larg,] @ \subsection{Normalizing the data} The ISA works best, if the expression matrix is scaled and centered. In fact, the two sub-steps of an ISA step require expression matrices that are normalized differently. The \Rfunction{ISANormalize} function can be used to calculate the normalized expression matrices; it returns an \Rclass{ISAExpressionSet} object. This is a subclass of \Rclass{ExpressionSet}, and contains three expression matrices: the original raw expression, the row-wise (=gene-wise) normalized and the column-wise (=sample-wise) normalized expression matrix. The normalized expression matrices can be queried with the \Rfunction{featExprs} and \Rfunction{sampExprs} functions. <>= ALL.normed <- ISANormalize(ALL.filt.3) ls(assayData(ALL.normed)) dim(featExprs(ALL.normed)) dim(sampExprs(ALL.normed)) @ \subsection{Generating starting seeds for the ISA} The ISA is an iterative algorithm that starts with a set of input seeds. An input seed is basically a set of probesets and the ISA stepwise refines this set by 1) including other probesets in the set that are coexpressed with the input probesets and 2) removing probesets from it that are not coexpressed with the rest of the input set. The \Rfunction{generate.seeds} function generates a set of random seeds (i.e. a set of random gene sets). See its documentation if you need to change the sparsity of the seeds. <>= set.seed(3) random.seeds <- generate.seeds(length=nrow(ALL.normed), count=100) @ In addition to random seeds, it is possible to start the ISA iteration from ``educated'' seeds, i.e. gene sets the user is interested in, or a set of samples that are supposed to have coexpressed genes. We create another set of starting seeds here, based on the type of acute lymphoblastic leukemia: ``B'', ``B1'', ``B2'', ``B3'', ``B4'' or ``T'', ``T1'', ``T2'', ``T3'' and ``T4''. <>= type <- as.character(pData(ALL.normed)$BT) ss1 <- ifelse(grepl("^B", type), -1, 1) ss2 <- ifelse(grepl("^B1", type), 1, 0) ss3 <- ifelse(grepl("^B2", type), 1, 0) ss4 <- ifelse(grepl("^B3", type), 1, 0) ss5 <- ifelse(grepl("^B4", type), 1, 0) ss6 <- ifelse(grepl("^T1", type), 1, 0) ss7 <- ifelse(grepl("^T2", type), 1, 0) ss8 <- ifelse(grepl("^T3", type), 1, 0) ss9 <- ifelse(grepl("^T4", type), 1, 0) smart.seeds <- cbind(ss1, ss2, ss3, ss4, ss5, ss6, ss7, ss8, ss9) @ The \variable{ss1} seed includes all samples, but their sign is opposite for B-cell leukemia samples and T-cell samples. This way ISA is looking for sets of genes that are differently regulated in these two groups of samples. \variable{ss2} contains only B1 type samples, so here we look for genes that are specific to this variant of the disease. The other seeds are similar, for the other subtypes. \subsection{Performing the ISA iteration} We perform the ISA iterations for our two sets of seeds separately. The two threshold parameters we use here were chosen after some experimentation; these result modules of the ``right'' size. <>= modules1 <- ISAIterate(ALL.normed, feature.seeds=random.seeds, thr.feat=1.5, thr.samp=1.8, convergence="cor") modules2 <- ISAIterate(ALL.normed, sample.seeds=smart.seeds, thr.feat=1.5, thr.samp=1.8, convergence="cor") @ \subsection{Dropping non-unique modules} \Rfunction{ISAIterate} returns the same number of modules as the number of input seeds; these are, however, not always meaningful, the input seeds can converge to an all-zero vector, or occasionally they may not converge at all. It is also possible that two or more input seeds converge to the same module. The \Rfunction{ISAUnique} function eliminates the all-zero or non-convergent input seeds and keeps only one instance of the duplicated ones. <>= modules1.unique <- ISAUnique(ALL.normed, modules1) modules2.unique <- ISAUnique(ALL.normed, modules2) length(modules1.unique) length(modules2.unique) @ \Sexpr{length(modules1.unique)} modules were kept for the first set of seeds and \Sexpr{length(modules2.unique)} for the second set. \subsection{Dropping non-robust modules} The \Rfunction{ISAFilterRobust} function filters a set of modules by running ISA with the same parameters on the scrambled data set and then calculating a robustness score, both for the real modules and the ones from the scrambled data. The highest robustness score obtained from the scrambled data is used as a threshold to filter the real modules. <>= modules1.robust <- ISAFilterRobust(ALL.normed, modules1.unique) modules2.robust <- ISAFilterRobust(ALL.normed, modules2.unique) length(modules1.robust) length(modules2.robust) @ We still have \Sexpr{length(modules1.robust)} modules for the first set of seeds and \Sexpr{length(modules2.unique)} for the second set. \section{More information} For more information about the ISA, please see the references below. The ISA homepage at \url{http://www.unil.ch/cbg/homepage/software.html} has example data sets, and all ISA related tutorials and papers. \section{Session information} The version number of R and packages loaded for generating this vignette were: <>= toLatex(sessionInfo()) @ \bibliographystyle{apalike} \bibliography{EISA} \end{document}