\documentclass[a4paper]{article} \usepackage[utf8]{inputenc} %\VignetteIndexEntry{UsersGuide} %\VignetteKeyword{GOexpress} \title{GOexpress: Visualise microarray and RNAseq data using gene ontology annotations} \author{Kévin Rue-Albrecht, Paul A. McGettigan, Belinda Hernández,\\ David A. Magee, Nicolas C. Nalpas, Andrew C. Parnell,\\ Stephen V. Gordon, and David E. MacHugh} \begin{document} \SweaveOpts{concordance=TRUE} \setkeys{Gin}{width=1\textwidth} <>= options(useFancyQuotes="UTF-8") @ \maketitle \tableofcontents \section[Introduction]{Introduction} \subsection[The origin and purpose of \texttt{GOexpress}]{The origin and purpose of \texttt{GOexpress}} The idea leading to the \texttt{GOexpress} R package emerged from a set of plotting functions I regularly copy-pasted across various complex multifactorial transcriptomics studies from both microarray and RNA-seq platforms. Those functions were repeatedly used to visualise the expression profile of genes across groups of samples, to annotate technical gene identifiers from both microarray and RNA-seq platforms (i.e. probesets, Ensembl gene identifiers) with their associated gene name, and to evaluate the clustering of samples based on genes participating in a common cellular function or location (i.e. gene ontology). While developing the \texttt{GOexpress} package and discussing its features with colleagues and potential users, a few more features were added, to enhance and complement the initial functions, leading to the present version of the package. Complex multifactorial experiments have become the norm in many research fields, thanks to the decrease in cost of high-throughput transcriptomics platforms and the barcoding/multiplexing of samples on the RNA-seq platform. While much effort has been (correctly!) spent on the development of adequate statistical frameworks for the processing of raw expression data, much of the genewise visualisation is left to the end-user. However, data summarisation and visualisation can be a daunting task in multifactorial experiments, or require large amounts of copy-pasting to investigate the expression profile of a handful or genes and cellular pathways. Tested on multiple RNA-seq and microarray datasets, \texttt{GOexpress} offers an extendable set of data-driven plotting functions readily applicable to the output of widely used analytic packages estimating (differential) gene expression. Once the initial analysis and filtering of \texttt{GOexpress} results is complete --- literaly two command lines ---, each gene and gene ontology is accessible by a single line of code to produce high-quality graphics. In short, the \texttt{GOexpress} package is a software package developed based on real experimental datasets to ease the visualisation and interpretation of multifactorial transcriptomics data by bioinformaticians and biologists, while striving to keep it a simple, fast, and intuitive toolkit. Notably, the use of the \texttt{biomaRt} package enables \texttt{GOexpress} to support and annotate gene expression matrix from any species and any microarray platform present in the Ensembl BioMart: \texttt{http://www.ensembl.org/biomart/martview}. \subsection[Purpose of this document]{Purpose of this document} This User's Guide was intended as a helpful description of the main features implemented in the \texttt{GOexpress} package, as well as a tutorial taking the user through a typical analysis pipeline that \texttt{GOexpress} was designed for. While an example usage will be provided for each function of the package, the many arguments of each function cannot realistically be demonstrated in this Guide, and we kindly ask users to also read the individual R vignettes accompanying the corresponding package functions for further details. \section[Before you start]{Before you start} \subsection[Installation]{Installation} Installing \texttt{GOexpress} should be as easy as running the two lines below: <>= source("http://bioconductor.org/biocLite.R") biocLite("GOexpress") @ Installation issues should be reported to the Bioconductor mailing list. \subsection[Getting help]{Getting help} The \texttt{GOexpress} package is still at an early stage of development and may require some fine-tuning or yet undetected bug fixes. Please contact the maintainer with a copy of the error message and the command run. Despite our efforts to repeatedly test the \texttt{GOexpress} package on in-house datasets of both microarray and RNA-seq platforms, and of human and bovine origin, many of the species and microarrays present in the Ensembl BioMart have not been tested yet. Do contact the package maintainer if you feel that the error is on our part: <>= maintainer("GOexpress") @ Interesting suggestions for additional package functions, or improvement of existing ones are most welcome and may be implemented when time allows. Alternatively, we also encourage users to fork the GitHub repository of the project, develop and test their own feature(s), and finally generate a pull request to integrate it to the original repository (\texttt{https://github.com/kevinrue/GOexpress}). As for all Bioconductor packages, the Bioconductor support site is the best place to seek advice with a large and active community of Bioconductor users. More detailed information is available at \texttt{http://www.bioconductor.org/help/support}. \subsection[Citing \texttt{GOexpress}]{Citing \texttt{GOexpress}} The work underlying \texttt{GOexpress} has not been formally published yet. A manuscript is in preparation and will be submitted to a bioinformatics journal in due course. In the meantime, users of the package are encouraged to cite the \texttt{GOexpress} package using the \texttt{citation} method of the \texttt{utils} package, as follows: <>= citation(package="GOexpress") @ \section[Quick start]{Quick start} \subsection[Input data]{Input data} Despite their different underlying technologies, microarray and RNA-seq analytic pipelines typically yield a matrix measuring the expression level of each gene in each sample. Commonly, this expression matrix will be filtered to retain only genes qualified as ``informative'' (e.g. > 1 cpm in at least N replicates; N being the number of replicates for a given set of experimental conditions); genes lowly expressed are removed to limit the False Discovery Rate (FDR) of differentially expressed genes induced by the larger variability of expression at the lower end of the dynamic range. \texttt{GOexpress} requires this prefiltered normalised expression matrix to be accompanied by an \texttt{AnnotatedDataFrame} object of the \texttt{Biobase} package providing phenotypic information for each of those samples (e.g. unique identifier, treatment, time-point). \texttt{GOexpress} expects those two variables in an \texttt{ExpressionSet} container of the \texttt{Biobase} package, both simplifying the manipulation of the data and, most importantly, ensuring interoperatibility with other packages that handle Bioconductor \texttt{ExpressionSet} objects. The other fields of the \texttt{ExpressionSet} container may be left empty as \texttt{GOexpress} does not currently access them. To use the analytic part of the \texttt{GOexpress} package, the phenotypic data-frame --- \texttt{phenodata} slot of the \texttt{ExpressionSet} --- must contain at least one column containing an experimental factor made of two or more levels in the strict meaning of ``factor'' and ``levels'' in the R programming language. The above \texttt{ExpressionSet} and the name of the column containing such a factor are the minimal two input variables required for the \texttt{GO\_analyse} method to work. Additional arguments may be required, in particular for microarray datasets, but those are discussed later in section \ref{sec:MicroarrayArguments}. In the examples below, we will use the sample dataset \texttt{AlvMac} provided with the package and made of a subset of 100 bovine Ensembl gene identifiers (rows) measured in 117 samples (columns). This sample data includes a data-frame detailing a number of phenotypic information fields describing each sample. Let us load those two variables in the workspace, and load the \texttt{GOexpress} packge as well: <<>>= data(AlvMac) # import the training dataset library(GOexpress) # load the GOexpress package @ Now, the expression matrix and phenotypic data of the \texttt{ExpressionSet} container can be accessed using dedicated methods from the \texttt{Biobase} package: <<>>= exprs(AlvMac)[1:5,1:5] # Subset of the expression data head(pData(AlvMac)) # Subset of the phenotypic information @ An advantage of the \texttt{ExpressionSet} container is that it takes care of the compatibility between the expression matrix and the phenotypic information data-frame. For instance, it will check that samples names do not differ between expression matrix and phenotypic information. However, importantly , it will not check one key requirement of \texttt{GOexpress}, namely that the expression matrix must have row names made of unique and supported feature identifiers (i.e. microarray probeset or Ensembl gene identifier) which will be later queried against the Ensembl BioMart server to fetch gene and gene ontology annotations. Users can visually inspect that adequate row names are used in the expression matrix: <<>>= head(rownames(exprs(AlvMac))) # Subset of gene identifiers @ Note that in the training dataset, the \texttt{Time} column of \texttt{pData(targets)} is an R factor while the \texttt{Timepoint} column is a numeric vector. The fomer is useful for grouping the samples for the analysis, while the latter is better suited to plot gene expression profiles while respecting the relative distance between the time-points. See section \ref{sec:expressionPlots} for examples using of a numeric value or a factor as the variable of the X-axis. \subsection[Main analysis]{Main analysis} In this example, we search for GO terms enriched in genes that cluster best the samples according their \texttt{Treatment} level. But first, let us make sure that the \texttt{Treatment} column of \texttt{pData(targets)} is indeed an R factor: <<>>= is.factor(AlvMac$Treatment) # assertion test AlvMac$Treatment # visual inspection @ In this case, it is already a properly formatted factor. If that was not the case, the following line of code would convert the column to an R factor and allow to continue the analysis (note that in some cases, it may be preferrable to order the different levels of a factor, for an example see factor \texttt{Time}): <<>>= AlvMac$Treatment <- factor(AlvMac$Treatment) @ Now, we use the random forest statistical framework to score each gene on its ability to cluster samples from different treatments separately. Then the method will score each GO term by aggregating the average rank of its associated genes: <<>>= AlvMac_results <- GO_analyse(eSet = AlvMac, f = "Treatment") @ At this stage, it is a good idea to save the result variable into an R data-file using the \texttt{save} function. Firstly, because the stochastic aspect of the sampling approach implemented by the \texttt{randomForest} package may return slightly different scores in each run (as opposed to the ANOVA scores). Secondly, because the BioMart web service may occasionally be temporarily down, disabling this essential step of the analysis. Note that the above command does not specify which species the data originated from. As mentioned in the output messages, the first feature identifier in the expression matrix was used to determine the corresponding species and type of data. This is a fairly straightforward process for Ensembl gene identifiers (e.g. in the prefix `ENSBTA', `BT' indicates \textit{Bos taurus}). \label{sec:MicroarrayArguments} However, it can be more difficult to identify the microarray used to obtain a certain dataset, as many Affymetrix chips contain probesets named with the pattern `AFFX.*'. In cases where the microarray platform cannot be detected automatically, we recommend users to use the \texttt{microarray} argument of the \texttt{GO\_analyse} method. The list of valid values for the microarray argument is available in the \texttt{microarray2dataset} data frame which can be loaded in the workspace using: <<>>= data(microarray2dataset) @ This step is the longest step in the pipeline, but merely takes a couple of minutes to analyse approximately 12,000 genes in 117 samples with default parameters of 1,000 trees build using approximately 220 random predictor genes per iteration on a standard Ubuntu 12.04 server. The output variable of the analysis summarises the parameters of the analysis and can easily be browsed with standard R methods: <<>>= names(AlvMac_results) # Data slot names str(AlvMac_results) # Details of data slots head(AlvMac_results$GO, n=5) # Ranked table of GO terms (subset) head(AlvMac_results$genes, n=5) # Ranked table of genes (subset) head(AlvMac_results$mapping) # Gene to gene ontology mapping table (subset) @ \subsection[Filtering of results]{Filtering of results} \label{sec:Filtering} In the above raw results of the analysis, all three types (i.e. namespaces) of GO terms are merged in a single ranked table. It is however possible to extract the GO terms corresponding to a single namespace. <<>>= BP <- subset_scores(result = AlvMac_results, namespace = "biological_process") MF <- subset_scores(result = AlvMac_results, namespace = "MF") # alias for "molecular_function" CC <- subset_scores(result = AlvMac_results, namespace = "CC") # cellular_component @ Importantly, an early-identified bias of the scoring function is that GO terms associated with fewer genes are favored at the top of the ranking table. This is due to the fact that it is much easier for a group of 5 genes (e.g. ``\texttt{B cell apoptotic process}'') to have an high average rank than it is for a group of ~6,000 genes (e.g. ``\texttt{protein binding}''). Indeed, the highest possible average rank of 5 genes is 3 while it is 3,000 for a group of 6,000 genes. However, in our experience, this bias has some advantages. First, it implicitely favors specific and well-defined GO terms (e.g. ``\texttt{negative regulation of T cell apoptotic process}'') as opposed to vague and uninformative GO terms (e.g. ``\texttt{cytoplasm}''). Secondly, we observed many top-ranking GO terms associated with a single gene. Those GO terms are consequently susceptible to single-gene events and artefacts in the expression data, as opposed to GO terms with a reasonable number of associated genes. Using the above filtering function, it is straightforward to filter out those GO terms with only a handful of associated genes: <<>>= BP.5 <- subset_scores(result = AlvMac_results, namespace = "biological_process", total = 5) # requires 5 or more associated genes MF.10 <- subset_scores(result = AlvMac_results, namespace = "molecular_function", total = 10) CC.15 <- subset_scores(result = AlvMac_results, namespace = "cellular_component", total = 15) @ On the other hand, GO terms associated with thousands of genes have a very limited sensitivity, as they require a proportionally large number of associated genes to cluster the sample groups in order to emerge among the top ranking scores. Furthermore, the inherent hierarchical structure and ``granularity'' of gene ontology terms can easily be browsed by using increasingly large values of the \texttt{total} filter. Note that this filter retains only GO terms associated with a minimal given count of genes in the Ensembl BioMart. It is also possible to use the \texttt{data} argument to filter for GO terms associated with a certain count of genes in the given expression dataset, although this approach is obviously more data-dependent and less robust. \subsection[Details of the top-ranking GO terms]{Details of the top-ranking GO terms} Once the GO terms are scored and ranked (and filtered), the top-ranking GO terms in the filtered object are those most enriched in genes with expression levels that best cluster the predefined groups of samples based on the levels of the factor considered (\texttt{raw\_results\$factor}). In this example, we list the top-10 filtered ``Biological Process'' GO terms extracted above and their statistics (automatically ranked by increasing average rank of their associated genes): <<>>= head(BP.5$GO, n=10) @ \subsection[Hierarchical clustering of samples based on gene expression associated with a GO term]{Hierarchical clustering of samples based on gene expression associated with a GO term} In the previous section, we identified the GO terms most enriched for genes clustering samples according to their treatment. We will now generate for the top-ranked GO term (``\texttt{toll-like receptor 4 signaling pathway}'') a heatmap to visualise simulatenously the clustering of samples and the expression level of each gene in each sample: \begin{center} <>= heatmap_GO(go_id = "GO:0034142", result = BP.5, eSet=AlvMac, cexRow=0.4, cexCol=1, cex.main=1, main.Lsplit=30) @ \end{center} Importantly, users should be aware that the highest ranked GO term may not cluster samples at all; all GO terms are scored in the process, and even if none of them return a suitable clustering, there will still be a higest score. In this example, we can observe a group of ``Control'' (i.e. untreated; green color) samples clustering together at the bottom of the heatmap. Re-labelling of samples by \texttt{Group} (i.e. combination of treatment and time-point) reveals that those samples are mainly 24 and 48 hours-post-infection control samples: \begin{center} <>= heatmap_GO(go_id = "GO:0034142", result = BP.5, eSet=AlvMac, f="Group", cexCol=1, cex.main=1, main.Lsplit=30) @ \end{center} Alternatively, it is possible to focus only on the hirearchical clustering of samples. The following code will build a dendrogram clustering samples using the expression data of the subset of genes associated with the ``\texttt{toll-like receptor 4 signaling pathway}'' gene ontology: \begin{center} <>= cluster_GO(go_id = "GO:0034142", result = BP.5, eSet=AlvMac, cex.main=1, cex=0.4, main.Lsplit=30) @ \end{center} Similarly to the \texttt{heatmap\_GO} method above, among the various arguments of the \texttt{cluster\_GO} function, the argument \texttt{f} can be used to specify an alternative column present in \texttt{pData(targets)} to label the samples in the dendrogram according to that phenotypic information. Obviously, if the same set of genes is being used, the clustering of samples will remain the same no matter the column chosen; however, the alternative labelling of samples may reveal relevant sub-clustering of samples. \begin{center} <>= cluster_GO(go_id = "GO:0034142", result = BP.5, eSet=AlvMac, f = "Time", cex.main=1, cex=0.4, main.Lsplit=30) @ \end{center} \subsection[Details of genes associated with a GO term]{Details of genes associated with a GO term} Following the identification of relevant GO terms in the above sections, users may want to have a closer look at the individual genes associated with a given GO term: <<>>= table_genes(go_id = "GO:0034142", result = BP.5) @ Note that the default behaviour of the above function is to return a table of all the genes associated with the GO term based on the annotations retrieved from the Ensembl BioMart. For obvious reasons, genes present in the BioMart but absent from the expression dataset will be absent from the score table and consequently lack annotations. It is possible to restrict the above table to only genes present in the expression dataset using the \texttt{data.only} argument. If only the feature identifiers associated with a given GO identifier are needed, users may use the function below: <<>>= list_genes(go_id = "GO:0034142", result = BP.5) @ \subsection[Expression profile of a gene by group]{Expression profile of a gene by group} \subsubsection[Using the unique feature identifier]{Using the unique feature identifier} \label{sec:expressionPlots} In the above section, we listed the genes associated with a particular gene ontology. In our example, the respective score and rank of each gene estimates the capacity of the gene to cluster the samples according to the treatment factor. The genes that best cluster the samples will have the highest scores and the lowest ranks. Those genes will likely produce the expression profiles with the most consistent differential expression between the treatment groups over time. Here is one example: \begin{center} <>= expression_plot(gene_id = "ENSBTAG00000047107", result = BP.5, eSet=AlvMac, x_var = "Timepoint", title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} Note that "\texttt{Timepoint}" is another column of \texttt{pData(targets)}. That column does not encode a factor, but a numeric vector. This difference enables the plotting function to respect the relative distance between the time-points for an output more representative of the actual time-scale. To investigate the impact of other factors on the expression level of the same gene, users are encouraged to use the \texttt{f} and \texttt{x\_var} arguments to specify alternate grouping factor and X variable, respectively. Note that the \texttt{geom\_smooth} of the \texttt{ggplot2} package may fail if a minimal number of replicates is not available to calculate proper confidence intervals. Here is another valid example separating samples by animal on the X axis. \begin{center} <>= expression_plot(gene_id = "ENSBTAG00000047107", result = BP.5, eSet=AlvMac, x_var = "Animal", title.size=1.5, axis.text.angle=90, legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} \subsubsection[Using the associated gene name]{Using the associated gene name} It is also possible to visualise the expression profile of genes from their associated gene names if any. This is a more human-friendly version of the function presented in the previous subsection: \begin{center} <>= expression_plot_symbol(gene_symbol = "BIKBA", result = BP.5, eSet=AlvMac, x_var = "Timepoint", title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} However, the benefits of this feature are balanced by the fact that genes lacking an associated gene name cannot be visualised in this manner, and that some gene symbols are associated with multiple Ensembl gene identifiers and probesets (e.g. `RPL36A'). In the latter case, we turned the ambiguity into an additional useful feature: a lattice is created, and each of the multiple features associated with the given gene symbol are plotted simultaneously in the lattice. Subsequently, each of the sub-figures plotted may be re-plotted by itself using the \texttt{index} argument as indicated in the accompanying message printed in the R console. \subsection[Expression profile of a gene by individual sample series]{ Expression profile of a gene by individual sample series} \subsubsection[Using the unique feature identifier]{Using the unique feature identifier} It may be useful to track and visualise the expression profile of genes in each individual sample series, rather than their average. This could help identify outliers within sample groups, or visually compare paired samples, for instance. In the \texttt{AlvMac} dataset, samples from each of the animals were subjected to all three treatments in parallel (i.e. paired samples). In the figure below, a sample series is defined by a given \texttt{Animal} and a given \texttt{Treatment}. Each sample series is then tracked over time, and coloured according to the \texttt{Treatment} factor (default, factor stored in \texttt{raw\_results\$factor}): \begin{center} <>= AlvMac$Animal.Treatment <- paste(AlvMac$Animal, AlvMac$Treatment, sep="_") expression_profiles(gene_id = "ENSBTAG00000047107", result = AlvMac_results, eSet=AlvMac, x_var = "Timepoint", line.size=1, seriesF="Animal.Treatment", linetypeF="Animal", legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} In the figure above, the \texttt{linetypeF} helps to highlight samples from an animal which start at unusually high expression values, while those samples progressively return to expression values similar to other samples in their respective treatment groups. If omitted, the \texttt{linetypeF} argument will mirror the \texttt{colourF}, which can be useful for colour-blind people. Alternatively, a single line-type can be applied to all groups using the \texttt{argument} as follows: \begin{center} <>= expression_profiles(gene_id = "ENSBTAG00000047107", result = AlvMac_results, eSet=AlvMac, x_var = "Timepoint", lty=rep(1,10), # use line-type 1 for all 10 groups seriesF="Animal.Treatment", linetypeF="Animal", legend.title.size=10, legend.text.size=10, legend.key.size=15, line.size=1) @ \end{center} \subsubsection[Using the associated gene name]{Using the associated gene name} Similarly to the \texttt{expression\_plot} method, an alternative was implemented to use gene names instead of Ensembl gene identifiers. An example: <>= expression_profiles_symbol(gene_symbol="TNIP3", result = AlvMac_results, x_var = "Timepoint", linetypeF="Animal", line.size=1, eSet=AlvMac, lty=rep(1,10), seriesF="Animal.Treatment", title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \subsection[Comparison of univariate effects on gene expression]{Comparison of univariate effects on gene expression} While the analysis is restricted to the evaluation of a single factor, it can be helpful to compare the relative impact of all known factors present in the the accompanying \texttt{phenoData} on the gene expression in the different groups of samples. In other words, given a GO term identifier this feature will generate a plot for each associated gene, where the mean (default; can be changed) expression level will be computed for each level of each factor and compared to one another: \begin{center} <>= plot_design(go_id = "GO:0034134", result = BP.5, eSet=AlvMac, ask = FALSE, factors = c("Animal", "Treatment", "Time", "Group"), main.Lsplit=30) @ \end{center} \section[Additional controls and advanced functions]{Additional controls and advanced functions} \subsection[Subsetting an ExpressionSet to specific sample groups]{Subsetting an ExpressionSet to specific sample groups} It is straightforward to subset an \texttt{ExpressionSet} by extracting given columns (i.e. samples) and rows (i.e. gene features). Nevertheless, the \texttt{randomForest} package is quite sensitive to the definition of R factors; for instance, the \texttt{randomForest} package will crash if a factor is declared to have 3 levels (e.g. "A", "B", and "C"), while the \texttt{ExpressionSet} only contains samples for two of them (e.g. "A" and "B"). A simple fix is to update the known levels of the factor after having subsetted the \texttt{ExpressionSet}: <<>>= AlvMac$Time <- factor(AlvMac$Time) @ Note that this operation preservers the order of ordered factors. However, subsetting an \texttt{ExpressionSet} by rows and columns does not automatically update the known levels of each factor to the remaining levels. For this task, the method \texttt{subEset} takes a named list, where item names must be column names from the \texttt{phenoData} slots and item values must be present in the corresponding columns, to return a subset of the original \texttt{ExpressionSet} and return only the samples which match those values: <<>>= subEset(eSet=AlvMac, subset=list(Time=c("2H","6H","24H"), Treatment=c("CN","MB"))) @ \subsection[Overlapping genes between GO terms]{Overlapping genes between GO terms} Often, the top-ranked cellular functions identified by pathway analysis and gene-set enrichment analysis tools may significantly overlap each other in terms of genes, which limits the number of genes identified by those approaches and blurrs the precise cellular function represented by the core gene-set among a list of closely related yet different functions. Typically, overlapping sets are representated as Venn diagrams. The example below will produce a Venn diagram of the gene-sets associated with the five top-ranked ``Biological Process'' ontologies, and display it on screen, as the filename argument is left to the default \texttt{NULL} value (if a filename is given, the diagram will be printed to a TIFF file , no matter the extension specified): <>= overlap_GO(go_ids = head(BP.5$GO$go_id, n=5), result = BP.5, filename=NULL) @ This method calls the \texttt{venn.diagram} method of the \texttt{VennDiagram} package, offering a high level of control on the resulting Venn diagram. A detailed description of the available arguments are accessible in the corresponding R documentation file. \subsection[Distribution of scores]{Distribution of scores} Users might be interested in the general distribution of score and rank statistics produced by \texttt{GOexpress}. The distribution of scores may be represented as a histogram: \begin{center} <>= hist_scores(result = BP.5, labels = TRUE) @ \end{center} Alternatively, quantile values can be returned for default or customised percentiles: <<>>= quantiles_scores(result = BP.5) @ \subsection[Reordering by score]{Reordering by score} While scores are more prone to extreme outlier values and may slightly fluctuate between multiple runs of the random forest algorithm; ranks of genes and subsequently average ranks of GO terms are less prone to such variations and may be more reliable estimators of the importance of genes and cellular functions. Therefore, the default behaviour of \texttt{GOexpress} is to use the rank and average rank metrics to order genes and GO terms, respectively, in the returned score tables. It is however possible to re-order the tables in the output variable according to the score metric (or revert back to the original one) as in the example below: <<>>= BP.5.byScore <- rerank(result = BP.5, rank.by = "score") @ \section[Statistics]{Statistics} \subsection[Overview]{Overview} \texttt{GOexpress} was initially created from a set of gene-based, and later ontology-based, visualisation functions. Following the integration of those various plotting functions at the core of the \texttt{GOexpress} package, the need for a ranking of genes and GO terms soon became apparent in order to rapidly identify those that best cluster the samples according to the experimental factor studied, resulting in the clearest dendrograms and heatmaps. A two-fold procedure was implemented: \begin{enumerate} \item Using the available expression data, each gene present in the dataset is scored, evaluating its ability to cluster the predefined groups of samples. Genes are ranked according to their respective score; ties are resolved by assigning the lowest rank R to all G genes, giving the rank G+n to the next gene(s). The genewise scoring functions implemented are described in the following subsections, and all were validated on in-house datasets cross-checked with widely validated analytic pipelines (e.g. Ingenuity\textregistered Pathway Analysis, SIGORA) \item Using the above ranks and scores, each GO term in the Ensembl BioMart is assigned the mean score and the mean rank of all the genes associated with it in the BioMart. Genes present in the BioMart but absent from the given dataset are assigned a score of 0 (minimum valid score; indicates no power to discriminates the predefined groups of sample) and a rank equal to the number of genes present in the entire dataset plus one (worst rank, while preserving discrete continuity of the ranking). \end{enumerate} Importantly, the statistics performed to rank GO terms and genes do \emph{not} influence the behaviour of the subsequent plotting functions; heatmaps, dendrograms and gene expression profiles are purely driven by the expression data, sample phenotype annotations provided and GO terms ontologies imported from the Ensembl BioMart server, without any transformation applied to the data. Therefore, users are encouraged to use and suggest alternative relevant scoring and ranking strategies, which could prioritise GO terms and genes in different ways. A current acknowledged bias is the higher scoring of GO terms associated with fewer genes, which shows some advantages as discussed in section \ref{sec:Filtering}. \subsection[Random Forest]{Random Forest} We implemented the Random Forest framework to answer the question: ``How well does each gene in the dataset cluster predefined groups of samples?''. The random forest consists of multiple decision trees. Each tree is built based on a bootstrap sample (sample with replacement) of observations and a random sample of variables. The \texttt{randomForest} package first calculates the Gini index (Breiman et al, 1984) for each node in each tree. The Gini index is a measure of homogeneity from 0 (homogeneous) to 1 (heterogeneous). The decrease in the Gini index resulting from a split on a variable is then calculated for each node and averaged for each variable over all the trees in the model. The variable with the biggest mean decrease in the Gini index is then considered the most important. Technically, \texttt{GOexpress} extracts the \texttt{MeanDecreaseGini} value from the \texttt{importance} slot of the \texttt{randomForest} output and uses this value as the score for each gene. A key feature of the Random Forest framework is the implicit handling of interactions between genes, given a sufficient number of trees generated and genes sampled. Indeed, at each node in the decision tree, genes are sampled from the expression data and tested for their individual capacity to improve the partitioning reached in the previous node. The larger the number of trees and genes sampled, the more complete the coverage of interactions will be. \subsection[One-way Analysis of Variance (ANOVA)]{One-way Analysis of Variance (ANOVA)} We implemented the ANOVA statistical framework to answer the question: "How different is the expression level of each gene in the dataset in the different groups of samples?". Given the expression level of a gene in all the samples, the one-way ANOVA determines the ratio of the variance between the groups compared to the variance within the groups, summarised as an F statistic that \texttt{GOexpress} uses as a score for each gene. Simply put, if samples from the same groups show gene expression values similar to each other while samples from different groups show different levels of expression, those genes will produce a higher score. This score cannot be less than 0 (variance between groups insignificant to variance within groups), while very large ratios can easily be reached for genes markedly different between groups. Contrary to the random forest framework, the Analysis of Variance makes important assumptions on the data: namely, the independence of observations, the normality of residuals, and the equality of variances in all groups. While the former two are the responsibility of the user to verify, the latter is taken care of by \texttt{GOexpress}. Indeed, the \texttt{oneway.test} method of the package \texttt{stats} is used with parameter \texttt{var.equal} set to to \texttt{FALSE}. While this reduces the sensitivity of the test, all genes are affected by this correction based on the relative amount of variance in the different predefined groups of samples. Finally, it is once again important to note that the ANOVA only evaluates univariate changes, while the random forest framework implicitely allows for interactions between genes. \section[Notes]{Notes} \subsection[Authors' contributions]{Authors' contributions} Conception and development of the \texttt{GOexpress} package was carried out by KR-A with contributions by PAM, under the supervision of SVG and DEM. Experimental data used for testing was generated and analysed with the help of DAM and NC. Integration of the random forest statistical frameworks was advised by BH and AP. This User's Guide was prepared by KR-A, and edited by PAM, BH, DAM, NCN, AP, SVG, and DEM. \subsection[Acknowledgement]{Acknowledgments} Special thanks to Dr. Kate Killick for testing and feedback. My thanks to the Bioconductor staff and in particular to Hervé Pagès for the helpful feedback which improved the standards of the code and documentation. Last but not least, thanks to the University College Dublin "OpenSequencing" and the "Virtual Institute of Bioinformatics and Evolution" (VIBE) communities where I first presented raw versions of \texttt{GOexpress} and got valuable feedback and advice in return, underlying a significant number of updated and new features. \end{document}