% \VignetteEngine{knitr} % \VignetteIndexEntry{An introduction to clusterProfiler} % \VignettePackage{clusterProfiler}} % To compile this document, run the commands within R % require(cacheSweave); Sweave('clusterProfiler.Rnw', driver=cacheSweaveDriver()); tools::texi2dvi("clusterProfiler.tex", pdf=TRUE) \documentclass[12pt]{article} \usepackage[a4paper,left=3cm,top=2cm,bottom=3cm,right=3cm,ignoreheadfoot]{geometry} \pagestyle{empty} \usepackage[usenames,dvipsnames]{xcolor} \usepackage{helvet} \usepackage[titletoc]{appendix} \usepackage{tocloft} \setlength{\parindent}{0em} \setlength{\parskip}{.5em} \renewcommand{\familydefault}{\sfdefault} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\small\texttt{#1}}} \newcommand{\Rfunction}[1]{\Robject{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rparameter}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\R}{\textsf{R}} \usepackage[ bookmarks,% colorlinks,% linktoc=section,% linkcolor=RedViolet,% citecolor=RedViolet,% urlcolor=RedViolet,% linkbordercolor={1 1 1},% citebordercolor={1 1 1},% urlbordercolor={1 1 1},% raiselinks,% plainpages,% pdftex ]{hyperref} \usepackage{cite} \renewcommand{\floatpagefraction}{0.9} \usepackage{sectsty} \sectionfont{\sffamily\bfseries\color{RoyalBlue}\sectionrule{0pt}{0pt}{-1ex}{1pt}} \subsectionfont{\sffamily\bfseries\color{RoyalBlue}} \subsubsectionfont{\sffamily\bfseries\color{RoyalBlue}} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhead{} \fancyfoot{} \lfoot{}\cfoot{}\rfoot{} \renewcommand{\headrule}{} \usepackage{graphicx} %\usepackage{xstring} <>= library(DOSE) library(GO.db) library(org.Hs.eg.db) library(pathview) library(clusterProfiler) library(knitr) opts_chunk$set(tidy=TRUE,tidy.opts=list(keep.blank.line=FALSE, width.cutoff=50),out.truncate=80,out.lines=6,cache=TRUE,dev='pdf',include=TRUE,fig.width=6,fig.height=6,resolution=150) @ \newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{\textcolor{blue}{#1}}}} \title{\textsf{\textbf{Using \Rpackage{clusterProfiler} to identify and compare functional profiles of gene lists}}} \author{Guangchuang Yu \\ School of Biological Sciences \\ The University of Hong Kong, Hong Kong SAR, China \\ email: \texttt{guangchuangyu@gmail.com}} \begin{document} \maketitle <>= options(digits=3, width=80, prompt=" ", continue=" ") @ \tableofcontents \section{Introduction} In recently years, high-throughput experimental techniques such as microarray, RNA-Seq and mass spectrometry can detect cellular moleculars at systems-level. These kinds of analyses generate huge quantitaties of data, which need to be given a biological interpretation. A commonly used approach is via clustering in the gene dimension for grouping different genes based on their similarities \cite{yu2010}. To search for shared functions among genes, a common way is to incorporate the biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of genes and Genomes (KEGG), for identifying predominant biological themes of a collection of genes. After clustering analysis, researchers not only want to determine whether there is a common theme of a particular gene cluster, but also to compare the biological themes among gene clusters. The manual step to choose interesting clusters followed by enrichment analysis on each selected cluster is slow and tedious. To bridge this gap, we designed \Rpackage{clusterProfiler} \cite{yu2012}, for comparing and visualizing functional profiles among gene clusters. \section{Citation} Please cite the following articles when using \Rpackage{clusterProfiler}. \\ G Yu, LG Wang, Y Han, QY He. clusterProfiler: an R package for comparing biological themes among gene clusters. \textit{OMICS: A Journal of Integrative Biology}. 2012, 16(5), 284-287. \\ \section{Supported organisms} At present, \Rpackage{clusterProfiler} about 20 species as shown below: \begin{itemize} \item \textit{Arabidopsis} \item \textit{Anopheles} \item \textit{Bovine} \item \textit{Canine} \item \textit{Chicken} \item \textit{Chimp} \item \textit{E coli strain K12} \item \textit{E coli strain Sakai} \item \textit{Fly} \item \textit{Human} \item \textit{Malaria} \item \textit{Mouse} \item \textit{Pig} \item \textit{Rat} \item \textit{Rhesus} \item \textit{Worm} \item \textit{Xenopus} \item \textit{Yeast} \item \textit{Zebrafish} \end{itemize} These species are all supported by GO and KEGG analyses. GO analyses also support \textit{Coelicolor} and \textit{Gondii}. \section{Gene Ontology Classification} In \Rpackage{clusterProfiler}, \Rfunction{groupGO} is designed for gene classification based on GO distribution at a specific level. <>= require(DOSE) data(geneList) gene <- names(geneList)[abs(geneList) > 2] head(gene) ggo <- groupGO(gene=gene, organism="human", ont="BP", level=3, readable=TRUE) head(summary(ggo)) @ \section{Enrichment Analysis} \subsection{Hypergeometric model} Enrichment analysis \cite{boyle2004} is a widely used approach to identify biological themes. Here we implement hypergeometric model to assess whether the number of selected genes associated with disease is larger than expected. To determine whether any terms annotate a specified list of genes at frequency greater than that would be expected by chance, \Rpackage{clusterProfiler} calculates a p-value using the hypergeometric distribution: $ p = 1 - \displaystyle\sum_{i = 0}^{k-1} \frac{ {M \choose i} {{N-M} \choose {n-i}} } { {N \choose n} } $ In this equation, \textit{N} is the total number of genes in the background distribution, \textit{M} is the number of genes within that distribution that are annotated (either directly or indirectly) to the node of interest, \textit{n} is the size of the list of genes of interest and \textit{k} is the number of genes within that list which are annotated to the node. The background distribution by default is all the genes that have annotation. P-values were adjusted for multiple comparison, and q-values were also calculated for FDR control. \subsection{Gene set enrichment analysis} A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA) \cite{subramanian_gene_2005} directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes. \\ \\ Genes are ranked based on their phenotypes. Given a priori defined set of gens \textit{S} (e.g., genes shareing the same \textit{GO} or \textit{KEGG} category), the goal of GSEA is to determine whether the members of \textit{S} are randomly distributed throughout the ranked gene list (\textit{L}) or primarily found at the top or bottom. \\ \\ There are three key elements of the GSEA method: \begin{itemize} \item Calculation of an Enrichment Score.\\ The enrichment score (\textit{ES}) represent the degree to which a set \textit{S} is over-represented at the top or bottom of the ranked list \textit{L}. The score is calculated by walking down the list \textit{L}, increasing a running-sum statistic when we encounter a gene in \textit{S} and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The \textit{ES} is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic \cite{subramanian_gene_2005}. \item Esimation of Significance Level of \textit{ES}.\\ The \textit{p-value} of the \textit{ES} is calculated using permutation test. Specifically, we permute the gene labels of the gene list \textit{L} and recompute the \textit{ES} of the gene set for the permutated data, which generate a null distribution for the \textit{ES}. The \textit{p-value} of the observed ES is then calculated relative to this null distribution. \item Adjustment for Multiple Hypothesis Testing.\\ When the entire \textit{GO} or \textit{KEGG} gene sets is evaluated, \Rpackage{clusterProfiler} adjust the estimated significance level to account for multiple hypothesis testing and also \textit{q-values} were calculated for FDR control. \end{itemize} \subsection{GO enrichment analysis} <>= ego <- enrichGO(gene=gene, universe = names(geneList), organism="human", ont="CC", pvalueCutoff=0.01, readable=TRUE) head(summary(ego)) @ <>= ego2 <- gseGO(geneList=geneList, organism="human", ont="CC", nPerm=100, minGSSize=120, pvalueCutoff=0.01, verbose=FALSE) head(summary(ego2)) @ \subsection{KEGG pathway enrichment analysis} <>= kk <- enrichKEGG(gene=gene, organism="human", pvalueCutoff=0.01, readable=TRUE) head(summary(kk)) @ <>= kk2 <- gseKEGG(geneList=geneList, organism="human", nPerm=100, minGSSize=120, pvalueCutoff=0.01, verbose=FALSE) head(summary(kk2)) @ \subsection{DO enrichment analysis} Disease Ontology (DO) enrichment analysis is implemented in \Rpackage{DOSE}, please refer to the package vignettes. The \Rfunction{enrichDO} function is very useful for identifying disease association of interesting genes, and function \Rfunction{gseAnalyzer} function is designed for gene set enrichment analysis of \textit{DO}. \subsection{Reactome pathway enrichment analysis} With the demise of KEGG (at least without subscription), the KEGG pathway data in Bioconductor will not update and we encourage user to analyze pathway using \Rpackage{ReactomePA} which use Reactome as a source of pathway data. The function call of \Rfunction{enrichPathway} and \Rfunction{gsePathway} in \Rpackage{ReactomePA} is consistent with \Rfunction{enrichKEGG} and \Rfunction{gseKEGG}. \subsection{Function call} The function calls of \Rfunction{groupGO}, \Rfunction{enrichGO}, \Rfunction{enrichKEGG}, \Rfunction{enrichDO} and \Rfunction{enrichPathway} are consistent. The input parameters of \textit{gene} is a vector of entrezgene (for human and mouse) or ORF (for yeast) IDs, and \textit{organism} should be supported species (please refer to the manual of the specific function). For gene set enrichment analysis, the function of \Rfunction{gseGO}, \Rfunction{gseKEGG}, \Rfunction{gseAnalyzer} and \Rfunction{gsePathway} need extra paramter \textit{nPerm} to specify the permutation number. For GO analysis, \textit{ont} must be assigned to one of "BP", "MF", and "CC" for biological process, molecular function and cellular component, respectively. In \Rfunction{groupGO}, the \textit{level} specify the GO level for gene projection. In enrichment analysis, the \textit{pvalueCutoff} is to restrict the result based on their pvalues and the adjusted p values. \textit{Q-values} were also calculated for controlling false discovery rate (FDR). The \textit{readable} is a logical parameter to indicate the input gene IDs will map to gene symbols or not. \subsection{Visualization} The output of \Rfunction{groupGO}, \Rfunction{enrichGO} and \Rfunction{enrichKEGG} can be visualized by bar plot, enrichment map and category-gene-network plot. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart. \subsubsection{barplot} <>= barplot(ggo, drop=TRUE, showCategory=12) @ <>= barplot(ego, showCategory=8) @ \subsubsection{enrichMap} Enrichment map can be viusalized by \Rfunction{enrichMap}, which support results obtained from hypergeometric test and gene set enrichment analysis. <>= enrichMap(ego) @ <>= enrichMap(ego2) @ \subsubsection{cnetplot} In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, we developed \Rfunction{cnetplot} function to extract the complex association. <>= cnetplot(ego, categorySize="pvalue", foldChange=geneList) @ <>= cnetplot(kk, categorySize="geneNum", foldChange=geneList) @ \subsubsection{gseaplot} Running score of gene set enrichment analysis and its association of phenotype can be visualized by \Rfunction{gseaplot}. <>= gseaplot(kk2, geneSetID = "hsa04145") @ \subsubsection{pathview from pathview package} \Rpackage{clusterProfiler} users can also use \Rfunction{pathview} from the \Rpackage{pathview} \cite{luo_pathview} to visualize KEGG pathway. The following example illustrate how to visualize "hsa04110" pathway, which was enriched in our previous analysis. <>= require(pathview) hsa04110 <- pathview(gene.data=geneList, pathway.id="hsa04110", species="hsa", limit=list(gene=max(abs(geneList)), cpd=1)) @ \begin{figure}[ht] \centering \includegraphics[width=.9\textwidth,type=png,ext=.png,read=.png]{hsa04110.pathview} \caption{visualize KEGG pathway using pathview} \label{viewKEGG} \end{figure} For further information, please refer to the vignette of \Rpackage{pathview} \cite{luo_pathview}. \section{Biological theme comparison} \Rpackage{clusterProfiler} was developed for biological theme comparison, and it provides a function, \Rfunction{compareCluster}, to automatically calculate enriched functional categories of each gene clusters. <>= data(gcSample) ck <- compareCluster(geneCluster=gcSample, fun="enrichKEGG") plot(ck) @ By default, only top 5 (most significant) categories of each cluster was plotted. User can changes the parameter \textit{showCategory} to specify how many categories of each cluster to be plotted, and if \textit{showCategory} was set to \textit{NULL}, the whole result will be plotted. The dot sizes were based on their corresponding row percentage by default, and user can set the parameter \textit{by} to "count" to make the comparison based on gene counts. The parameter \textit{by} can also set to "rowPercentage" to normalize the dot sizes, since some categories may contain a large number of genes, and make the dot sizes of those small categories too small to compare. The default parameter \textit{by} is setting to "geneRatio", which corresponding to the "GeneRatio" column of the output. To provide the full information, we also provide number of identified genes in each category (numbers in parentheses) when \textit{by} is setting to "rowPercentage" and number of gene clusters in each cluster label (numbers in parentheses) when \textit{by} is setting to "geneRatio", as shown in Figure 3. If the dot sizes were based on "count", the row numbers will not shown. The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values and adjusted p-values were filtered out by the threshold giving by parameter \textit{pvalueCutoff}, and FDR can be estimated by \textit{qvalue}. User can refer to the example in \cite{yu2012}; we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus) \cite{schmidt2008}. We identified 8 gene clusters from differentially expressed genes, and using \Rfunction{compareCluster} to compare these gene clusters by their enriched biological process. Another example was shown in \cite{yu2011}, we calculated functional similarities among viral miRNAs using method described in \cite{yu_new_2011}, and compared significant KEGG pathways regulated by different viruses using \Rfunction{compareCluster}. The comparison function was designed as a general-package for comparing gene clusters of any kind of ontology associations, not only \Rfunction{groupGO}, \Rfunction{enrichGO}, and \Rfunction{enrichKEGG} this package provided, but also other biological and biomedical ontologies, for instance, \Rfunction{enrichDO} from \Rpackage{DOSE} and \Rfunction{enrichPathway} from \Rpackage{ReactomePA} work fine with \Rfunction{compareCluster} for comparing biological themes in disease and reactome pathway perspective. More details can be found in the vignettes of \Rpackage{DOSE} and \Rpackage{ReactomePA}. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliographystyle{unsrt} \bibliography{clusterProfiler} \end{document}