% \VignetteIndexEntry{An introduction to clusterProfiler} % \VignetteDepends{AnnotationDbi, GO.db, org.Hs.eg.db, ggplot2, methods, qvalue, KEGG.db} % \VignetteSuggests{GOSemSim} % \VignetteKeywords{Gene Ontology Analysis} % \VignettePackage{clusterProfiler} %\SweaveOpts{prefix.string=images/fig} \documentclass[a4paper]{article} \usepackage{Sweave} \usepackage{a4wide} \usepackage{times} \usepackage{hyperref} \usepackage[T1]{fontenc} \usepackage[english]{babel} \usepackage{framed} \usepackage{longtable} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newtheorem{theorem}{Theorem}[section] \newcommand{\R}{\textsf{R}} \newcommand{\clusterProfiler}{\Rpackage{clusterProfiler}} \newcommand{\compareClusterResult}{\Rclass{compareClusterResult}} \newcommand{\groupGOResult}{\Rclass{groupGOResult}} \newcommand{\enrichGOResult}{\Rclass{enrichGOResult}} \newcommand{\term}[1]{\emph{#1}} \newcommand{\mref}[2]{\htmladdnormallinkfoot{#2}{#1}} \bibliographystyle{plainnat} \title{\Rpackage{clusterProfiler}: an \R\, package for \\ Statistical Analysis and Visualization of Functional Profiles\\ for Genes and Gene Clusters} \author{Guangchuang Yu \\ \\ Jinan University, Guangzhou, China} \begin{document} \maketitle <>= options(width=60) require(clusterProfiler) @ \section{Introduction} In recently years, high-throughput experimental techniques such as microarray and mass spectrometry can identify many lists of genes and gene products. The most widely used strategy for high-throughput data analysis is to identify different gene clusters based on their expression profiles. Another commonly used approach is to annotate these genes to biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), and identify the statistically significantly enriched categories. These two different strategies were implemented in many bioconductor packages, such as \Rpackage{Mfuzz} and \Rpackage{BHC} for clustering analysis and \Rpackage{GOstats}\citep{Gentleman2007} for GO enrichment analysis. After clustering analysis, researchers not only want to determine whether there is a common theme of a particular gene cluster, but also to compare the biological themes among gene clusters, which have different expression profiles. To bridge this gap, we designed \Rpackage{clusterProfiler}, for comparing functional profiles among gene clusters. This document presents an introduction to the use of \Rpackage{clusterProfiler}, an \R\, package for the analysis of lists of genes and gene clusters based on their GO annotation distribution or enrichment categories of GO and KEGG, and provides methods for visulization. \section{Quick start} The following lines provide a quick and simple example on the use of \Rpackage{clusterProfiler} to explore a set of genes and compare gene clusters. The analysis proceeds as follows: \begin{itemize} \item First a sample dataset is loaded. This dataset contains 5 gene clusters. <>= require(clusterProfiler) data(gcSample) gcSample @ \item Use \Rfunction{groupGO} for genes classification based on GO distribution at a specific level. <>= x <- groupGO(gene=gcSample[[1]], organism="human", ont="CC", level=2, readable=TRUE) summary(x) @ \item Use \Rfunction{enrichGO} for GO enrichment analysis. <>= y <- enrichGO(gene=gcSample[[2]], organism="human", ont="MF", pvalueCutoff=0.01, readable=TRUE) @ \item Use \Rfunction{enrichKEGG} for KEGG pathway enrichment analysis. <>= z <- enrichKEGG(gene=gcSample[[3]], organism="human", pvalueCutoff=0.05, readable=TRUE) summary(z) @ The input parameters of \textit{gene} is a vector of entrez genes or ORF IDs (for yeast), and \textit{organism} must be one of "human", "mouse", and "yeast", according to the gene IDs. For GO analysis, \textit{ont} must be assigned to one of "BP", "MF", and "CC" for biological process, molecular function and cellular component, respectively. In \Rfunction{groupGO}, the \textit{level} specify the GO level for gene projection. In enrichment analysis, the \textit{pvalueCutoff} is to restrict the result based on their pvalues. Consider multiple testing, qvalues are also provided, for estimating FDR. The \textit{readable} is a logical parameter, if TRUE, the gene IDs will map to gene symbols. In addition, these results can be visualized by our \Rfunction{plot} function. For example: <>= plot(x, title="CC Ontology Classification, level 2", font.size=12) @ \begin{figure} \centering \centering \resizebox{1\textwidth}{!}{\includegraphics{groupGO.png}} \caption{Example of gene classification} \label{Fig:groupGO} \end{figure} <>= plot(z, title="KEGG Enrichment") @ \begin{figure} \centering \centering \resizebox{1\textwidth}{!}{\includegraphics{enrichKEGG.png}} \caption{Example of KEGG enrichment analysis} \label{Fig:enrichKEGG} \end{figure} \item Gene clusters can be compared by \textit{compareCluster}, and plotted by bar chart or dot chart. <>= xx <- compareCluster(gcSample, fun=groupGO, organism="human", ont="MF", level=2) plot(xx, title="MF Ontology Distribution Comparison") @ \begin{figure} \centering \centering \resizebox{1\textwidth}{!}{\includegraphics{MFcomparison.png}} \caption{Example of comparing MF ontology distribution using dotplot} \label{Fig:compareClusterPlot-groupGO} \end{figure} <>= yy <- compareCluster(gcSample, fun=enrichGO, organism="human", ont="CC", pvalueCutoff=0.01) plot(yy, title="CC Ontology Enrichment Comparison") @ \begin{figure} \centering \centering \resizebox{1\textwidth}{!}{\includegraphics{CCenrichComparisonDot.png}} \caption{Example of comparing CC ontology enrichment using dot chart} \label{Fig:compareClusterPlot-enrichGO} \end{figure} <>= plot(yy, title="CC Ontology Enrichment Comparison", type="bar", by="count") @ \begin{figure} \centering \centering \resizebox{1\textwidth}{!}{\includegraphics{CCenrichComparisonBar.png}} \caption{Example of comparing CC ontology enrichment using bar chart} \label{Fig:compareClusterPlot-enrichGO} \end{figure} <>= zz <- compareCluster(gcSample, fun=enrichKEGG, organism="human", pvalueCutoff=0.05) plot(zz, title="KEGG Pathway Enrichment Comparison") @ \begin{figure} \centering \centering \resizebox{1\textwidth}{!}{\includegraphics{KEGGcomparison.png}} \caption{Example of comparing KEGG enrichment among gene clusters} \label{Fig:compareClusterPlot-enrichKEGG} \end{figure} By default, only top 5 categories of each cluster was plotted. User can changes the parameter \textit{limit} to specify how many categories of each cluster to be plotted, and if \textit{limit} set to NULL, the whole result will be plotted. By default, the dot sizes were based on their corresponding row percentage, and user can set the parameter \textit{by} to "count" to make the comparison based on gene counts. We chose "percentage" as default parameter to represent the sizes of dots, since some categories may contain a large number of genes, and make the dot sizes of those small categories too small to compare. To provide the full information, we also provide number of identified genes in each category (numbers in parentheses), as shown in Figure 1. If the dot sizes were based on "count", the parentheses will not shown. The p-values indicate that which categories are more likely to have biological meanings. The dots in the image are color-encoded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values were filtered out by the threshold giving by parameter \textit{pvalueCutoff}. We also provide q-values, which were calculated by \Rpackage{qvalue}, for user to control false discovery rate. FDR control is necessary since enrichment analysis carrying out hundreds, if not thousands, of tests. \end{itemize} \section{Session Information} The version number of R and packages loaded for generating the vignette were: \begin{verbatim} <>= sessionInfo() @ \end{verbatim} \bibliography{clusterProfiler} \end{document}