% \VignetteIndexEntry{KEGGprofile: Application Examples} % \VignettePackage{KEGGprofile} \documentclass[12pt]{article} \textwidth 6.75in \textheight 9.5in \topmargin -.875in \oddsidemargin -.06in \evensidemargin -.06in \usepackage{hyperref} \hypersetup{ colorlinks=true, %set true if you want colored links linktoc=all, %set to all if you want both sections and subsections linked linkcolor=blue, %choose some color if you want links to stand out } \begin{document} \title{KEGGprofile: Application Examples} \author{Shilin Zhao} \maketitle \begin{abstract} Abstract: In this vignette, we demonstrate the application of KEGGprofile as an annotation and visualization tool in analysis of multi-types and multi-groups high-throughput expression data. Superior to existing approaches, KEGGprofile combined the KEGG pathway map with expression profiles of genes in that pathway and facilitated more detailed analysis about the specific function changes inner pathway or temporal correlations in different genes and samples. Here we introduce the data preparation and functions used for pathway gene expression profile visualization. \end{abstract} \section{Introduction} KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, from genomic and molecular-level information (\href{http://www.kegg.jp/kegg/}{http://www.kegg.jp/kegg/}). The KEGG pathway database is composed by a lot of pathway maps focused on different biological functions, including metabolism, signal transduction, cellular process, and disease. It is now a prominent reference knowledge base for integration and interpretation of large-scale molecular data sets generated by high-throughput experimental technologies. There are plenty of tools developed for KEGG pathway mapping or function annotation. But most of them are limited in finding significant enriched pathways for selected genes. To further analysis the function changes inner pathway, some tools were developed to map selected genes in pathway map, such as Color Pathway in KEGG mapper tools\cite{2}. When direct comparing of two different samples, such as disease and normal persons, the gene expression changes in each KEGG pathway could be visualized, which would be helpful in understanding the function changes between samples. With the development of high-throughput experimental technologies, the systematic analysis for complicated biological questions, such as drug stimulation, disease progression and cell differentiation, often contains multi-types or multi-groups of data, including the expression in transcriptome and proteome, in disease and normal samples, and in different time points. However, none of the currently available software for KEGG pathway mapping could be used for visualization of the expression profiles in such complicated data analysis. The only solution is to generate multiple pathway maps, each map for each time point, or illustrate the expression profile manually, which is inconvenient in visualization and ambiguous in function analysis. To address this problem, we developed an R package KEGGprofile, which provided an easy and automatic pipeline for analyze and visualization for multi-types and multi-groups expression data. With this package, the expression profile of genes and the annotation in KEGG pathway maps could be integrated together. Then the researcher could directly focus on the function changes inner pathway or expression correlation between different types of data. This would be a valuable tool for systematic profiling or time series data analysis. \section{Example usage} \subsection{Data preparation} The NCBI gene IDs (such as 67040, 93683) is used in KEGG database to represent genes in the pathway. We need to transform the identifiers in our expression data into NCBI gene IDs. After the transformation, KEGGprofile could be generally applicable for genomics, transcriptomics and proteomics data. A previously published data of proteome and phosphoproteome analysis in different cell phase was taken as an example\cite{1}. We have prepared an example data in the data directory. Then it can be import into R environment with: <>= library(KEGGprofile) data(pro_pho_expr) data(pho_sites_count) ls() colnames(pro_pho_expr) pro_pho_expr[1:3,1:4] @ The pro\_pho\_expr is a data.frame with expression profiles. The column 1-6 are proteome data and column 7-12 are phosphoproteome data. The 6 time points are G1, G1/S, Early S, Late S, G2, Mitosis. For the phosphorylation sites mapping to the same gene, the one with largest variation in 6 time points are kept. The pho\_sites\_count is a data.frame with number of phosphorylation sites quantified for each gene. Here the NCBI gene IDs should be row.names of all the data.frame. If your expression data is not in NCBI gene IDs, you need to first convert it. We provided a function called 'convertId' to do it. <>= example(convertId) @ Besides, The package requires original KEGG pathway maps as backgrounds and KGML (KEGG XML) fi{}les to extract the gene locations in the pathway maps. These files can be downloaded from KEGG website (\href{http://www.kegg.jp/kegg/}{http://www.kegg.jp/kegg/}) and we also provide a function called \textquoteleft{}download\_KEGGfile\textquoteright{} to do so. Now download the pathway map and KGML file for human pathway '04110' to the work directory: <>= download_KEGGfile(pathway_id="04110",species='hsa') @ Here the pathway\_id could be set as \textquoteleft{}all\textquoteright{}, and then the entire pathway ids for human would be extracted from the KEGG.db package and the related files would be downloaded. \subsection{Find enriched pathways} The function 'find\_enriched\_pathways' could be used to find enriched pathways for interested genes. The interested genes could be selected in several methods, such as genes response to specific stimulation, or genes with negative correlation between disease and normal samples. And the result of statistic tests could also be used. Then the selected genes would be annotated with KEGG pathway database and hypergeometric tests were used to estimate the significance of enrichment. Besides, a criterion for number of annotated genes in the pathway could also be used for pathway selection. There is a very important parameter 'download\_latest' in 'find\_enriched\_pathway' function. As the KEGG.db package was only updated until 2012, we can download the lateset genes and pathways links from KEGG database when 'download\_latest' was set as TRUE. It is very important when the users were interested in some non model organisms which were imported into KEGG after 2012. Here we used the proteins highly phosphorylated as candidates for annotation. The number of phosphorylation sites quantified larger than 10 was set as a criterion. <>= genes<-row.names(pho_sites_count)[which(pho_sites_count>=10)] pho_KEGGresult<-find_enriched_pathway(genes,species='hsa') pho_KEGGresult[[1]][,c(1,5)] @ Then we compared the correlations between proteins and phospholations for these enriched in highly phosphorylated proteins pathways. <>= plot_pathway_cor(gene_expr=pro_pho_expr,kegg_enriched_pathway=pho_KEGGresult) @ As the example data here was from an research in different cell phase, the Cell cycle pathway (pathway id 04110) was further visualized. \subsection{Visualization of expression profile on KEGG maps} In each KEGG pathway map, genes are represented by a polygon and biological relations between genes such as activation or phosphorylation are represented by lines. The function 'plot\_pathway' could be used to integrate the expression profiles in the pathway map instead of the original gene polygon. There are two visualization methods to represent gene expression profiles: \textquotedblleft{}background\textquotedblright{} and \textquotedblleft{}lines\textquotedblright{}. The first one is applicable for analysis with only one sample or one type of data, which divides the gene polygon into several sub-polygons to represent different time points. And each sub-polygon has a specific background color to represent expression changes in that time point. We used the phosphoproteome changes in 6 time points as a example. Firstly a function 'col\_by\_value' was used to transform the expression difference between samples into specific color. After that, we can use ``plot\_profile'' function to visualiz the gene expression profile in the KEGG pathway. A pathway map named 'hsa04110\_profile\_bg.png' would be generated at the working directory. <>= ## the phosphoproteome data pho_expr<-pro_pho_expr[,7:12] temp<-apply(pho_expr,1,function(x) length(which(is.na(x)))) pho_expr<-pho_expr[which(temp==0),] ## transform the expression difference into specific color col<-col_by_value(pho_expr,col=colorRampPalette(c('green','black','red'))(1024),range=c(-6,6)) ## visualization by method 'bg' temp<-plot_pathway(pho_expr,type="bg",bg_col=col,text_col="white",magnify=1.2,species='hsa',database_dir=system.file("extdata",package="KEGGprofile"),pathway_id="04110") @ The second method plots lines with different colors in the gene polygon to represent different samples or different types of data. The dynamic changes of lines are determined by the profiles of genes in different time points. The background colors could also be added to the pathway map to provide more biological information, such as p values and subcellular locations. The proteome and phosphoproteome changes were used as example for method 'lines'. Firstly the function 'col\_by\_value' was used to transform the number of phosphorylation sites quantified for each gene into specific color as the background for each gene polygon. Then the ``plot\_profile'' function was performed and a pathway map named 'hsa04110\_profile\_lines.png' would be generated at the working directory. <>= ## transform the number of phosphorylation sites into specific color col<-col_by_value(pho_sites_count,col=colorRampPalette(c('white','khaki2'))(4),breaks=c(0,1,4,10,Inf)) ## visualization by method 'lines' temp<-plot_pathway(pro_pho_expr,type="lines",bg_col=col,line_col=c("brown1","seagreen3"),groups=c(rep("Proteome",6),rep("Phosphoproteome",6)),magnify=1.2,species='hsa',database_dir=system.file("extdata",package="KEGGprofile"),pathway_id="04110",max_dist=5) @ In this section, we just used the background colors of gene polygon to represent the number of phosphorylation sites. In fact, the colors for gene name (text\_col) and gene polygon border (border\_col) could also be determined by function 'col\_by\_value' and represent some other important biological information, such as subcellular locations, correlation between samples. Here we just demonstrated the application of gene expression data. In fact Compound data was also supported by KEGGprofile. You can see the examples in 'plot\_pathway' function for more details. \section{More details} To make the visualization process more easier, the function 'plot\_pathway' is in fact a wrapper function for download\_KEGGfile, parse\_XMLfile and plot\_profile functions. Firstly, the existence of KEGG pathway map files (.xml and .png) would be checked in the database\_dir. If not, the download\_KEGGfile function would be used to download the files. Then the function parse\_XMLfile would be used to parse xml file to get a matrix containing the genes in this pathway, and their names, locations etc. At last, the function 'plot\_profile' would be used to generate the pathway map. \bibliographystyle{plain} \addcontentsline{toc}{section}{\refname}\bibliography{Ref} \end{document}