% \VignetteIndexEntry{PatientGeneSetsTutorial} % \VignetteKeywords{Patient-oriented gene-set analysis} % \VignettePackage{PatientGeneSets} \documentclass[11pt]{article} \usepackage{epsfig} \usepackage{latexsym} \usepackage{amsmath} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsxtra} \usepackage{graphicx,subfigure} \usepackage{vmargin} \usepackage{hyperref} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \parindent 0in \setpapersize{USletter} \setmarginsrb{1truein}{0.5truein}{1truein}{0.5truein}{16pt}{30pt}{0pt}{20truept} \setlength{\emergencystretch}{2em} \usepackage{Sweave} \begin{document} \title{\Rpackage{PatientGeneSets} package} \author{Simina M. Boca \\ Johns Hopkins Bloomberg School of Public Health \\ email: \texttt{sboca@jhsph.edu}, \\ \\ Giovanni Parmigiani \\ Dana-Farber Cancer Institute and \\ Harvard School of Public Health \\ email: \texttt{gp@jimmy.harvard.edu}} \maketitle \tableofcontents \section{Overview} Patient-oriented methods are a novel approach to gene-set analysis which calculate gene-set scores for each sample, then combine them across samples. By comparison, gene-oriented methods calculate a score for each gene across all samples, then combine them into gene-set scores. We find that for cancer mutation data, patient-oriented methods often perform better on both real and simulated data. The \Rpackage{PatientGeneSets} package has functions which perform gene-set analyses for cancer mutation data. It can be used to compute p-values and q-values for $4$ patient-oriented methods described in \cite{BocaEtAl2010}, as well as the gene-oriented method described in \cite{Smyth2004} (see also \cite{SchaefferEtAl2008}), which is in the \Rpackage{limma} package and performs a Wilcoxon test. It can also be used to do the simulations described in \cite{BocaEtAl2010}. The user has a choice between calculating q-values based on the FDR-control method in \cite{BenjaminiAndHochberg1995} or \cite{StoreyAndTibshirani2003}. This package is closely related to the \Rpackage{CancerMutationAnalysis} package, available at \url{http://bcb.dfci.harvard.edu/~gp/software/CancerMutationAnalysis/cma.htm} (\cite{ParmigianiEtAl2007}), which provides methods for gene-level analysis of cancer mutation data. In particular, the \Rfunction{cma.scores} function and the data format are nearly identical, but the \Rpackage{PatientGeneSets} package does not consider the two-phase design when performing gene-set analysis. This vignette represents an introduction on the \Rpackage{PatientGeneSets} package. The primary function \Rfunction{do.gene.set.analysis}, which implements the gene-set analysis methods. The function \Rfunction{sim.data.p.values} performs simulations using either the permutation null or the passenger null (see \cite{BocaEtAl2010}). The function \Rfunction{cma.data} calculates scores for each gene across all samples, as in \cite{SjoblomEtAl2006} and \cite{WoodEtAl2007}. The functions \Rfunction{extract.sims.method} and \Rfunction{combine.sims} are used to manipulate the objects resulting from \Rfunction{sim.data.p.values}. \section{Glioblastoma dataset} We use the glioblastoma dataset from \cite{ParsonsEtAl2008}. When typing \texttt{data(Parsons)}, $4$ objects are loaded: \texttt{CoverageBrain, EventsBySampleBrain, GeneSizes08} and \texttt{MutationsBrain}. For this example, we use KEGG annotations (\cite{KanehisaEtAl2000}, \cite{KanehisaEtAl2006}, \cite{KanehisaEtAl2010}). from the Bioconductor package \Rpackage{KEGG.db} by using the \texttt{KEGGPATHID2EXTID} object. <>= library(PatientGeneSets) data(Parsons) ls() library(KEGG.db) KEGGPATHID2EXTID @ Since the genes in the glioblastoma dataset are annotated by gene-names, while \texttt{KEGGPATHID2EXTID} uses EntrezGene identifiers, we also provide a vector mapping the identifiers to the names, which we obtained by using the \Rpackage{biomaRt} package. <>= data(ID2name) head(ID2name) @ \section{Implementing the gene-set analysis methods} The function \Rfunction{do.gene.set.analysis} returns a data-frame with p-values and q-values for all the methods selected. By default, all the methods are implemented; however this takes several minutes. Setting the \texttt{gene.method} parameter to \texttt{TRUE} implements the gene-oriented method. The other method parameters refer to the patient-oriented methods: \texttt{perm.null.method} refers to the permutation null without heterogeneity, \texttt{perm.null.het.method} to the permutation null with heterogeneity, \texttt{pass.null.method} to the passenger null without heterogeneity, and \texttt{pass.null.het.method} to the passenger null with heterogeneity. See \cite{BocaEtAl2010} for more details on all these methods. The gene-sets we use correspond to the KEGG annotations for endometrial cancer, non-small cell lung cancer, and alanine, aspartate and glutamate metabolism i.e. hsa$05213$, hsa$05223$, and hsa$00250$: <>= as.character(KEGGPATHNAME2ID[c("Endometrial cancer", "Non-small cell lung cancer", "Alanine, aspartate and glutamate metabolism")]) resultsBrain <- do.gene.set.analysis(EventsBySample = EventsBySampleBrain, GeneSizes = GeneSizes08, GeneSets = KEGGPATHID2EXTID[c("hsa05213", "hsa05223", "hsa00250")], Coverage = CoverageBrain, ID2name = ID2name, gene.method = FALSE, perm.null.method = TRUE, perm.null.het.method = FALSE, pass.null.method = TRUE, pass.null.het.method = FALSE) @ The resulting object is a data-frame, with NAs for the methods which were not implemented: <>= resultsBrain @ \subsection{Calculating gene scores} In order to implement the gene-oriented method, we need to have gene-level scores. We can obtain a variety of scores using the \texttt{cma.scores} function. The functions in this package are designed to use the \texttt{logLRT} scores. <>= GeneScores <- cma.scores(cma.data = MutationsBrain, number.genes = nrow(GeneSizes08)) head(GeneScores) @ \section{Simulating data} We now demonstrate the use of the \texttt{sim.data.p.values} function, which simulates datasets under either the permutation or passenger null (see \cite{BocaEtAl2010}), depending on whether \texttt{pass.null} is set to \texttt{TRUE} or \texttt{FALSE}, and calculates the p-values and q-values for those datasets for the selected methods. The simulations may also include spiked-in gene-sets, by using the \texttt{perc.samples} and \texttt{spiked.set.sizes} options. For example, if one desires to have two spiked-in gene-sets, both having $50$ genes, but one having the probability of being altered in any given sample of $0.75$ and the other of $0.95$, then these parameters should be set to \texttt{perc.samples = c(75, 90)} and \texttt{spiked.set.sizes = 50}. The spiked-in gene-sets are artificially created with hypothetical genes (for more details on how they are generated, see \cite{BocaEtAl2010}. To simulate the data without spiked-in sets, under the permutation or passenger null hypotheses, the parameters should be set as following: \texttt{perc.samples = NULL, spiked.set.sizes = NULL}. The object outputted by \texttt{sim.data.p.values} is of the class \texttt{SetMethodsSims}. Note that this code takes several minutes to run: <>= set.seed(831984) resultsSim <- sim.data.p.values(EventsBySample = EventsBySampleBrain, Mutations = MutationsBrain, GeneSizes = GeneSizes08, Coverage = CoverageBrain, GeneSets = KEGGPATHID2EXTID[c("hsa05213", "hsa05223", "hsa00250")], ID2name = ID2name, nr.iter = 2, pass.null = TRUE, perc.samples = c(75, 95), spiked.set.sizes = c(50), show.iter = TRUE, gene.method = FALSE, perm.null.method = TRUE, perm.null.het.method = FALSE, pass.null.method = TRUE, pass.null.het.method = FALSE) resultsSim slotNames(resultsSim) resultsSim@null.dist @ \subsection{Manipulating the \texttt{SetMethodsSims} objects} We provide $2$ functions to help manipulate the \texttt{SetMethodsSims} objects which result from the \texttt{sim.data.p.values} function: \texttt{extract.sims.method} and \texttt{combine.sims}. The \texttt{extract.sims.method} function is used to obtain a single data frame with the p-values or q-values from one of the specific methods. For instance, the command to get the p-values for the permutation null with no heterogeneity method is: <>= extract.sims.method(resultsSim, "p.values.perm.null") @ The function \texttt{combine.sims} may be used to combine $2$ simulations: <>= combine.sims(resultsSim, resultsSim) @ \bibliographystyle{plain} \bibliography{PatientGeneSets} \end{document}