%\VignetteIndexEntry{facopy} %\VignetteDepends{data.table, FactoMineR, IRanges, nnet, reshape2, scales, methods, ggplot2, gridExtra, facopy.annot} %\VignetteKeywords{association,gene-set enrichment, copy number alteration, cancer, recurrence} %\VignettePackage{facopy} \documentclass[11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{color} \usepackage{float} \definecolor{facopy}{RGB}{189,21,80} \definecolor{facopydark}{RGB}{131,15,56} \hypersetup{colorlinks=true, linkcolor=facopydark, citecolor=facopydark} \SweaveOpts{echo=FALSE} \setlength{\textheight}{8.5in} \setlength{\textwidth}{6in} \setlength{\topmargin}{-0.25in} \setlength{\oddsidemargin}{0.25in} \setlength{\evensidemargin}{0.25in} \begin{document} \SweaveOpts{concordance=TRUE} \setkeys{Gin}{width=0.99\textwidth} \title{\bf \textcolor{facopy}{facopy}: Feature-based association and gene-set enrichment for copy number alteration analysis in cancer } \author{David Mosen-Ansorena$^1$} \maketitle \begin{center} $^1$Genome Analysis Platform\\ CIC bioGUNE and CIBERehd\\ {\tt dmosen.gn@cicbiogune.es}\\ \end{center} \setcounter{tocdepth}{2} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} This document guides the reader through the use of the {\bf\textcolor{facopy}{facopy}} package. The main aim of the package is to provide fine-tuned copy number alteration (CNA) association modeling. Association is measured directly at the genomic features of interest and, in the case of genes, downstream gene-set enrichment analysis can be performed thanks to novel internal processing of the data. The software opens a way to systematically scrutinize the differences in CNA distribution across tumoral phenotypes, such as those that relate to tumor type, location and progression. Currently, the output format from 12 different methods that analyze data from whole-genome/exome sequencing and SNP microarrays, is supported. \\ Overall, the package was conceived to be user-friendly, provide visual assessment at each step and keep a simple and limited parameterization. \subsection{Example data} The package bundles example data from 20 colorectal cancer samples. The DNA samples were hybridized to SNP microarrays. Then the data was GC-corrected, preprocessed with TumorBoost \citep{Bengtsson2010} and analyzed with Genome Alteration Print (GAP) \citep{Popova2009}. See {\bf\textcolor{facopy}{facopy}}'s paper for more details. For the sake of space and speed, only copy number calls longer than one megabase are included. Although the example data does not include the sexual chromosomes, the package is able to process them. \section{Input} An example analysis on the previously described data follows. First, if you have not already, install the {\bf\textcolor{facopy}{facopy}} package through the \texttt{biocLite()} function, and then load it. Use the devel version of Bioconductor to get the package seamlessly. <>= source("http://www.bioconductor.org/biocLite.R") biocLite("facopy") @ %install.packages("facopy.annot_0.99.0.tar.gz",type="source",repos=NULL) %install.packages("facopy_0.0.1.tar.gz",type="source",repos=NULL) <>= library(facopy) @ The initial step of a {\bf\textcolor{facopy}{facopy}} analysis is to load the data from your input files and from {\bf\textcolor{facopy}{facopy}}'s companion annotation package: \begin{description} \item[Copy number calls] Copy number calls are the output files generated by CNA detection tools that analyze data from HTS (whole genome, exome sequencing) or SNP-microarrays (aCGH, SNP-based). As of now, the output format from the following methods is supported: seqCNA \citep{Mosen-Ansorena2014}, CNAnorm \citep{Gusnanto2012}, FREEC \citep{Boeva2012}, OncoSNP-SEQ \citep{Yau2013}, Patchwork \citep{Mayrhofer2013}, TITAN \citep{Ha2014}, EXCAVATOR \citep{Magi2013}, ExomeCNV \citep{Sathirapongsasuti2011}, GAP \citep{Popova2009}, OncoSNP \citep{Yau2010} and FastCall \citep{Benelli2010}. \item[Phenotypes] Phenotypic information takes the form of one variable per phenotype. A variable can be continuous or categorical (discrete), and does not need to be defined for all the samples. \item[Genomic features] {\bf\textcolor{facopy}{facopy}} is able to analyze data from Homo sapiens genome builds hg18 and hg19, and Mus musculus build mm8. For this latter, genes and microRNAs retrieved from BioMart's Ensembl database \citep{Hubbard2002} can be used as features at which to perform the association. The following sets of features taken from CaSNP \citep{Cao2011} can be used on human data as well: large intergenic non-coding RNAs (lincRNAs), tumor suppressor genes and oncogenes. Note nevertheless that any other set of genomic features may be imported to the analysis from an external source. \end{description} Call \texttt{getFacopyInfo()} to get a list of possible combinations for supported input formats, variable types, genomic features and more: <>= getFacopyInfo() @ Starting with the copy number calls, the \texttt{readCNData()} function takes as parameters, at least, the following two: the folder with the copy number data and the name of the tool used to generate it. Check \texttt{?readCNData} to learn more about the formats that {\bf\textcolor{facopy}{facopy}} understands as input. Here are some examples: <>= # myCalls = readCNData("~/myFolder/", "seqcna") # myCalls = readCNData("~/myFolder/", "gap", pfbFilename="~/myPfb.pfb") # myCalls = readCNData("~/myFolder/", "cnanorm", window=50000) @ For this vignette, the output of \texttt{readCNData()} over the example data has already been left in the \texttt{myCalls} object. Hence, we will just load it into memory through the \texttt{data()} function, like so: <>= data(myCalls) @ The next step is to attach the phenotypic information to the \texttt{myCalls} object. Either if you have it in a file (with headers) or in a data frame you previously created, just call the \texttt{addVariables()} function. You also need to specify, at least, the types of the variables (continuous or categorical). In our case, the \texttt{age} and \texttt{stage} variables are continuous and categorical, respectively. We will leave the combined data in the \texttt{myStudy} object. <>= data(myVariables) class(myVariables) head(myVariables) myStudy = addVariables(myCalls, myVariables, c("continuous","categorical")) @ To complete our input, we will select the genomic features of our interest and the genome build, which should coincide with the one used during the CNA detection. Typically, you will want to select the whole set of genes (\texttt{ensembl}), but there are interesting studies highlighting the relevance of miRNAs (\texttt{mirna}) and lincRNAs (\texttt{lincRNA}) in cancer \citep{Cao2011}. Let me stress that, in order to perform the subsequent gene-set enrichment analysis, it is necessary to indicate some kind of gene collection (\texttt{ensembl}, \texttt{oncogene}, \texttt{tumorsuppressor}, \texttt{cancergene}). <>= myStudy = addFeatures(myStudy, "oncogene", "hg18") summary(myStudy) @ By default, for each genomic feature, only overlapping copy number calls are considered. However, additional upstream and downstream flanks can be set. This might render useful, for instance, in order to include regions upstream of every gene, thus accounting for nearby regulatory elements. \section{The alterations} Two factors are used in {\bf\textcolor{facopy}{facopy}} to discriminate copy number alterations: (1) whether the copy number is lower, equal or greater than 2 (diploid genomes are assumed) and the presence of loss of heterozygosity (LOH). The following relevant combinations of alterations are defined in {\bf\textcolor{facopy}{facopy}}: \begin{description} \itemsep0em \item[\texttt{amplifications}] All amplifications (CN>2). \item[\texttt{deletions}] All deletions (CN<2). \item[\texttt{loh}] All loss of heterozygosity (LOH), regardless of copy number. \item[\texttt{cnas}] All copy number alterations (CN<>2). \item[\texttt{any}] Any kind of alteration. \item[\texttt{all}] An alias for any kind of alteration, same as \texttt{any}. \item[\texttt{onlygain}] Only non-LOH amplifications. \item[\texttt{someloss}] All deletions plus LOH alterations. \end{description} You might also use capital letters anywhere in the names and abbreviations (e.g. \texttt{Amp} for \texttt{amplifications}). Where relevant, these modified names will show up in the graphical output. Depending on the combination of alterations, different designs are available in some of {\bf\textcolor{facopy}{facopy}}'s functions. The simplest design is \texttt{binary}: an alteration exists or it does not. The \texttt{versus} design, for CNAs, assigns a value of -1, 0 or 1 depending on whether a deletion, no copy number change or an amplification exists for a given feature, respectively. The \texttt{vlog} design, for all (any) alterations, assigns a value of -1, 0 or 1 depending on whether a deletion or LOH, no copy number change or an amplification without LOH exists. \section{Analysis} One way to start looking at the data is to plot the results of a Principal Component Analysis (PCA), showing the similarity among the samples. By default, \texttt{plotPCA()} uses a \texttt{binary} design. In the following call, we tell it to use all alterations and to color the points (representing the samples) based on their phenotypes. By default, the thickness of the point rims represents the rank, among all samples, in the amount of total altered base pairs, where a thinner rim indicates fewer altered based pairs. Visual assessment discards correlation between amount of total alterations and the phenotypes. <>= par(mfrow=c(1,2)) pca1 = plotPCA(myStudy, "any", "age") pca2 = plotPCA(myStudy, "any", "stage") head(pca2$eig, 2) @ \subsection{Recurrence assessment} {\bf\textcolor{facopy}{facopy}} offers both graphical and text output to help assessing alteration recurrence. \subsubsection{Summary tables} Text output consists of three summaries that provide an overview of the variables in the study and the alterations in the samples. The following functions generate tables that focus on summarizing or testing a specific aspect of the data, and can write the output to a selected file: \begin{description} \item[\texttt{variableSummary()}] Shows per-alteration type differences among the values in each variable. \item[\texttt{alterationSummary()}] Breaks down alteration frequencies by chromosome arm and two alteration classifications: deletion, normal copy number or amplification, and presenting LOH or not. \item[\texttt{variableCor()}] Performs appropriate statistical tests that measure correlations between pairs of variables. \end{description} <>= variableSummary(myStudy) head(alterationSummary(myStudy)) variableCor(myStudy) @ You can actually call these capitulating functions before you attach the genomic feature annotation. Also, alternatively, you might call the \texttt{variableSummary()}, which collects the three tables and can output them to a folder you indicate: <>= myCallsPreview = preview(myStudy) sapply(myCallsPreview, head, 1) @ \subsubsection{Visualization} The plot generated by \texttt{plotBar()} provides an overview of chromosome arm-wise frequencies of CNA and LOH combinations. It displays, for each chromosome arm, four triangle-shaped points. The two points above the abscissa indicate amplification frequencies and the two points below it, deletion frequencies. The two big points specify the overall amplification and deletion frequencies in the arm, while the small points specify somatic LOH frequencies within either copy number alteration type. Frequencies can be calculated as the median incidence of either altered features or base pairs in a given arm across all samples in the dataset. Chromosome arms of interest can be highlighted. <>= myArms = c("8q","13q","20q","8p","18q") myColors = c(rainbow(15)[1:3], rainbow(15)[10:11]) plotBar(myStudy, TRUE, myArms, myColors, ylim=c(-0.4,1)) @ Data of such complexity, with multiple variables, genes, alteration types and chromosome arms can be broken down in many ways in order to investigate different aspects. \\ \texttt{plotHist()} outputs individual stacked histograms with alteration frequencies for each of the values in a categorical variable. In this example, we choose to display amplifications broken down by tumor stage. Each bar in the histograms represents the number of features (here, genes) with an altered frequency that falls within a specific range. Chromosome arms of interest can be differentially colored here as well. Notice that the bin size (width of each value range) and the maximum value in the Y axis are required. <>= plotHist(myStudy, "amp", "stage", myArms, myColors, bin=0.1, ymax=80) @ Another possibility, which deepens more into the data, is to look at a specific arm within each sample, and color a set of alterations based on the phenotypic value. Given the histograms, there seemed to be no relevant differences in main amplifications based on tumor stage. Therefore, now, let us focus on chromosome arm 8p, which looked pretty deleted overall in the plot bar, and the deletions within it. Each sample's deletions are all depicted within the same line: <>= plotZoom(myStudy, "arm", "8p", "del", "stage") @ Arm 8p is quite more deleted in stage IV (4) samples. Although we do not yet have association results for the genes in this arm, we can, for the sake of providing an example, zoom into gene \emph{PCM1}, which we will see to be associated to tumor stage. <>= plotZoom(myStudy, "feat", "PCM1", "del", "stage") @ \subsection{Association analysis} {\bf\textcolor{facopy}{facopy}} implements a flexible system for the definition of association models between genomic copy number and phenotypes. The association is assessed at every genomic feature by considering for each test only those alterations that overlap with the corresponding feature. \subsubsection{Association models} In the simplest scenario, a single phenotype is measured against the presence of a certain combination of alterations overlapping each feature. The meaning of the phenotype will prompt to simply assess the independence from the copy number or to model the phenotype as either a cause (predictor) or consequence (response) of the copy number differences. Apart from just measuring the presence or absence of a combination of alterations, other designs are possible (see above). \\ Discrete variables can be turned into ordinal variables, while values from continuous variables can be classified into intervals and thus be considered ordinal as well. \\ More complex models are possible by defining variable interactions and strata, see \texttt{?formula}. In addition, null models can also be specified so that, for example, the effect of considering an additional variable in a model can be statistically measured. Finally, advanced users of {\bf\textcolor{facopy}{facopy}} can fully specify their association models in order to define their own statistical tests and interactions between phenotypes and copy number. \\ For a detailed description of the parameters in the \texttt{facopy()} function that allow the aforementioned characteristics, see \texttt{?facopy}. \subsubsection{Examples} In these examples, we will see that \texttt{facopy()} outputs both text and graphical results. \\ Example 1 shows how deletions in gene \emph{PCM1} are indeed significantly associated with tumor stage in our samples. The call to the function is pretty straightforward, so let us take a look at a slightly more advanced use case. \\ By default, the function establishes that the phenotype is a consequence of the copy number differences. In example 2, we indicate that our model establishes our variables to predict the copy number. Specifically, we measure the impact of the interaction between our two variables on the copy number. \\ If you want to visualize the results of the association, set the \texttt{plot} parameter to \texttt{TRUE} (example 3). In such case, you might overlay data from an external database. Namely, the frequencies of the relevant alterations in the desired data set will be displayed as translucent black bars. \texttt{getFacopyInfo()} lists, among other things, the collection of available data sets, which were gathered from the Cancer Genome WorkBench \citep{Zhang2007}. <>= genes = facopy(myStudy, "del", "stage") #1 head(genes[order(genes$p_value),]) genes = facopy(myStudy, "amp", "stage*age", modelPart="predictor") #2 head(genes[order(genes$p_value),]) genes = facopy(myStudy, "amp", "stage", plot=TRUE, db="gsk_colon") #3 @ \subsection{Gene-set enrichment analysis} Gene-set enrichment is a statistical analysis in which biologically relevant sets of related genes (e.g. pathways) are examined for enrichment in genes found during the association. {\bf\textcolor{facopy}{facopy}} performs such analysis over a range of gene-set collections from MSigDB \citep{Liberzon2011} that include Gene Ontology (GO), Reactome, KEGG, miRNA and transcription factor targets, and more. Notice that gene-set enrichment analysis is only available if genes or a subset of genes, such as those that are cancer-related, were selected as features. Before the analysis is run, genes with little influence on gene expression are disregaded, as they are not expected to present relevant mutations. This is achieved by calculating the correlation coefficient (R2) between copy number and expression. You might use your own expression data (tab-delimited with headers) or tell {\bf\textcolor{facopy}{facopy}} to calculate the correlation in a dataset from the cBio project \citep{Cerami2012}: <>= # eCor = calculateCor(myStudy, "~/myExprData.txt") # eCor = calculateCor(myStudy, "mrna_merged_median_Zscores", "coadread_tcga_pub") @ The \texttt{facopyEnrichment()} function uses the correlation coefficients to keep only those genes with values above a given threshold. <>= # facopyEnrichment(myStudy, genes, eCor, "~/myFolder/stageAmpEnrichment") @ Another issue with gene-set enrichment analysis is that it assumes independent association results for each gene. However, this is not the case in cancer copy number analysis. Within a significant cluster, most likely, many of the gene alterations are merely what are known as passengers, in contrast to driver mutations, which do actually have an influence on the phenotype. \\ GRAS (Score column in the enrichment tables, see below) is a two-valued score that involves the examination of gene distribution across chromosome arms, in an attempt to assess the possibility that enrichment derives from the presence of many significant genes in few arm-wide alterations. One should not discard, however, that significantly lower counts are due to co-location of related genes, especially for small numbers of genes. GRAS provides a value that indicates the amount of chromosome arms to which the genes in the enrichment belong and another one for the significance of the enrichment. \begin{table}[H] \begin{tabular}{l|llllllll} Category & Size & ExpCount & Count & OddsR & Pvalue & PvalueAdj & Score & Genes \\ \hline Class A/1 & 6 & 0.197 & 2 & 14.91 & 0.0147 & 1 & 2\_0.807 & ... \\ Apoptosis & 65 & 2.13 & 6 & 3.082 & 0.0190 & 1 & 5\_0.044 & ... \\ Zinc transp. & 7 & 0.230 & 2 & 11.92 & 0.0202 & 1 & 2\_0.718 & ... \\ \end{tabular} \end{table} {\bf\textcolor{facopy}{facopy}} also produces graphical output for the three gene ontologies (Biological Process, Molecular Function and Cellular Component) and for the canonical pathway collections (Reactome, KEGG and BioCarta). In the case of the former, graph trees with enriched terms are produced, whereas, for each enriched pathway, a graph with gene relationships is generated. Only those gene sets with an enrichment significance below the corresponding threshold in the call \texttt{facopyEnrichment()} will be included in the grahical output. \\ All the output from the gene-set enrichment analysis is saved to the indicated folder in the call to \texttt{facopyEnrichment()}. \begin{figure}[hp] \includegraphics[width=0.99\textwidth]{enrichmentGO.pdf} \end{figure} \begin{figure}[hp] \includegraphics[width=0.9\textwidth]{enrichmentPathway.pdf} \end{figure} \pagebreak <>= sessionInfo() @ \newpage \bibliographystyle{apalike} \bibliography{facopy} \end{document}