%\VignetteIndexEntry{GeneBreak} %\VignetteDepends{GeneBreak} %\VignetteKeywords{Recurrent gene break detection on copy number data from array-CGH or sequencing.} %\VignettePackage{GeneBreak} %\VignetteEngine{utils::Sweave} \documentclass{article} \usepackage[colorlinks=true,linkcolor=black,citecolor=black,urlcolor=blue]{ hyperref} \begin{document} \SweaveOpts{concordance=TRUE} \newcommand{\pkg}[1]{\texttt{#1}} \newcommand{\code}[1]{\texttt{#1}} \title{Introduction to GeneBreak} \author{Evert van den Broek\footnote{Correspondence to: Christian Rausch (c.rausch@vumc.nl) or Sanne Abeln (s.abeln@vu.nl)} \cr\& Stef van Lieshout} \maketitle \begin{center} Department of Pathology \\* VU University Medical Center \\* The Netherlands, Amsterdam \\* \end{center} \tableofcontents \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Running GeneBreak} The \pkg{GeneBreak} package aims to systematically identify genes recurrently affected by copy number aberration-associated breakpoint locations that indicate underlying DNA breaks and thereby genes involved in structural variants. This is a short tutorial on how to use the \pkg{GeneBreak} package. It describes an example workflow which uses copy number aberration (CNA) data obtained by analysis of 200 array-CGH (Agilent 180k) samples from advanced colorectal cancers. We used the \pkg{CGHcall} package that can be obtained via Bioconductor (www.bioconductor.org). First, we will start with loading the package. <>= library(GeneBreak) @ <>= options("GeneBreak::verbose"=NA) options(width=75) @ \subsection{Detect breakpoints from copy-number data} Copy number data can be loaded in two ways. We recommend the usage of Bioconductor packages \code{CGHcall} or \code{QDNAseq} to process CNA data from array-CGH or sequencing data respectively. The obtained cghCall/QDNAseq object can directly serve as input for the GeneBreak pipeline. Alternatively, a data.frame with exactly these five columns: "Chromosome", "Start", "End" and "FeatureName" (usually probe or bin identifier) followed by columns with sample data can be provided. In this tutorial we will use a built-in dataset that contains CNA data from chromosome 20: %------------------------------------------------------------ \subsubsection{Loading cghCall object} %------------------------------------------------------------ To load and run the example dataset, which is an object of class \code{CGHcall}, the \pkg{CGHcall} package needs to be installed. \begin{Schunk} \begin{Sinput} > # Install the "CGHcall" package from Bioconductor: > # if (!requireNamespace("BiocManager", quietly=TRUE)) > # install.packages("BiocManager") > # BiocManager::install("CGHcall") \end{Sinput} \end{Schunk} Load the example dataset from \pkg{GeneBreak}: <>= library(CGHcall) data( "copynumber.data.chr20" ) @ Inspection of the loaded data shows an R object of class \code{cghCall} that contains CNA data from 3653 features (array-CGH probes in this case) and 200 samples. <>= copynumber.data.chr20 @ To generate an object of class \code{CopyNumberBreakpoints} with breakpoint locations, run \code{getBreakpoints()}. This will obtain the required information from the \code{cghCall} object and determine the breakpoint locations. <>= breakpoints <- getBreakpoints( data = copynumber.data.chr20 ) breakpoints @ Inspection of the generated object shows that we have copy number data of 3653 features from 200 samples. A total of 1035 individual breakpoint locations were identified. %------------------------------------------------------------ \subsubsection{Loading data from a dataframe} %------------------------------------------------------------ If the CNA data has not been generated by \pkg{CGHcall} or \pkg{QDNAseq}, there is a possibilty of using a \code{data.frame()} as input for \code{GeneBreak}. This allows breakpoint analysis of data from any copy number detection pipeline by importing a text file into \code{getBreakpoints()}. Here we show how to use two data.frames() with segment and (optionally) call values as input for \code{getBreakpoints} instead of a cghCall/QDNAseq object. <>= library(CGHcall) cgh <- copynumber.data.chr20 segmented <- data.frame( Chromosome=chromosomes(cgh), Start=bpstart(cgh), End=bpend(cgh), FeatureName=rownames(cgh), segmented(cgh)) called <- data.frame( Chromosome=chromosomes(cgh), Start=bpstart(cgh), End=bpend(cgh), FeatureName=rownames(cgh), calls(cgh)) breakpoints <- getBreakpoints( data = segmented, data2 = called ) @ Note: the first five column names of the data.frame must exactly be "Chromosome", "Start", "End" and "FeatureName". %------------------------------------------------------------ \subsection{Breakpoint selection by filtering} %------------------------------------------------------------ Next, breakpoints can be filtered by stringent criteria. Different filters can be set (see \code{?bpFilter} for more details). Default setting is "CNA-ass" which means that breakpoints flanked by copy number neutral segments will be filtered out. Note: you need discrete copy number calls (loss,neutral, gain, etc) for this option. <>= breakpointsFiltered <- bpFilter( breakpoints, filter = "CNA-ass" ) breakpointsFiltered @ Inspection of the output shows that 985 CNA-associated breakpoint locations remain following the filter step. %------------------------------------------------------------ \subsection{Identification of genes affected by breakpoints} %------------------------------------------------------------ Identification of genes affected by breakpoints requires execution of the following three steps. %------------------------------------------------------------ \subsubsection{Loading gene annotation data} %------------------------------------------------------------ We need to load gene annotations to be able to identify genes affected by breakpoints in the next step. Gene annotation for human reference genome hg18 (and hg19, hg38) are built-in, but also user-defined annotations can be used. The required columns for this data.frame are "Gene", "Chromosome", "Start" and "End". <>= data( "ens.gene.ann.hg18" ) @ This shows the content of the first six rows of the hg18 gene annotation dataframe: <>= head( ens.gene.ann.hg18 ) @ %------------------------------------------------------------ \subsubsection{Feature-to-gene mapping} %------------------------------------------------------------ Here, the loaded gene annotation information will be added to the GeneBreak object and feature-to-gene mapping will be performed. <>= breakpointsAnnotated <- addGeneAnnotation( breakpointsFiltered, ens.gene.ann.hg18 ) @ Tho show the names of associated features of e.g. the "PCMTD2" gene, give: <>= featuresPerGene ( breakpointsAnnotated , geneName = "PCMTD2" ) @ Gene-associated feature information has been added to \code{breakpointsAnnotated}. Visualisation shows: <>= geneFeatures <- geneInfo( breakpointsAnnotated ) head( geneFeatures[ , c("Gene", "Chromosome", "Start", "End", "featureTotal", "featureNames", "remarks") ] ) @ Possible "remarks" that describe gene position with respect to feature positions are: "A": genes located upstream of the first chromosomal feature (no gene-associated features) "B": genes located downstream of the last chromosomal feature (no gene-associated features) "C": in case of array-CGH probes, the whole gene is located between two features "C": in case of sequencing data, the whole gene is located between start and end of one bin "D": gene represented by one or multiple features "E": gene represented by one or multiple features, but the end of the gene is not covered by any feature "X": no feature covers the chromosome of the gene %------------------------------------------------------------ \subsubsection{Detection of gene-associated breakpoints} %------------------------------------------------------------ In the next step, gene-associated breakpoints will be identified by using \code{bpGenes()}. <>= breakpointGenes <- bpGenes( breakpointsAnnotated ) @ This is an example of the output when selected for broken genes: <>= result_BreakpointGenes <- geneInfo ( breakpointGenes ) head( result_BreakpointGenes[ which ( result_BreakpointGenes$sampleCount > 0 ) , c( "Gene", "Chromosome", "Start", "End", "featureTotal", "nrOfBreakLocations", "sampleCount", "sampleNamesWithBreakpoints") ] ) @ This table shows the genes (rows) and the number of gene-associated features in "featureTotal". The column "nrOfBreakLocations" indicates the number of identified breakpoint locations in the gene across all samples. As a consequence, this is a subset of, and limited by, the total number of gene-associated features. The total of samples that harbor a breakpoint in the gene is given in the column "sampleCount". %------------------------------------------------------------ \subsection{Cohort-based breakpoint statistics} %------------------------------------------------------------ Following identification of (gene) breakpoints per profile, breakpoint events that are significantly recurring across samples will be determined by dedicated statistical analysis. This can be performed at "gene" (breakpoint gene) and/or "feature" (breakpoint location) level. Two different methods of FDR-type correction for multiple testing can be used, the standard Benjamini-Hochberg FDR-type correction ("BH") or dedicated Benjamini-Hochberg FDR-type correction ("Gilbert"). %------------------------------------------------------------ \subsubsection{Detection of recurrent breakpoint genes} %------------------------------------------------------------ The gene-based statistical analysis includes correction for covariates that may influence the probability to be a breakpoint gene including number of breakpoints in a profile, number of gene-associated features and gene length by gene-associated feature coverage. Multiple testing can be applied by the powerful dedicated Benjamini-Hochberg FDR-type correction ("Gilbert") that accounts for the discreteness of the null-distribution. (Reference: Gilbert PB, Appl Statist. 2005;54:143-58) NOTE: when running \code{bpStats()} warnings can be generated by a function (glm.fit) of a dependancy package, this does not harm the analysis. <>= breakpointStatistics <- bpStats( breakpointGenes, level = "gene", method = "Gilbert" ) @ This will return an object of class \code{CopyNumberBreakPointGenes}. By using \code{recurrentGenes()} we can observe the recurrent affected genes with associated P-value and FDR. <>= head( recurrentGenes( breakpointStatistics ) ) @ %------------------------------------------------------------ \subsubsection{Detection of recurrent breakpoint locations} %------------------------------------------------------------ With this step, statistics at breakpoint location (feature) level will be added to the object of class \code{CopyNumberBreakPointGenes}. Here, we recommend to use the less computationally intensive standard Benjamini-Hochberg FDR-type correction for multiple testing, because the breakpoint probability is equal across features per profile, which means that all positions correspond to the same null-distribution. <>= breakpointStatistics <- bpStats( breakpointStatistics, level = "feature", method = "BH" ) @ <>= breakpointStatistics @ By using \code{featureInfo()} we can observe the features and whether they were identified as breakpoints including the calculated FDR values: <>= head( featureInfo( breakpointStatistics ) ) @ %------------------------------------------------------------ \subsection{Visualization of breakpoint frequencies} %------------------------------------------------------------ Breakpoint locations and frequencies can be visualized using \code{bpPlot()}: <>= pdf("bpPlot.png", width=10) @ <>= bpPlot( breakpointStatistics, fdr.threshold = 0.1 ) @ <>= dev.off() @ \begin{figure}[h] \centering \includegraphics{bpPlot} \caption{Graphical representation of CNA-associated chromosomal breakpoint frequencies and their distribution over chromosomes 20. The X-axis depicts the genomic position in Mb. The Y-axis depicts the chromosomal breakpoint frequencies across the series of 200 CRC samples. Breakpoint frequencies are indicated on array-CGH probe-level (vertical black bars) and on gene-level (horizontal red bars). Recurrent breakpoint genes (FDR<0.1) are named. When the gene breakpoint frequency exceeded 15\% (horizontal dashed line), the breakpoint frequency (\%) follows the gene name.} \label{fig:bpPlot} \end{figure} \clearpage \section{Storage of R objects} At any time during the analysis, the GeneBreak objects (and any R objects for that matter) can be saved to disk with: \code{saveRDS}, and in the future be read from the local file with \code{loadRDS} \section{Downloading Gene Annotations} This section describes the steps taken to create the gene annotations used in this package. It may serve as a start for creating your own if required for whatever reason. <>= # gene annotations obtained via Biomart. # HUGO gene names (HGNC symbol), Ensembl_ID and chromosomal location # Used (and most) recent releases: # HG18: release54 # HG19: release75 # HG38: release80 (date: 150629) library(biomaRt) ensembl54 = useMart( host = 'may2009.archive.ensembl.org', biomart = 'ENSEMBL_MART_ENSEMBL', dataset = "hsapiens_gene_ensembl" ) ensembl75 = useMart( host = 'feb2014.archive.ensembl.org', biomart = 'ENSEMBL_MART_ENSEMBL', dataset = "hsapiens_gene_ensembl" ) ensembl80 = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" ) createAnnotationFile <- function( biomartVersion ) { biomart_result <- getBM( attributes = c( "hgnc_symbol", "ensembl_gene_id", "chromosome_name", "start_position", "end_position", "band", "strand" ), mart = biomartVersion ) biomart_result[ ,3] <- as.vector( biomart_result[ ,3] ) idx_x <- biomart_result$chromosome_name == "X" idx_y <- biomart_result$chromosome_name == "Y" biomart_result$chromosome_name[ idx_x ] <- "23" biomart_result$chromosome_name[ idx_y ] <- "24" biomart_genes <- biomart_result[ which(biomart_result[ ,1] != "" & biomart_result[ ,3] %in% c(1:24)) , ] colnames(biomart_genes)[1:5] <- c("Gene","EnsID","Chromosome","Start","End") cat( c( "Biomart version:", biomartVersion@host, "including:", dim(biomart_genes)[1], "genes\n" ) ) return( biomart_genes ) } ens.gene.ann.hg18 <- createAnnotationFile( ensembl54 ) ens.gene.ann.hg19 <- createAnnotationFile( ensembl75 ) ens.gene.ann.hg38 <- createAnnotationFile( ensembl80 ) @ \clearpage \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \end{document} % EOF