%\VignetteIndexEntry{Introduction to MutationalPatterns} \documentclass{article} \usepackage{float} \usepackage[natbibapa]{apacite} \bibliographystyle{apacite} <>= BiocStyle::latex(use.unsrturl=FALSE) @ \title{Introduction to \Biocpkg{MutationalPatterns}} \author{Francis Blokzijl \and Roel Janssen \and Ruben van Boxtel \and Edwin Cuppen} \date{\today} \begin{document} \maketitle \tableofcontents \newpage{} <>= options(width=96) library(ggplot2) @ \section{Introduction} Mutational processes leave characteristic footprints in genomic DNA. This package provides an easy-to-use toolset for the characterization and visualization of mutational patterns in base substitution catalogues of e.g. tumour samples or DNA-repair deficient cells. The package covers a wide range of patterns including: mutational signatures, transcriptional strand bias, genomic distribution and association with genomic features, which are collectively meaningful for studying the activity of mutational processes. The package provides functionalities for both extracting mutational signatures \emph{de novo} and inferring the contribution of previously identified mutational signatures in a given sample. MutationalPatterns integrates with common R genomic analysis workflows and allows easy association with (publicly available) annotation data. This package provides a comprehensive set of flexible functions for easy finding and plotting of such mutational patterns in base substitution catalogues. \section{Related packages} \subsection{Comparison to \Biocpkg{SomaticSignatures}} \subsubsection{Similar functionality} SomaticSignatures provides functions for genomic context determination and \emph{de novo} signature extraction using NMF decomposition of 96 trinucleotide count matrices. MutationalPatterns provides this functionality too, but the plotting is different, because MutationalPatterns offers the functionality to extract signatures \emph{de novo} from a 192 feature matrix, with 96 trinucleotide X 2 transcriptional strands. This allows assessment of transcriptional strand bias of mutational signatures, which is important to characterize the underlying mutational mechanism. \subsubsection{Unique functionality} The following functions can be found in MutationalPatterns, but not in SomaticSignatures. \begin{itemize} \item \Rfunction{plot\_contribution}: A function to determine the optimal contribution of previously identified signatures, e.g. cosmic cancer signatures, to reconstruct the mutational profile of just a single sample. This is useful for two reasons: \begin{enumerate} \item for NMF you need many samples with distinct mutational profiles, as it relies on dimensionality reduction. \item In order to further characterize ``known'' mutational signatures and find the underlying mutational mechanisms, this function can be used to determine the contribution of these signatures in different samples, e.g. normal cells, or cells with defective DNA repair mechanisms etc. \end{enumerate} \item \Rfunction{plot\_enrichment\_depletion}: A plotting function and statistical test for enrichment or depletion of mutations in any (publicly available) annotated genomic region such as transcription factor binding site or ``open chromatin''. This is useful for the characterization of mutational mechanisms. \item \Rfunction{strand\_bias\_test}, \Rfunction{plot\_strand\_bias}: A statistical test and a plotting function for transcriptional strand bias in mutation catalogs. \item \Rfunction{plot\_compare\_profiles}: A plotting function to visualize the difference between two 96 mutational profiles and calculate RSS. \end{itemize} \newpage{} \section{Data} To perform mutational pattern analyses you need your VCF datasets and the reference genome. \subsection{List reference genome} List available genomes using \Biocpkg{BSgenome}: <>= library(BSgenome) head(available.genomes()) @ Download and load your reference genome of interest: <>= ref_genome <- "BSgenome.Hsapiens.UCSC.hg19" library(ref_genome, character.only = TRUE) @ \subsection{Load example data} We provided a small example data set with this package. The data consists of somatic mutation catalogues of 9 normal human adult stem cells from 3 tissues \citep{Blokzijl2016}. % You can download larger % sample datafrom the MutationalPatterns-data repository at % \url{https://github.com/CuppenResearch/MutationalPatterns-data/}. Load the MutationalPatterns package: <>= library(MutationalPatterns) @ Locate the example data: <>= vcf_files <- list.files(system.file("extdata", package="MutationalPatterns"), pattern = ".vcf", full.names = TRUE) @ Define corresponding sample names for the datasets: <>= sample_names <- c( "colon1", "colon2", "colon3", "intestine1", "intestine2", "intestine3", "liver1", "liver2", "liver3") @ This package uses \texttt{GRanges} to represent the variant calls. So, to work with data, we need to load it as such. Load the example VCF files into a \texttt{GRangesList}: <>= vcfs <- read_vcfs_as_granges(vcf_files, sample_names, ref_genome) summary(vcfs) @ Store relevant metadata on the samples, such as tissue type, in a character vector: <>= tissue <- c(rep("colon", 3), rep("intestine", 3), rep("liver", 3)) @ \section{Mutation characteristics} Now we have loaded the VCF files and the corresponding reference genome, we can start our search for characteristic mutational footprints. \subsection{Base substitution types} We can retrieve base substitutions from VCF object as "REF>ALT" using \Rfunction{mutations\_from\_vcf}: <>= muts = mutations_from_vcf(vcfs[[1]]) head(muts, 12) @ % alleen de head (eerste 5) outputten zodat de output wat kleiner wordt We can retrieve base substitutions from the VCF object and convert them to the 6 types of base substitution types that are distinguished by convention: C>A, C>G, C>T, T>A, T>C, T>G. For example, when the reference allele is G and the alternative allele is T (G>T), this functions returns the G:C>T:A mutation as a C>A mutation: <>= types = mutation_types(vcfs[[1]]) head(types, 12) @ To retrieve the context (one base upstream and one base downstream) of the positions in the VCF object from the reference genome, you can use the \Rfunction{mutation\_context} function: <>= context = mutation_context(vcfs[[1]], ref_genome) head(context, 12) @ With \Rfunction{type\_context}, you can retrieve the types and contexts for all positions in the VCF object. For the base substitutions that are converted to the conventional base substitution types, the reverse complement of the context is returned. <>= type_context = type_context(vcfs[[1]], ref_genome) head(type_context$types, 12) head(type_context$context, 12) @ Using \Rfunction{mut\_type\_occurrences}, you can count mutation type occurrences for the loaded samples in a list of VCF objects, which can be used to plot the mutation spectrum. <>= type_occurrences <- mut_type_occurrences(vcfs, ref_genome) type_occurrences @ \subsection{Mutation spectrum} A mutation spectrum shows the relative contribution of each mutation type in the base substitution catalogs. The \Rfunction{plot\_spectrum} function plots the mean relative contribution of each of the 6 base substitution types over all samples. Error bars indicate standard deviation over all samples. The $n$ indicates the total number of mutations in the set. <>= p1 = plot_spectrum(type_occurrences) @ Plot the mutation spectrum with distinction between C>T at CpG sites: <>= p2 = plot_spectrum(type_occurrences, CT = TRUE) @ Plot spectrum without legend: <>= p3 = plot_spectrum(type_occurrences, CT = TRUE, legend = FALSE) @ <>= library("gridExtra") grid.arrange(p1, p2, p3, ncol=3, widths=c(3,3,1.75)) @ <>= library("gridExtra") ggsave("combine_plot_spectrum.pdf", grid.arrange(p1, p2, p3, ncol=3, widths=c(3,3,1.75)), width=10, height=3) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{combine_plot_spectrum} \end{center} \end{figure} You can facet the per sample group, e.g. plot spectrum for each tissue separately: <>= p4 <- plot_spectrum(type_occurrences, by = tissue, CT = TRUE, legend = TRUE) @ Define your own 7 colors for spectrum plotting: <>= palette <- c("pink", "orange", "blue", "lightblue", "green", "red", "purple") p5 <- plot_spectrum(type_occurrences, CT = TRUE, legend = TRUE, colors = palette) @ <>= ggsave("combine_plot_spectrum_2.pdf", grid.arrange(p4, p5, ncol=2, widths=c(4,2.3)), width=10, height=3) @ <>= grid.arrange(p4, p5, ncol=2, widths=c(4,2.3)) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{combine_plot_spectrum_2} \end{center} \end{figure} \subsection{96 Mutation profile} Make 96 trinucleodide mutation count matrix: <>= mut_mat <- mut_matrix(vcf_list = vcfs, ref_genome = ref_genome) @ Plot 96 profile of three samples: \begin{figure}[H] \begin{center} <>= plot_96_profile(mut_mat[,c(1,4,7)]) @ \end{center} \end{figure} \newpage{} \section{Mutational signatures} \subsection{\emph{De novo} mutational signature extraction using NMF} A critical parameter in NMF is the factorization rank, which is the number of mutational signatures. Determine the optimal factorization rank using the NMF package \citep{Gaujoux2010}. As described in their paper: ``...a common way of deciding on the rank is to try different values, compute some quality measure of the results, and choose the best value according to this quality criteria. The most common approach is to choose the smallest rank for which cophenetic correlation coefficient starts decreasing. Another approach is to choose the rank for which the plot of the residual sum of squares (RSS) between the input matrix and its estimate shows an inflection point.'' We can use the NMF package to determine which rank we should use to extract signatures using \Rfunction{extract\_signatures}: Add a small psuedocount to avoid a 0 in the matrix: <>= mut_mat = mut_mat + 0.0001 @ Use the NMF package to generate an estimate plot. <>= library("NMF") estimate = nmf(mut_mat, rank=2:5, method="brunet", nrun=100, seed=123456) @ \begin{figure}[H] \begin{center} <>= plot(estimate) @ \end{center} \end{figure} Extract e.g. 2 mutational signatures from the mutation count matrix: <>= nmf_res <- extract_signatures(mut_mat, rank = 2) @ Provide column names for the signatures: <>= colnames(nmf_res$signatures) <- c("Signature A", "Signature B") @ Plot the 96-profile of the signatures: \begin{figure}[H] \begin{center} <>= plot_96_profile(nmf_res$signatures) @ \end{center} \end{figure} Plot the contribution of the signatures: \begin{figure}[H] \begin{center} <>= rownames(nmf_res$contribution) <- c("Signature A", "Signature B") pc1 <- plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "relative") @ \end{center} \end{figure} Plot the contribution of the signatures in absolute number of mutations: \begin{figure}[H] \begin{center} <>= pc2 <- plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "absolute") grid.arrange(pc1, pc2) @ \end{center} \end{figure} Plot contribution of signatures for subset of samples with index parameter: \begin{figure}[H] \begin{center} <>= pc3 <- plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "absolute", index = c(1,2)) @ \end{center} \end{figure} Flip X and Y coordinates: \begin{figure}[H] \begin{center} <>= pc4 <- plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "absolute", coord_flip = TRUE) grid.arrange(pc3, pc4, ncol=2) @ \end{center} \end{figure} Compare reconstructed mutation profile with original mutation profile: \begin{figure}[H] \begin{center} <>= plot_compare_profiles(mut_mat[,1], nmf_res$reconstructed[,1], profile_names = c("Original", "Reconstructed")) @ \end{center} \end{figure} \subsection{Fit 96 mutation profiles to known signatures} Download signatures from pan-cancer study \citep{Alexandrov2013}: <>= sp_url <- paste("http://cancer.sanger.ac.uk/cancergenome/assets/", "signatures_probabilities.txt", sep = "") cancer_signatures = read.table(sp_url, sep = "\t", header = TRUE) # Reorder (to make the order of the trinucleotide changes the same) cancer_signatures = cancer_signatures[order(cancer_signatures[,1]),] # Only signatures in matrix cancer_signatures = as.matrix(cancer_signatures[,4:33]) @ Fit mutation matrix to cancer signatures. This function finds the optimal linear combination of mutation signatures that most closely reconstructs the mutation matrix by solving a non-negative least-squares constraints problem. <>= fit_res <- fit_to_signatures(mut_mat, cancer_signatures) @ <>= # select signatures with some contribution select <- which(rowSums(fit_res$contribution) > 0) # plot contribution plot_contribution ( fit_res$contribution[select,], cancer_signatures[,select], coord_flip = FALSE, mode = "absolute" ) @ <>= # select signatures with some contribution select <- which(rowSums(fit_res$contribution) > 0) # plot contribution ggsave("plot_contribution_3.pdf", plot_contribution(fit_res$contribution[select,], cancer_signatures[,select], coord_flip = FALSE, mode = "absolute"), width=9, height=5) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{plot_contribution_3} \end{center} \end{figure} Compare reconstructed mutation profile of sample 1 using cancer signatures with original profile: \begin{figure}[H] \begin{center} <>= plot_compare_profiles ( mut_mat[,1], fit_res$reconstructed[,1], profile_names = c("Original", "Reconstructed") ) @ \end{center} \end{figure} \section{Transcriptional strand bias} \subsection{Strand bias analysis} For the mutations within genes it can be determined whether the mutation is on the transcribed or non-transcribed strand, which is interesting to study the involvement of transcription-coupled repair. To this end, it is determined whether the "C" or "T" base (since by convention we regard base substitutions as C>X or T>X) are on the same strand as the gene definition. Base substitions on the same strand as the gene definitions are considered "untranscribed", and on the opposite strand of gene bodies as transcribed, since the gene definitions report the coding or sense strand, which is untranscribed. No strand information is reported for base substitution that overlap with more than one gene body. Get gene definitions for your reference genome: <>= # For example get known genes table from UCSC for hg19 using # biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene") library("TxDb.Hsapiens.UCSC.hg19.knownGene") genes_hg19 <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) @ Get transcriptional strand information for all positions in the first VCF object. Function returns ``-'' for positions outside gene bodies, ``U'' for untranscribed/sense/coding strand, ``T'' for transcribed/anti-sense/non-coding strand. <>= strand = strand_from_vcf(vcfs[[1]], genes_hg19) head(strand, 10) @ Make mutation count matrix with transcriptional strand information (96 trinucleotides * 2 strands = 192 features). NB: only those mutations that are located within gene bodies are counted. <>= mut_mat_s <- mut_matrix_stranded(vcfs, ref_genome, genes_hg19) head(mut_mat_s, 10) @ Perform strand bias analysis: <>= strand_counts <- strand_occurrences(mut_mat_s, by=tissue) head(strand_counts) @ \begin{figure}[H] \begin{center} <>= ps1 <- plot_strand(strand_counts, mode = "relative") @ \end{center} \end{figure} Perform poisson test for strand asymmetry significance testing: <>= strand_bias <- strand_bias_test(strand_counts) @ Plot the effect size (log2(untranscribed/transcribed) of the strand bias. Asteriks indicate significant strand bias. \begin{figure}[H] \begin{center} <>= ps2 <- plot_strand_bias(strand_bias) grid.arrange(ps1, ps2) @ \end{center} \end{figure} %TODO: combine plots of plot_strand & plot_strand_bias, zodat ze onder elkaar staan \subsection{Extract signatures with strand bias} Extract 2 signatures with strand bias: <>= nmf_res_strand <- extract_signatures(mut_mat_s, rank = 2) # Provide signature names colnames(nmf_res_strand$signatures) <- c("Signature A", "Signature B") @ Plot signatures with 192 features: <>= a <- plot_192_profile(nmf_res_strand$signatures) @ Plot strand bias per mutation type for each signature with significance test: <>= b <- plot_signature_strand_bias(nmf_res_strand$signatures) @ <>= grid.arrange(a, b, ncol=2, widths=c(5,2)) @ <>= ggsave("plot_192_profile.pdf", grid.arrange(a, b, ncol=2, widths=c(5,2)), width = 10, height = 5) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{plot_192_profile} \end{center} \end{figure} \section{Genomic distribution} \subsection{Rainfall plot} A rainfall plot visualizes mutation types and intermutation distance. Rainfall plots can be used to visualize the distribution of mutations along the genome or a subset of chromosomes. The y-axis corresponds to the distance of a mutation with the previous mutation and is log10 transformed. Drop-downs from the plots indicate clusters or ``hotspots'' of mutations. Make rainfall plot of sample 1 over all autosomal chromosomes <>= # Define autosomal chromosomes chromosomes <- seqnames(get(ref_genome))[1:22] # Make a rainfall plot plot_rainfall ( vcfs[[1]], title = names(vcfs[1]), chromosomes = chromosomes, cex = 1.5, ylim = 1e+09 ) @ <>= # Define autosomal chromosomes chromosomes <- seqnames(get(ref_genome))[1:22] # Make a rainfall plot ggsave("plot_rainfall.pdf", plot_rainfall(vcfs[[1]], title = names(vcfs[1]), chromosomes = chromosomes, cex = 1.5, ylim = 1e+09), width=9, height=3) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{plot_rainfall} \end{center} \end{figure} Make rainfall plot of the first sample over chromosome 1: \begin{figure}[H] \begin{center} <>= chromosomes <- seqnames(get(ref_genome))[1] plot_rainfall ( vcfs[[1]], title = names(vcfs[1]), chromosomes = chromosomes[1], cex = 2, ylim = 1e+09 ) @ \end{center} \end{figure} \subsection{Enrichment or depletion of mutations in genomic regions} Test for enrichment or depletion of mutations in certain genomic regions, such as promoters, CTCF binding sites and transcription factor binding sites. To use your own genomic region definitions (based on e.g. ChipSeq experiments) specify your genomic regions in a named list of GRanges objects. Alternatively, use publically available genomic annotation data, like in the example below. \subsubsection{Example: regulation annotation data from Ensembl using \Biocpkg{biomaRt}} The following example displays how to download promoter, CTCF binding sites and transcription factor binding sites regions for genome build hg19 from Ensembl using \Biocpkg{biomaRt}. For other datasets, see the \Biocpkg{biomaRt} documentation \citep{Durinck2005}. To install \Biocpkg{biomaRt}, uncomment the following lines: <>= source("https://bioconductor.org/biocLite.R") biocLite("biomaRt") @ Load the \Biocpkg{biomaRt} package. <>= library(biomaRt) @ Download genomic regions. NB: Here we take some shortcuts by loading the results from our example data. The corresponding code for downloading this data can be found above the command we run: <>= # regulatory <- useEnsembl(biomart="regulation", # dataset="hsapiens_regulatory_feature", # GRCh = 37) ## Download the regulatory CTCF binding sites and convert them to ## a GRanges object. # CTCF <- getBM(attributes = c('chromosome_name', # 'chromosome_start', # 'chromosome_end', # 'feature_type_name', # 'cell_type_name'), # filters = "regulatory_feature_type_name", # values = "CTCF Binding Site", # mart = regulatory) # # CTCF_g <- reduce(GRanges(CTCF$chromosome_name, # IRanges(CTCF$chromosome_start, # CTCF$chromosome_end))) CTCF_g <- readRDS(system.file("states/CTCF_g_data.rds", package="MutationalPatterns")) ## Download the promoter regions and convert them to a GRanges object. # promoter = getBM(attributes = c('chromosome_name', 'chromosome_start', # 'chromosome_end', 'feature_type_name'), # filters = "regulatory_feature_type_name", # values = "Promoter", # mart = regulatory) # promoter_g = reduce(GRanges(promoter$chromosome_name, # IRanges(promoter$chromosome_start, # promoter$chromosome_end))) promoter_g <- readRDS(system.file("states/promoter_g_data.rds", package="MutationalPatterns")) ## Download the promoter flanking regions and convert them to a GRanges object. # flanking = getBM(attributes = c('chromosome_name', # 'chromosome_start', # 'chromosome_end', # 'feature_type_name'), # filters = "regulatory_feature_type_name", # values = "Promoter Flanking Region", # mart = regulatory) # flanking_g = reduce(GRanges( # flanking$chromosome_name, # IRanges(flanking$chromosome_start, # flanking$chromosome_end))) flanking_g <- readRDS(system.file("states/promoter_flanking_g_data.rds", package="MutationalPatterns")) @ Combine all genomic regions (GRanges objects) in a named list: <>= regions <- GRangesList(promoter_g, flanking_g, CTCF_g) names(regions) <- c("Promoter", "Promoter flanking", "CTCF") @ Don't forget to use the same chromosome naming convention consistently: <>= seqlevelsStyle(regions) <- "UCSC" @ \subsection{Test for significant depletion or enrichment in genomic regions} It is necessary to include a list with Granges of regions that were surveyed in your analysis for each sample, that is: positions in the genome at which you have enough high quality reads to call a mutation. This can be determined using e.g. CallableLoci tool by GATK. If you would not include the surveyed area in your analysis, you might for example see a depletion of mutations in a certain genomic region that is solely a result from a low coverage in that region, and therefore does not represent an actual depletion of mutations. We provided an example surveyed region data file with the package. For simplicity, here we use the same surveyed file for each sample. For a proper analysis, determine the surveyed area per sample and use these in your analysis. Download the example surveyed region data: <>= ## Get the filename with surveyed/callable regions surveyed_file <- list.files(system.file("extdata", package = "MutationalPatterns"), pattern = ".bed", full.names = TRUE) ## Import the file using rtracklayer and use the UCSC naming standard library(rtracklayer) surveyed <- import(surveyed_file) seqlevelsStyle(surveyed) <- "UCSC" ## For this example we use the same surveyed file for each sample. surveyed_list <- rep(list(surveyed), 9) @ Test for enrichment or depletion of mutations in your defined genomic regions using a binomial test. For this test, the chance of observing a mutation is calculated as the total number of mutations, divided by the total number of surveyed bases. <>= ## Calculate the number of observed and expected number of mutations in ## each genomic regions for each sample. distr <- genomic_distribution(vcfs, surveyed_list, regions) @ <>= ## Perform the enrichment/depletion test by tissue type. distr_test <- enrichment_depletion_test(distr, by = tissue) head(distr_test) ## Or without specifying the 'by' parameter. distr_test2 <- enrichment_depletion_test(distr) head(distr_test2) @ \begin{figure}[H] \begin{center} <>= plot_enrichment_depletion(distr_test) @ \end{center} \end{figure} \bibliography{references} \section{Session Information} <>= toLatex(sessionInfo()) @ \end{document}