%\VignetteIndexEntry{Introduction to MutationalPatterns} \documentclass{article} \usepackage{float} \usepackage[natbibapa]{apacite} \bibliographystyle{apacite} <>= BiocStyle::latex(use.unsrturl=FALSE) @ \title{Introduction to \Biocpkg{MutationalPatterns}} \author{Francis Blokzijl} \author{Roel Janssen} \author{Ruben van Boxtel} \author{Edwin Cuppen} \affil{University Medical Center Utrecht, Utrecht, The Netherlands} \date{\today} \begin{document} \maketitle \tableofcontents \newpage{} <>= options(width=96) library(ggplot2) @ \section{Introduction} Mutational processes leave characteristic footprints in genomic DNA. This package provides a comprehensive set of flexible functions that allows researchers to easily evaluate and visualize a multitude of mutational patterns in base substitution catalogues of e.g. tumour samples or DNA-repair deficient cells. The package covers a wide range of patterns including: mutational signatures, transcriptional and replicative strand bias, genomic distribution and association with genomic features, which are collectively meaningful for studying the activity of mutational processes. The package provides functionalities for both extracting mutational signatures \emph{de novo} and determining the contribution of previously identified mutational signatures on a single sample level. MutationalPatterns integrates with common R genomic analysis workflows and allows easy association with (publicly available) annotation data. Background on the biological relevance of the different mutational patterns, a practical illustration of the package functionalities, comparison with similar tools and software packages and an elaborate discussion, are described in the MutationalPatterns article, of which a preprint is available at bioRxiv: \url{https://doi.org/10.1101/071761} \newpage{} \section{Data} To perform the mutational pattern analyses, you need to load one or multiple VCF files with single-nucleotide variant calls and the corresponding reference genome. \subsection{List reference genome} List available genomes using \Biocpkg{BSgenome}: <>= library(BSgenome) head(available.genomes()) @ Download and load your reference genome of interest: <>= ref_genome <- "BSgenome.Hsapiens.UCSC.hg19" library(ref_genome, character.only = TRUE) @ \subsection{Load example data} We provided an example data set with this package, which consists of a subset of somatic mutation catalogues of 9 normal human adult stem cells from 3 different tissues \citep{Blokzijl2016}. Load the MutationalPatterns package: <>= library(MutationalPatterns) @ Locate the VCF files of the example data: <>= vcf_files <- list.files(system.file("extdata", package="MutationalPatterns"), pattern = ".vcf", full.names = TRUE) @ Define corresponding sample names for the VCF files: <>= sample_names <- c( "colon1", "colon2", "colon3", "intestine1", "intestine2", "intestine3", "liver1", "liver2", "liver3") @ Load the VCF files into a \texttt{GRangesList}: <>= vcfs <- read_vcfs_as_granges(vcf_files, sample_names, ref_genome) summary(vcfs) @ Define relevant metadata on the samples, such as tissue type: <>= tissue <- c(rep("colon", 3), rep("intestine", 3), rep("liver", 3)) @ \section{Mutation characteristics} \subsection{Base substitution types} We can retrieve base substitutions from the VCF GRanges object as "REF>ALT" using \Rfunction{mutations\_from\_vcf}: <>= muts = mutations_from_vcf(vcfs[[1]]) head(muts, 12) @ We can retrieve the base substitutions from the VCF GRanges object and convert them to the 6 types of base substitution types that are distinguished by convention: C>A, C>G, C>T, T>A, T>C, T>G. For example, when the reference allele is G and the alternative allele is T (G>T), \Rfunction{mut\_type} returns the G:C>T:A mutation as a C>A mutation: <>= types = mut_type(vcfs[[1]]) head(types, 12) @ To retrieve the sequence context (one base upstream and one base downstream) of the base substitutions in the VCF object from the reference genome, you can use the \Rfunction{mut\_context} function: <>= context = mut_context(vcfs[[1]], ref_genome) head(context, 12) @ With \Rfunction{type\_context}, you can retrieve the types and contexts for all positions in the VCF GRanges object. For the base substitutions that are converted to the conventional base substitution types, the reverse complement of the sequence context is returned. <>= type_context = type_context(vcfs[[1]], ref_genome) lapply(type_context, head, 12) @ With \Rfunction{mut\_type\_occurrences}, you can count mutation type occurrences for all VCF objects in the \texttt{GRangesList}. For C>T mutations, a distinction is made between C>T at CpG sites and other sites, as deamination of methylated cytosine at CpG sites is a common mutational process. For this reason, the reference genome is needed for this functionality. <>= type_occurrences <- mut_type_occurrences(vcfs, ref_genome) type_occurrences @ \subsection{Mutation spectrum} A mutation spectrum shows the relative contribution of each mutation type in the base substitution catalogs. The \Rfunction{plot\_spectrum} function plots the mean relative contribution of each of the 6 base substitution types over all samples. Error bars indicate standard deviation over all samples. The total number of mutations is indicated. <>= p1 <- plot_spectrum(type_occurrences) @ Plot the mutation spectrum with distinction between C>T at CpG sites and other sites: <>= p2 <- plot_spectrum(type_occurrences, CT = TRUE) @ Plot spectrum without legend: <>= p3 <- plot_spectrum(type_occurrences, CT = TRUE, legend = FALSE) @ The gridExtra package will be used throughout this vignette to combine multiple plots: <>= library("gridExtra") grid.arrange(p1, p2, p3, ncol=3, widths=c(3,3,1.75)) @ <>= library("gridExtra") ggsave("combine_plot_spectrum.pdf", grid.arrange(p1, p2, p3, ncol=3, widths=c(3,3,1.75)), width=10, height=3) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{combine_plot_spectrum} \end{center} \end{figure} You can facet the per sample group, e.g. plot the spectrum for each tissue separately: <>= p4 <- plot_spectrum(type_occurrences, by = tissue, CT = TRUE, legend = TRUE) @ Define your own 7 colors for spectrum plotting: <>= palette <- c("pink", "orange", "blue", "lightblue", "green", "red", "purple") p5 <- plot_spectrum(type_occurrences, CT=TRUE, legend=TRUE, colors=palette) @ <>= ggsave("combine_plot_spectrum_2.pdf", grid.arrange(p4, p5, ncol=2, widths=c(4,2.3)), width=10, height=3) @ <>= grid.arrange(p4, p5, ncol=2, widths=c(4,2.3)) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{combine_plot_spectrum_2} \end{center} \end{figure} \subsection{96 mutational profile} Make a 96 trinucleodide mutation count matrix: <>= mut_mat <- mut_matrix(vcf_list = vcfs, ref_genome = ref_genome) head(mut_mat) @ Plot the 96 profile of two samples: <>= plot_96_profile(mut_mat[,c(1,7)]) @ \begin{figure}[H] \begin{center} <>= plot_96_profile(mut_mat[,c(1,7)]) @ \end{center} \end{figure} Plot 96 profile of two samples in a more condensed plotting format: <>= plot_96_profile(mut_mat[,c(1,7)], condensed = TRUE) @ \begin{figure}[H] \begin{center} <>= plot_96_profile(mut_mat[,c(1,7)], condensed = TRUE) @ \end{center} \end{figure} \newpage{} \section{Mutational signatures} \subsection{\textit{De novo} mutational signature extraction using NMF} Mutational signatures are thought to represent mutational processes, and are characterized by a specific contribution of 96 base substitution types with a certain sequence context. Mutational signatures can be extracted from your mutation count matrix, with non-negative matrix factorization (NMF). A critical parameter in NMF is the factorization rank, which is the number of mutational signatures. You can determine the optimal factorization rank using the NMF package \citep{Gaujoux2010}. As described in their paper: ``...a common way of deciding on the rank is to try different values, compute some quality measure of the results, and choose the best value according to this quality criteria. The most common approach is to choose the smallest rank for which cophenetic correlation coefficient starts decreasing. Another approach is to choose the rank for which the plot of the residual sum of squares (RSS) between the input matrix and its estimate shows an inflection point.'' First add a small psuedocount to your mutation count matrix: <>= mut_mat <- mut_mat + 0.0001 @ Use the NMF package to generate an estimate rank plot: <>= library("NMF") estimate <- nmf(mut_mat, rank=2:5, method="brunet", nrun=10, seed=123456) @ And plot it: <>= plot(estimate) @ \begin{figure}[H] \begin{center} <>= plot(estimate) @ \end{center} \end{figure} Extract 2 mutational signatures from the mutation count matrix with \Rfunction{extract\_signatures} (For larger datasets it is wise to perform more iterations by changing the nrun parameter to achieve stability and avoid local minima): <>= nmf_res <- extract_signatures(mut_mat, rank = 2, nrun = 10) @ Assign signature names: <>= colnames(nmf_res$signatures) <- c("Signature A", "Signature B") rownames(nmf_res$contribution) <- c("Signature A", "Signature B") @ Plot the 96-profile of the signatures: <>= plot_96_profile(nmf_res$signatures, condensed = TRUE) @ \begin{figure}[H] \begin{center} <>= plot_96_profile(nmf_res$signatures, condensed = TRUE) @ \end{center} \end{figure} Visualize the contribution of the signatures in a barplot: <>= pc1 <- plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "relative") @ Visualize the contribution of the signatures in absolute number of mutations: <>= pc2 <- plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "absolute") @ Combine the two plots: <>= grid.arrange(pc1, pc2) @ \begin{figure}[H] \begin{center} <>= grid.arrange(pc1, pc2) @ \end{center} \end{figure} Flip X and Y coordinates: <>= plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "absolute", coord_flip = TRUE) @ \begin{figure}[H] \begin{center} <>= plot_contribution(nmf_res$contribution, nmf_res$signature, mode = "absolute", coord_flip = TRUE) @ \end{center} \end{figure} The relative contribution of each signature for each sample can also be plotted as a heatmap with \Rfunction{plot\_contribution\_heatmap}, which might be easier to interpret and compare than stacked barplots. The samples can be hierarchically clustered based on their euclidean distance. The signatures can be plotted in a user-specified order. Plot signature contribution as a heatmap with sample clustering dendrogram and a specified signature order: <>= pch1 <- plot_contribution_heatmap(nmf_res$contribution, sig_order = c("Signature B", "Signature A")) @ Plot signature contribution as a heatmap without sample clustering: <>= pch2 <- plot_contribution_heatmap(nmf_res$contribution, cluster_samples=FALSE) @ Combine the plots into one figure: <>= grid.arrange(pch1, pch2, ncol = 2, widths = c(2,1.6)) @ \begin{figure}[H] \begin{center} <>= grid.arrange(pch1, pch2, ncol = 2, widths = c(2,1.6)) @ \end{center} \end{figure} Compare the reconstructed mutational profile with the original mutational profile: <>= plot_compare_profiles(mut_mat[,1], nmf_res$reconstructed[,1], profile_names = c("Original", "Reconstructed"), condensed = TRUE) @ \begin{figure}[H] \begin{center} <>= plot_compare_profiles(mut_mat[,1], nmf_res$reconstructed[,1], profile_names = c("Original", "Reconstructed"), condensed = TRUE) @ \end{center} \end{figure} \subsection{Find optimal contribution of known signatures} \subsubsection{COSMIC mutational signatures} Download mutational signatures from the COSMIC website: <>= sp_url <- paste("http://cancer.sanger.ac.uk/cancergenome/assets/", "signatures_probabilities.txt", sep = "") cancer_signatures = read.table(sp_url, sep = "\t", header = TRUE) # Match the order of the mutation types to MutationalPatterns standard new_order = match(row.names(mut_mat), cancer_signatures$Somatic.Mutation.Type) # Reorder cancer signatures dataframe cancer_signatures = cancer_signatures[as.vector(new_order),] # Add trinucletiode changes names as row.names row.names(cancer_signatures) = cancer_signatures$Somatic.Mutation.Type # Keep only 96 contributions of the signatures in matrix cancer_signatures = as.matrix(cancer_signatures[,4:33]) @ Plot mutational profile of the first two COSMIC signatures: <>= plot_96_profile(cancer_signatures[,1:2], condensed = TRUE, ymax = 0.3) @ \begin{figure}[H] \begin{center} <>= plot_96_profile(cancer_signatures[,1:2], condensed = TRUE, ymax = 0.3) @ \end{center} \end{figure} Hierarchically cluster the COSMIC signatures based on their similarity with average linkage: <>= hclust_cosmic = cluster_signatures(cancer_signatures, method = "average") # store signatures in new order cosmic_order = colnames(cancer_signatures)[hclust_cosmic$order] plot(hclust_cosmic) @ \subsubsection{Similarity between mutational profiles and COSMIC signatures} The similarity between each mutational profile and each COSMIC signature, can be calculated with \Rfunction{cos\_sim\_matrix}, and visualized with \Rfunction{plot\_cosine\_heatmap}. The cosine similarity reflects how well each mutational profile can be explained by each signature individually. The advantage of this heatmap representation is that it shows in a glance the similarity in mutational profiles between samples, while at the same time providing information on which signatures are most prominent. The samples can be hierarchically clustered in \Rfunction{plot\_cosine\_heatmap}. The cosine similarity between two mutational profiles/signatures can be calculated with \Rfunction{cos\_sim}: <>= cos_sim(mut_mat[,1], cancer_signatures[,1]) @ Calculate pairwise cosine similarity between mutational profiles and COSMIC signatures: <>= cos_sim_samples_signatures = cos_sim_matrix(mut_mat, cancer_signatures) # Plot heatmap with specified signature order plot_cosine_heatmap(cos_sim_samples_signatures, col_order = cosmic_order, cluster_rows = TRUE) @ \subsubsection{Find optimal contribution of COSMIC signatures to reconstruct 96 mutational profiles} In addition to \textit{de novo} extraction of signatures, the contribution of any set of signatures to the mutational profile of a sample can be quantified. This unique feature is specifically useful for mutational signature analyses of small cohorts or individual samples, but also to relate own findings to known signatures and published findings. The \Rfunction{fit\_to\_signatures} function finds the optimal linear combination of mutational signatures that most closely reconstructs the mutation matrix by solving a non-negative least-squares constraints problem. Fit mutation matrix to the COSMIC mutational signatures: <>= fit_res <- fit_to_signatures(mut_mat, cancer_signatures) @ Plot the optimal contribution of the COSMIC signatures in each sample as a stacked barplot. <>= # Select signatures with some contribution select <- which(rowSums(fit_res$contribution) > 10) # Plot contribution barplot plot_contribution(fit_res$contribution[select,], cancer_signatures[,select], coord_flip = FALSE, mode = "absolute") @ <>= # Select signatures with some contribution select <- which(rowSums(fit_res$contribution) > 0) # Plot contribution ggsave("plot_contribution_3.pdf", plot_contribution(fit_res$contribution[select,], cancer_signatures[,select], coord_flip = FALSE, mode = "absolute"), width=9, height=5) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{plot_contribution_3} \end{center} \end{figure} Plot relative contribution of the cancer signatures in each sample as a heatmap with sample clustering: <>= plot_contribution_heatmap(fit_res$contribution, cluster_samples = TRUE, method = "complete") @ \begin{figure}[H] \begin{center} <>= plot_contribution_heatmap(fit_res$contribution, cluster_samples = TRUE, method = "complete") @ \end{center} \end{figure} Compare the reconstructed mutational profile of sample 1 with its original mutational profile: <>= plot_compare_profiles(mut_mat[,1], fit_res$reconstructed[,1], profile_names = c("Original", "Reconstructed"), condensed = TRUE) @ \begin{figure}[H] \begin{center} <>= plot_compare_profiles(mut_mat[,1], fit_res$reconstructed[,1], profile_names = c("Original", "Reconstructed"), condensed = TRUE) @ \end{center} \end{figure} Calculate the cosine similarity between all original and reconstructed mutational profiles with \Rfunction{cos\_sim\_matrix}: <>= # calculate all pairwise cosine similarities cos_sim_ori_rec <- cos_sim_matrix(mut_mat, fit_res$reconstructed) # extract cosine similarities per sample between original and reconstructed cos_sim_ori_rec <- as.data.frame(diag(cos_sim_ori_rec)) @ We can use ggplot to make a barplot of the cosine similarities between the original and reconstructed mutational profile of each sample. This clearly shows how well each mutational profile can be reconstructed with the COSMIC mutational signatures. Two identical profiles have a cosine similarity of 1. The lower the cosine similarity between original and reconstructed, the less well the original mutational profile can be reconstructed with the COSMIC signatures. You could use, for example, cosine similarity of 0.95 as a cutoff. <>= # Adjust data frame for plotting with gpplot colnames(cos_sim_ori_rec) = "cos_sim" cos_sim_ori_rec$sample = row.names(cos_sim_ori_rec) @ <>= # Load ggplot2 library(ggplot2) # Make barplot ggplot(cos_sim_ori_rec, aes(y=cos_sim, x=sample)) + geom_bar(stat="identity", fill = "skyblue4") + coord_cartesian(ylim=c(0.8, 1)) + # coord_flip(ylim=c(0.8,1)) + ylab("Cosine similarity\n original VS reconstructed") + xlab("") + # Reverse order of the samples such that first is up # xlim(rev(levels(factor(cos_sim_ori_rec$sample)))) + theme_bw() + theme(panel.grid.minor.y=element_blank(), panel.grid.major.y=element_blank()) + # Add cut.off line geom_hline(aes(yintercept=.95)) @ \begin{figure}[H] \begin{center} <>= # Load ggplot2 library(ggplot2) # Make barplot ggplot(cos_sim_ori_rec, aes(y=cos_sim, x=sample)) + geom_bar(stat="identity", fill = "skyblue4") + coord_cartesian(ylim=c(0.8, 1)) + # coord_flip(ylim=c(0.8,1)) + ylab("Cosine similarity\n original VS reconstructed") + xlab("") + # Reverse order of the samples such that first is up # xlim(rev(levels(factor(cos_sim_ori_rec$sample)))) + theme_bw() + theme(panel.grid.minor.y=element_blank(), panel.grid.major.y=element_blank()) + # Add cut.off line geom_hline(aes(yintercept=.95)) @ \end{center} \end{figure} \section{Strand bias analyses} \subsection{Transcriptional strand bias analysis} For the mutations within genes it can be determined whether the mutation is on the transcribed or non-transcribed strand, which can be used to evaluate the involvement of transcription-coupled repair. To this end, it is determined whether the "C" or "T" base (since by convention we regard base substitutions as C>X or T>X) are on the same strand as the gene definition. Base substitions on the same strand as the gene definitions are considered "untranscribed", and on the opposite strand of gene bodies as "transcribed", since the gene definitions report the coding or sense strand, which is untranscribed. No strand information is reported for base substitution that overlap with more than one gene body on different strands. Get gene definitions for your reference genome: <>= # For example get known genes table from UCSC for hg19 using # biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene") library("TxDb.Hsapiens.UCSC.hg19.knownGene") genes_hg19 <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) genes_hg19 @ Get transcriptional strand information for all positions in the first VCF object with \Rfunction{mut\_strand}. This function returns ``-'' for positions outside gene bodies, and positions that overlap with more than one gene on different strands. <>= strand = mut_strand(vcfs[[1]], genes_hg19) head(strand, 10) @ Make mutation count matrix with transcriptional strand information (96 trinucleotides * 2 strands = 192 features). NB: only those mutations that are located within gene bodies are counted. <>= mut_mat_s <- mut_matrix_stranded(vcfs, ref_genome, genes_hg19) mut_mat_s[1:5,1:5] @ Count the number of mutations on each strand, per tissue, per mutation type: <>= strand_counts <- strand_occurrences(mut_mat_s, by=tissue) head(strand_counts) @ Perform Poisson test for strand asymmetry significance testing: <>= strand_bias <- strand_bias_test(strand_counts) strand_bias @ Plot the mutation spectrum with strand distinction: <>= ps1 <- plot_strand(strand_counts, mode = "relative") @ Plot the effect size (log2(untranscribed/transcribed) of the strand bias. Asteriks indicate significant strand bias. <>= ps2 <- plot_strand_bias(strand_bias) @ Combine the plots into one figure: <>= grid.arrange(ps1, ps2) @ \begin{figure}[H] \begin{center} <>= grid.arrange(ps1, ps2) @ \end{center} \end{figure} \subsection{Replicative strand bias analysis} The involvement of replication-associated mechanisms can be evaluated by testing for a mutational bias between the leading and lagging strand. The replication strand is dependent on the locations of replication origins from which DNA replication is fired. However, replication timing is dynamic and cell-type specific, which makes replication strand determination less straightforward than transcriptional strand bias analysis. Replication timing profiles can be generated with Repli-Seq experiments. Once the replication direction is defined, a strand asymmetry analysis can be performed similarly as the transcription strand bias analysis. Read example bed file provided with the package with replication direction annotation: <>= repli_file = system.file("extdata/ReplicationDirectionRegions.bed", package = "MutationalPatterns") repli_strand = read.table(repli_file, header = TRUE) # Store in GRanges object repli_strand_granges = GRanges(seqnames = repli_strand$Chr, ranges = IRanges(start = repli_strand$Start + 1, end = repli_strand$Stop), strand_info = repli_strand$Class) # UCSC seqlevelsstyle seqlevelsStyle(repli_strand_granges) = "UCSC" repli_strand_granges @ The GRanges object should have a ``strand\_info'' metadata column, which contains only two different annotations, e.g. ``left'' and ``right'', or ``leading'' and ``lagging''. The genomic ranges cannot overlap, to allow only one annotation per location. Get replicative strand information for all positions in the first VCF object. No strand information ``-'' is returned for base substitutions in unannotated genomic regions. <>= strand_rep <- mut_strand(vcfs[[1]], repli_strand_granges, mode = "replication") head(strand_rep, 10) @ Make mutation count matrix with transcriptional strand information (96 trinucleotides * 2 strands = 192 features). <>= mut_mat_s_rep <- mut_matrix_stranded(vcfs, ref_genome, repli_strand_granges, mode = "replication") mut_mat_s_rep[1:5, 1:5] @ The levels of the "strand\_info" metadata in the GRanges object determines the order in which the strands are reported in the mutation matrix that is returned by \Rfunction{mut\_matrix\_stranded}, so if you want to count right before left, you can specify this, before you run \Rfunction{mut\_matrix\_stranded}: <>= repli_strand_granges$strand_info <- factor(repli_strand_granges$strand_info, levels = c("right", "left")) mut_mat_s_rep2 <- mut_matrix_stranded(vcfs, ref_genome, repli_strand_granges, mode = "replication") mut_mat_s_rep2[1:5, 1:5] @ Count the number of mutations on each strand, per tissue, per mutation type: <>= strand_counts_rep <- strand_occurrences(mut_mat_s_rep, by=tissue) head(strand_counts) @ Perform Poisson test for strand asymmetry significance testing: <>= strand_bias_rep <- strand_bias_test(strand_counts_rep) strand_bias_rep @ Plot the mutation spectrum with strand distinction: <>= ps1 <- plot_strand(strand_counts_rep, mode = "relative") @ Plot the effect size (log2(untranscribed/transcribed) of the strand bias. Asteriks indicate significant strand bias. <>= ps2 <- plot_strand_bias(strand_bias_rep) @ Combine the plots into one figure: <>= grid.arrange(ps1, ps2) @ \begin{figure}[H] \begin{center} <>= grid.arrange(ps1, ps2) @ \end{center} \end{figure} \subsection{Extract signatures with strand bias} Extract 2 signatures from mutation count matrix with strand features: <>= nmf_res_strand <- extract_signatures(mut_mat_s, rank = 2) # Provide signature names colnames(nmf_res_strand$signatures) <- c("Signature A", "Signature B") @ Plot signatures with 192 features: <>= a <- plot_192_profile(nmf_res_strand$signatures, condensed = TRUE) @ Plot strand bias per mutation type for each signature with significance test: <>= b <- plot_signature_strand_bias(nmf_res_strand$signatures) @ Combine the plots into one figure: <>= grid.arrange(a, b, ncol = 2, widths = c(5, 1.8)) @ <>= ggsave("plot_192_profile.pdf", grid.arrange(a, b, ncol = 2, widths = c(5, 1.8)), width = 10, height = 5) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{plot_192_profile} \end{center} \end{figure} \section{Genomic distribution} \subsection{Rainfall plot} A rainfall plot visualizes mutation types and intermutation distance. Rainfall plots can be used to visualize the distribution of mutations along the genome or a subset of chromosomes. The y-axis corresponds to the distance of a mutation with the previous mutation and is log10 transformed. Drop-downs from the plots indicate clusters or ``hotspots'' of mutations. Make rainfall plot of sample 1 over all autosomal chromosomes <>= # Define autosomal chromosomes chromosomes <- seqnames(get(ref_genome))[1:22] # Make a rainfall plot plot_rainfall(vcfs[[1]], title = names(vcfs[1]), chromosomes = chromosomes, cex = 1.5, ylim = 1e+09) @ <>= # Define autosomal chromosomes chromosomes <- seqnames(get(ref_genome))[1:22] # Make a rainfall plot ggsave("plot_rainfall.pdf", plot_rainfall(vcfs[[1]], title = names(vcfs[1]), chromosomes = chromosomes, cex = 1.5, ylim = 1e+09), width=9, height=3) @ \begin{figure}[H] \begin{center} \includegraphics[width=1.0\textwidth]{plot_rainfall} \end{center} \end{figure} % Make rainfall plot of the first sample over chromosome 1: % % \begin{figure}[H] % \begin{center} % <>= % chromosomes <- seqnames(get(ref_genome))[1] % plot_rainfall(vcfs[[1]], title = names(vcfs[1]), % chromosomes = chromosomes[1], cex = 2, ylim = 1e+09) % @ % \end{center} % \end{figure} \subsection{Enrichment or depletion of mutations in genomic regions} Test for enrichment or depletion of mutations in certain genomic regions, such as promoters, CTCF binding sites and transcription factor binding sites. To use your own genomic region definitions (based on e.g. ChipSeq experiments) specify your genomic regions in a named list of GRanges objects. Alternatively, use publicly available genomic annotation data, like in the example below. \subsubsection{Example: regulation annotation data from Ensembl using \Biocpkg{biomaRt}} The following example displays how to download promoter, CTCF binding sites and transcription factor binding sites regions for genome build hg19 from Ensembl using \Biocpkg{biomaRt}. For other datasets, see the \Biocpkg{biomaRt} documentation \citep{Durinck2005}. To install \Biocpkg{biomaRt}, uncomment the following lines: <>= source("https://bioconductor.org/biocLite.R") biocLite("biomaRt") @ Load the \Biocpkg{biomaRt} package. <>= library(biomaRt) @ Download genomic regions. NB: Here we take some shortcuts by loading the results from our example data. The corresponding code for downloading this data can be found above the command we run: <>= # regulatory <- useEnsembl(biomart="regulation", # dataset="hsapiens_regulatory_feature", # GRCh = 37) ## Download the regulatory CTCF binding sites and convert them to ## a GRanges object. # CTCF <- getBM(attributes = c('chromosome_name', # 'chromosome_start', # 'chromosome_end', # 'feature_type_name', # 'cell_type_name'), # filters = "regulatory_feature_type_name", # values = "CTCF Binding Site", # mart = regulatory) # # CTCF_g <- reduce(GRanges(CTCF$chromosome_name, # IRanges(CTCF$chromosome_start, # CTCF$chromosome_end))) CTCF_g <- readRDS(system.file("states/CTCF_g_data.rds", package="MutationalPatterns")) ## Download the promoter regions and convert them to a GRanges object. # promoter = getBM(attributes = c('chromosome_name', 'chromosome_start', # 'chromosome_end', 'feature_type_name'), # filters = "regulatory_feature_type_name", # values = "Promoter", # mart = regulatory) # promoter_g = reduce(GRanges(promoter$chromosome_name, # IRanges(promoter$chromosome_start, # promoter$chromosome_end))) promoter_g <- readRDS(system.file("states/promoter_g_data.rds", package="MutationalPatterns")) ## Download the promoter flanking regions and convert them to a GRanges object. # flanking = getBM(attributes = c('chromosome_name', # 'chromosome_start', # 'chromosome_end', # 'feature_type_name'), # filters = "regulatory_feature_type_name", # values = "Promoter Flanking Region", # mart = regulatory) # flanking_g = reduce(GRanges( # flanking$chromosome_name, # IRanges(flanking$chromosome_start, # flanking$chromosome_end))) flanking_g <- readRDS(system.file("states/promoter_flanking_g_data.rds", package="MutationalPatterns")) @ Combine all genomic regions (GRanges objects) in a named list: <>= regions <- GRangesList(promoter_g, flanking_g, CTCF_g) names(regions) <- c("Promoter", "Promoter flanking", "CTCF") @ Use the same chromosome naming convention consistently: <>= seqlevelsStyle(regions) <- "UCSC" @ \subsection{Test for significant depletion or enrichment in genomic regions} It is necessary to include a list with Granges of regions that were surveyed in your analysis for each sample, that is: positions in the genome at which you have enough high quality reads to call a mutation. This can be determined using e.g. CallableLoci tool by GATK. If you would not include the surveyed area in your analysis, you might for example see a depletion of mutations in a certain genomic region that is solely a result from a low coverage in that region, and therefore does not represent an actual depletion of mutations. We provided an example surveyed region data file with the package. For simplicity, here we use the same surveyed file for each sample. For a proper analysis, determine the surveyed area per sample and use these in your analysis. Download the example surveyed region data: <>= ## Get the filename with surveyed/callable regions surveyed_file <- system.file("extdata/callableloci-sample.bed", package = "MutationalPatterns") ## Import the file using rtracklayer and use the UCSC naming standard library(rtracklayer) surveyed <- import(surveyed_file) seqlevelsStyle(surveyed) <- "UCSC" ## For this example we use the same surveyed file for each sample. surveyed_list <- rep(list(surveyed), 9) @ Test for enrichment or depletion of mutations in your defined genomic regions using a binomial test. For this test, the chance of observing a mutation is calculated as the total number of mutations, divided by the total number of surveyed bases. <>= ## Calculate the number of observed and expected number of mutations in ## each genomic regions for each sample. distr <- genomic_distribution(vcfs, surveyed_list, regions) @ <>= ## Perform the enrichment/depletion test by tissue type. distr_test <- enrichment_depletion_test(distr, by = tissue) head(distr_test) @ <>= plot_enrichment_depletion(distr_test) @ \begin{figure}[H] \begin{center} <>= plot_enrichment_depletion(distr_test) @ \end{center} \end{figure} \bibliography{references} \section{Session Information} <>= toLatex(sessionInfo()) @ \end{document}