<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>signeR</title> <style type='text/css'> * { font-family: sans-serif; } body { margin: auto; width: 80%; } h1 { text-align: center; } h2 { margin-top: 2em; border-bottom: solid 1px gray; } a:link { color: #0087BD; } a:visited { color: #009F6B; } img { display: block; margin: auto; } code, .rcode, .rcode * { font-family: monospace; } #table_of_contents { position: fixed; right: 0; top: 0; padding: 0.5em; background-color: white; border-bottom: solid 1px black; border-left: solid 1px black; } #table_of_contents li { list-style-type: none; } #table_of_contents a { text-decoration: none; } #table_of_contents #full { display: none; } #table_of_contents:hover #full { display: block; } p { text-align: justify; } .message { background-color: rgb(217, 237, 247); padding: 0.4em; border: 1px solid rgb(188, 232, 241); border-radius: 5px; } .legend { width: 60%; margin: auto; } </style> </head> <body> <!-- %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{signeR} %\VignettePackage{signeR} --> <!--begin.rcode results='hide', echo=FALSE, message=FALSE, warning=FALSE set.seed(42) library(knitr) opts_knit$set( self.contained = TRUE, upload.fun = image_uri ) opts_chunk$set( dev = 'png', dpi = 300, out.width = "700px", out.height = "700px" ) end.rcode--> <div id="table_of_contents"> <h4>Table of contents</h4> <div id="full"> <ul> <li><a href="#toc1">1: Introduction</a></li> <li><a href="#toc2">2: Installation</a></li> <li><a href="#toc3">3: Preparing the input</a> <ul> <li><a href="#toc3.1">3.1: Input from VCF</a></li> <li><a href="#toc3.2">3.2: Input from MAF</a></li> <li><a href="#toc3.3">3.3: Input from tab-delimited file</a></li> <li><a href="#toc3.4">3.4: Known signatures</a></li> </ul> </li> <li><a href="#toc4">4: Estimating the number of mutational processes and their signatures</a></li> <li><a href="#toc5">5: Estimating exposures to known signatures</a> <li><a href="#toc6">6: Results and plots</a> <ul> <li><a href="#toc6.1">6.1: Plot the MCMC paths for the NMF parameters</a></li> <li><a href="#toc6.2">6.2: Plot the signatures</a> <ul> <li><a href="#toc6.2.1"> Signature barplot</a></li> <li><a href="#toc6.2.2"> Signature heatmap</a></li> </ul> </li> <li><a href="#toc6.3">6.3: Plot the exposures</a> <ul> <li><a href="#toc6.3.1"> Exposure boxplot</a></li> <li><a href="#toc6.3.2"> Exposure barplot</a></li> <li><a href="#toc6.3.3"> Exposure heatmap</a></li> </ul> </li> </ul> </li> <li><a href="#toc7">7: Supervised approaches to exposure analysis</a> <ul> <li><a href="#toc7.1">7.1: Differential Exposure Analysis</a></li> <li><a href="#toc7.2">7.2: Sample Classification</a></li> <li><a href="#toc7.3">7.3: Correlation Analysis</a></li> <li><a href="#toc7.4">7.4: Linear Regression</a></li> <li><a href="#toc7.5">7.5: Survival Analysis</a></li> <li><a href="#toc7.6">7.6: Cox Regression</a></li> </ul> </li> <li><a href="#toc8">8: Unsupervised approaches to exposure analysis</a> <ul> <li><a href="#toc8.1">8.1: Hierarquical Clustering</a></li> <li><a href="#toc8.2">8.2: Fuzzy Clustering</a></li> </ul> </li> <li><a href="#toc9">9: Frequently Asked Questions</a> <ul> <li><a href="#toc9.1">9.1: Citing signeR</a></li> <li><a href="#toc9.2">9.2: Compilation errors on OS X</a></li> <li><a href="#toc9.3">9.3: Missing library headers</a></li> </ul> </li> <li><a href="#toc10">10: References</a></li> </div> </div> <h1>signeR</h1> <h4 style='text-align:center'><em>Rodrigo Drummond, Renan Valieris, Rafael Rosales and Israel Tojal da Silva</em></h4> <h2 id="toc1">1: Introduction</h2> <p>Motivation: Cancer is an evolutionary process driven by continuous acquisition of genetic variations in individual cells. The diversity and the complexity of somatic mutational processes is a conspicuous feature orchestrated by DNA damage agents and repair processes, including exogenous or endogenous mutagen exposures, defects in DNA mismatch repair and enzymatic modification of DNA. The identification of the underlying mutational processes are central to the understanding of cancer origin and evolution.</p> <p>The <b>signeR</b> package focuses on the estimation and further analysis of mutational signatures. The functionalities of this package can be divided into three categories. First, it provides tools to process VCF files and generate matrices of SNV mutation counts and mutational opportunities, both defined according to a 3bp context (mutation site and its neighboring 3' and 5' bases). Second, these count matrices are considered as input for the estimation of the underlying mutational signatures and the number of active mutational processes. Third, the package provides tools to correlate the activities of those signatures with other relevant information such as clinical data, in order to draw conclusions about the analyzed genome samples, which can be useful for clinical applications. These include the Differential Exposure Score and the a posteriori sample classification.</p> <p>Although signeR is intended for the estimation of mutational signatures, it actually provides a full Bayesian treatment to the non-negative matrix factorisation (NMF) model. Further details about the method can be found in Rosales & Drummond <i>et al.</i>, 2016 (see <a href="#toc9.1">section 9.1</a> below).</p> <p>This vignette briefly explains the use of <b>signeR</b> through examples.</p> <h2 id="toc2">2: Installation</h2> <p>Before installing, please make sure you have the latest version of R and <a href="http://bioconductor.org/">Bioconductor</a> installed.</p> <p>To install <b>signeR</b>, start R and enter:</p> <!--begin.rcode eval=FALSE install.packages("BiocManager") BiocManager::install("signeR") end.rcode--> <p>For more information, see <a href="http://bioconductor.org/packages/signeR">this page</a>.</p> <p>Once installed the library can be loaded as</p> <!--begin.rcode results='hide', message=FALSE library(signeR) end.rcode--> <p class='message'>OS X users might experience compilation errors due to missing gfortran libraries. Please read <a href="#toc9.2">this section</a>.</p> <h2 id="toc3">3: Preparing the input</h2> <p><b>signeR</b> takes as input a count matrix of samples x features. Each feature is usually an SNV mutation within a 3bp context (96 features, 6 types of SNV mutations and 4 possibilities for the bases at each side of the SNV change). Optionally, an opportunity matrix can also be provided containing the count frequency of the features in the whole analyzed region for each sample. Although not required, this argument is highly recommended because it allows <b>signeR</b> to normalize the feature frequency over the analyzed region.</p> <p>Input matrices can be read both from a VCF, MAF or a tab-delimited file, as described next.</p> <h3 id="toc3.1">3.1: Input from VCF</h3> <p>The <a target="_blank" href="http://www.1000genomes.org/wiki/Analysis/vcf4.0">VCF</a> file format is the most common format for storing genetic variations, the <b>signeR</b> package includes a utility function for generating a count matrix from the VCF: </p> <!--begin.rcode eval=FALSE library(VariantAnnotation) # BSgenome, equivalent to the one used on the variant call library(BSgenome.Hsapiens.UCSC.hg19) vcfobj <- readVcf("/path/to/a/file.vcf", "hg19") mut <- genCountMatrixFromVcf(BSgenome.Hsapiens.UCSC.hg19, vcfobj) end.rcode--> <p>This function will generate a matrix of mutation counts for each sample in the provided VCF.</p> <p>If you have one VCF per sample you can combine the results into a single matrix like this:</p> <!--begin.rcode eval=FALSE mut = matrix(ncol=96,nrow=0) for(i in vcf_files) { vo = readVcf(i, "hg19") # sample name (should pick up from the vcf automatically if available) # colnames(vo) = i m0 = genCountMatrixFromVcf(mygenome, vo) mut = rbind(mut, m0) } dim(mut) # matrix with all samples end.rcode--> <p>The opportunity matrix can also be generated from the reference genome (hg19 in the following case):</p> <!--begin.rcode eval=FALSE library(rtracklayer) target_regions <- import(con="/path/to/a/target.bed", format="bed") opp <- genOpportunityFromGenome(BSgenome.Hsapiens.UCSC.hg19, target_regions, nsamples=nrow(mut)) end.rcode--> <p>Where <code>target.bed</code> is a <a target="_blank" href="https://genome.ucsc.edu/FAQ/FAQformat.html#format1">bed file</a> containing the genomic regions analyzed by the variant caller.</p> <p>If a BSgenome is not available for your genome, you can use a fasta file:</p> <!--begin.rcode eval=FALSE library(Rsamtools) # make sure /path/to/genome.fasta.fai exists ! # you can use "samtools faidx" command to create it mygenome <- FaFile("/path/to/genome.fasta") mut <- genCountMatrixFromVcf(mygenome, vcfobj) opp <- genOpportunityFromGenome(mygenome, target_regions) end.rcode--> <h3 id="toc3.2">3.2: Input from MAF</h3> The <a target="_blank" href="https://docs.gdc.cancer.gov/Encyclopedia/pages/Mutation_Annotation_Format/">MAF</a> file format is another popular format for storing somatic genetic variants, the signeR package also includes a function for generating a count matrix from the MAF: <!--begin.rcode eval=FALSE mut <- genCountMatrixFromMAF(mygenome, "my_file.maf") end.rcode--> <h3 id="toc3.3">3.3: Input from tab-delimited file</h3> <p>By convention, the input file should be tab-delimited with sample names as row names and features as column names. Features should be referred to in the format "base change:triplet", e.g. "C>A:TCG", as can be seen in the example below. Similarly, the opportunity matrix can be provided in a tab-delimited file with the same structure as the mutation counts file. An example of the required matrix format can be seen <a target="_blank" href="https://github.com/TojalLab/signeR/blob/devel/inst/extdata/21_breast_cancers.mutations.txt"> here</a>.</p> <p>This tutorial uses as input the 21 breast cancer dataset described in <a target="_blank" href="http://dx.doi.org/10.1016/j.cell.2012.04.023"> Nik-Zainal et al 2012</a>. For the sake of convenience, this dataset is included with the package and can be accessed by using the <code>system.file</code> function:</p> <!--begin.rcode mut <- read.table(system.file("extdata","21_breast_cancers.mutations.txt", package="signeR"), header=TRUE, check.names=FALSE) opp <- read.table(system.file("extdata","21_breast_cancers.opportunity.txt", package="signeR")) end.rcode--> <h3 id="toc3.4">3.4: Known signatures</h3> <p>signeR analysis can incorporate any previous knowledge about the signatures present in the dataset. If signatures are known in advance, they can be provided as a matrix, which may be used by signeR in two different ways: a starting value that will be updated according to mutation patterns found on present data or a fixed set of parameters, kept unchanged during the estimation of exposures.</p> <p>The signatures matrix shall contain each signature in one column. An example of the required matrix format can be seen <a target="_blank" href="https://github.com/TojalLab/signeR/blob/devel/inst/extdata/Cosmic_signatures_BRC.txt"> here</a>.</p> <p>Along this tutorial a matrix of signatures found in breast cancer, as described in <a target="_blank" href="https://cancer.sanger.ac.uk/cosmic/signatures_v2"> Cosmic database</a>. For the sake of convenience this matrix is included with the package and can be accessed by the <code>system.file</code> function:</p> <!--begin.rcode Pmatrix <- as.matrix(read.table(system.file("extdata","Cosmic_signatures_BRC.txt", package="signeR"), sep="\t", check.names=FALSE)) end.rcode--> <h2 id="toc4">4: Estimating the number of mutational processes and their signatures</h2> <p><b>signeR</b> takes a count matrix as its only required parameter, but the user can provide an opportunity matrix as well. The algorithm allows the assessment of the number of signatures by three options, as follows.</p> <ol type="i"> <li>signeR detects the number of signatures at run time by considering the best NMF factorization rank between 1 and min(G, K)-1, with G = number of genomes and K = number of features (i.e. 96): <!--begin.rcode eval=FALSE signatures <- signeR(M=mut, Opport=opp) end.rcode--> </li> <li>The user can give an interval of the possible numbers of signatures as the parameter nlim. <b>signeR</b> will calculate the optimal number of signatures within this range, for example: <!--begin.rcode eval=FALSE signatures <- signeR(M=mut, Opport=opp, nlim=c(3,7)) end.rcode--> </li> <li><b>signeR</b> can also be run by passing the number of signatures as the parameter nsig. In this setting, the algorithm is faster. For example, the following command will make <b>signeR</b> consider only the rank N=5 to estimate the signatures and their exposures: <!--begin.rcode results='hide', message=FALSE signatures <- signeR(M=mut, Opport=opp, nsig=3, main_eval=100, EM_eval=50, EMit_lim=20) end.rcode--> </li> <li>Finally, when signatures are known in advance, <b>signeR</b> can use them as a starting point for the estimation of signatures in the present dataset. To this end, signatures must be provided in a matrix, as described in item 3.4 above. For example, the following command will make <b>signeR</b> use six Cosmic signatures found on breast cancer as a starting point: <!--begin.rcode eval=FALSE signatures.Pstart <- signeR(M=mut, Opport=opp, P=Pmatrix, fixedP=FALSE, main_eval=100, EM_eval=50, EMit_lim=20) end.rcode--> </li> </ol> <p>The parameters <code>testing_burn</code> and <code>testing_eval</code> control the number of iterations used to estimate the number of signatures (default value is 1000 for both parameters). There are other arguments that may be passed on to signeR. Please have a look at signeR's manual, issued by typing <code>help(signeR)</code>.</p> <p>Whenever <b>signeR</b> is left to decide which number of signatures is optimal, it will search for the rank Nsig that maximizes the median Bayesian Information Criterion (BIC). After the processing is done, this information can be plotted by the following command:</p> <!--begin.rcode eval=FALSE BICboxplot(signatures) end.rcode--> <!--begin.rcode echo=FALSE, results='asis' cat(sprintf("<img src=\"%s\" />\n",image_uri("Model_selection_BICs.png"))) end.rcode--> <p class='legend'>Boxplot of BIC values, showing that the optimal number of signatures for this dataset is 5.</p> <h2 id="toc5">5: Estimating exposures to known signatures.</h2> <p><b>signeR</b> also offers the possibility to estimate exposures to known signatures as, for example, the ones described on <a target="_blank" href="https://cancer.sanger.ac.uk/cosmic/signatures_v2"> Cosmic database</a>. In this case, signatures should be provided in a matrix, as described in item 3.4 above, and should be kept constant during analysis:</p> <!--begin.rcode Pmatrix <- as.matrix(read.table(system.file("extdata","Cosmic_signatures_BRC.txt", package="signeR"), sep="\t", check.names=FALSE)) end.rcode--> <p>The following command will make <b>signeR</b> estimate the exposures to the Cosmic signatures found on breast cancer:</p> <!--begin.rcode eval=FALSE exposures.known.sigs <- signeR(M=mut, Opport=opp, P=Pmatrix, fixedP=TRUE, main_eval=100, EM_eval=50, EMit_lim=20) end.rcode--> <p>Exposures can then be recovered from the signeR output by the following command (as in any signeR analysis):</p> <!--begin.rcode eval=FALSE exposures <- Median_exp(exposures.known.sigs$SignExposures) end.rcode--> <h2 id="toc6">6: Results and plots</h2> <p><b>signeR</b> offers several plots to visualize estimated signatures and their exposures, as well as the convergence of the MCMC used to estimate them.</p> <h3 id="toc6.1">6.1: Plot the MCMC paths for the NMF parameters (P and E matrices)</h3> <p>The following instruction plots the MCMC sample paths for each entry of the signature matrix P and their exposures, i.e. the E matrix. Only post-burnin paths are available for plotting. Those plots are useful for checking if entries have leveled off, reflecting the sampler convergence.</p> <!--begin.rcode Paths(signatures$SignExposures) end.rcode--> <p class='legend'>Each plot shows the entries and exposures of one signature along sampler iterations.</p> <h3 id="toc6.2">6.2: Plot the signatures</h3> <p>Once the processing is done, the estimated signatures can be displayed in two charts, described below. </p> <h4 id="toc6.2.1"> Signature barplot</h4> <p>Signatures can be visualized in a barplot by issuing the following command:</p> <!--begin.rcode SignPlot(signatures$SignExposures) end.rcode--> <p class='legend'>Signatures barplot with error bars reflecting the sample percentiles 0.05, 0.25, 0.75, and 0.95 for each entry.</p> <h4 id="toc6.2.2"> Signature heatmap</h4> <p>Estimated signatures can also be visualized in a heatmap, generated by the following command:</p> <!--begin.rcode SignHeat(signatures$SignExposures) end.rcode--> <p class='legend'>Heatmap showing the entries of each signature.</p> <h3 id="toc6.3">6.3: Plot the exposures</h3> <p>The relative contribution of each signature to the inspected genomes can be displayed in several ways. <b>signeR</b> currently provides three alternatives. </p> <h4 id="toc6.3.1"> Exposure boxplot</h4> <p>The levels of exposure to each signature in all genome samples can also be plotted:</p> <!--begin.rcode ExposureBoxplot(signatures$SignExposures) end.rcode--> <h4 id="toc6.3.2"> Exposure barplot</h4> <p>The contribution of the signatures to the mutations found on each genome sample can also be visualized in a barplot, plotted by the following command:</p> <!--begin.rcode ExposureBarplot(signatures$SignExposures) end.rcode--> <p class='legend'>Barplot showing the contributions of the signatures to genome samples mutation counts.</p> <p>The <b>relative</b> contribution of signatures on each genome sample can also be visualized in a barplot, setting the <b>relative</b> parameter to TRUE:</p> <!--begin.rcode ExposureBarplot(signatures$SignExposures, relative=TRUE) end.rcode--> <p class='legend'>Barplot showing the relative contributions of signatures to genome samples.</p> <h4 id="toc6.3.3"> Exposure heatmap</h4> <p>The levels of exposure to each signature can also be plotted in a heatmap: </p> <!--begin.rcode ExposureHeat(signatures$SignExposures) end.rcode--> <p class='legend'>Heatmap showing the exposures for each genome sample. Samples are grouped according to their levels of exposure to the signatures, as can be seen in the dendrogram on the left.</p> <h2 id="toc7">7: Supervised approaches to exposure analysis</h2> <p>If additional information is available for the samples analyzed <b>signeR</b> is able to evaluate how those relate to the estimated exposures to mutational signatures. When additional data is categorical, differences in exposures among groups can be analyzed and if some of the samples are unlabeled they can be labeled based on the similarity of their exposure profiles to those of labeled samples. In the case of a continuous additional feature, its correlation to estimated exposures can be evaluated. Survival data can also be analyzed and the relation of signatures to survival can be accessed. In every case, analyses are repeated for all samples of the exposure matrix generated by <b>signeR</b> sampler and results are considered significant if they are consistently found on most of them. </p> <h3 id="toc7.1">7.1: Differential Exposure Analysis</h3> <p><b>signeR</b> can highlight signatures that are differentially active among previously defined groups of samples. To perform this task <b>signeR</b> needs a vector of group labels. In this example the samples were divided according to germline mutations at BRCA genes: groups <code>wt</code>, <code>BRCA1+</code> and <code>BRCA2+</code>, taken from the description of the 21 breast cancer data set.</p> <!--begin.rcode # group labels, respective to each row of the mutation count matrix BRCA_labels <- c("wt","BRCA1+","BRCA2+","BRCA1+","BRCA2+","BRCA1+","BRCA1+", "wt","wt","wt","wt","BRCA1+","wt","BRCA2+","BRCA2+","wt","wt","wt", "wt","wt","wt") diff_exposure <- DiffExp(signatures$SignExposures, labels=BRCA_labels) end.rcode--> <p class='legend'><b>Top chart:</b> DES plot showing that the BRCA+ samples were significantly more exposed to signatures S3, S4 and S5. <b>Bottom chart:</b> plots showing the significant differences found when groups are compared against each other. These last plots are generated only when there are more than two groups in the analysis and any signature is found to be differentially active. Groups marked by the same letter are not significantly different from the corresponding signature.</p> <p>The <code>Pvquant</code> vector holds the pvalues quantile of the test for each signature (by default, the 0.5 quantile, i.e. the median). The logarithms of those are considered the Differential Exposure Scores (DES). Signatures with <code>Pvquant</code> values below the cutoff, 0.05 by default, are considered differentially exposed.</p> <!--begin.rcode # pvalues diff_exposure$Pvquant end.rcode--> <p>The <code>MostExposed</code> vector contains the name of the group where each differentially exposed signature showed the highest levels of activity.</p> <!--begin.rcode # most exposed group diff_exposure$MostExposed end.rcode--> <h3 id="toc7.2">7.2: Sample Classification</h3> <p><b>signeR</b> can also classify samples based on their exposures to mutational processes. In order to do this, a vector of target labels must be provided to the function ExposureClassify. Samples corresponding to NA values in this vector will be classified according to the similarity of their mutational profiles to the ones of labeled samples. Several classification algorithms are available: k-nearest neighbor (knn), linear vector quantization (lvq), Logistic regression (logreg), linear discriminant analysis (lda), least absolute shrinkage and selection operator (lasso), naive bayes(nb), support vector machines (svm), random forests (rf) and adaboost (ab). The following example uses the sample labels defined in the DES analysis performed previously.</p> <!--begin.rcode # note that BRCA_labels [15],[20] and [21] are set to NA BRCA_labels <- c("wt","BRCA+","BRCA+","BRCA+","BRCA+","BRCA+","BRCA+","wt","wt", "wt","wt","BRCA+","wt","BRCA+",NA,"wt","wt","wt","wt",NA,NA) Class <- ExposureClassify(signatures$SignExposures, labels=BRCA_labels, method="knn") end.rcode--> <p class='legend'>Barplot showing the relative frequencies of assignment of each unlabeled sample to the selected group.</p> <!--begin.rcode # Final assignments Class$class # Relative frequencies of assignment to selected groups Class$freq # All assigment frequencies Class$allfreqs end.rcode--> <!--########### Checked until here--> <h3 id="toc7.3">7.3: Correlation Analysis</h3> <p> When a continuous feature is available for the samples being analyzed, <b>signeR</b> can evaluate its correlation to exposures to mutational signatures. The following command applies the <code>cor.test</code> function to evaluate the correlation between the provided feature and the levels of exposure of each signature:</p> <!--begin.rcode eval=TRUE # feature values, respective to each row of the mutation count matrix. # The feature simulated below will be correlated with exposures to signature 2. myFeature <- (10^6)*Median_exp(signatures$SignExposures)[2,] + rnorm(21,0,1) ExpCorr <- ExposureCorrelation(signatures$SignExposures, feature=myFeature) end.rcode--> <p class='legend'>P-values boxplot and scatterplots showing the correlations of exposures and the provided feature.</p> <p>The output <code>ExpCorr</code> will contain a list with two vectors: the estimated correlations of the signatures to the feature and their estimated pvalues.</p> <h3 id="toc7.4">7.4: Linear Regression</h3> <p> A continuous feature may also be modeled by its exposure to mutational signatures. The following command applies the <code>glm</code> function to fit a linear model with the provided feature as output of the levels of exposures to mutational signatures:</p> <!--begin.rcode eval=TRUE ExpLM <- ExposureGLM(signatures$SignExposures, feature=myFeature) end.rcode--> <p class='legend'>P-values boxplot showing the relevance of exposures in final model of provided feature.</p> <h3 id="toc7.5">7.5: Survival Analysis</h3> <p> If survival data is available for the samples being analyzed, <b>signeR</b> can evaluate the effect of exposure on survival. The following function performs an independent test for the exposure levels of each mutational signature, as opposed to the <code>Cox regression</code> described in next item. If <code>method = logrank</code> samples are separated according to their exposure and top and bottom groups are compared by the log-rank test. Otherwise, <code>method = cox</code> and a Cox's proportional hazards model are fitted to exposure levels and evaluate their effects upon survival.</p> <!--begin.rcode eval=TRUE, message=FALSE # survival times and censoring, respective to each row of the mutation count matrix. #Time is simulated so that it is correlated to exposures to signature 3. critical_sig<-Median_exp(signatures$SignExposures)[2,] probs<-(critical_sig-min(critical_sig))/(max(critical_sig)-min(critical_sig)) su<-as.matrix(data.frame(time=sample(80:150,21,replace=T), status=sapply(probs,function(p){sample(c(0,1),1,prob=c(p,1-p))}))) ExpSurv <- ExposureSurvival(Exposures=signatures$SignExposures, surv=su, method="logrank") end.rcode--> <p class='legend'>P.values boxplots and plots of survival data comparing, for each signature, sample groups showing top or bottom exposure levels.</p> <!--begin.rcode eval=TRUE, message=FALSE ExpSurv.cox <- ExposureSurvival(Exposures=signatures$SignExposures, surv=su, method="cox") end.rcode--> <p class='legend'>P.values boxplots and forest plot showing the effect of each signature upon survival according to univariate Cox's proportional hazards model.</p> <h3 id="toc7.6">7.6: Cox regression </h3> <p></b>signeR</b> can also evaluate the combined effect on survival of exposure levels to different signatures. The following command fits Cox's proportional hazards model on exposures. It generates a p-values boxplot and a forest plot (from package <b>forestplot</b>) to show the relevance of each mutational signature to survival.</p> <!--begin.rcode eval=TRUE, message=FALSE BRCA_labels <- c("wt","BRCA+","BRCA+","BRCA+","BRCA+","BRCA+","BRCA+", "wt","wt","wt","wt","BRCA+","wt","BRCA+","BRCA+","wt","wt","wt", "wt","wt","wt") ExpSurvMd <- ExposureSurvModel(Exposures=signatures$SignExposures, surv=su, addata=data.frame(as.factor(BRCA_labels))) end.rcode--> <p class='legend'>Forestplot: Cox proportional model p-values and hazard ratios for each signature.</p> <h2 id="toc8">8: Unsupervised approaches to exposure analysis</h2> <p>When no additional information is available for the samples analyzed, <b>signeR</b> can search patterns on estimated exposures. Unsupervised analysis can be performed on all generated samples of the exposure matrix and consistent results are reported as significant.</p> <h3 id="toc8.1">8.1: Hierarchical Clustering</h3> <p>Samples can be hierarchically clustered according to their exposures to mutational signatures. The function <code>HClustExp</code> applies the R function <code>hclust</code> to each sampled exposure matrix. Dendogram branches consistently found in most of the analyses are reported:</p> <!--begin.rcode eval=TRUE # feature values, respective to each row of the mutation count matrix HCE <- HClustExp(signatures$SignExposures,method.dist="euclidean", method.hclust="average") end.rcode--> <p class='legend'>Hierarchical structure found on data, each branch showing the frequency of its occurrence in the analysis of sampled exposure matrices.</p> <h3 id="toc8.2">8.2: Fuzzy Clustering</h3> <p><b>signeR</b> can also perform a Fuzzy clustering of the samples according to their exposures to mutational signatures. The function <code>FuzzyClustExp</code> applies a clustering algorithm (hard or fuzzy) to each sampled exposure matrix and averages the fuzzy membership degrees of samples to clusters. The number of clusters can be defined by the user (Clim parameter), otherwise different numbers of clusters are tested and the one that maximizes the PBMF index is used (the search algorithm used is similar to the one applied to estimate the number of signatures present in a dataset):</p> <!--begin.rcode eval=TRUE FCE <- FuzzyClustExp(signatures$SignExposures, Clim=c(3,7)) end.rcode--> <p class='legend'>Heatmap showing the mean fuzzy membership degrees of samples to clusters.</p> <p>The function's output is a list of three elements: the mean matrix of fuzzy membership degrees of samples to clusters, the full array of all fuzzy membership degrees found by the analysis of sampled exposure matrices and the matrix of membership degrees found by the analysis of median exposures of samples to signatures.</p> <h2 id="toc9">9: Frequently Asked Questions</h2> <h3 id="toc9.1">9.1: Citing signeR</h3> If you use the <b>signeR</b> package in your work, please cite it: <!--begin.rcode citation("signeR") end.rcode--> <h3 id="toc9.2">9.2: Compilation errors on OS X</h3> <p>OS X users might get compilation errors similar to these:</p> <pre> <code>ld: warning: directory not found for option '-L/usr/local/lib/x86_64' ld: library not found for -lgfortran ld: library not found for -lquadmath</code></pre> <p>This problem arises when the machine is missing gfortran libraries necessary to compile RcppArmadillo and signeR. To install the missing libraries, execute these lines on a terminal:</p> <pre> <code>curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2 sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /</code></pre> <p>For more information <a href="http://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks-lgfortran-and-lquadmath-error/"> see this post</a> and the <a href="http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf"> Rcpp FAQ, section 2.16</a>.</p> <h3 id="toc9.3">9.3: Missing library headers</h3> <p>Some packages that signeR depends on require that 3rd party library headers be installed. If you see errors like:</p> <pre><code>Error: checking for curl-config... no Cannot find curl-config</code></pre> or <pre><code>libopenblas.so.0 not found</code></pre> <p>It means you need to install these headers with your package manager. For example on Ubuntu:</p> <pre><code>$ sudo apt-get install libcurl4-openssl-dev libopenblas-dev</code></pre> <h2 id="toc10">10: References</h2> <p>L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, P. J. Campbell, and M. R. Stratton. Deciphering Signatures of Mutational Processes Operative in Human Cancer. Cell Reports, 3(1):246-259, Jan. 2013. [ <a href="http://dx.doi.org/10.1016/j.celrep.2012.12.008">DOI</a> ]</p> <p>A. Fischer, C. J. Illingworth, P. J. Campbell, and V. Mustonen. EMu: probabilistic inference of mutational processes and their localization in the cancer genome. Genome biology, 14(4):R39, Apr. 2013. [ <a href="http://dx.doi.org/10.1186/gb-2013-14-4-r39">DOI</a> ]</p> <h2>Debug Info</h2> <!--begin.rcode sessionInfo() end.rcode--> <!--begin.rcode print(names(dev.cur())) end.rcode--> <div style="margin-bottom:5em"></div> </body> </html>