--- title: "HiLDA: a package for testing the burdens of mutational signatures" author: - name: Zhi Yang affiliation: Department of Preventive Medicine, University of Southern California, Los Angeles, USA email: zyang895@gmail.com date: "`r Sys.Date()`" vignette: | %\VignetteIndexEntry{An introduction to HiLDA} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default abstract: | Instructions on using _HiLDA_ on testing the burdens of mutational signatures. --- ```{r style, echo = FALSE, results = 'asis'} library(BiocStyle) ``` # Introduction The R package **HiLDA** is developed under the Bayesian framework to allow statistically testing whether there is a change in the mutation burdens of mutation signatures between two groups. The mutation signature is defined based on the independent model proposed by Shiraishi's et al. ## Paper - Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, bioRxiv, doi: [http://dx.doi.org/10.1101/019901](http://dx.doi.org/10.1101/019901). - **Zhi Yang**, Priyatama Pandey, Darryl Shibata, David V. Conti, Paul Marjoram, Kimberly D. Siegmund. HiLDA: a statistical approach to investigate differences in mutational signatures, bioRxiv, doi: https://doi.org/10.1101/577452 # Installing and loading the package {#installation} ## Installation ### Bioconductor **HiLDA** requires several CRAN and Bioconductor R packages to be installed. Dependencies are usually handled automatically, when installing the package using the following commands: ``` install.packages("BiocManager") BiocManager::install("HiLDA") ``` [NOTE: Ignore the first line if you already have installed the `r CRANpkg("BiocManager")`.] You can also download the newest version from the GitHub using *devtools*: ``` devtools::install_github("USCbiostats/HiLDA") ``` ### Just Another Gibbs Sampler (JAGS) In order to run HiLDA, one also needs to install an external program called Just Another Gibbs Sampler, JAGS, downloaded from this website http://mcmc-jags.sourceforge.net/. For more details, please follow the INSTALL file to install the program. # Input data `HiLDA` is a package built on some basic functions from `pmsignature` including how to read the input data. Here is an example from `pmsignature` on the input data, *mutation features* are elements used for categorizing mutations such as: * 6 substitutions (C>A, C>G, C>T, T>A, T>C and T>G) * 2 flanking bases (A, C, G and T) * transcription direction. ## Mutation Position Format sample1 chr1 100 A C sample1 chr1 200 A T sample1 chr2 100 G T sample2 chr1 300 T C sample3 chr3 400 T C * The 1st column shows the name of samples * The 2nd column shows the name of chromosome * The 3rd column shows the coordinate in the chromosome * The 4th column shows the reference base (A, C, G, or T). * The 5th colum shows the alternate base (A, C, G, or T). # Workflow ## Get input data Here, *inputFile* is the path for the input file. *numBases* is the number of flanking bases to consider including the central base (if you want to consider two 5' and 3' bases, then set 5). Also, you can add transcription direction information using *trDir*. *numSig* sets the number of mutation signatures estimated from the input data. You will see a warning message on some mutations are being removed. ```{r} library(HiLDA) inputFile <- system.file("extdata/esophageal.mp.txt.gz", package="HiLDA") G <- hildaReadMPFile(inputFile, numBases=5, trDir=TRUE) ``` Also, we also provided a small simulated dataset which contains 10 mutational catalogs andused it for demonstrating the key functions in HiLDA. We start with loading the sample dataset G stored as extdata/sample.rdata. ```{r} load(system.file("extdata/sample.rdata", package = "HiLDA")) class(G) ``` If you'd like to use the USC data in the manuscript, please download the data from the OSF home page https://osf.io/a8dzx/ ## Run tests from `HiLDA` ### Perform the global test and the local test After we read in the sample data G, we can run the local and the global tests from HiLDA. Here, we specify the *inputG* as *G*, the number of mutational signatures to be three, the indices for the reference group to be 1:4, the number of iterations to be 1000. *localTest* being *FALSE* means that a global test is called while it being *TRUE* means that a local test is called instead. ```{r} set.seed(123) hildaGlobal <- hildaTest(inputG=G, numSig=3, localTest=FALSE, refGroup=1:4, nIter=1000) hildaLocal <- hildaTest(inputG=G, numSig=3, localTest=TRUE, refGroup=1:4, nIter=1000) ``` ## Get signatures from *pmsignature* This object is used to provide an initial values for running MCMC sampling to reduce the running time by using the EM algorithm from *pmsignature* package developed by Shiraishi et al. ```{r} Param <- pmgetSignature(G, K = 3) ``` ### Perform the global test and the local test with initial values In a very similar way as running the HiLDA test, one just needs to specify *useInits* to be *Param* returned by the previous function to allow the initial values to be used in the MCMC sampling. ```{r} set.seed(123) hildaGlobal <- hildaTest(inputG=G, numSig=3, useInits = Param, localTest=TRUE, refGroup=1:4, nIter=1000) hildaLocal <- hildaTest(inputG=G, numSig=3, useInits = Param, localTest=TRUE, refGroup=1:4, nIter=1000) ``` ### Assess Convergence of MCMC chains After the MCMC sampling finishes, we can compute the potential scale reduction statistic to examine the convergence of two chains. Usually it is recommended to be less than 1.10. If not, it can be done by increasing the number of *nIter*. ```{r} hildaRhat(hildaGlobal) hildaRhat(hildaLocal) ``` ## Visualize the mutation signatures from pmsignature To allow users to compare the mutational signatures from both pmsignature and HiLDA, this function is used to plot the results from pmsignature. ```{r} pmPlots <- pmBarplot(G, Param, refGroup=1:4, sigOrder=c(1,3,2)) cowplot::plot_grid(pmPlots$sigPlot, pmPlots$propPlot, rel_widths = c(1,3)) ``` ## Visualize the mutation signatures from HiLDA In contrast, the following function is used to plot the results from HiLDA. ```{r} hildaPlots <- hildaBarplot(G, hildaLocal, refGroup=1:4, sigOrder=c(1,3,2)) cowplot::plot_grid(pmPlots$sigPlot, pmPlots$propPlot, rel_widths = c(1,3)) ``` ## Output the posterior distribution of the mean difference in exposures To visualize the 95% credible interval of the mean differences in exposures, the following function plots the differences along with the mutational signatures. ```{r} hildaDiffPlots <- hildaDiffPlot(G, hildaLocal, sigOrder=c(1,3,2)) cowplot::plot_grid(hildaDiffPlots$sigPlot, hildaDiffPlots$diffPlot, rel_widths = c(1,3)) ```