--- title: "megadepth quick start guide" author: - name: David Zhang affiliation: - UCL email: david.zhang.12@ucl.ac.uk - name: Leonardo Collado-Torres affiliation: - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus - &ccb Center for Computational Biology, Johns Hopkins University email: lcolladotor@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('megadepth')`" vignette: > %\VignetteIndexEntry{megadepth quick start guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], megadepth = citation("megadepth")[1], megadepthpaper = citation("megadepth")[2], bioconductor2015 = RefManageR::BibEntry( bibtype = "Article", key = "bioconductor2015", author = "Wolfgang Huber and Vincent J Carey and Robert Gentleman and Simon Anders and Marc Carlson and Benilton S Carvalho and Hector Corrada Bravo and Sean Davis and Laurent Gatto and Thomas Girke and Raphael Gottardo and Florian Hahne and Kasper D Hansen and Rafael A Irizarry and Michael Lawrence and Michael I Love and James MacDonald and Valerie Obenchain and Andrzej K Oleś and Hervé Pagès and Alejandro Reyes and Paul Shannon and Gordon K Smyth and Dan Tenenbaum and Levi Waldron and Martin Morgan", title = "Orchestrating high-throughput genomic analysis with Bioconductor", year = 2015, doi = "10.1038/nmeth.3252", journal = "Nature Methods", journaltitle = "Nat Methods" ), RefManageR = citation("RefManageR")[1], rtracklayer = citation("rtracklayer") ) ``` The goal of `r Biocpkg("megadepth")` is to provide an R interface to the command line tool [Megadepth](https://github.com/ChristopherWilks/megadepth) for BigWig and BAM related utilities created by [Christopher Wilks](https://twitter.com/chrisnwilks) `r Citep(bib[["megadepthpaper"]])`. This R package enables **fast** processing of BigWig files on downstream packages such as [dasper](https://bioconductor.org/packages/dasper) and [recount3](https://bioconductor.org/packages/recount3). The [Megadepth](https://github.com/ChristopherWilks/megadepth) software also provides utilities for processing BAM files and extracting coverage information from them. Here is an illustration on how fast `r Biocpkg("megadepth")` is compared to other tools for processing local and remote BigWig files. ```{r "runtime", out.width="100%", echo = FALSE, fig.cap = "Timing results. Timing comparison for processing 1,000 genomic regions on a bigWig file that is available on the local disk or on a remote resource. We compared megadepth against rtracklayer and pyBigWig. megadepth is typically faster that these other software solutions for computing the mean coverage across a set of 1,000 input genomic regions. Check for more details."} knitr::include_graphics("https://raw.githubusercontent.com/LieberInstitute/megadepth/master/analysis/md_rt_pybw_runtime.png") ``` Throughout the documentation we use a capital `M` to refer to the software by Christopher Wilks and a lower case `m` to refer to this R/Bioconductor package. # Basics ## Install `megadepth` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg("megadepth")` is a `R` package available via the [Bioconductor](http://bioconductor.org) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg("megadepth")` by using the following commands in your `R` session: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("megadepth") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Required knowledge `r Biocpkg("megadepth")` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data and high-throughput sequencing data in general. You might benefit from being familiar with the BigWig file format and the `r Biocpkg("rtracklayer")` for importing those files into R as well as exporting BED files `r Citep(bib[["rtracklayer"]])`. If you are working with annoation files, `r Biocpkg("GenomicFeatures")` and `r Biocpkg("GenomicRanges")` will also be useful to you. If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU). ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `megadepth` tag and check [the older posts](https://support.bioconductor.org/t/megadepth/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. ## Citing `megadepth` We hope that `r Biocpkg("megadepth")` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r "citation"} ## Citation info citation("megadepth") ``` # Quick start To get started, we need to load the `r Biocpkg("megadepth")` package into our R session. This will load all the required dependencies. ```{r "start", message=FALSE} library("megadepth") ``` Once we have the R package loaded, we need to install the [Megadepth](https://github.com/ChristopherWilks/megadepth) software. We can do so with `install_megadepth()`, which downloads a binary for your OS (Linux, Windows or macOS) ^[Please check [Megadepth](https://github.com/ChristopherWilks/megadepth) for instructions on how to compile the software from source if the binary version doesn't work for you.]. We can then use with an example BigWig file to compute the coverage at a set of regions. ```{r "install_software"} ## Install the latest version of Megadepth install_megadepth(force = TRUE) ``` Next, we might want to use `r Biocpkg("megadepth")` to quantify the coverage at a set of regions of the genome of interest to us. Here we will use two example files that are include in the package for illustration and testing purposes. One of them is a [bigWig file](https://genome.ucsc.edu/goldenPath/help/bigWig.html) that contains the base-pair coverage information for a sample of interest and the second one is [BED file](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) which contains the genomic region coordinates of interest. So we first locate them. ```{r "locate_example_bw"} ## Next, we locate the example BigWig and annotation files example_bw <- system.file("tests", "test.bam.all.bw", package = "megadepth", mustWork = TRUE ) annotation_file <- system.file("tests", "testbw2.bed", package = "megadepth", mustWork = TRUE ) ## Where are they? example_bw annotation_file ``` Once we have located the example files we can proceed to calculating the base-pair coverage for our genomic regions of interest. There are several ways to do this with `r Biocpkg("megadepth")`, but here we use the user-friendly function `get_coverage()`. This function will perform a given operation **op** on the bigWig file for a given set of regions of interest (_annotation_). One of those operations is to compute the mean base-pair coverage for each input region. This is what we'll do with our example bigWig file. ```{r example} ## We can then use megadepth to compute the coverage bw_cov <- get_coverage( example_bw, op = "mean", annotation = annotation_file ) bw_cov ``` `get_coverage()` returns an object that is familiar to `r Biocpkg("GenomicRanges")` users, that is, a `GRanges` object that can be used with other Bioconductor software packages `r Citep(bib[["bioconductor2015"]])`. This example is just the tip of the iceberg, as [Megadepth](https://github.com/ChristopherWilks/megadepth) and thus `r Biocpkg("megadepth")` can do a lot of useful processing operations on BAM and bigWig files. # Users guide ## Command interface [Megadepth](https://github.com/ChristopherWilks/megadepth) is very powerful and can do a lot of different things. The R/Bioconductor package provides two functions for interfacing with [Megadepth](https://github.com/ChristopherWilks/megadepth), `megadepth_cmd()` and `megadepth_shell()`. For the first one, `megadepth_cmd()`, you need to know the actual command syntax you want to use and format it accordingly. If you are more comfortable with R functions, `megadepth_shell()` uses `r BiocStyle::CRANpkg("cmdfun")` to power this interface and capture the standard output stream into R. To make it easier to use, `megadepth` includes functions that simplify the number of arguments, read in the output files, and converts them into R/Bioconductor friendly objects, such as `get_coverage()` illustrated previously in the quick start section. We hope that you'll find `megadepth` and [Megadepth](https://github.com/ChristopherWilks/megadepth) useful for your work. If you are interested in checking how **fast** `megadepth` is, check out the [**speed analysis**](https://github.com/LieberInstitute/megadepth/tree/master/analysis) comparison against other tools. Note that the size of the files used and the number of genomic regions queried will affect the speed comparisons. ```{r "interface_options"} ## R-like interface ## that captures the standard output into R head(megadepth_shell(help = TRUE)) ## Command-like interface megadepth_cmd("--help") ``` ```{r "show_help", echo = FALSE} x <- megadepth_shell(help = TRUE) cat(paste0(x, "\n")) ``` ## BAM to bigWig One use case of [Megadepth](https://github.com/ChristopherWilks/megadepth) is to convert BAM files to bigWig coverage files. To simplify this process and verify that you are not accidentally overwriting valuable files, `r Biocpkg("megadepth")` provides the function `bam_to_bigwig()`. To illustrate this functionality, we first locate an example BAM file. We then generate the output bigWig files ```{r "bam_to_bigwig", eval = !xfun::is_windows()} ## Find the example BAM file example_bam <- system.file("tests", "test.bam", package = "megadepth", mustWork = TRUE ) ## Create the BigWig file ## Currently Megadepth does not support this on Windows example_bw <- bam_to_bigwig(example_bam, overwrite = TRUE) ## Path to the output files generated by bam_to_bigwig() example_bw ``` Currently this functionality does not work on Windows. In which case, you can continue the vignette with the following example bigWig file. ```{r "bam_to_bigwig_windows", eval = xfun::is_windows()} ## On Windows, use the example bigWig file that is already included in ## the R package example_bw <- system.file("tests", "test.bam.all.bw", package = "megadepth", mustWork = TRUE ) ``` ## Summarize coverage over regions Once you have a biWig file, you might want to quantify the mean or total expression across a set of genomic coordinates. bigWig files are typically used by genome browsers, and as part of the `r Biocpkg('recount')` and `r Biocpkg('recount3')` projects we have released thousands of them. Before we expand more complex uses cases, you might be interested in `get_coverage()`. This function will use [Megadepth](https://github.com/ChristopherWilks/megadepth) to create a tab-separated value (TSV) file containing the coverage summary information ^[Sum, mean, min or max base-pair coverage for each region.] for a given input file that can be read into R using `read_coverage()` ^[If you prefer a `tibble`, use `read_coverage_table()`.] as shown below with an example set of genomic regions of interest (_annotation_). ```{r "get_coverage"} ## Next, we locate the example annotation BED file annotation_file <- system.file("tests", "testbw2.bed", package = "megadepth", mustWork = TRUE ) ## Now we can compute the coverage bw_cov <- get_coverage(example_bw, op = "mean", annotation = annotation_file) bw_cov ``` If you are familiar with `r Biocpkg("rtracklayer")`, you'll notice that the coverage summaries are basically the same to the one that can be generated with `rtracklayer::import.bw()`, which is what `r Biocpkg("derfinder")` uses internally. ```{r "rtracklayer_derfinder"} ## Checking with derfinder and rtracklayer bed <- rtracklayer::import(annotation_file) ## The file needs a name names(example_bw) <- "example" ## Read in the base-pair coverage data regionCov <- derfinder::getRegionCoverage( regions = bed, files = example_bw, verbose = FALSE ) ## Summarize the base-pair coverage data. ## Note that we have to round the mean to make them comparable. testthat::expect_equivalent( round(sapply(regionCov[c(1, 3:4, 2)], function(x) mean(x$value)), 2), bw_cov$score, ) ## If we compute the sum, there's no need to round testthat::expect_equivalent( sapply(regionCov[c(1, 3:4, 2)], function(x) sum(x$value)), get_coverage(example_bw, op = "sum", annotation = annotation_file)$score, ) ``` ## BAM to junctions [Megadepth](https://github.com/ChristopherWilks/megadepth) provides utilities that might be of use for future work or that were developed for building `r Biocpkg('recount3')`. One of these features is the possibility to extract locally co-ocurring junctions from a BAM file as described in the [Megadepth documentation](https://github.com/ChristopherWilks/megadepth#megadepth-pathtobamfile---junctions). This feature works only for junctions for which a read or (read pair) has 2 or more junctions. To illustrate this functionality, we will use an example BAM file and generate the locally co-occurring junction table with `bam_to_junctions()`. We'll then read in the data using `read_junction_table()` ^[The `strand` columns have been switched from 0s and 1s to `+` and `-` for the forward and reverse strands, to match frequently used Bioconductor packages.]. `process_junction_table()` can then be used to convert the junctions into a STAR-compatible format. ```{r "bam_to_junctions"} ## Find the example BAM file example_bam <- system.file("tests", "test.bam", package = "megadepth", mustWork = TRUE ) ## Run bam_to_junctions() example_jxs <- bam_to_junctions(example_bam, overwrite = TRUE) ## Path to the output file generated by bam_to_junctions() example_jxs ## Read the data as a tibble using the format specified at ## https://github.com/ChristopherWilks/megadepth#megadepth-pathtobamfile---junctions example_jxs <- read_junction_table(example_jxs) example_jxs process_junction_table(example_jxs) ``` ## Teams involved `r Biocpkg('megadepth')` was made possible to [David Zhang](https://twitter.com/dyzhang32), the author of `r Biocpkg("dasper")`, and a member of the [Mina Ryten](https://snca.atica.um.es/)'s lab at UCL. The `ReCount` family involves the following teams: * [Ben Langmead's lab](http://www.langmead-lab.org/) at JHU Computer Science * [Kasper Daniel Hansen's lab](https://www.hansenlab.org/) at JHBSPH Biostatistics Department * [Leonardo Collado-Torres](http://lcolladotor.github.io/) and [Andrew E. Jaffe](http://aejaffe.com/) from [LIBD](https://www.libd.org/) * [Abhinav Nellore's lab](http://nellore.bio/) at OHSU * [Jeff Leek's lab](http://jtleek.com/) at JHBSPH Biostatistics Deparment * Data hosted by [SciServer from IDIES at JHU](https://www.sciserver.org/) ## Other related tools The `ReCount` team has worked on several software solutions and projects that complement each other and enable you to re-use public RNA-seq data. Another Bioconductor package that you might be highly interested in is `r Biocpkg("snapcount")`, which allows you to use the [Snaptron web services](http://snaptron.cs.jhu.edu/). In particular, `r Biocpkg("snapcount")` is best for queries over a particular subset of genes or intervals across all or most of the samples in `recount2`/`Snaptron`. We remind you that the **main documentation website** for all the `recount3`-related projects is available at [**recount.bio**](https://LieberInstitute/github.io/recount3-docs). Please check that website for more information about how this R/Bioconductor package and other tools are related to each other. **Thank you!** P.S. An [alternative version of this vignette is available](https://LieberInstitute.github.io/megadepth/) that was made using `r CRANpkg("pkgdown")`. # Reproducibility The `r Biocpkg("megadepth")` package `r Citep(bib[["megadepth"]])` was made possible thanks to: * R `r Citep(bib[["R"]])` * `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` * `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` * `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` * `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])` * `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])` * `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])` This package was developed using `r BiocStyle::Biocpkg("biocthis")`. Code for creating the vignette ```{r createVignette, eval=FALSE} ## Create the vignette library("rmarkdown") system.time(render("megadepth.Rmd", "BiocStyle::html_document")) ## Extract the R code library("knitr") knit("megadepth.Rmd", tangle = TRUE) ``` Date the vignette was generated. ```{r reproduce1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproduce2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes. Citations made with `r CRANpkg('RefManageR')` `r Citep(bib[['RefManageR']])`. ```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```