--- title: "derfinder quick start guide" author: - name: Leonardo Collado-Torres affiliation: - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus - &ccb Center for Computational Biology, Johns Hopkins University email: lcolladotor@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('derfinder')`" vignette: > %\VignetteIndexEntry{derfinder quick start guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Basics ## Install `r Biocpkg('derfinder')` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg('derfinder')` is a `R` package available via the [Bioconductor](http://bioconductor/packages/derfinder) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg('derfinder')` by using the following commands in your `R` session: ```{r 'installDer', eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("derfinder") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Required knowledge `r Biocpkg('derfinder')` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. That is, packages like `r Biocpkg('Rsamtools')`, `r Biocpkg('GenomicAlignments')` and `r Biocpkg('rtracklayer')` that allow you to import the data. A `r Biocpkg('derfinder')` user is not expected to deal with those packages directly but will need to be familiar with `r Biocpkg('GenomicRanges')` to understand the results `r Biocpkg('derfinder')` generates. It might also prove to be highly beneficial to check the `r Biocpkg('BiocParallel')` package for performing parallel computations. If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU). ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `derfinder` tag and check [the older posts](https://support.bioconductor.org/t/derfinder/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. We would like to highlight the `r Biocpkg('derfinder')` user [Jessica Hekman](https://support.bioconductor.org/u/6877/). She has used `r Biocpkg('derfinder')` with non-human data, and in the process of doing so discovered some small bugs or sections of the documentation that were not clear. ## Citing `r Biocpkg('derfinder')` We hope that `r Biocpkg('derfinder')` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r 'citation'} ## Citation info citation("derfinder") ``` # Quick start to using to `r Biocpkg('derfinder')` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( derfinder = citation("derfinder")[1], BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[3], rmarkdown = citation("rmarkdown")[1], brainspan = RefManageR::BibEntry( bibtype = "Unpublished", key = "brainspan", title = "Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards 1RC2MH089921-01, 1RC2MH090047-01, and 1RC2MH089929-01.", author = "BrainSpan", year = 2011, url = "http://www.brainspan.org/" ), originalder = citation("derfinder")[2], R = citation(), IRanges = citation("IRanges")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], GenomeInfoDb = RefManageR::BibEntry( bibtype = "manual", key = "GenomeInfoDb", author = "Sonali Arora and Martin Morgan and Marc Carlson and H. Pagès", title = "GenomeInfoDb: Utilities for manipulating chromosome and other 'seqname' identifiers", year = 2017, doi = "10.18129/B9.bioc.GenomeInfoDb" ), GenomicRanges = citation("GenomicRanges")[1], ggplot2 = citation("ggplot2")[1], bumphunter = citation("bumphunter")[1], TxDb.Hsapiens.UCSC.hg19.knownGene = citation("TxDb.Hsapiens.UCSC.hg19.knownGene")[1], AnnotationDbi = RefManageR::BibEntry( bibtype = "manual", key = "AnnotationDbi", author = "Hervé Pagès and Marc Carlson and Seth Falcon and Nianhua Li", title = "AnnotationDbi: Annotation Database Interface", year = 2017, doi = "10.18129/B9.bioc.AnnotationDbi" ), BiocParallel = citation("BiocParallel")[1], derfinderHelper = citation("derfinderHelper")[1], GenomicAlignments = citation("GenomicAlignments")[1], GenomicFeatures = citation("GenomicFeatures")[1], GenomicFiles = citation("GenomicFiles")[1], Hmisc = citation("Hmisc")[1], qvalue = citation("qvalue")[1], Rsamtools = citation("Rsamtools")[1], rtracklayer = citation("rtracklayer")[1], S4Vectors = RefManageR::BibEntry( bibtype = "manual", key = "S4Vectors", author = "Hervé Pagès and Michael Lawrence and Patrick Aboyoun", title = "S4Vectors: S4 implementation of vector-like and list-like objects", year = 2017, doi = "10.18129/B9.bioc.S4Vectors" ), bumphunterPaper = RefManageR::BibEntry( bibtype = "article", key = "bumphunterPaper", title = "Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies", author = "Jaffe, Andrew E and Murakami, Peter and Lee, Hwajin and Leek, Jeffrey T and Fallin, M Daniele and Feinberg, Andrew P and Irizarry, Rafael A", year = 2012, journal = "International Journal of Epidemiology" ), derfinderData = citation("derfinderData")[1], RefManageR = citation("RefManageR")[1] ) ``` Here is a very quick example of a DER Finder analysis. This analysis is explained in more detail later on in this document. ```{r 'ultraQuick', eval = FALSE} ## Load libraries library("derfinder") library("derfinderData") library("GenomicRanges") ## Determine the files to use and fix the names files <- rawFiles(system.file("extdata", "AMY", package = "derfinderData"), samplepatt = "bw", fileterm = NULL ) names(files) <- gsub(".bw", "", names(files)) ## Load the data from disk -- On Windows you have to load data from Bam files fullCov <- fullCoverage(files = files, chrs = "chr21", verbose = FALSE) ## Get the region matrix of Expressed Regions (ERs) regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, verbose = FALSE) ## Get pheno table pheno <- subset(brainspanPheno, structure_acronym == "AMY") ## Identify which ERs are differentially expressed, that is, find the DERs library("DESeq2") ## Round matrix counts <- round(regionMat$chr21$coverageMatrix) ## Round matrix and specify design dse <- DESeqDataSetFromMatrix(counts, pheno, ~ group + gender) ## Perform DE analysis dse <- DESeq(dse, test = "LRT", reduced = ~gender, fitType = "local") ## Extract results mcols(regionMat$chr21$regions) <- c(mcols(regionMat$chr21$regions), results(dse)) ## Save info in an object with a shorter name ers <- regionMat$chr21$regions ers ``` # Introduction `r Biocpkg('derfinder')` is an R package that implements the DER Finder approach `r Citep(bib[['originalder']])` for RNA-seq data. Briefly, this approach is an alternative to feature-counting and transcript assembly. The basic idea is to identify contiguous base-pairs in the genome that present differential expression signal. These base-pairs are grouped into _d_ifferentially _e_xpressed _r_regions (DERs). This approach is annotation-agnostic which is a feature you might be interested in. In particular, `r Biocpkg('derfinder')` contains functions that allow you to identify DERs via two alternative methods. You can find more details on the full [derfinder users guide](derfinder-users-guide.html). # Sample DER Finder analysis This is a brief overview of what a DER Finder analysis looks like. In particular, here we will be identifying expressed regions (ERs) without relying on annotation. Next, we'll identify candidate differentially expressed regions (DERs). Finally, we'll compare the DERs with known annotation features. We first load the required packages. ```{r 'start', message=FALSE} ## Load libraries library("derfinder") library("derfinderData") library("GenomicRanges") ``` Next, we need to locate the chromosome 21 coverage files for a set of 12 samples. These samples are a small subset from the _BrainSpan Atlas of the Human Brain_ `r Citep(bib[['brainspan']])` publicly available data. The function `rawFiles()` helps us in locating these files. ```{r 'locateAMYfiles'} ## Determine the files to use and fix the names files <- rawFiles(system.file("extdata", "AMY", package = "derfinderData"), samplepatt = "bw", fileterm = NULL ) names(files) <- gsub(".bw", "", names(files)) ``` Next, we can load the full coverage data into memory using the `fullCoverage()` function. Note that the _BrainSpan_ data is already normalized by the total number of mapped reads in each sample. However, that won't be the case with most data sets in which case you might want to use the `totalMapped` and `targetSize` arguments. The function `getTotalMapped()` will be helpful to get this information. ```{r 'getData', eval = .Platform$OS.type != "windows"} ## Load the data from disk fullCov <- fullCoverage( files = files, chrs = "chr21", verbose = FALSE, totalMapped = rep(1, length(files)), targetSize = 1 ) ``` ```{r 'getDataWindows', eval = .Platform$OS.type == "windows", echo = FALSE} ## Load data in Windows case foo <- function() { load(system.file("extdata", "fullCov", "fullCovAMY.RData", package = "derfinderData" )) return(fullCovAMY) } fullCov <- foo() ``` Now that we have the data, we can identify expressed regions (ERs) by using a cutoff of 30 on the base-level mean coverage from these 12 samples. Once the regions have been identified, we can calculate a coverage matrix with one row per ER and one column per sample (12 in this case). For doing this calculation we need to know the length of the sequence reads, which in this study were 76 bp long. ```{r 'regionMatrix'} ## Get the region matrix of Expressed Regions (ERs) regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, verbose = FALSE) ``` `regionMatrix()` returns a list of elements, each with three pieces of output. The actual ERs are arranged in a `GRanges` object named `regions`. ```{r 'exploreRegMatRegs'} ## regions output regionMat$chr21$regions ``` `bpCoverage` is the base-level coverage list which can then be used for plotting. ```{r 'exploreRegMatBP'} ## Base-level coverage matrices for each of the regions ## Useful for plotting lapply(regionMat$chr21$bpCoverage[1:2], head, n = 2) ## Check dimensions. First region is 565 long, second one is 138 bp long. ## The columns match the number of samples (12 in this case). lapply(regionMat$chr21$bpCoverage[1:2], dim) ``` The end result of the coverage matrix is shown below. Note that the coverage has been adjusted for read length. Because reads might not fully align inside a given region, the numbers are generally not integers but can be rounded if needed. ```{r 'exploreRegMatrix'} ## Dimensions of the coverage matrix dim(regionMat$chr21$coverageMatrix) ## Coverage for each region. This matrix can then be used with limma or other pkgs head(regionMat$chr21$coverageMatrix) ``` We can then use the coverage matrix and packages such as `r Biocpkg('limma')`, `r Biocpkg('DESeq2')` or `r Biocpkg('edgeR')` to identify which ERs are differentially expressed. Here we'll use `r Biocpkg('DESeq2')` for which we need some phenotype data. ```{r 'phenoData'} ## Get pheno table pheno <- subset(brainspanPheno, structure_acronym == "AMY") ``` Now we can identify the DERs using a rounded version of the coverage matrix. ```{r 'identifyDERsDESeq2'} ## Identify which ERs are differentially expressed, that is, find the DERs library("DESeq2") ## Round matrix counts <- round(regionMat$chr21$coverageMatrix) ## Round matrix and specify design dse <- DESeqDataSetFromMatrix(counts, pheno, ~ group + gender) ## Perform DE analysis dse <- DESeq(dse, test = "LRT", reduced = ~gender, fitType = "local") ## Extract results mcols(regionMat$chr21$regions) <- c( mcols(regionMat$chr21$regions), results(dse) ) ## Save info in an object with a shorter name ers <- regionMat$chr21$regions ers ``` We can then compare the DERs against known annotation to see which DERs overlap known exons, introns, or intergenic regions. A way to visualize this information is via a Venn diagram which we can create using `vennRegions()` from the `r Biocpkg('derfinderPlot')` package as shown in Figure \@ref(fig:vennRegions). ```{r 'vennRegions', fig.cap = "Venn diagram showing ERs by annotation class."} ## Find overlaps between regions and summarized genomic annotation annoRegs <- annotateRegions(ers, genomicState$fullGenome, verbose = FALSE) library("derfinderPlot") venn <- vennRegions(annoRegs, counts.col = "blue", main = "Venn diagram using TxDb.Hsapiens.UCSC.hg19.knownGene annotation" ) ``` We can also identify the nearest annotated feature. In this case, we'll look for the nearest known gene from the UCSC hg19 annotation. ```{r 'nearestGene'} ## Load database of interest library("TxDb.Hsapiens.UCSC.hg19.knownGene") txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, "chr21") ## Find nearest feature library("bumphunter") genes <- annotateTranscripts(txdb) annotation <- matchGenes(ers, subject = genes) ## Restore seqlevels txdb <- restoreSeqlevels(txdb) ## View annotation results head(annotation) ## You can use derfinderPlot::plotOverview() to visualize this information ``` We can check the base-level coverage information for some of our DERs. In this example we do so for the first 5 ERs (Figures \@ref(fig:firstfive1), \@ref(fig:firstfive2), \@ref(fig:firstfive3), \@ref(fig:firstfive4), \@ref(fig:firstfive5)). ```{r 'firstfive', fig.cap = paste0("Base-pair resolution plot of differentially expressed region ", 1:5, ".")} ## Extract the region coverage regionCov <- regionMat$chr21$bpCoverage plotRegionCoverage( regions = ers, regionCoverage = regionCov, groupInfo = pheno$group, nearestAnnotation = annotation, annotatedRegions = annoRegs, whichRegions = seq_len(5), txdb = txdb, scalefac = 1, ask = FALSE, verbose = FALSE ) ``` You can then use the `r Biocpkg('regionReport')` package to generate interactive HTML reports exploring the results. If you are interested in using `r Biocpkg('derfinder')` we recommend checking the [derfinder users guide](derfinder-users-guide.html) and good luck with your analyses! # Reproducibility This package was made possible thanks to: * R `r Citep(bib[['R']])` * `r Biocpkg('AnnotationDbi')` `r Citep(bib[['AnnotationDbi']])` * `r Biocpkg('BiocParallel')` `r Citep(bib[['BiocParallel']])` * `r Biocpkg('bumphunter')` `r Citep(bib[['bumphunter']])` and `r Citep(bib[['bumphunterPaper']])` * `r Biocpkg('derfinderHelper')` `r Citep(bib[['derfinderHelper']])` * `r Biocpkg('GenomeInfoDb')` `r Citep(bib[['GenomeInfoDb']])` * `r Biocpkg('GenomicAlignments')` `r Citep(bib[['GenomicAlignments']])` * `r Biocpkg('GenomicFeatures')` `r Citep(bib[['GenomicFeatures']])` * `r Biocpkg('GenomicFiles')` `r Citep(bib[['GenomicFiles']])` * `r Biocpkg('GenomicRanges')` `r Citep(bib[['GenomicRanges']])` * `r CRANpkg('Hmisc')` `r Citep(bib[['Hmisc']])` * `r Biocpkg('IRanges')` `r Citep(bib[['IRanges']])` * `r Biocpkg('qvalue')` `r Citep(bib[['qvalue']])` * `r Biocpkg('Rsamtools')` `r Citep(bib[['Rsamtools']])` * `r Biocpkg('rtracklayer')` `r Citep(bib[['rtracklayer']])` * `r Biocpkg('S4Vectors')` `r Citep(bib[['S4Vectors']])` * `r Biocexptpkg('derfinderData')` `r Citep(bib[['derfinderData']])` * `r CRANpkg('sessioninfo')` `r Citep(bib[['sessioninfo']])` * `r CRANpkg('ggplot2')` `r Citep(bib[['ggplot2']])` * `r CRANpkg('knitr')` `r Citep(bib[['knitr']])` * `r Biocpkg('BiocStyle')` `r Citep(bib[['BiocStyle']])` * `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])` * `r CRANpkg('rmarkdown')` `r Citep(bib[['rmarkdown']])` * `r CRANpkg('testthat')` `r Citep(bib[['testthat']])` * `r Biocannopkg('TxDb.Hsapiens.UCSC.hg19.knownGene')` `r Citep(bib[['TxDb.Hsapiens.UCSC.hg19.knownGene']])` Code for creating the vignette ```{r createVignette, eval=FALSE} ## Create the vignette library("rmarkdown") system.time(render("derfinder-quickstart.Rmd", "BiocStyle::html_document")) ## Extract the R code library("knitr") knit("derfinder-quickstart.Rmd", tangle = TRUE) ``` Date the vignette was generated. ```{r reproducibility1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproducibility2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r reproducibility3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg('BiocStyle')` `r Citep(bib[['BiocStyle']])` with `r CRANpkg('knitr')` `r Citep(bib[['knitr']])` and `r CRANpkg('rmarkdown')` `r Citep(bib[['rmarkdown']])` running behind the scenes. Citations made with `r CRANpkg('RefManageR')` `r Citep(bib[['RefManageR']])`. ```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```