--- title: "Introduction to derfinderHelper" author: - name: Leonardo Collado-Torres affiliation: - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus - &ccb Center for Computational Biology, Johns Hopkins University email: leo.collado@libd.org output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('derfinderHelper')`" vignette: > %\VignetteIndexEntry{Introduction to derfinderHelper} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Basics ## Install `r Biocpkg('derfinderHelper')` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg('derfinderHelper')` is a `R` package available via the [Bioconductor](http://bioconductor/packages/derfinderHelper) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg('derfinderHelper')` by using the following commands in your `R` session: ```{r 'installDer', eval = FALSE} ## try http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("derfinderHelper") ## Check that you have a valid Bioconductor installation biocValid() ``` ## Required knowledge `r Biocpkg('derfinderHelper')` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. A `r Biocpkg('derfinderHelper')` user is not expected to deal with those packages directly but will need to be familiar with `r Biocpkg('derfinder')`. If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startBioC/#.VkOKbq6rRuU). ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `derfinder` or `derfinderHelper` tags and check [the older posts](https://support.bioconductor.org/t/derfinder/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. ## Citing `r Biocpkg('derfinderHelper')` We hope that `r Biocpkg('derfinderHelper')` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r 'citation'} ## Citation info citation('derfinderHelper') ``` # Introduction to `r Biocpkg('derfinderHelper')` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library('knitcitations') ## Load knitcitations with a clean bibliography cleanbib() cite_options(hyperlink = 'to.doc', citation_format = 'text', style = 'html') # Note links won't show for now due to the following issue # https://github.com/cboettig/knitcitations/issues/63 ## Write bibliography information bib <- c(knitcitations = citation('knitcitations'), derfinderHelper = citation('derfinderHelper')[1], BiocStyle = citation('BiocStyle'), knitr = citation('knitr')[3], rmarkdown = citation('rmarkdown'), BiocParallel = citation('BiocParallel'), R = citation(), IRanges = citation('IRanges'), Matrix = citation('Matrix'), S4Vectors = RefManageR::BibEntry(bibtype = 'manual', key = 'S4Vectors', author = 'Hervé Pagès and Michael Lawrence and Patrick Aboyoun', title = "S4Vectors: S4 implementation of vector-like and list-like objects", year = 2017, doi = '10.18129/B9.bioc.S4Vectors'), devtools = citation('devtools'), testthat = citation('testthat')) write.bibtex(bib, file = 'derfinderHelperRef.bib') ``` `r Biocpkg('derfinderHelper')` `r citep(bib[['derfinderHelper']])` is a small package that was created to speed up the F-statistics approach implemented in the parent package `r Biocpkg('derfinder')`. It contains a single function, `fstats.apply()`, which is used to calculate the F-statistics for a given data matrix, null and an alternative models. The data is generally arranged in an matrix where the rows ($n$) are the genomic features of interest (gene-level summaries, exon-level summaries, or base-level data) and the columns ($m$) represent the samples. The other two main arguments for `fstats.apply()` are the null and alternative model matrices which are $m \times p_0$ and $m \times p$ where $p_0$ is the number of covariates in the null model and $p$ is the number of covariates in the alternative model. The models have to be nested and thus by definition $p > p_0$. The end result is a vector of F-statistics with length $n$, which is run length encoded for memory saving purposes. Other arguments of `fstats.apply()` are related to flow in `r Biocpkg('derfinder')` such as the scaling factor (`scalefac`) used, whether to subset the data (`index`), and if the data was separated into chunks and saved to disk to lower the memory load (`lowMemDir`). Implementation-wise, `adjustF` is useful when the denominator of the F-statistic calculation is too small. Finally, `method` controls how will the F-statistics be calculated. * `Matrix` is the recommended option because it uses around half the memory load of `regular` and can be faster. Specially if the data was saved in this format previously by `r Biocpkg('derfinder')`. * `Rle` uses the least amount of memory but gets very slow as the number of samples increases. Thus making it less than ideal in several cases. * `regular` uses base `R` to calculate the F-statistics and can require a large amount of memory. This is noticeable when using several cores to run `fstats.apply()` on different portions of the data. The F-statistics for each feature $i$ are calculated using the following formula: $$ F_i = \frac{ (\text{RSS0}_i - \text{RSS1}_i)/(\text{df}_1 - \text{df}_0) }{ \text{adjustF} + (\text{RSS1}_i / (p - p_0 - \text{df_1}))} $$ # Example The following section walks through an example. However, in practice, you will probably not use this package directly and it will be used via `r Biocpkg('derfinder')`. ## Data First lets create an example data set where we have information for 1000 features and 16 samples where samples 1 to 4 are from group A, 5 to 8 from group B, 9 to 12 from group C, and 13 to 16 from group D. ```{r 'createData'} ## Create some toy data suppressPackageStartupMessages(library('IRanges')) set.seed(20140923) toyData <- DataFrame( 'sample1' = Rle(sample(0:10, 1000, TRUE)), 'sample2' = Rle(sample(0:10, 1000, TRUE)), 'sample3' = Rle(sample(0:10, 1000, TRUE)), 'sample4' = Rle(sample(0:10, 1000, TRUE)), 'sample5' = Rle(sample(0:15, 1000, TRUE)), 'sample6' = Rle(sample(0:15, 1000, TRUE)), 'sample7' = Rle(sample(0:15, 1000, TRUE)), 'sample8' = Rle(sample(0:15, 1000, TRUE)), 'sample9' = Rle(sample(0:20, 1000, TRUE)), 'sample10' = Rle(sample(0:20, 1000, TRUE)), 'sample11' = Rle(sample(0:20, 1000, TRUE)), 'sample12' = Rle(sample(0:20, 1000, TRUE)), 'sample13' = Rle(sample(0:100, 1000, TRUE)), 'sample14' = Rle(sample(0:100, 1000, TRUE)), 'sample15' = Rle(sample(0:100, 1000, TRUE)), 'sample16' = Rle(sample(0:100, 1000, TRUE)) ) ## Lets say that we have 4 groups group <- factor(rep(toupper(letters[1:4]), each = 4)) ## Note that some groups have higher coverage, we can adjust for this in the model sampleDepth <- sapply(toyData, sum) sampleDepth ``` ## Models Next we create the model matrices for our example data set. Lets say that we want to calculate F-statistics comparing the alternative hypothesis that the group coefficients are not 0 versus the null hypothesis that they are equal to 0, when adjusting for the sample depth. To do so, we create the nested models. ```{r 'createModels'} ## Build the model matrices mod <- model.matrix(~ sampleDepth + group) mod0 <- model.matrix(~ sampleDepth) ## Explore them mod mod0 ``` ## Get F-statistics Finally, we can calculate the F-statistics using `fstats.apply()`. ```{r 'calculateFstats'} library('derfinderHelper') fstats <- fstats.apply(data = toyData, mod = mod, mod0 = mod0, scalefac = 1) fstats ``` We can then proceed to use this information in `r Biocpkg('derfinder')` or in any way you like. # Details We created `r Biocpkg('derfinderHelper')` for calculating F-statistics using `SnowParam()` from `r Biocpkg('BiocParallel')` `r citep(bib[['BiocParallel']])`. Using this form of parallelization requires loading the necessary packages in the child processes. Because `r Biocpkg('derfinder')` takes a long time to load, we shipped off `fstats.apply()` to its own package to improve the speed of the calculations while retaining the memory advantages of `SnowParam()` over `MulticoreParam()`. Note that transforming the data from a `DataFrame` to a `dgCMatrix` takes some time, so the most efficient performance is achieved when the data is converted at the beginning instead of at every permutation calculation. This is done in `derfinder::preprocessCoverage()` when `lowMemDir` is specified. # Reproducibility This package was made possible thanks to: * R `r citep(bib[['R']])` * `r Biocpkg('IRanges')` `r citep(bib[['IRanges']])` * `r CRANpkg('Matrix')` `r citep(bib[['Matrix']])` * `r Biocpkg('S4Vectors')` `r citep(bib[['S4Vectors']])` * `r CRANpkg('devtools')` `r citep(bib[['devtools']])` * `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])` * `r CRANpkg('knitr')` `r citep(bib[['knitr']])` * `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])` * `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])` * `r CRANpkg('testthat')` `r citep(bib[['testthat']])` Code for creating the vignette ```{r createVignette, eval=FALSE} ## Create the vignette library('rmarkdown') system.time(render('derfinderHelper.Rmd', 'BiocStyle::html_document')) ## Extract the R code library('knitr') knit('derfinderHelper.Rmd', tangle = TRUE) ``` ```{r createVignette2} ## Clean up file.remove('derfinderHelperRef.bib') ``` Date the vignette was generated. ```{r reproducibility1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproducibility2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits=3) ``` `R` session information. ```{r reproducibility3, echo=FALSE} ## Session info library('devtools') options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])` with `r CRANpkg('knitr')` `r citep(bib[['knitr']])` and `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])` running behind the scenes. Citations made with `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])`. ```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography bibliography() ```