derfinder advanced details and usage

If you wish, you can view this vignette online here.

Overview

This vignette explains in more detail the relationship between the different functions in derfinder (Collado-Torres, Frazee, Jaffe, and Leek, 2014) as well as add-on packages derfinderPlot (vignette), derfinderHelper (vignette) and regionReport (vignette). The vignette also includes some bash scripts for running derfinder in a cluster, although there are probably other ways to do so using only R scripts. For example, by using BatchJobsParam() from BiocParallel.

This vignette assumes that you have read through the introductory vignette.

Lets start by loading the derfinder package.

## Load libraries
library('derfinder')

If you explore the package, you will notice that all exported functions use camel case name and have aliases with underscore names. For example, analyzeChr() and analyze_chr(). You should use whichever naming style you prefer. Regrettably, implementing the same flexibility at the argument naming level is not straightforward.

Advanced arguments

Just like it is shown in the introductory vignette, you can find more information about the advanced arguments using advancedArg().

## URLs to advanced arguemtns
sapply(c('analyzeChr', 'loadCoverage'), advancedArg, browse = FALSE)
##                                                                                                                             analyzeChr 
## "https://github.com/search?utf8=%E2%9C%93&q=advanced_argument+filename%3Aanalyze+repo%3Alcolladotor%2Fderfinder+path%3A%2FR&type=Code" 
##                                                                                                                           loadCoverage 
##    "https://github.com/search?utf8=%E2%9C%93&q=advanced_argument+filename%3Aload+repo%3Alcolladotor%2Fderfinder+path%3A%2FR&type=Code"
## Set browse = TRUE if you want to open them in your browser

These arguments are options we expect not all users will want to change and which would otherwise make the help pages confusing to them. However, all arguments are documented in roxygen2 style in the source code.

Furthermore, note that using the ... argument allows you to specify some of the documented arguments. For example, you might want to control the maxClusterGap from findRegions() in the analyzeChr() call.

Non-human data

If you are working with data from an organism that is not Homo sapiens, then set the global options defining the species and the chrsStyle used. For example, if you are working with Arabidopsis Thaliana and the NCBI naming style, then set the options using the following code:

## Set global species and chrsStyle options
options(species = 'arabidopsis_thaliana')
options(chrsStyle = 'NCBI')

## Then proceed to load and analyze the data

Internally derfinder uses extendedMapSeqlevels() to use the appropriate chromosome naming style given a species in all functions involving chromosome names.

Further note that the argument subject from analyzeChr() is passed to bumphunter::annotateNearest(subject). So if you are using a genome different from hg19 remember to provide the appropriate annotation data or simply use analyzeChr(runAnnotation = FALSE). You might find the discussion Using bumphunter with non-human genomes useful.

Advanced loading data

Controlling loading from BAM files

If you are loading data from BAM files, you might want to specify some criteria to decide which reads to include or not. For example, your data might have been generated by a strand-specific protocol. You can do so by specifying the arguments of scanBamFlag() from Rsamtools.

You can also control whether to include or exclude bases with CIGAR string D (deletion from the reference) by setting the advanced argument drop.D = TRUE in your fullCoverage() or loadCoverage() call.

Unfiltered base-level coverage

Note that in most scenarios, the fullCov object illustrated in the introductory vignette can be large in memory. When making plots or calculating the region level coverage, we don't need the full information. In such situations, it might pay off to create a smaller version by loading only the required data. This can be achieved using the advanced argument which to fullCoverage() or loadCoverage().

However, it is important to consider that when reading the data from BAM files, a read might align partially inside the region of interest. By default such a read would be discarded and thus the base-level coverage would be lower than what it is in reality. The advanced argument protectWhich extends regions by 30 kbp (15 kbp each side) to help mitigate this issue.

We can illustrate this issue with the example data from derfinder. First, we load in the data and generate some regions of interest.

## Find some regions to work with
example('loadCoverage', 'derfinder')
## 
## ldCvrg> datadir <- system.file('extdata', 'genomeData', package='derfinder')
## 
## ldCvrg> files <- rawFiles(datadir = datadir, samplepatt = '*accepted_hits.bam$',
## ldCvrg+     fileterm = NULL)
## 
## ldCvrg> ## Shorten the column names
## ldCvrg> names(files) <- gsub('_accepted_hits.bam', '', names(files))
## 
## ldCvrg> ## Read and filter the data, only for 2 files
## ldCvrg> dataSmall <- loadCoverage(files = files[1:2], chr = '21', cutoff = 0)
## 2014-11-22 21:41:38 loadCoverage: finding chromosome lengths
## 2014-11-22 21:41:38 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
## 2014-11-22 21:41:39 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009102_accepted_hits.bam
## 2014-11-22 21:41:39 loadCoverage: applying the cutoff to the merged data
## 2014-11-22 21:41:39 filterData: originally there were 48129895 rows, now there are 454 rows. Meaning that 100 percent was filtered.
## 
## ldCvrg> ## Not run: 
## ldCvrg> ##D ## Export to BigWig files
## ldCvrg> ##D createBw(list('chr21' = dataSmall))
## ldCvrg> ##D 
## ldCvrg> ##D ## Load data from BigWig files
## ldCvrg> ##D dataSmall.bw <- loadCoverage(c(ERR009101 = 'ERR009101.bw', ERR009102 =
## ldCvrg> ##D     'ERR009102.bw'), chr = 'chr21')
## ldCvrg> ##D 
## ldCvrg> ##D ## Compare them
## ldCvrg> ##D mapply(function(x, y) { x - y }, dataSmall$coverage, dataSmall.bw$coverage)
## ldCvrg> ##D 
## ldCvrg> ##D ## Note that the only difference is the type of Rle (integer vs numeric) used
## ldCvrg> ##D ## to store the data.
## ldCvrg> ##D 
## ldCvrg> ## End(Not run)
## ldCvrg> 
## ldCvrg> 
## ldCvrg>
example('getRegionCoverage', 'derfinder')
## 
## gtRgnC> ## Obtain fullCov object
## gtRgnC> fullCov <- list('21'=genomeDataRaw$coverage)
## 
## gtRgnC> ## Assign chr lengths using hg19 information, use only first two regions
## gtRgnC> library('GenomicRanges')
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, as.vector, cbind, colnames, do.call,
##     duplicated, eval, evalq, get, intersect, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rep.int, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## 
## gtRgnC> data(hg19Ideogram, package = 'biovizBase', envir = environment())
## 
## gtRgnC> regions <- genomeRegions$regions[1:2]
## 
## gtRgnC> seqlengths(regions) <- seqlengths(hg19Ideogram)[names(seqlengths(regions))]
## 
## gtRgnC> ## Finally, get the region coverage
## gtRgnC> regionCov <- getRegionCoverage(fullCov=fullCov, regions=regions)
## extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
## 2014-11-22 21:41:39 getRegionCoverage: processing chr21
## 2014-11-22 21:41:40 getRegionCoverage: done processing chr21

Next, we load the coverage again using which but without any padding. We can see how the coverage is not the same by looking at the maximum coverage for each sample.

## Illustrate reading data from a set of regions
test <- loadCoverage(files = files, chr = '21', cutoff = NULL, which = regions, protectWhich = 0, fileStyle = 'NCBI')
## extendedMapSeqlevels: sequence names mapped from UCSC to NCBI for species homo_sapiens
## 2014-11-22 21:41:40 loadCoverage: finding chromosome lengths
## 2014-11-22 21:41:40 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
## 2014-11-22 21:41:40 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009102_accepted_hits.bam
## 2014-11-22 21:41:40 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009105_accepted_hits.bam
## 2014-11-22 21:41:40 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009107_accepted_hits.bam
## 2014-11-22 21:41:40 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009108_accepted_hits.bam
## 2014-11-22 21:41:40 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009112_accepted_hits.bam
## 2014-11-22 21:41:40 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009115_accepted_hits.bam
## 2014-11-22 21:41:40 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009116_accepted_hits.bam
## 2014-11-22 21:41:40 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009131_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009138_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009144_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009145_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009148_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009151_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009152_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009153_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009159_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009161_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009163_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009164_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009167_accepted_hits.bam
## 2014-11-22 21:41:41 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031812_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031835_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031867_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031868_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031900_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031904_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031914_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031936_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031958_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031960_accepted_hits.bam
## 2014-11-22 21:41:42 loadCoverage: applying the cutoff to the merged data
## 2014-11-22 21:41:42 filterData: originally there were 48129895 rows, now there are 48129895 rows. Meaning that 0 percent was filtered.
## Some reads were ignored and thus the coverage is lower as can be seen below:
sapply(test$coverage, max) - sapply(genomeDataRaw$coverage, max)
## ERR009101 ERR009102 ERR009105 ERR009107 ERR009108 ERR009112 ERR009115 
##         0         0         0         0        -1         0        -1 
## ERR009116 ERR009131 ERR009138 ERR009144 ERR009145 ERR009148 ERR009151 
##        -2        -2        -2        -1        -1        -3        -3 
## ERR009152 ERR009153 ERR009159 ERR009161 ERR009163 ERR009164 ERR009167 
##         0        -3        -3        -3        -1        -3        -3 
## SRR031812 SRR031835 SRR031867 SRR031868 SRR031900 SRR031904 SRR031914 
##         0        -1         0         0         0        -1         0 
## SRR031936 SRR031958 SRR031960 
##         0         0         0

When we re-load the data using some padding to the regions, we find that the coverage matches at all the bases.

## Illustrate reading data from a set of regions

test2 <- loadCoverage(files = files, chr = '21', cutoff = NULL, which = 
regions, protectWhich = 3e4, fileStyle = 'NCBI')
## extendedMapSeqlevels: sequence names mapped from UCSC to NCBI for species homo_sapiens
## 2014-11-22 21:41:43 loadCoverage: finding chromosome lengths
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009102_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009105_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009107_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009108_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009112_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009115_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009116_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009131_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009138_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009144_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009145_accepted_hits.bam
## 2014-11-22 21:41:43 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009148_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009151_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009152_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009153_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009159_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009161_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009163_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009164_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/ERR009167_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031812_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031835_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031867_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031868_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031900_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031904_accepted_hits.bam
## 2014-11-22 21:41:44 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031914_accepted_hits.bam
## 2014-11-22 21:41:45 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031936_accepted_hits.bam
## 2014-11-22 21:41:45 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031958_accepted_hits.bam
## 2014-11-22 21:41:45 loadCoverage: loading BAM file /tmp/RtmpU2CQy2/Rinst77d16c347744/derfinder/extdata/genomeData/SRR031960_accepted_hits.bam
## 2014-11-22 21:41:45 loadCoverage: applying the cutoff to the merged data
## 2014-11-22 21:41:45 filterData: originally there were 48129895 rows, now there are 48129895 rows. Meaning that 0 percent was filtered.
## Adding some padding to the regions helps get the same coverage
identical(sapply(test2$coverage, max), sapply(genomeDataRaw$coverage, max))
## [1] TRUE
## A more detailed test reveals that the coverage matches at every base
all(mapply(function(x, y) { identical(x, y) }, test2$coverage, genomeDataRaw$coverage))
## [1] TRUE

How much padding you need to use will depend on your specific data set, and you might be comfortable getting approximately the same coverage values for the sake of greatly reducing the memory resources needed.

Input files in a different naming style

If you are under the case where you like to use a specific chromosome naming style but the raw data files use another one, you might need to use the fileStyle argument.

For example, you could be working with Homo sapiens data and your preferred naming style is UCSC (chr1, chr2, ..., chrX, chrY) but the raw data uses NCBI style names (1, 2, ..., X, Y). In that case, use fullCoverage(fileStyle = 'NCBI') or loadCoverage(fileStyle = 'NCBI') depending if you are loading one chromosome or multiple at a time.

Loading data in chunks

If you prefer to do so, fullCoverage() and loadCoverage() can load the data of a chromosome in chunks using GenomicFiles. This is controlled by whether you specify the tilewidth argument.

Notice that you might run into slight coverage errors near the borders of the tiles for the same reason that was illustrated previously when loading specific regions.

This approach is not necessarily more efficient and can be significantly time consuming if you use a small tilewidth.

Flow charts

DER analysis flow chart

The following figure illustrates how most of derfinder's functions interact when performing a base-level differential expression analysis by calculating base-level F-statistics.

DER flow

Flow chart of the different processing steps (black boxes) that can be carried out using derfinder and the functions that perform these actions (in red). Input and output is shown in green boxes. Functions in blue are those applied to the results from multiple chromosomes (mergeResults() and derfinderReport). regionReport functions are shown in orange while derfinderPlot functions are shown in dark purple. Purple dotted arrow marks functions that require unfiltered base-level coverage.

analyzeChr() flow chart

analyzeChr flow

This figure shows in more detail the processing flow in analyzeChr(), which is the main function for identifying candidate differentially expressed regions (DERs) from the base-level F-statistics.

Many fine-tunning arguments can be passed to analyzeChr() to feed into the other functions. For example, you might want to use a smaller chunksize when pre-processing the coverage data (the default is 5 million): specially if you have hundreds of samples.

Another useful argument is scalefac (by default it's 32) which controls the scaling factor to use before the log2 transformation.

Furthermore, you might want to specify maxClusterGap to control the maximum gap between two regions before they are considered to be part of the same cluster.

regionMatrix() flow chart

regionMatrix flow

The above figure shows the functions used internally by regionMatrix() and processing steps. Overall, it is much simpler than analyzeChr().

Functions that use multiple cores

Currently, the following functions can use multiple cores, several of which are called inside analyzeChr().

All parallel operations use SnowParam() from BiocParallel when more than 1 core is being used. Otherwise, SerialParam() is used. Note that if you prefer to specify other types of parallelization you can do so by specifying the BPPARAM.custom advanced argument.

Because SnowParam() requires R to load the necessary packages on each worker, the key function fstats.apply() was isolated in the derfinderHelper package. This package has much faster loading speeds than derfinder which greatly impacts performance on cases where the actual step of calculating the F-statistics is fast.

You may prefer to use MulticoreParam() described in the BiocParallel vignette. In that case, when using these functions use BPPARAM.custom = MulticoreParam(workers = x) where x is the number of cores you want to use. Note that in some systems, as is the case of the cluster used by derfinder's developers, the system tools for assessing memory usage can be misleading, thus resulting in much higher memory loads when using MulticoreParam() instead of the default SnowParam().

Project organization

For each project, we recommend the following organization.

ProjectDir
|-derCoverageInfo
|-derAnalysis
|---analysis01
|---analysis02

That is, a main project directory with two initial directories. One for storing the coverage data, and one for storing each analysis: you might explore different models, cutoffs or other parameters.

You can then use fullCoverage(), save the result and also save the filtered coverage information for each chromosome separately. Doing so will result in the following structure.

ProjectDir
|-derCoverageInfo
|---Chr1Cov.Rdata
|---Chr2Cov.Rdata
...
|---ChrYCov.Rdata
|---fullCov.Rdata
|-derAnalysis
|---analysis01
|---analysis02

Next, you can use analyzeChr() for each of the chromosomes of a specific analysis (say analysis01). Doing so will create several Rdata files per chromosome as shown below. bash scripts can be useful if you wish to submit one cluster job per chromosome. In general, you will use the same model and group information for each chromosome, so saving the information can be useful.

ProjectDir
|-derCoverageInfo
|---Chr1Cov.Rdata
|---Chr2Cov.Rdata
...
|---ChrYCov.Rdata
|---fullCov.Rdata
|-derAnalysis
|---analysis01
|-----models.Rdata
|-----groupInfo.Rdata
|-----chr1/
|-------chunksDir/
|-------logs/
|-------annotation.Rdata
|-------coveragePrep.Rdata
|-------fstats.Rdata
|-------optionsStats.Rdata
|-------regions.Rdata
|-------timeinfo.Rdata
|-----chr2/
...
|-----chrY/
|---analysis02

Then use mergeResults() to pool together the results from all the chromosomes for a given analysis (here analysis01).

ProjectDir
|-derCoverageInfo
|---Chr1Cov.Rdata
|---Chr2Cov.Rdata
...
|---ChrYCov.Rdata
|---fullCov.Rdata
|-derAnalysis
|---analysis01
|-----logs/
|-----fullAnnotatedRegions.Rdata
|-----fullFstats.Rdata
|-----fullNullSummary.Rdata
|-----fullRegions.Rdata
|-----fullTime.Rdata
|-----optionsMerge.Rdata
|-----chr1/
|-------chunksDir/
|-------logs/
|-------annotation.Rdata
|-------coveragePrep.Rdata
|-------fstats.Rdata
|-------optionsStats.Rdata
|-------regions.Rdata
|-------timeinfo.Rdata
|-----chr2/
...
|-----chrY/
|---analysis02

Finally, you might want to use derfinderReport() from regionReport to create a HTML report of the results.

bash scripts

For interacting between bash and R we have found quite useful the getopt package. Here we include an example R script that is controlled by a bash script which submits a job for each chromosome to analyze for a given analysis.

The two files, derfinderAnalysis.R and derAnalysis.sh should live under the derAnalysis directory.

ProjectDir
|-derCoverageInfo
|---Chr1Cov.Rdata
|---Chr2Cov.Rdata
...
|---ChrYCov.Rdata
|---fullCov.Rdata
|-derAnalysis
|---derfinder-Analysis.R
|---derAnalysis.sh
|---analysis01
|---analysis02

Then, you can simply use:

cd /ProjectDir/derAnalysis
sh derAnalysis.sh analysis01

to run analyzeChr() on all your chromosomes.

derfinder-analysis.R

## Run derfinder's analysis steps with timing info

## Load libraries
library("getopt")

## Available at https://github.com/lcolladotor/derfinder
library("derfinder")

## Specify parameters
spec <- matrix(c(
    'DFfile', 'd', 1, "character", "path to the .Rdata file with the results from loadCoverage()",
    'chr', 'c', 1, "character", "Chromosome under analysis. Use X instead of chrX.",
    'mcores', 'm', 1, "integer", "Number of cores",
    'verbose' , 'v', 2, "logical", "Print status updates",
    'help' , 'h', 0, "logical", "Display help"
), byrow=TRUE, ncol=5)
opt <- getopt(spec)

## Testing the script
test <- FALSE
if(test) {
    ## Speficy it using an interactive R session and testing
    test <- TRUE
}

## Test values
if(test){
    opt <- NULL
    opt$DFfile <- "/ProjectDir/derCoverageInfo/chr21Cov.Rdata"
    opt$chr <- "21"
    opt$mcores <- 1
    opt$verbose <- NULL
}

## if help was asked for print a friendly message
## and exit with a non-zero error code
if (!is.null(opt$help)) {
    cat(getopt(spec, usage=TRUE))
    q(status=1)
}

## Default value for verbose = TRUE
if (is.null(opt$verbose)) opt$verbose <- TRUE

if(opt$verbose) message("Loading Rdata file with the output from loadCoverage()")
load(opt$DFfile)

## Make it easy to use the name later. Here I'm assuming the names were generated using output='auto' in loadCoverage()
eval(parse(text=paste0("data <- ", "chr", opt$chr, "CovInfo")))
eval(parse(text=paste0("rm(chr", opt$chr, "CovInfo)")))

## Just for testing purposes
if(test) {
    tmp <- data
    tmp$coverage <- tmp$coverage[1:1e6, ]
    library("IRanges")
    tmp$position[which(tmp$pos)[1e6 + 1]:length(tmp$pos)] <- FALSE
    data <- tmp
}

## Load the models
load("models.Rdata")

## Load group information
load("groupInfo.Rdata")


## Run the analysis with lowMemDir
analyzeChr(chr=opt$chr, coverageInfo=data, models=models, cutoffFstat=1e-06, cutoffType="theoretical", nPermute=1000, seeds=seq_len(1000), maxClusterGap=3000, groupInfo=groupInfo, subject="hg19", mc.cores=opt$mcores, lowMemDir=file.path(tempdir(), paste0("chr", opt$chr) , "chunksDir")), verbose=opt$verbose, chunksize=1e5)

## Done
if(opt$verbose) {
    print(proc.time())
    print(sessionInfo(), locale=FALSE)
}

Remember to modify the the script to fit your project.

derAnalysis.sh

#!/bin/sh

## Usage
# sh derAnalysis.sh analysis01

# Directories
MAINDIR=/ProjectDir
WDIR=${MAINDIR}/derAnalysis
DATADIR=${MAINDIR}/derCoverageInfo

# Define variables
SHORT='derA-01'
PREFIX=$1

# Construct shell files
for chrnum in 22 21 Y 20 19 18 17 16 15 14 13 12 11 10 9 8 X 7 6 5 4 3 2 1
do
    echo "Creating script for chromosome ${chrnum}"

    if [[ ${chrnum} == "Y" ]]
    then
        CORES=2
    else
        CORES=6
    fi

    chr="chr${chrnum}"
    outdir="${PREFIX}/${chr}"
    sname="${SHORT}.${PREFIX}.${chr}"
    cat > ${WDIR}/.${sname}.sh <<EOF
#!/bin/bash
#$ -cwd
#$ -m e
#$ -l mem_free=120G,h_vmem=60G,h_fsize=10G
#$ -N ${sname}
#$ -pe local ${CORES}

echo "**** Job starts ****"
date

# Create output directory 
mkdir -p ${WDIR}/${outdir}
# Make logs directory
mkdir -p ${WDIR}/${outdir}/logs

# run derfinder-analysis.R
cd ${WDIR}/${PREFIX}/

# specific to our cluster
# see http://www.jhpce.jhu.edu/knowledge-base/environment-modules/
module load R/3.1.x
Rscript ${WDIR}/derfinder2-analysis.R -d "${DATADIR}/${chr}Cov.Rdata" -c "${chrnum}" -m ${CORES} -v TRUE

# Move log files into the logs directory
mv ${WDIR}/${sname}.* ${WDIR}/${outdir}/logs/

echo "**** Job ends ****"
date
EOF
    call="qsub .${sname}.sh"
    $call
done

Your cluster might specify memory requirements differently and you might need to use fewer or more cores depending on your data set.

Full example

A fully documented example is available at derfinderExample.

TODO: update the example to use derfinder version 0.99.0.

Summary

This vignette covered the most commonly used advanced arguments, details on how to load data, flow charts explaining the relationships between the functions, the recommended output organization, and example shell scripts for running the analysis.

Origins

This implementation of derfinder has its origins in Alyssa C. Frazee's derfinder (Frazee, Sabunciyan, Hansen, Irizarry, et al., 2014). The statistical methods and implementation by now are very different.

Citing derfinder

Please use:

## Citation info
citation('derfinder')
## 
## Collado-Torres L, Frazee AC, Jaffe AE and Leek JT (2014).
## _derfinder: Annotation-agnostic differential expression analysis
## of RNA-seq data at base-pair resolution_.
## https://github.com/lcolladotor/derfinder - R package version
## 1.0.10, .
## 
## Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA and Leek JT
## (2014). "Differential expression analysis of RNA-seq data at
## single-base resolution." _Biostatistics_, *15 (3)*, pp. 413-426.
## , .

Reproducibility

This package was made possible thanks to:

Code for creating the vignette

## Create the vignette
library('knitrBootstrap') 

knitrBootstrapFlag <- packageVersion('knitrBootstrap') < '1.0.0'
if(knitrBootstrapFlag) {
    ## CRAN version
    library('knitrBootstrap')
    system.time(knit_bootstrap('derfinderAdvanced.Rmd', chooser=c('boot', 'code'), show_code = TRUE))
    unlink('derfinder.md')
} else {
    ## GitHub version
    library('rmarkdown')
    system.time(render('derfinderAdvanced.Rmd', 'knitrBootstrap::bootstrap_document'))
}
## Note: if you prefer the knitr version use:
# library('rmarkdown')
# system.time(render('derfinderAdvanced.Rmd', 'html_document'))
## Clean up
file.remove('derfinderAdvRef.bib')

## Extract the R code
library('knitr')
knit('derfinderAdvanced.Rmd', tangle = TRUE)

Date the vignette was generated.

## [1] "2014-11-22 21:41:45 PST"

Wallclock time spent generating the vignette.

## Time difference of 21.878 secs

R session information.

## Session info---------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.1.2 (2014-10-31)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US:                      
##  collate  C                           
##  tz       
## Packages-------------------------------------------------------------------------------------------
##  package           * version  date       source        
##  AnnotationDbi       1.28.1   2014-11-23 Bioconductor  
##  BBmisc              1.8      2014-10-30 CRAN (R 3.1.2)
##  BatchJobs           1.5      2014-10-30 CRAN (R 3.1.2)
##  Biobase             2.26.0   2014-11-23 Bioconductor  
##  BiocGenerics      * 0.12.1   2014-11-23 Bioconductor  
##  BiocParallel        1.0.0    2014-11-23 Bioconductor  
##  Biostrings          2.34.0   2014-11-23 Bioconductor  
##  DBI                 0.3.1    2014-09-24 CRAN (R 3.1.2)
##  Formula             1.1.2    2014-07-13 CRAN (R 3.1.2)
##  GenomeInfoDb      * 1.2.3    2014-11-23 Bioconductor  
##  GenomicAlignments   1.2.1    2014-11-23 Bioconductor  
##  GenomicFeatures     1.18.2   2014-11-23 Bioconductor  
##  GenomicFiles        1.2.0    2014-11-23 Bioconductor  
##  GenomicRanges     * 1.18.3   2014-11-23 Bioconductor  
##  Hmisc               3.14.6   2014-11-22 CRAN (R 3.1.2)
##  IRanges           * 2.0.0    2014-11-23 Bioconductor  
##  Matrix              1.1.4    2014-06-15 CRAN (R 3.1.2)
##  R.methodsS3         1.6.1    2014-01-05 CRAN (R 3.1.2)
##  RColorBrewer        1.0.5    2011-06-17 CRAN (R 3.1.2)
##  RCurl               1.95.4.3 2014-07-29 CRAN (R 3.1.2)
##  RJSONIO             1.3.0    2014-07-28 CRAN (R 3.1.2)
##  RSQLite             1.0.0    2014-10-25 CRAN (R 3.1.2)
##  Rcpp                0.11.3   2014-09-29 CRAN (R 3.1.2)
##  RefManageR          0.8.40   2014-10-29 CRAN (R 3.1.2)
##  Rsamtools           1.18.2   2014-11-23 Bioconductor  
##  S4Vectors         * 0.4.0    2014-11-23 Bioconductor  
##  XML                 3.98.1.1 2013-06-20 CRAN (R 3.1.2)
##  XVector             0.6.0    2014-11-23 Bioconductor  
##  acepack             1.3.3.3  2013-05-03 CRAN (R 3.1.2)
##  base64enc           0.1.2    2014-06-26 CRAN (R 3.1.2)
##  bibtex              0.3.6    2013-07-29 CRAN (R 3.1.2)
##  biomaRt             2.22.0   2014-11-23 Bioconductor  
##  bitops              1.0.6    2013-08-17 CRAN (R 3.1.2)
##  brew                1.0.6    2011-04-13 CRAN (R 3.1.2)
##  bumphunter          1.6.0    2014-11-23 Bioconductor  
##  checkmate           1.5.0    2014-10-19 CRAN (R 3.1.2)
##  cluster             1.15.3   2014-09-04 CRAN (R 3.1.2)
##  codetools           0.2.9    2014-08-21 CRAN (R 3.1.2)
##  derfinder         * 1.0.10   2014-11-23 Bioconductor  
##  derfinderHelper     1.0.4    2014-11-23 Bioconductor  
##  devtools          * 1.6.1    2014-10-07 CRAN (R 3.1.2)
##  digest              0.6.4    2013-12-03 CRAN (R 3.1.2)
##  doRNG               1.6      2014-03-07 CRAN (R 3.1.2)
##  evaluate            0.5.5    2014-04-29 CRAN (R 3.1.2)
##  fail                1.2      2013-09-19 CRAN (R 3.1.2)
##  foreach             1.4.2    2014-04-11 CRAN (R 3.1.2)
##  foreign             0.8.61   2014-03-28 CRAN (R 3.1.2)
##  formatR             1.0      2014-08-25 CRAN (R 3.1.2)
##  httr                0.5      2014-09-02 CRAN (R 3.1.2)
##  iterators           1.0.7    2014-04-11 CRAN (R 3.1.2)
##  knitcitations     * 1.0.4    2014-10-28 CRAN (R 3.1.2)
##  knitr               1.8      2014-11-11 CRAN (R 3.1.2)
##  knitrBootstrap    * 0.9.0    2013-10-17 CRAN (R 3.1.2)
##  lattice             0.20.29  2014-04-04 CRAN (R 3.1.2)
##  latticeExtra        0.6.26   2013-08-15 CRAN (R 3.1.2)
##  locfit              1.5.9.1  2013-04-20 CRAN (R 3.1.2)
##  lubridate           1.3.3    2013-12-31 CRAN (R 3.1.2)
##  markdown            0.7.4    2014-08-24 CRAN (R 3.1.2)
##  matrixStats         0.10.3   2014-10-15 CRAN (R 3.1.2)
##  memoise             0.2.1    2014-04-22 CRAN (R 3.1.2)
##  nnet                7.3.8    2014-03-28 CRAN (R 3.1.2)
##  pkgmaker            0.22     2014-05-14 CRAN (R 3.1.2)
##  plyr                1.8.1    2014-02-26 CRAN (R 3.1.2)
##  qvalue              1.40.0   2014-11-23 Bioconductor  
##  registry            0.2      2012-01-24 CRAN (R 3.1.2)
##  rngtools            1.2.4    2014-03-06 CRAN (R 3.1.2)
##  rpart               4.1.8    2014-03-28 CRAN (R 3.1.2)
##  rstudioapi          0.1      2014-03-27 CRAN (R 3.1.2)
##  rtracklayer         1.26.2   2014-11-23 Bioconductor  
##  sendmailR           1.2.1    2014-09-21 CRAN (R 3.1.2)
##  stringr             0.6.2    2012-12-06 CRAN (R 3.1.2)
##  survival            2.37.7   2014-01-22 CRAN (R 3.1.2)
##  xtable              1.7.4    2014-09-12 CRAN (R 3.1.2)
##  zlibbioc            1.12.0   2014-11-23 Bioconductor

Bibliography

This vignette was generated using knitrBootstrap (Hester, 2013) with knitr (Xie, 2014) and rmarkdown (Allaire, McPherson, Xie, Wickham, et al., 2014) running behind the scenes.

Citations made with knitcitations (Boettiger, 2014).

[1] J. Allaire, J. McPherson, Y. Xie, H. Wickham, et al. rmarkdown: Dynamic Documents for R. R package version 0.3.3. 2014. URL: http://CRAN.R-project.org/package=rmarkdown.

[2] S. Arora, M. Morgan, M. Carlson and H. Pages. GenomeInfoDb: Utilities for manipulating chromosome and other 'seqname' identifiers. R package version 1.2.3. 2014.

[3] C. Boettiger. knitcitations: Citations for knitr markdown files. R package version 1.0.4. 2014. URL: http://CRAN.R-project.org/package=knitcitations.

[4] M. Carlson. TxDb.Hsapiens.UCSC.hg19.knownGene: Annotation package for TxDb object(s). R package version 3.0.0. 2014.

[5] L. Collado-Torres, A. C. Frazee, A. E. Jaffe and J. T. Leek. derfinder: Annotation-agnostic differential expression analysis of RNA-seq data at base-pair resolution. https://github.com/lcolladotor/derfinder - R package version 1.0.10. 2014. URL: http://www.bioconductor.org/packages/release/bioc/html/derfinder.html.

[6] L. Collado-Torres, A. E. Jaffe and J. T. Leek. derfinderHelper: derfinder helper package. https://github.com/lcolladotor/derfinderHelper - R package version 1.0.4. 2014. URL: http://www.bioconductor.org/packages/release/bioc/html/derfinderHelper.html.

[7] A. Dabney and J. D. Storey. qvalue: Q-value estimation for false discovery rate control. R package version 1.40.0. 2014.

[8] A. C. Frazee, S. Sabunciyan, K. D. Hansen, R. A. Irizarry, et al. “Differential expression analysis of RNA-seq data at single-base resolution”. In: Biostatistics 15 (3) (2014), pp. 413-426. DOI: 10.1093/biostatistics/kxt053. URL: http://biostatistics.oxfordjournals.org/content/15/3/413.long.

[9] J. Hester. knitrBootstrap: Knitr Bootstrap framework. R package version 0.9.0. 2013. URL: http://CRAN.R-project.org/package=knitrBootstrap.

[10] R. A. Irizarry, M. Ayree, K. D. Hansen and H. C. Hansen. bumphunter: Bump Hunter. R package version 1.6.0. 2014. URL: https://github.com/ririzarr/bumphunter.

[11] Jaffe, A. E, Murakami, Peter, et al. “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies”. In: International Journal of Epidemiology (2012).

[12] F. E. H. Jr, with contributions from Charles Dupont and many others. Hmisc: Harrell Miscellaneous. R package version 3.14-6. 2014. URL: http://CRAN.R-project.org/package=Hmisc.

[13] M. Lawrence, R. Gentleman and V. Carey. “rtracklayer: an R package for interfacing with genome browsers”. In: Bioinformatics 25 (2009), pp. 1841-1842. DOI: 10.1093/bioinformatics/btp328. URL: http://bioinformatics.oxfordjournals.org/content/25/14/1841.abstract.

[14] M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (8 2013). DOI: 10.1371/journal.pcbi.1003118. URL: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118}.

[15] M. Morgan, M. Lang and R. Thompson. BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.0.0. 2014.

[16] M. Morgan, H. Pagès, V. Obenchain and N. Hayden. Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 1.18.2. 2014. URL: http://bioconductor.org/packages/release/bioc/html/Rsamtools.html.

[17] V. Obenchain, M. Love and M. Morgan. GenomicFiles: Distributed computing by file or by range. R package version 1.2.0. 2014.

[18] H. Pages, M. Carlson, S. Falcon and N. Li. AnnotationDbi: Annotation Database Interface. R package version 1.28.1. 2014.

[19] H. Pages, M. Lawrence and P. Aboyoun. S4Vectors: S4 implementation of vectors and lists. R package version 0.4.0. 2014.

[20] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2014. URL: http://www.R-project.org/.

[21] H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. ISBN: 978-0-387-98140-6. URL: http://had.co.nz/ggplot2/book.

[22] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

[23] H. Wickham and W. Chang. devtools: Tools to make developing R code easier. R package version 1.6.1. 2014. URL: http://CRAN.R-project.org/package=devtools.

[24] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.

[25] T. Yin, M. Lawrence and D. Cook. biovizBase: Basic graphic utilities for visualization of genomic data. R package version 1.14.0. 2014.