1 Common Bioconductor work flows

This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.

Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.

Important packages for analysis of differential expression include edgeR and DESeq2; both have excellent vignettes for exploration. Additional research methods embodied in Bioconductor packages can be discovered by visiting the biocViews web page, searching for the ‘DifferentialExpression’ view term, and narrowing the selection by searching for ‘RNA seq’ and similar.
Popular ChIP-seq packages include csaw and DiffBind for comparison of peaks across samples, ChIPQC for quality assessment, and ChIPseeker for annotating results (e.g., discovering nearby genes). What other ChIP-seq packages are listed on the biocViews page?
Working with called variants (VCF files) is facilitated by packages such as VariantAnnotation, VariantFiltering, ensemblVEP, and SomaticSignatures; packages for calling variants include, e.g., h5vc and VariantTools.
Several packages identify copy number variants from sequence data, including cn.mops; from the biocViews page, what other copy number packages are available? The CNTools package provides some useful facilities for comparison of segments across samples.
Microbiome and metagenomic analysis is facilitated by packages such as phyloseq and metagenomeSeq.
Metabolomics, chemoinformatics, image analysis, and many other high-throughput analysis domains are also represented in Bioconductor; explore these via biocViews and title searches.

Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure.

The Biostrings package is used to represent DNA and other sequences, with many convenient sequence-related functions. Check out the functions documented on the help page ?consensusMatrix, for instance. Also check out the BSgenome package for working with whole genome sequences, e.g., ?"getSeq,BSgenome-method"
The GenomicAlignments package is used to input reads aligned to a reference genome. See for instance the ?readGAlignments help page and vigentte(package="GenomicAlignments", "summarizeOverlaps")
rtracklayer’s import and export functions can read in many common file types, e.g., BED, WIG, GTF, …, in addition to querying and navigating the UCSC genome browser. Check out the ?import page for basic usage.
The ShortRead and Rsamtools packages can be used for lower-level access to FASTQ and BAM files, respectively. Explore the ShortRead vignette and Scalable Genomics labs to see approaches to effectively processing the large files.

Visualization

The Gviz package provides great tools for visualizing local genomic coordinates and associated data.
epivizr drives the epiviz genome browser from within R; rtracklayer provides easy ways to transfer data to and manipulate UCSC browser sessions.
Additional packages include ggbio, OmicCircos, …

Alt Sequencing Ecosystem

2 Computation: Big Data

2.1 Efficient R code

The goal of this section is to highlight practices for writing correct, robust and efficient R code.

2.2 Priorities

Correct: consistent with hand-worked examples (identical(), all.equal())
Robust: supports realistic inputs, e.g., 0-length vectors, NA values, …
Simple: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance.
Fast, or at least reasonable given the speed of modern computers.

2.3 Strategies

Profile
- Look at the script to understand in general terms what it is doing.
- Step through the code to see how it is executed, and to gain an understanding of the speed of each line.
- Time evaluation of select lines or simple chunks of code with system.time() or the microbenchmark package.
- Profile the code with a tool that indicates how much time is spent in each function call or line – the built-in Rprof() function, or packages such as lineprof or aprof

Vectorize – operate on vectors, rather than explicit loops

x <- 1:10
log(x)     ## NOT for (i in seq_along(x)) x[i] <- log(x[i])

##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

Pre-allocate memory, then fill in the result

result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
       result[i] <- runif(1) * result[i - 1]
result

##  [1] 7.074943e-01 1.018198e-01 6.819417e-02 2.886654e-02 1.453849e-02
##  [6] 3.508991e-03 9.874054e-04 6.821778e-04 1.967931e-06 1.836865e-06

Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once
- Simple, e.g., ‘hoist’ constant multiplications from a for loop
- Higher-level, e.g., use lm.fit() rather than repeatedly fitting the same design matrix.
Re-use existing, tested code
- Efficient implementations of common operations – tabulate(), rowSums() and friends, %in%, …
- Efficient domain-specific implementations, e.g., snpStats for GWAS linear models; limma for microarray linear models; edgeR, DESeq2 for negative binomial GLMs relevant to RNASeq.
- Reuse others’ work – DESeq2, GenomicRanges, Biostrings, …, dplyr, data.table, Rcpp

2.3.1 Case Study: Pre-allocate and vectorize

Here’s an obviously inefficient function:

f0 <- function(n, a=2) {
    ## stopifnot(is.integer(n) && (length(n) == 1) &&
    ##           !is.na(n) && (n > 0))
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- a * log(i)
    result
}

Use system.time() to investigate how this algorithm scales with n, focusing on elapsed time.

system.time(f0(10000))

##    user  system elapsed 
##   0.004   0.000   0.007

n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")

Remember the current ‘correct’ value, and an approximate time

n <- 10000
system.time(expected <- f0(n))

##    user  system elapsed 
##   0.004   0.000   0.003

head(expected)

## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519

Revise the function to hoist the common multiplier, a, out of the loop. Make sure the result of the ‘optimization’ and the original calculation are the same. Use the microbenchmark package to compare the two versions

f1 <- function(n, a=2) {
    result <- numeric()
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f1(n))

## [1] TRUE

library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)

## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval
##  f0(n) 2.632667 2.640774 2.659077 2.657991 2.660657 2.703296     5
##  f1(n) 2.487912 2.534903 3.166422 2.702409 3.842308 4.264579     5

Adopt a ‘pre-allocate and fill’ strategy

f2 <- function(n, a=2) {
    result <- numeric(n)
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f2(n))

## [1] TRUE

microbenchmark(f0(n), f2(n), times=5)

## Unit: microseconds
##   expr      min       lq     mean   median       uq      max neval
##  f0(n) 2382.328 2384.400 2640.674 2425.520 2439.865 3571.257     5
##  f2(n)  831.750  856.862  875.623  868.295  887.126  934.082     5

Use an *apply() function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.

f3 <- function(n, a=2)
    a * sapply(seq_len(n), log)

identical(expected, f3(n))

## [1] TRUE

microbenchmark(f0(n), f2(n), f3(n), times=10)

## Unit: microseconds
##   expr      min       lq      mean   median       uq      max neval
##  f0(n) 2384.027 2388.214 2726.7890 2441.573 2565.606 3949.162    10
##  f2(n)  834.296  855.826  899.9728  884.603  956.382  987.314    10
##  f3(n) 3232.041 3396.959 3660.2407 3446.093 3520.587 4820.226    10

Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:

f4 <- function(n, a=2)
    a * log(seq_len(n))
identical(expected, f4(n))

## [1] TRUE

microbenchmark(f0(n), f3(n), f4(n), times=10)

## Unit: microseconds
##   expr      min       lq      mean   median       uq      max neval
##  f0(n) 2378.197 2425.047 2850.2893 2505.837 3554.125 3940.095    10
##  f3(n) 3273.346 3304.547 3410.5855 3382.308 3448.089 3807.240    10
##  f4(n)  204.820  207.267  373.4011  214.985  245.809 1743.434    10

f4() definitely seems to be the winner. How does it scale with n? (Repeat several times)

n <- 10 ^ (5:8)                         # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")

Any explanations for the different pattern of response?

Lessons learned:

Vectorizing offers a huge improvement over iteration
Pre-allocate-and-fill is very helpful when explicit iteration is required.
*apply() functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it
Hoisting common sub-expressions can be helpful for improving performance when explicit iteration is required.

2.4 Parallel evaluation

When data are too large to fit in memory, we can iterate through files in chunks or subset the data by fields or genomic positions.

Iteration

Chunk-wise
open(), read chunk(s), close().
e.g., yieldSize argument to Rsamtools::BamFile()
Framework: GenomicFiles::reduceByYield()

Restriction

Limit to columns and / or rows of interest
Exploit domain-specific formats
- BAM files and Rsamtools::ScanBamParam()
- BAM files and Rsamtools::PileupParam()
- VCF files and VariantAnnotation::ScanVcfParam()
Use a data base

Parallel evalutation

After writing efficient R code, adopting iteration and restriction
BiocParallel for common, cross-platform operation

BiocParallel provides a standardized interface for simple parallel evaluation. The package builds provides access to the snow and multicore functionality in the parallel package as well as BatchJobs for running cluster jobs.

General ideas:

Use bplapply() instead of lapply()
Argument BPPARAM influences how parallel evaluation occurs
- MulticoreParam(): threads on a single (non-Windows) machine
- SnowParam(): processes on the same or different machines
- BatchJobsParam(): resource scheduler on a cluster
- DoparParam(): parallel execution with foreach()

2.4.1 Exercise: Sleeping serially and in parallel

This small example motivates the use of parallel execution and demonstrates how bplapply() can be a drop in for lapply.

Use system.time() to explore how long this takes to execute as n increases from 1 to 10. Use identical() and microbenchmark to compare alternatives f0() and f1() for both correctness and performance.

fun sleeps for 1 second, then returns i.

library(BiocParallel)

fun <- function(i) {
    Sys.sleep(1)
    i
}

## serial
f0 <- function(n)
    lapply(seq_len(n), fun)

## parallel
f1 <- function(n)
    bplapply(seq_len(n), fun)

B.5 – Next Steps

8 - 9 May 2017

Contents