---
title: "Bioc2016: RNA-seq EDA"
author: "Michael Love"
date: "26 June 2016"
output: 
  rmarkdown::html_vignette:
    toc: true
    fig_width: 5
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{Low-level exploratory data analysis and methods development for RNA-seq}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Goals of this workshop

* Low-level EDA (exploratory data analysis) of RNA-seq,
  using core Bioconductor functions and base R graphics  
* Examine biases affecting paired-end RNA-seq fragments

We will **not** cover, but you can read about elsewhere:

* Counting RNA-seq fragments in genes
    * Using, e.g. [featureCounts](http://bioinf.wehi.edu.au/featureCounts/), 
    [htseq](http://www-huber.embl.de/HTSeq/doc/overview.html), 
    or [summarizeOverlaps](http://bioconductor.org/packages/GenomicAlignments)
    * Or [tximport](http://bioconductor.org/packages/tximport) following ultra-fast 
    quantification software such as 
    [Sailfish](http://www.cs.cmu.edu/~ckingsf/software/sailfish/), 
    [Salmon](https://combine-lab.github.io/salmon/), or
    [kallisto](https://pachterlab.github.io/kallisto/)
* [Basic differential expression analysis workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/)

# Gene annotation

We start by looking at two genes, USF2 and CHPF. 
Let's obtain some information on these genes from the 
*Homo.sapiens* package. We will pull out the 
transcript ID and name that go along with the gene symbols.

```{r, message=FALSE}
library(Homo.sapiens)
columns(Homo.sapiens)
```

```{r}
g <- list()
g[["USF2"]] <- select(Homo.sapiens, "USF2", c("TXID","TXNAME"), "SYMBOL")
g[["CHPF"]] <- select(Homo.sapiens, "CHPF", c("TXID","TXNAME"), "SYMBOL")
g
```

We can use a *TxDb* object to get the exons for the transcripts
of these two genes. The *exonsBy* function returns a *GRangesList*
object, and by specifying `by="tx"`, within this *GRangesList* 
we can find the GRanges for each exon of every transcript.
In `ebt` we have the exons of every transcript, labelled by
transcript ID. We then use the information from the previous
chunk to pull out the transcripts of USF2 and CHPF:

```{r}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
ebt <- exonsBy(txdb, by="tx")
head(names(ebt))
usf2 <- ebt[ g[["USF2"]]$TXID ]
chpf <- ebt[ g[["CHPF"]]$TXID ]
```

We can take a look at these:

```{r}
usf2
chpf
```

If we ask about the `width` of the exons, we get an
*IntegerList*. If we follow that command with `sum`, 
we get the length of each transcript.

```{r}
width(usf2)
sum(width(usf2))
```

It will be useful to calculate the total range of the 
exons of the transcripts of the genes:

```{r}
usf2.r <- range(unlist(usf2))
chpf.r <- range(unlist(chpf))
```

# Plotting gene models

Let's now visualize the exons of the transcripts using base R graphics.
First, let's write a line of code that draws the exons
of a single transcript of USF2:

```{r}
library(rafalib)
lens <- length(usf2)
nullplot(start(usf2.r),end(usf2.r),0,2)
segments(start(usf2[[1]]), rep(1,lens[1]), 
           end(usf2[[1]]), rep(1,lens[1]), lwd=3)
```

Now, to avoid repeating this code over and over for each
gene, let's write a general function for drawing the 
exons of each transcript of a gene, 
contained in a `GRangesList` x.

```{r}
plotGRangesList <- function(x,name="") {
  r <- range(unlist(range(x)))
  nullplot(start(r),end(r),0.5,length(x)+0.5,
           main=name,xlab=seqnames(x[[1]][1]))
  lens <- elementNROWS(x)
  for (i in seq_along(x)) {
    segments(start(x[[i]]), rep(i,lens[i]), 
             end(x[[i]]), rep(i,lens[i]), lwd=3)
  }
}
plotGRangesList(usf2, "USF2")
plotGRangesList(chpf, "CHPF")
```

# Reading in alignments

We have provided four BAM files for use in this workshop, so that we can 
explore how real RNA-seq fragments look using Bioconductor.
These four files are associated with GEUVADIS RNA-seq samples, 
aligned using [STAR](https://github.com/alexdobin/STAR) read aligner
version 2.5.0, aligning to the hg19 genome packaged by Illumina iGenomes.
The BAM files contain just a subset of the total number of paired-end reads,
to limit the disk space required for this workshop. The files contain
only the reads that overlap the genes USF2 and CHPF. Full details about
the experiments can be found by searching using the ERR ID on the 
[ENA](http://www.ebi.ac.uk/ena) website.

```{r}
dir <- system.file("extdata", package="bioc2016eda")
samples <- read.csv(file.path(dir,"samples.csv"))
samples
```

We create a vector that points to these files on disk.

```{r, message=FALSE}
library(GenomicAlignments)
bamfiles <- file.path(dir,paste0(samples$run,".bam"))
file.exists(bamfiles)
```

We did not sort or include an index for these BAM files, 
so first we must sort and index them. Normally,
you would sort and index these outside of R
using, e.g. [Samtools](http://www.htslib.org/).

```{r}
for (i in seq_along(bamfiles)) {
  sortBam(bamfiles[i], destination=file.path(dir,paste0(samples$run[i],"_sort")))
  indexBam(file.path(dir,paste0(samples$run[i],"_sort.bam")))
}
bamfiles <- file.path(dir,paste0(samples$run,"_sort.bam"))
file.exists(bamfiles)
```

Now we can read in some paired alignments for the first BAM files, 
specifically those alignments that cover USF2.
We need to remove some of the `seqnames` from the
`usf2.r` ranges, as these do not exist in the BAM header.

```{r}
usf2.r <- keepStandardChromosomes(usf2.r)
gap <- readGAlignmentPairs(bamfiles[1], 
                           param=ScanBamParam(which=usf2.r))
gap
```

# Computing genomic coverage

It's helpful to be able to quickly visualize these alignments,
and one such summary visualization is to see where the reads 
fall on the genome. This plot can be made using the `coverage`
function:

```{r}
cov <- coverage(gap)
cov.genome <- as.integer(cov[usf2.r][["chr19"]])
pos.genome <- seq(from=start(usf2.r),to=end(usf2.r))
plot(pos.genome, cov.genome, xlab="position (genome)", ylab="coverage")
```

# Find compatible overlaps

The function `findCompatibleOverlaps` can be used to do just that:
find which paired alignments are compatible with which transcripts.
"Compatible" in this case means that the splicing must be compatible:
the read junctions must align exactly with the introns of the transcript.
For more details, see `vignette("OverlapEncodings")`.

The returned object is of the class `Hits`, and can be interrogated
with `queryHits` and `subjectHits`. Transcripts 2 and 3 have the most
compatible alignments, though these are also the longest transcripts.

```{r}
fco <- findCompatibleOverlaps(gap, usf2)
fco
countSubjectHits(fco)
```

Note that elements of the query (paired alignments) can be compatible with 
multiple elements in the subject (transcripts). 
We may therefore want to categorize the paired alignments into *equivalence classes*, 
defined by the set of transcripts that the alignments are compatible with.
An early proposal for collapsing RNA-seq reads 
into equivalence classes can be found in 
[Statistical Modeling of RNA-Seq Data](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3846358/)
by Salzman, J., Jiang, H., and Wong, W.H.
Another reference on the use of equivalence classes for RNA-seq data is 
the [Sailfish paper](http://www.ncbi.nlm.nih.gov/pubmed/24752080),
discussing equivalence classes for read k-mers.

Here we will rename each paired alignment using a 1 or 0
if it is or is not compatible with a transcript.
So `1-1-0-0-1` means that a paired alignment is compatible
with transcripts 1, 2, and 5.

```{r}
tab <- table(queryHits(fco), subjectHits(fco))
tab <- as.matrix(tab)
frags <- apply(tab, 1, paste, collapse="-")
frags[1]
```

We can then tabulate the paired alignments using their new
names. Note that we have the same number of alignments
this way as if we count the unique number of query hits.

```{r}
table(frags)
sum(table(frags))
length(unique(queryHits(fco)))
```

We can see by scanning the table that many reads are
not compatible with transcript 1, 4, or 5.

# Map to transcripts

Another useful operation is to map from genomic coordinates to 
transcript coordinates. Note that, if we only wanted to 
have transcript alignments from the start, we could have used
software such as [RSEM](http://deweylab.github.io/RSEM/) or 
[RapMap](https://github.com/COMBINE-lab/RapMap) to produce
a BAM file contain *transcriptome* alignments. 
Such a file will necessarily contain, for many pairs, 
the alignment of the paired reads to the many isoforms of a gene
with which it is compatible. The genomic alignment file
will only contain multiple alignments for pairs when they 
are at different *genomic* loci.

To map our genomic coordinates to transcript coordinates,
we have to map the start and end of each fragment separately, 
because `mapToTranscripts` does not work on
*GAlignmentPairs*. Furthermore, from the man page we can read that,
"A successful mapping occurs when `x` is 
completely within the `transcripts` range".
So we will break apart each paired alignment that is
compatible with our transcript of interest into a start position and
an end position, map these positions to transcript 
coordinates, then rebuild the fragment afterward.

We don't keep track of fragment strand in this example, 
but one could do this simply by taking care of the
plus and minus strand fragments separately.

```{r}
idx <- queryHits(fco)[subjectHits(fco) == 3]
gr <- as(gap[idx],"GRanges")
strand(gr) <- "*"
m2tx.start <- mapToTranscripts(resize(gr, width=1, fix="start"), usf2[3])
m2tx.start
```

Now we do the same for the ends as we did for the starts,
mapping them to the transcript coordinates. 
We then create a *GRanges* in the transcript coordinate
space, using the starts and ends of the paired alignments.

The following code works for a plus strand transcript,
because the position of the fragment given by `start` 
(the left-most position of the range) is to the left
in the transcript coordinates as well.
For a minus strand transcript, we would have to reverse these.

```{r}
m2tx.end <- mapToTranscripts(resize(gr, width=1, fix="end"), usf2[3])
m2tx <- GRanges(seqnames(m2tx.start), 
                IRanges(start(m2tx.start),start(m2tx.end)))
m2tx
```

It will be convenient to turn this set of operations into 
a function so we can repeat the operation for a generic
BAM file and transcript.

```{r}
readTxFragments <- function(file, transcript) {
  r <- range(transcript[[1]])
  r <- keepStandardChromosomes(r)
  # suppress warnings about alignments with ambiguous pairing
  suppressWarnings({ 
    gap <- readGAlignmentPairs(file, param=ScanBamParam(which=r))
  })
  fco <- findCompatibleOverlaps(gap, transcript)
  idx <- queryHits(fco)
  gr <- as(gap[idx],"GRanges")
  strand(gr) <- "*"
  m2tx.start <- mapToTranscripts(resize(gr, width=1, fix="start"), 
                                 transcript)
  m2tx.end <- mapToTranscripts(resize(gr, width=1, fix="end"), 
                               transcript)
  tx.strand <- as.character(strand(transcript)[[1]][1])
  if (tx.strand == "+") {
    m2tx <- GRanges(seqnames(m2tx.start), 
                    IRanges(start(m2tx.start),start(m2tx.end)))
  } else {
    m2tx <- GRanges(seqnames(m2tx.start), 
                    IRanges(start(m2tx.end),start(m2tx.start)))   
  }
  m2tx
}
```

We can then compute transcript coverage for the different files.
We color the coverage by the sequencing center to identify
differences in coverage that correlate.

```{r}
mypar()
nullplot(0,1600,0,900,xlab="position (tx)",ylab="coverage")
for (i in seq_along(bamfiles)) {
  frags <- readTxFragments(bamfiles[i], usf2[3])
  cov.tx <- coverage(frags)
  lines(as.integer(cov.tx[[1]]), col=samples$center[i], lwd=2)
}
```

And the same for the second transcript of CHPF:

```{r}
mypar()
nullplot(0,3100,0,150,xlab="position (tx)",ylab="coverage")
for (i in seq_along(bamfiles)) {
  frags <- readTxFragments(bamfiles[i], chpf[2])
  cov.tx <- coverage(frags)
  lines(as.integer(cov.tx[[1]]), col=samples$center[i], lwd=2)
}
```

# Fragment length distribution

Once we have our fragments in the transcript coordinates
it is easy to observe the fragment length distribution.
For a more comprehensive view, obviously one would want to take
fragments from many or all genes.

```{r}
frags <- readTxFragments(bamfiles[1], usf2[3])
hist(width(frags), col="grey", border="white")
```

# Fragment start and length plot

A density plot of fragment length over fragment start 
was proposed by Evans, S.N., Hower, V., and Pachter, L. 
[Coverage statistics for sequence census methods](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-430)
and can be seen in Figure 6 in that reference.
This plot is useful for reminding us that RNA-seq data is 2-dimensional,
and for observing a bias affecting both the fragment starts (fragment position closest
to transcript 5') and the ends (closest to transcript 3').
We can see the bias as vertical and downward-sloping 45 degree
lines. We use transparency to allow overplotting at the same
or nearby locations in the (start,length) plane.

```{r}
mypar(4,1,mar=c(2,2,1,1))
for (f in bamfiles) {
  frags <- readTxFragments(f, usf2[3])
  plot(start(frags), width(frags),
       xlim=c(200,1300), ylim=c(50,400),
       pch=15, col=rgb(0,0,0,.2), cex=.5, 
       xlab="", ylab="")
}
```

For CHPF it is more difficult to see the pattern,
as the coverage is lower:

```{r}
mypar(4,1,mar=c(2,2,1,1))
for (f in bamfiles) {
  frags <- readTxFragments(f, chpf[2])
  plot(start(frags), width(frags),
       xlim=c(0,3000), ylim=c(50,400),
       pch=15, col=rgb(0,0,0,.2), cex=.5, 
       xlab="", ylab="")
}
```

# Extract transcript sequence

Up until this point, we have not yet taken a look at the 
transcript sequence. In order to see if the distribution
of paired alignments is correlated with sequence features, 
we will need to extract the transcript sequence.
Again, we will focus on the third transcript of
USF2 and the second of CHPF.
We can extract transcript sequence with the following lines:

```{r, message=FALSE}
library(BSgenome.Hsapiens.UCSC.hg19)
usf2.seq <- extractTranscriptSeqs(Hsapiens, usf2[3])[[1]]
chpf.seq <- extractTranscriptSeqs(Hsapiens, chpf[2])[[1]]
usf2.seq
chpf.seq
```

# Common sequences at read starts

We can look at, for a single transcript, the sequence beneath the read 
closest to the 5' end of the transcript. For uniformly distributed
reads, we would expect to see, for example, all 64 of the 3-mers
showing up equally:

```{r}
frags <- readTxFragments(bamfiles[1], usf2[3])
length(frags) / 4^3
```

Instead we see some 3-mers are much more common, which leads
to the vertical lines in the previous plots.

```{r}
start.seq <- as(Views(usf2.seq, start=start(frags), width=3), "DNAStringSet")
seq.tab <- table(start.seq)
sort(seq.tab, decreasing=TRUE)
```

Likewise, for the 3-mers for the read closest to the 3' end of
the transcript.

```{r}
end.seq <- as(Views(usf2.seq, end=end(frags), width=3), "DNAStringSet")
end.seq <- reverseComplement(end.seq)
seq.tab <- table(end.seq)
sort(seq.tab, decreasing=TRUE)
```

To properly infer the bias at the starts and ends of the 
fragments, one would need to look over many or all transcripts.
Such a bias calculation is performed by a number of software
including Cufflinks, eXpress, Sailfish, Salmon, kallisto and alpine.

# Fragment sequence features

Finally, the last kind of bias we will look into is whether
the fragment sequence itself has an effect in the distribution
of fragments on the transcript. We can extract the fragment
sequence with code similar to the lines above.

```{r}
frag.seq <- as(Views(usf2.seq, start=start(frags), end(frags)), "DNAStringSet")
```

We can easily find summaries of the fragment sequence, for
example, the GC content of the fragments:

```{r}
frag.gc <- letterFrequency(frag.seq, "GC", as.prob=TRUE)
plot(density(frag.gc))
```

Let's calculate what we might expect if fragments
were distributed uniformly along the transcript:

```{r}
uniform.gc <- letterFrequencyInSlidingView(usf2.seq, 
                                           median(width(frags)), 
                                           "GC", as.prob=TRUE) 
plot(density(uniform.gc))
```

Now let's combine these plots, iterating over the different samples
and coloring by sequencing center:

```{r}
plot(density(uniform.gc), ylim=c(0,12), lwd=2,
     xlab="fragment GC content", main="")
for (i in seq_along(bamfiles)) {
  frags <- readTxFragments(bamfiles[i], usf2[3])
  frag.seq <- as(Views(usf2.seq, start=start(frags), end(frags)), 
                 "DNAStringSet")
  frag.gc <- letterFrequency(frag.seq, "GC", as.prob=TRUE)
  lines(density(frag.gc), col=samples$center[i], lwd=2)
}
```

To summarize this plot, it's clear that the samples from one
of the sequencing centers have much less representation 
than the other sequencing center in the range of 0.65-0.75
GC content.

Again, in order to properly calculate this bias, one 
would need to look over many of all transcripts, and 
to adjust for the potentially confounding effect
of the read start sequence bias. Such a calculation
is performed by the alpine software.

# Other QA software

* Within Bioconductor, there is the 
  [ShortRead](http://bioconductor.org/packages/ShortRead) package with the 
  `qa` function for quality assessment
* I strongly recommend using the 
  [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) 
  software to generate QC reports, followed by the 
  [MultiQC](http://multiqc.info/)
  software to aggregate statistics across multiple samples.

# Session Information

```{r}
sessionInfo()
```