--- title: "Bioc2016: RNA-seq EDA" author: "Michael Love" date: "26 June 2016" output: rmarkdown::html_vignette: toc: true fig_width: 5 fig_height: 5 vignette: > %\VignetteIndexEntry{Low-level exploratory data analysis and methods development for RNA-seq} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Goals of this workshop * Low-level EDA (exploratory data analysis) of RNA-seq, using core Bioconductor functions and base R graphics * Examine biases affecting paired-end RNA-seq fragments We will **not** cover, but you can read about elsewhere: * Counting RNA-seq fragments in genes * Using, e.g. [featureCounts](http://bioinf.wehi.edu.au/featureCounts/), [htseq](http://www-huber.embl.de/HTSeq/doc/overview.html), or [summarizeOverlaps](http://bioconductor.org/packages/GenomicAlignments) * Or [tximport](http://bioconductor.org/packages/tximport) following ultra-fast quantification software such as [Sailfish](http://www.cs.cmu.edu/~ckingsf/software/sailfish/), [Salmon](https://combine-lab.github.io/salmon/), or [kallisto](https://pachterlab.github.io/kallisto/) * [Basic differential expression analysis workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/) # Gene annotation We start by looking at two genes, USF2 and CHPF. Let's obtain some information on these genes from the *Homo.sapiens* package. We will pull out the transcript ID and name that go along with the gene symbols. ```{r, message=FALSE} library(Homo.sapiens) columns(Homo.sapiens) ``` ```{r} g <- list() g[["USF2"]] <- select(Homo.sapiens, "USF2", c("TXID","TXNAME"), "SYMBOL") g[["CHPF"]] <- select(Homo.sapiens, "CHPF", c("TXID","TXNAME"), "SYMBOL") g ``` We can use a *TxDb* object to get the exons for the transcripts of these two genes. The *exonsBy* function returns a *GRangesList* object, and by specifying `by="tx"`, within this *GRangesList* we can find the GRanges for each exon of every transcript. In `ebt` we have the exons of every transcript, labelled by transcript ID. We then use the information from the previous chunk to pull out the transcripts of USF2 and CHPF: ```{r} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ebt <- exonsBy(txdb, by="tx") head(names(ebt)) usf2 <- ebt[ g[["USF2"]]$TXID ] chpf <- ebt[ g[["CHPF"]]$TXID ] ``` We can take a look at these: ```{r} usf2 chpf ``` If we ask about the `width` of the exons, we get an *IntegerList*. If we follow that command with `sum`, we get the length of each transcript. ```{r} width(usf2) sum(width(usf2)) ``` It will be useful to calculate the total range of the exons of the transcripts of the genes: ```{r} usf2.r <- range(unlist(usf2)) chpf.r <- range(unlist(chpf)) ``` # Plotting gene models Let's now visualize the exons of the transcripts using base R graphics. First, let's write a line of code that draws the exons of a single transcript of USF2: ```{r} library(rafalib) lens <- length(usf2) nullplot(start(usf2.r),end(usf2.r),0,2) segments(start(usf2[[1]]), rep(1,lens[1]), end(usf2[[1]]), rep(1,lens[1]), lwd=3) ``` Now, to avoid repeating this code over and over for each gene, let's write a general function for drawing the exons of each transcript of a gene, contained in a `GRangesList` x. ```{r} plotGRangesList <- function(x,name="") { r <- range(unlist(range(x))) nullplot(start(r),end(r),0.5,length(x)+0.5, main=name,xlab=seqnames(x[[1]][1])) lens <- elementNROWS(x) for (i in seq_along(x)) { segments(start(x[[i]]), rep(i,lens[i]), end(x[[i]]), rep(i,lens[i]), lwd=3) } } plotGRangesList(usf2, "USF2") plotGRangesList(chpf, "CHPF") ``` # Reading in alignments We have provided four BAM files for use in this workshop, so that we can explore how real RNA-seq fragments look using Bioconductor. These four files are associated with GEUVADIS RNA-seq samples, aligned using [STAR](https://github.com/alexdobin/STAR) read aligner version 2.5.0, aligning to the hg19 genome packaged by Illumina iGenomes. The BAM files contain just a subset of the total number of paired-end reads, to limit the disk space required for this workshop. The files contain only the reads that overlap the genes USF2 and CHPF. Full details about the experiments can be found by searching using the ERR ID on the [ENA](http://www.ebi.ac.uk/ena) website. ```{r} dir <- system.file("extdata", package="bioc2016eda") samples <- read.csv(file.path(dir,"samples.csv")) samples ``` We create a vector that points to these files on disk. ```{r, message=FALSE} library(GenomicAlignments) bamfiles <- file.path(dir,paste0(samples$run,".bam")) file.exists(bamfiles) ``` We did not sort or include an index for these BAM files, so first we must sort and index them. Normally, you would sort and index these outside of R using, e.g. [Samtools](http://www.htslib.org/). ```{r} for (i in seq_along(bamfiles)) { sortBam(bamfiles[i], destination=file.path(dir,paste0(samples$run[i],"_sort"))) indexBam(file.path(dir,paste0(samples$run[i],"_sort.bam"))) } bamfiles <- file.path(dir,paste0(samples$run,"_sort.bam")) file.exists(bamfiles) ``` Now we can read in some paired alignments for the first BAM files, specifically those alignments that cover USF2. We need to remove some of the `seqnames` from the `usf2.r` ranges, as these do not exist in the BAM header. ```{r} usf2.r <- keepStandardChromosomes(usf2.r) gap <- readGAlignmentPairs(bamfiles[1], param=ScanBamParam(which=usf2.r)) gap ``` # Computing genomic coverage It's helpful to be able to quickly visualize these alignments, and one such summary visualization is to see where the reads fall on the genome. This plot can be made using the `coverage` function: ```{r} cov <- coverage(gap) cov.genome <- as.integer(cov[usf2.r][["chr19"]]) pos.genome <- seq(from=start(usf2.r),to=end(usf2.r)) plot(pos.genome, cov.genome, xlab="position (genome)", ylab="coverage") ``` # Find compatible overlaps The function `findCompatibleOverlaps` can be used to do just that: find which paired alignments are compatible with which transcripts. "Compatible" in this case means that the splicing must be compatible: the read junctions must align exactly with the introns of the transcript. For more details, see `vignette("OverlapEncodings")`. The returned object is of the class `Hits`, and can be interrogated with `queryHits` and `subjectHits`. Transcripts 2 and 3 have the most compatible alignments, though these are also the longest transcripts. ```{r} fco <- findCompatibleOverlaps(gap, usf2) fco countSubjectHits(fco) ``` Note that elements of the query (paired alignments) can be compatible with multiple elements in the subject (transcripts). We may therefore want to categorize the paired alignments into *equivalence classes*, defined by the set of transcripts that the alignments are compatible with. An early proposal for collapsing RNA-seq reads into equivalence classes can be found in [Statistical Modeling of RNA-Seq Data](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3846358/) by Salzman, J., Jiang, H., and Wong, W.H. Another reference on the use of equivalence classes for RNA-seq data is the [Sailfish paper](http://www.ncbi.nlm.nih.gov/pubmed/24752080), discussing equivalence classes for read k-mers. Here we will rename each paired alignment using a 1 or 0 if it is or is not compatible with a transcript. So `1-1-0-0-1` means that a paired alignment is compatible with transcripts 1, 2, and 5. ```{r} tab <- table(queryHits(fco), subjectHits(fco)) tab <- as.matrix(tab) frags <- apply(tab, 1, paste, collapse="-") frags[1] ``` We can then tabulate the paired alignments using their new names. Note that we have the same number of alignments this way as if we count the unique number of query hits. ```{r} table(frags) sum(table(frags)) length(unique(queryHits(fco))) ``` We can see by scanning the table that many reads are not compatible with transcript 1, 4, or 5. # Map to transcripts Another useful operation is to map from genomic coordinates to transcript coordinates. Note that, if we only wanted to have transcript alignments from the start, we could have used software such as [RSEM](http://deweylab.github.io/RSEM/) or [RapMap](https://github.com/COMBINE-lab/RapMap) to produce a BAM file contain *transcriptome* alignments. Such a file will necessarily contain, for many pairs, the alignment of the paired reads to the many isoforms of a gene with which it is compatible. The genomic alignment file will only contain multiple alignments for pairs when they are at different *genomic* loci. To map our genomic coordinates to transcript coordinates, we have to map the start and end of each fragment separately, because `mapToTranscripts` does not work on *GAlignmentPairs*. Furthermore, from the man page we can read that, "A successful mapping occurs when `x` is completely within the `transcripts` range". So we will break apart each paired alignment that is compatible with our transcript of interest into a start position and an end position, map these positions to transcript coordinates, then rebuild the fragment afterward. We don't keep track of fragment strand in this example, but one could do this simply by taking care of the plus and minus strand fragments separately. ```{r} idx <- queryHits(fco)[subjectHits(fco) == 3] gr <- as(gap[idx],"GRanges") strand(gr) <- "*" m2tx.start <- mapToTranscripts(resize(gr, width=1, fix="start"), usf2[3]) m2tx.start ``` Now we do the same for the ends as we did for the starts, mapping them to the transcript coordinates. We then create a *GRanges* in the transcript coordinate space, using the starts and ends of the paired alignments. The following code works for a plus strand transcript, because the position of the fragment given by `start` (the left-most position of the range) is to the left in the transcript coordinates as well. For a minus strand transcript, we would have to reverse these. ```{r} m2tx.end <- mapToTranscripts(resize(gr, width=1, fix="end"), usf2[3]) m2tx <- GRanges(seqnames(m2tx.start), IRanges(start(m2tx.start),start(m2tx.end))) m2tx ``` It will be convenient to turn this set of operations into a function so we can repeat the operation for a generic BAM file and transcript. ```{r} readTxFragments <- function(file, transcript) { r <- range(transcript[[1]]) r <- keepStandardChromosomes(r) # suppress warnings about alignments with ambiguous pairing suppressWarnings({ gap <- readGAlignmentPairs(file, param=ScanBamParam(which=r)) }) fco <- findCompatibleOverlaps(gap, transcript) idx <- queryHits(fco) gr <- as(gap[idx],"GRanges") strand(gr) <- "*" m2tx.start <- mapToTranscripts(resize(gr, width=1, fix="start"), transcript) m2tx.end <- mapToTranscripts(resize(gr, width=1, fix="end"), transcript) tx.strand <- as.character(strand(transcript)[[1]][1]) if (tx.strand == "+") { m2tx <- GRanges(seqnames(m2tx.start), IRanges(start(m2tx.start),start(m2tx.end))) } else { m2tx <- GRanges(seqnames(m2tx.start), IRanges(start(m2tx.end),start(m2tx.start))) } m2tx } ``` We can then compute transcript coverage for the different files. We color the coverage by the sequencing center to identify differences in coverage that correlate. ```{r} mypar() nullplot(0,1600,0,900,xlab="position (tx)",ylab="coverage") for (i in seq_along(bamfiles)) { frags <- readTxFragments(bamfiles[i], usf2[3]) cov.tx <- coverage(frags) lines(as.integer(cov.tx[[1]]), col=samples$center[i], lwd=2) } ``` And the same for the second transcript of CHPF: ```{r} mypar() nullplot(0,3100,0,150,xlab="position (tx)",ylab="coverage") for (i in seq_along(bamfiles)) { frags <- readTxFragments(bamfiles[i], chpf[2]) cov.tx <- coverage(frags) lines(as.integer(cov.tx[[1]]), col=samples$center[i], lwd=2) } ``` # Fragment length distribution Once we have our fragments in the transcript coordinates it is easy to observe the fragment length distribution. For a more comprehensive view, obviously one would want to take fragments from many or all genes. ```{r} frags <- readTxFragments(bamfiles[1], usf2[3]) hist(width(frags), col="grey", border="white") ``` # Fragment start and length plot A density plot of fragment length over fragment start was proposed by Evans, S.N., Hower, V., and Pachter, L. [Coverage statistics for sequence census methods](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-430) and can be seen in Figure 6 in that reference. This plot is useful for reminding us that RNA-seq data is 2-dimensional, and for observing a bias affecting both the fragment starts (fragment position closest to transcript 5') and the ends (closest to transcript 3'). We can see the bias as vertical and downward-sloping 45 degree lines. We use transparency to allow overplotting at the same or nearby locations in the (start,length) plane. ```{r} mypar(4,1,mar=c(2,2,1,1)) for (f in bamfiles) { frags <- readTxFragments(f, usf2[3]) plot(start(frags), width(frags), xlim=c(200,1300), ylim=c(50,400), pch=15, col=rgb(0,0,0,.2), cex=.5, xlab="", ylab="") } ``` For CHPF it is more difficult to see the pattern, as the coverage is lower: ```{r} mypar(4,1,mar=c(2,2,1,1)) for (f in bamfiles) { frags <- readTxFragments(f, chpf[2]) plot(start(frags), width(frags), xlim=c(0,3000), ylim=c(50,400), pch=15, col=rgb(0,0,0,.2), cex=.5, xlab="", ylab="") } ``` # Extract transcript sequence Up until this point, we have not yet taken a look at the transcript sequence. In order to see if the distribution of paired alignments is correlated with sequence features, we will need to extract the transcript sequence. Again, we will focus on the third transcript of USF2 and the second of CHPF. We can extract transcript sequence with the following lines: ```{r, message=FALSE} library(BSgenome.Hsapiens.UCSC.hg19) usf2.seq <- extractTranscriptSeqs(Hsapiens, usf2[3])[[1]] chpf.seq <- extractTranscriptSeqs(Hsapiens, chpf[2])[[1]] usf2.seq chpf.seq ``` # Common sequences at read starts We can look at, for a single transcript, the sequence beneath the read closest to the 5' end of the transcript. For uniformly distributed reads, we would expect to see, for example, all 64 of the 3-mers showing up equally: ```{r} frags <- readTxFragments(bamfiles[1], usf2[3]) length(frags) / 4^3 ``` Instead we see some 3-mers are much more common, which leads to the vertical lines in the previous plots. ```{r} start.seq <- as(Views(usf2.seq, start=start(frags), width=3), "DNAStringSet") seq.tab <- table(start.seq) sort(seq.tab, decreasing=TRUE) ``` Likewise, for the 3-mers for the read closest to the 3' end of the transcript. ```{r} end.seq <- as(Views(usf2.seq, end=end(frags), width=3), "DNAStringSet") end.seq <- reverseComplement(end.seq) seq.tab <- table(end.seq) sort(seq.tab, decreasing=TRUE) ``` To properly infer the bias at the starts and ends of the fragments, one would need to look over many or all transcripts. Such a bias calculation is performed by a number of software including Cufflinks, eXpress, Sailfish, Salmon, kallisto and alpine. # Fragment sequence features Finally, the last kind of bias we will look into is whether the fragment sequence itself has an effect in the distribution of fragments on the transcript. We can extract the fragment sequence with code similar to the lines above. ```{r} frag.seq <- as(Views(usf2.seq, start=start(frags), end(frags)), "DNAStringSet") ``` We can easily find summaries of the fragment sequence, for example, the GC content of the fragments: ```{r} frag.gc <- letterFrequency(frag.seq, "GC", as.prob=TRUE) plot(density(frag.gc)) ``` Let's calculate what we might expect if fragments were distributed uniformly along the transcript: ```{r} uniform.gc <- letterFrequencyInSlidingView(usf2.seq, median(width(frags)), "GC", as.prob=TRUE) plot(density(uniform.gc)) ``` Now let's combine these plots, iterating over the different samples and coloring by sequencing center: ```{r} plot(density(uniform.gc), ylim=c(0,12), lwd=2, xlab="fragment GC content", main="") for (i in seq_along(bamfiles)) { frags <- readTxFragments(bamfiles[i], usf2[3]) frag.seq <- as(Views(usf2.seq, start=start(frags), end(frags)), "DNAStringSet") frag.gc <- letterFrequency(frag.seq, "GC", as.prob=TRUE) lines(density(frag.gc), col=samples$center[i], lwd=2) } ``` To summarize this plot, it's clear that the samples from one of the sequencing centers have much less representation than the other sequencing center in the range of 0.65-0.75 GC content. Again, in order to properly calculate this bias, one would need to look over many of all transcripts, and to adjust for the potentially confounding effect of the read start sequence bias. Such a calculation is performed by the alpine software. # Other QA software * Within Bioconductor, there is the [ShortRead](http://bioconductor.org/packages/ShortRead) package with the `qa` function for quality assessment * I strongly recommend using the [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) software to generate QC reports, followed by the [MultiQC](http://multiqc.info/) software to aggregate statistics across multiple samples. # Session Information ```{r} sessionInfo() ```