---
title: "Analysing Long Read RNA-Seq data with bambu"
author: Ying Chen, Yuk Kei Wan, Jonathan Göke
output: rmarkdown::html_vignette
vignette: >
    %\VignetteIndexEntry{bambu}
    %\VignetteEngine{knitr::rmarkdown}
    % \VignettePackage{bambu}
    %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,tidy = TRUE,
    warning=FALSE, message=FALSE,
    comment = "##>"
)
```


# Introduction
*[Bambu](https://github.com/GoekeLab/bambu)* is a method for transcript 
discovery and quantification from long read RNA-Seq data. Bambu uses aligned 
reads and genome reference annotations as input, and will return abundance 
estimates for all known transcripts and for newly discovered transcripts.
Bambu uses the information from the reference annotations to correct 
misalignment at splice junctions, then reduces the aligned reads to read 
equivalent classes, and uses this information to identify novel transcripts
across all samples of interest. Reads are then assigned to transcripts,
and expression estimates are obtained using an expectation maximisation 
algorithm. Here, we present an example workflow for analysing Nanopore 
long read RNA-Sequencing data from two human cancer cell lines from the 
Singapore Nanopore Expression Project (SG-NEx).


# Quick start: Transcript discovery and quantification with bambu {#quick-start}

## Installation
You can install bambu from github:
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("bambu")
BiocManager::install("NanoporeRNASeq")
```

## General Usage
The default mode to run *bambu* is using a set of aligned reads (bam files),
reference genome annotations (gtf file, TxDb object, or bambuAnnotation object),
and reference genome sequence (fasta file or BSgenome). *bambu* will return a 
summarizedExperiment object with the genomic coordinates for annotated and 
new transcripts and transcript expression estimates. 
We highly recommend to use the same annotations that were used for genome 
alignment. If you have a gtf file and fasta file you can run bambu with the
following options:
```{r}
library(bambu)
test.bam <- system.file("extdata",
    "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam",
    package = "bambu")

fa.file <- system.file("extdata",
    "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa",
    package = "bambu")

gtf.file <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf",
    package = "bambu")

bambuAnnotations <- prepareAnnotations(gtf.file)

se <- bambu(reads = test.bam, annotations = bambuAnnotations,
    genome = fa.file)
```
*bambu* returns a SummarizedExperiment object which can be accessed as follows:

* assays(se) returns the transcript abundance estimates as counts or CPM  
* rowRanges(se) returns a GRangesList with all annotated and newly discovered
    transcripts  
* rowData(se) returns additional information about each transcript such as the
    gene name and the class of newly discovered transcript  

# A complete workflow to identify and quantify transcript expression from 
Nanopore RNA-Seq data {#complete-workflow}
To demonstrate the usage of Bambu, we used long-read RNA-Seq data generated 
using Oxford Nanopore Sequencing from the NanoporeRNASeq package, 
which consists of 6 samples from two human cell lines (K562 and MCF7) 
that were generated by the 
[SG-NEx project](https://github.com/GoekeLab/sg-nex-data).
Each of these cell lines has three replicates, with 1 direct RNA sequencing 
run and 2 cDNA sequencing runs. Reads are aligned to chromosome 22 (Grch38)
and stored as bam files. In this workflow, we will demonstrate how to apply
*bambu* to these bam files to identify novel transcripts and estimates
transcript expression, visualize the results, and  identify differentially 
expressed genes and transcripts. 

## Input data {#input-data}
### Aligned reads (bam files) {#bam-files}            
*bambu* takes genomic alignments saved in bam files. Here we use bam-files
from the *NanoporeRNASeq* package, which contains reads aligned to the first
half of the human chromosome 22 using *minimap2*.
```{r}
library(bambu)

library(NanoporeRNASeq)
data("SGNexSamples")
SGNexSamples

library(ExperimentHub)
NanoporeData <- query(ExperimentHub(), c("NanoporeRNA", "GRCh38","BAM"))
bamFiles <- Rsamtools::BamFileList(NanoporeData[["EH3808"]],
    NanoporeData[["EH3809"]],NanoporeData[["EH3810"]], NanoporeData[["EH3811"]],
    NanoporeData[["EH3812"]], NanoporeData[["EH3813"]])
```

### Genome sequence (fasta file/ BSGenome object) {#genome-sequences}          
*bambu* additionally requires a genome sequence, which is used to correct 
splicing junctions in read alignments. Ideally, we recommend to use the same
genome seqeunce file that was used for alignment to be used for bambu.
```{r}
# get path to fasta file
genomeSequence <- query(ExperimentHub(), c("NanoporeRNA", "GRCh38","FASTA"))
```
As an option, users can also choose to use a BSgenome object:
```{r}
library(BSgenome.Hsapiens.NCBI.GRCh38)
genomeSequence <- BSgenome.Hsapiens.NCBI.GRCh38
```

### Genome annotations (bambu annotations object/ gtf file / TxDb object)
{#annotations}    
*bambu* also requires a reference transcript annotations object, which is used 
to correct read alignments, to identify for transcripts and genes (and the type
for novel transcripts), and for quantification. The annotation object can be
created from a gtf file: 
```{r}
gtf.file <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf",
    package = "bambu")

annotation <- prepareAnnotations(gtf.file)
```
The annotation object can also be created from a TxDb object:
```{r}
txdb <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf",
    package = "bambu")

annotation <- prepareAnnotations(txdb)
```

The annotation object can be stored and used again for re-running bambu.
Here we will used the annotation object from the *NanoporeRNASeq* package
that wasis prepared from a gtf file using the \code{prepareAnnotations} 
function in by function in *bambu*.
```{r}
data("HsChr22BambuAnnotation")
HsChr22BambuAnnotation
```

## Transcript discovery and quantification{#transcript-discovery-quantification}
### Running bambu {#run-bambu}           
Next we apply *bambu* on the input data (bam files, annotations,
genomeSequence). Bambu will perform isoform discovery to extend the provided 
annotation, and then quantify the transcript expression from these extended 
annotation using an Expectation-Maximisation algorithm. Here we will use 1 core,
which can be changed to process multiple files in parallel.
```{r, results = "hide"}
se <- bambu(reads = bamFiles, annotations = HsChr22BambuAnnotation,
    genome = genomeSequence, ncore = 1)
```
```{r}
se
```
For the downstream analysis, we will add the condition of interest to the 
\code{colData} object that describes the samples. Here we are interested in
a comparison of the 2 cell lines:
```{r}
colData(se)$condition <- as.factor(SGNexSamples$cellLine)
```


Optionally, users can choose to apply *bambu* to do quantification only 
(without isoform discovery)
```{r, results = "hide"}
seUnextended <- bambu(reads = bamFiles,
    annotations = HsChr22BambuAnnotation,
    genome = genomeSequence, discovery = FALSE)
```
```{r}
seUnextended
```


### Visualise results {#visualise-results}               
*bambu* provides functions to visualise and explore the results. When multiple
samples are used, we can visualise the correlation and clustering of all
samples with a heatmap:
```{r, fig.width = 8, fig.height = 6}
library(ggplot2)
plotBambu(se, type = "heatmap")
```

Additionally, we can also visualise the correlation with a 2-dimmensional
PCA plot.
```{r, fig.width = 8, fig.height = 6}
plotBambu(se, type = "pca")
```

In addition to visualising the correlation between samples, *bambu* also provide
a function to visualise the extended annotation and expression estimation for 
individual genes. Here we look at gene ENSG00000099968 and visualise the
transcript coordinates for annotated and novel isoforms and expression levels 
for these isoforms across all samples. 
```{r, fig.width = 8, fig.height = 10}
plotBambu(se, type = "annotation", gene_id = "ENSG00000099968")
```

### Obtain gene expression estimates from transcript expression
{#gene-expression}
Gene expression can be calculated from the transcript expression estimates
returned by *bambu* using the \code{transcriptToGeneExpression} function.
Looking at the output, we can see there are novel genes identified as well
```{r}
seGene <- transcriptToGeneExpression(se)
seGene
```

We can again use the \code{plotBambu} function to visualise the gene expression
data across the 6 samples with a heatmap or PCA plot. As expected, samples
from the same cell line showed higher correlation than across the cell lines.
```{r, fig.width = 8, fig.height = 6}
colData(seGene)$groupVar <- SGNexSamples$cellLine
plotBambu(seGene, type = "heatmap")
```


### Save data (gtf/text){#save-data}
*bambu* includes a function to write the extended annotations, the transcript
and the gene expression estimates that include any newly discovered genes and
transcripts to text files.   
```{r}
save.dir <- tempdir()
writeBambuOutput(se, path = save.dir, prefix = "NanoporeRNASeq_")
```

*bambu* also includes a function that only exports the extended annotations
to gtf file:
```{r}
save.file <- tempfile(fileext = ".gtf")
writeToGTF(rowRanges(se), file = save.file)
```


## Identifying differentially expressed genes {#DESeq2}
One of the most common tasks when analysing RNA-Seq data is the analysis of 
differential gene expression across a condition of intertest. Here we use 
*DESeq2* to find the differentially expressed genes between MCF7 and K562 cell
lines. Similar to using results from *Salmon*, estimates from *bambu* will
first be rounded.
```{r}
library(DESeq2)
dds <- DESeqDataSetFromMatrix(round(assays(seGene)$counts),
                                    colData = colData(se),
                                    design = ~ condition)
dds.deseq <- DESeq(dds)
deGeneRes <- DESeq2::results(dds.deseq, independentFiltering = FALSE)
head(deGeneRes[order(deGeneRes$padj),])
```
A quick summary of differentially expressed genes
```{r}
summary(deGeneRes)
```

We can also visualise the MA-plot for differentially used isoforms using 
\code{plotMA(deGeneRes)}. However, visualizing the MA-plots using the original
log-fold change results will be affected by the noise associated with log2 fold
changes from low count genes without requiring arbitrary filtering thresholds.
As recommended in the
[*DESeq2*tutorial]
(http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/
DESeq2.html#alternative-shrinkage-estimators). 
we applied the same shrinkage to effect sizes to 
improve the visualization.
```{r, fig.width = 8, fig.height = 6}
library(apeglm)
resLFC <- lfcShrink(dds.deseq, coef = "condition_MCF7_vs_K562", type = "apeglm")
plotMA(resLFC, ylim = c(-3,3))
```

## Identifying differential transcript usage {#DEXSeq}
We used *DEXSeq* to detect alternative used isoforms.
```{r}
library(DEXSeq)
dxd <- DEXSeqDataSet(countData = round(assays(se)$counts), 
sampleData = as.data.frame(colData(se)), 
design = ~sample + exon + condition:exon,
featureID = rowData(se)$TXNAME, 
groupID = rowData(se)$GENEID)

dxr <- DEXSeq(dxd)

head(dxr)
```

We can visualize the MA-plot
```{r,fig.width = 8, fig.height = 6}
plotMA(dxr, cex = 0.8 )
```


# Running bambu with large number of samples {#large-sample-num}
For larger sample numbers we recommend to write the processed data to a file.
This can be done by providing the readClass.outputDir:
```{r, eval = FALSE}
se <- bambu(reads = bamFiles, rcOutDir = "./bambu/",
    annotations = annotaiton, 
    genome = genomeSequence)
```
# Bambu parameters {#bambu-parameters}
## Advanced Options
For transcript discovery we recommend to adjust the parameters according to 
the number of replicates and the sequencing throughput. The most relevant
parameters are explained here. You can use any combination of these parameters.
### More stringent filtering thresholds imposed on potential novel transcripts

* Keep novel transcripts with min 5 read count in at least 1 sample:
```{r, eval = FALSE}
bambu(reads, annotations, genome,
    opt.discovery = list(min.readCount = 5))
```
* Keep novel transcripts with min 5 samples having at least 2 counts:
```{r, eval = FALSE}
bambu(reads, annotations, genome, 
    opt.discovery = list(min.sampleNumber = 5))
```
* Filter out transcripts with relative abundance within gene lower than 10%:
```{r, eval = FALSE}
bambu(reads, annotations, genome, 
    opt.discovery = list(min.readFractionByGene = 0.1))
```
### Quantification without bias correction    

The default estimation automatically does bias correction for expression
estimates. However, you can choose to perform the quantification without
bias correction.
```{r, eval = FALSE}
bambu(reads, annotations, genome, opt.em = list(bias = FALSE))
```
### Parallel computation

bambu allows parallel computation.
```{r, eval = FALSE}
bambu(reads, annotations, genome, ncore = 8)
```
See *[manual
](https://github.com/GoekeLab/bambu/blob/master/docs/bambu_0.1.0.pdf)* 
for details to customize other conditions.

# Getting help {#get-help}

Questions and issues can be raised at the Bioconductor support site 
(once bambu is available through bioconductor): 
https://support.bioconductor.org. Please tag your your posts with bambu.

Alternatively, issues can be raised at the bambu Github
repository:https://github.com/GoekeLab/bambu.


# Citing bambu {#cite-bambu}
A manuscript describing bambu is currently in preparation. 
If you use bambu for your research, 
please cite using the following doi: 10.5281/zenodo.3900025.

# Session Information {#session-info}
```{r}
sessionInfo()
```