```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
IsoformSwitchAnalyzeR
Enabling Identification and Analysis of Isoform Switches with Functional Consequences from RNA-sequencing data
Kristoffer Vitting-Seerup
`r Sys.Date()`
## Abstract
Recent breakthroughs in bioinformatics now allow us to accurately reconstruct and quantify full-length gene isoforms from RNA-sequencing data (via tools such as Cufflinks, Kallisto and Salmon). These tools make it possible to analyzing alternative isoform usage, but unfortunatly this is rarely done. Thus, RNA-se data is typically not used to its full potential.
To solve this problem we developed IsoformSwitchAnalyzeR. IsoformSwitchAnalyzeR is an easy-to use-R package that enables statistical identification of isoform switching from RNA-seq derived quantification of novel and/or annotated full-length isoforms. IsoformSwitchAnalyzeR facilitates integration of many sources of (predicted) annotation including features, including Open Reading Frame (ORF), protein domains (via Pfam), signal peptides (via SignalP), coding potential (via CPAT) and sensitivity to Non-sense Mediated Decay (NMD). The combination of identified isoform switches and their annotation also enables IsoformSwitchAnalyzeR to predict potential functional consequences of the identified isoform switches --- such as loss of protein domains or coding potential --- thereby identifying isoform switches of particular interest. Lastly, IsoformSwitchAnalyzeR provides article-ready visualization methods for isoform switches, and summary statistics describing the genome-wide occurences of isoform switches, their consequences as well as the associated alternative splicing.
In summary, IsoformSwitchAnalyzeR enables analysis of RNA-seq data with isoform resolution with a focus on isoform switching (with predicted consequences), thereby expanding the usability of RNA-seq data.
## Table of Content
[Preliminaries]
- [Background and Package Description]
- [Installation]
- [How To Get Help]
[What To Cite] (please remember)
[Quick Start]
- [Workflow Overview]
- [Short Example Workflow] (a.k.a. the "Too long; didn't read" section)
+ [Workflow Overview]
+ [Short Example Workflow]
+ [Examples of Switch Visualizations]
[Detailed Workflow]
- [Overview]
+ [IsoformSwitchAnalyzeR Background Information]
- [Importing Data Into R]
+ [Data from Cufflinks/Cuffdiff]
+ [Data from Kallisto, Salmon or RSEM]
+ [Data From Other Full-length Transcript Assemblers]
- [Filtering]
- [Identifying Isoform Switches]
+ [Testing Isoform Switches with IsoformSwitchAnalyzeR]
+ [Testing Isoform Switches via DRIMSeq]
+ [Testing Isoform Switches with other Tools]
- [Analyzing Open Reading Frames]
- [Extracting Nucleotide and Amino Acid Sequences]
- [Advice for Running External Sequence Analysis Tools]
- [Importing External Sequence Analysis]
- [Predicting Alternative Splicing]
- [Predicting Switch Consequences]
- [Post Analysis of Isoform Switches with Consequences]
+ [Analysis of Individual Isoform Switching]
+ [Genome-Wide Analysis of Isoform Switching]
[Analyzing Alternative Splicing] (new)
- [Workflow Overview]
- [Genome-Wide Analysis of Alternative Splicing]
[Other workflows]
- [Augmenting ORF Predictions with Pfam Results]
- [Analyze Small Upstream ORFs]
- [Remove Sequences Stored in SwitchAnalyzeRlist]
- [Adding Uncertain Category to Coding Potential Predictions]
- [Quality control of ORF of known annotation]
- [Analyzing the Biological Mechanisms Behind Isoform Switching]
- [Analysing experiments without replicates]
[Frequently Asked Questions, Problems and Errors]
[Final Remarks]
[Sessioninfo]
## Preliminaries
### Background and Package Description
The usage of Alternative Transcription Start sites (aTSS), Alternative Splicing (AS) and alternative Transcription Termination Sites (aTTS) are collectively collectively results in the production of different isoforms. Alternative isoforms are widely used as recently demonstrated by The ENCODE Consortium, which found that on average, 6.3 different transcripts are generated per gene; a number whic may var considerably per gene.
The importance of analyzing isoforms instead of genes has been highlighted by many examples showing functionally important changes. One of these examples is the pyruvate kinase. In normal adult homeostasis, cells use the adult isoform (M1), which supports oxidative phosphorylation. However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis, one of the hallmarks of cancer. Such shifts in isoform usage are termed 'isoform switching' and cannot be detected at when only analyzing data on gene level.
Tools such as Cufflinks, Salmon and Kallisto allows for reconstruction and quantification of full-length transcripts from RNA-seq data. Such data has the potential to facilitate genome-wide analysis of alternative isoform usage and identification of isoform switching --- but unfortunately these types of analyses are still only rarely done; most analyses are on gene level only.
We hypothesize that there are multiple reasons why RNA-seq data is not used to its full potential:
1) There is still a lack of tools that can identify isoform switches with isoform resolution
2) Although there are many excellent tools to perform sequence analysis, there is no common framework which allows for integration of the analysis provided by these tools.
3) There is a lack of tools facilitating easy and article-ready visualization of isoform switches.
To solve all these problems, we developed IsoformSwitchAnalyzeR.
IsoformSwitchAnalyzeR is an easy-to-use R package that enables the user to import (novel) full-length derived isoforms from an RNA-seq experiment into R. If annotated transcripts are analyzed, IsoformSwitchAnalyzeR offers integration with the multi-layer information stored in a GTF file including the annotated coding sequences (CDS). If transcript structures are predicted (either de-novo or guided) IsoformSwitchAnalyzeR offers an accurate tool for identifying the dominant ORF of the isoforms. The knowledge of isoform positions for the CDS/ORF allows for prediction of sensitivity to Nonsense Mediated Decay (NMD) --- the mRNA quality control machinery that degrades isoforms with pre-mature termination codons (PTC).
IsoformSwitchAnalyzeR facilitates identification of isoform switches via newly developed statistical methods that tests each individual isoform for differential usage and thereby identifies the exact isoforms involved in an isoform switch.
Since we know the exon structure of the full-length isoform, IsoformSwitchAnalyzeR can extract the underlying nucleotide sequence from a reference genome. This enables integration with the Coding Potential Assessment Tool (CPAT) which predicts the coding potential of an isoform and can be used to increase accuracy of ORF predictions. By combining the CDS/ORF isoform positions with the nucleotide sequence, we can extract the most likely amino acid sequence of the CDS/ORF. The amino acid sequence enables integration of analysis of protein domains (via Pfam) and signal peptides (via SignalP) --- both supported by IsoformSwitchAnalyzeR. Lastly, since the structures of all expressed isoforms from a given gene are known, one can also annotate alternative splicing - including retentions - a functionality also implemented in IsoformSwitchAnalyzeR.
Thus, in summary, IsoformSwitchAnalyzeR enables annotation of isoforms with intron retention, ORF, NMD sensitivity, coding potential, protein domains and signal peptides (and many more), resulting in the ability to predict important functional consequences of isoform switches.
IsoformSwitchAnalyzeR contains tools that allow the user to create article-ready visualizations of:
1) Individual isoform switches
2) Genome-wide analysis of isoform switches and their predicted consequences
3) Genome-wide analusis of alternative splicing.
These visualizations are easy to understand and integrate all information gathered throughout the workflow. Example of visualizations can be found in the [Examples of Switch Visualizations] section.
Lastly IsoformSwitchAnalyzeR is based on standard Bioconductor classes such as GRanges and BSgenome. Thus, it supports all species- and annotation versions facilitated by the Bioconductor annotation packages.
Back to [Table of Content].
### Installation
IsoformSwitchAnalyzeR is part of the Bioconductor repository and community which means it is distributed with, and dependent on, Bioconductor. Installation of IsoformSwitchAnalyzeR is easy and can be done from within the R terminal. If it is the first time you use Bioconductor, simply copy-paste the following into an R session to install the basic Bioconductor packages:
source("http://bioconductor.org/biocLite.R")
biocLite()
If you already have installed Bioconductor, running these two commands will check whether updates for installed packages are available.
After you have installed the basic Bioconductor packages you can install IsoformSwitchAnalyzeR by copy-pasting the following into an R session:
source("http://bioconductor.org/biocLite.R")
biocLite("IsoformSwitchAnalyzeR")
This will install the IsoformSwitchAnalyzeR package as well as other R packages that are needed for IsoformSwitchAnalyzeR to work.
If you need to install from the developmental branch of Bioconductor you need to specify that - note that this is for advanced uses and should not be done unless you have good reason to:
source("http://bioconductor.org/biocLite.R")
useDevel(devel = TRUE)
biocLite("IsoformSwitchAnalyzeR")
### How To Get Help
This R package comes with plenty of documentation. Much information can be found in the R help files (which can easily be accessed by running the following command in R "?functionName", for example "?isoformSwitchTest"). Furthermore, this vignette contains a lot of information and will be continously updated, so make sure to read both sources carefully as it contains the answers to the most Frequently Asked Questions, Problems and Errors.
If you want to report a bug/error (found in the newest version of the R package!) please make an issue with a reproducible example at [github](https://github.com/kvittingseerup/IsoformSwitchAnalyzeR) --- remember to add the appropriate label.
If you have unanswered questions or comments regarding IsoformSwitchAnalyzeR, please post them on the associated [google group](https://groups.google.com/forum/#!forum/isoformswitchanalyzer) (plase make sure the question was not already answered there).
If you have suggestions for improvements, please put them on [github](https://github.com/kvittingseerup/IsoformSwitchAnalyzeR). This will allow other people to upvote your idea, thereby showing us there is wide support of implementing your idea.
Back to [Table of Content].
## What To Cite
The analysis performed by IsoformSwitchAnalyzeR is only possible due to a string of other tools and scientific discoveries --- please read this section thoroughly and cite the appropriate articles to give credit where credit is due.
If you are using the
- **Import of data from Salmon/Kallisto/RSEM** : Please cite reference _10_
- **Inter-library normalization of abundance values** : Please cite reference _10_ and _11_
- **Isoform switch test implemented in IsoformSwitchAnalyzeR** : Please cite both reference _1_ and _2_
- **Isoform switch test implemented in the DRIMSeq package (default)** : Please cite referencea _1_ and _3_
- **Prediction of open reading frames (ORF) analysis** : Please cite reference _1_ and _4_
- **Prediction of pre-mature termination codons (PTC) and thereby NMD-sensitivity** : Please cite refrence _1_, _4_, _5_ and _6_
- **CPAT** : Please cite reference _7_
- **Pfam** : Please cite reference _8_
- **SignalP** : Please cite reference _9_
- **Prediction of consequences** please cite reference _1_
- **Visualizations** (plots) implemented in the IsoformSwitchAnalyzeR package : Please cite reference _1_
- **Alternative splicing analysis** : Please cite both reference _1_ and _4_
Refrences:
1. _Vitting-Seerup et al. **The Landscape of Isoform Switches in Human Cancers.** Cancer Res. (2017)_
2. _Ferguson et al. **P-value calibration for multiple testing problems in genomics.** Stat. Appl. Genet. Mol. Biol. 2014, 13:659-673._
3. _Nowicka et al. **DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics.** F1000Research, 5(0), 1356._
4. _Vitting-Seerup et al. **spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data**. BMC Bioinformatics 2014, 15:81._
5. _Weischenfeldt et al. **Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns**. Genome Biol 2012, 13:R35_
6. _Huber et al. **Orchestrating high-throughput genomic analysis with Bioconductor**. Nat. Methods, 2015, 12:115-121._
7. _Wang et al. **CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model**. Nucleic Acids Res. 2013, 41:e74._
8. _Finn et al. **The Pfam protein families database**. Nucleic Acids Research (2014) Database Issue 42:D222-D230_
9. _Petersen et al. **SignalP 4.0: discriminating signal peptides from transmembrane regions**. Nature Methods, 8:785-786, 2011_
10. _Soneson et al. **Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.** F1000Research 4, 1521 (2015)._
11. _Robinson et al. **A scaling normalization method for differential expression analysis of RNA-seq data**. Genome Biology (2010)_
## Quick Start
### Workflow Overview
The idea behind IsoformSwitchAnalyzeR is to make it easy to do advanced post-analysis of full-length RNA-seq derived transcripts with a focus on finding, annotating and visualizing isoform switches with functional consequences. In the switch analysis workflow (see [Analyzing Alternative Splicing] for the alternative splicing workflow) IsoformSwitchAnalyzeR performs three high-level tasks:
- Identification of isoform switches.
- Annotation of the transcripts involved in the isoform switches.
- Visualization of predicted consequences of the isoform switches, for indivsual switches and globally.
A normal workflow for identification and analysis of isoform switches with functional consequences can be divided into two parts (also illustrated below in Figure 1).
**Part 1) Extract Isoform Switches and Their Sequences.** This part includes importing the data into R, identifying isoform switches, annotating those switches with open reading frames (ORF) and extracting the nucleotide and amino acid (peptide) sequences. The latter step enables the usage of external sequence analysis tools such as
* CPAT : The Coding-Potential Assessment Tool, which can be run either locally or via their [webserver](http://lilab.research.bcm.edu/cpat/).
* Pfam : Prediction of protein domains, which can be run either locally or via their [webserver](http://pfam.xfam.org/search#tabview=tab1).
* SignalP : Prediction of Signal Peptides, which can be run either locally or via their [webserver](http://www.cbs.dtu.dk/services/SignalP/).
All of the above steps are performed by the high-level function:
isoformSwitchAnalysisPart1()
See below for example of usage, and [Detailed Workflow] for details on the individual steps.
**Part 2) Plot All Isoform Switches and Their annotation.** This part involves importing and incorporating the results of the external sequence analysis, identifying intron retention, predicting functional consequences and plotting i) all genes with isoform switches and ii) summaries of general consequences of switching.
All of this can be done using the function:
isoformSwitchAnalysisPart2()
See below for usage example, and [Detailed Workflow] for details on the individual steps.
**Alternatively** if one does not plan to incorporate external sequence analysis, it is possible to run the full workflow using:
isoformSwitchAnalysisCombined()
This corresponds to running _isoformSwitchAnalysisPart1()_ and _isoformSwitchAnalysisPart2()_ without adding the external results.