```{r, echo = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ```

IsoformSwitchAnalyzeR

Enabling Identification and Analysis of Isoform Switches with Functional Consequences from RNA-sequencing data

Kristoffer Vitting-Seerup

`r Sys.Date()`

## Abstract Recent breakthroughs in bioinformatics now allow us to accurately reconstruct and quantify full-length gene isoforms from RNA-sequencing data (via tools such as Cufflinks, StringTie, Kallisto and Salmon). These tools make it possible to analyzing alternative isoform usage, but unfortunatly this is rarely done meaning RNA-seq data is typically not used to its full potential. To solve this problem we developed IsoformSwitchAnalyzeR. IsoformSwitchAnalyzeR is an easy-to use-R package that enables statistical identification of isoform switching from RNA-seq derived quantification of novel and/or annotated full-length isoforms. IsoformSwitchAnalyzeR facilitates integration of many sources of (predicted) annotation such as Open Reading Frame (ORF), protein domains (via Pfam), signal peptides (via SignalP), coding potential (via CPAT) and sensitivity to Non-sense Mediated Decay (NMD) etc. The combination of identified isoform switches and their annotation also enables IsoformSwitchAnalyzeR to predict potential functional consequences of the identified isoform switches --- such as loss of protein domains or coding potential --- thereby identifying isoform switches of particular interest. Lastly, IsoformSwitchAnalyzeR provides article-ready visualization methods for isoform switches, and summary statistics describing the genome-wide occurences of isoform switches, their consequences as well as the associated alternative splicing. In summary, IsoformSwitchAnalyzeR enables analysis of RNA-seq data with isoform resolution with a focus on isoform switching (with predicted consequences), thereby expanding the usability of RNA-seq data.
## Table of Content [Preliminaries] - [Background and Package Description] - [Installation] - [How To Get Help] [What To Cite] (please remember) [Quick Start] - [Overview of Isoform Switch Workflow] - [Example Workflow] (a.k.a. the "Too long; didn't read" section) - [Examples Visualizations] [Detailed Workflow] - [Overview of Detailed Workflow] + [IsoformSwitchAnalyzeR Background Information] - [Importing Data Into R] + [Importing Data from Kallisto, Salmon, RSEM or StringTie] + [Importing Data from Cufflinks/Cuffdiff] + [Importing Data from other Full-length Transcript Assemblers] - [Filtering] - [Identifying Isoform Switches] + [Testing for Isoform Switches via DEXSeq] (recomended) + [Testing for Isoform Switches via DRIMSeq] + [Testing for Isoform Switches with other Tools] - [Analyzing Open Reading Frames] - [Extracting Nucleotide and Amino Acid Sequences] - [Advice for Running External Sequence Analysis Tools] - [Importing External Sequence Analysis] - [Predicting Alternative Splicing] - [Predicting Switch Consequences] - [Removal of annotation data not needed anymore] - [Post Analysis of Isoform Switches with Consequences] + [Analysis of Individual Isoform Switching] + [Genome-Wide Analysis of Isoform Switching] [Analyzing Alternative Splicing] (new) - [Overview of Alternative Splicing Workflow] - [Genome-Wide Analysis of Alternative Splicing] [Other workflows] - [Augmenting ORF Predictions with Pfam Results] - [Analyze Small Upstream ORFs] - [Remove Sequences Stored in SwitchAnalyzeRlist] - [Adding Uncertain Category to Coding Potential Predictions] - [Quality control of ORF of known annotation] - [Analyzing the Biological Mechanisms Behind Isoform Switching] - [Analysing experiments without replicates] [Frequently Asked Questions, Problems and Errors] - [What Quantification Tool(s) Should I Use?] - [How to handle cofounding effects (including batches)] - [What constitute an independent biological replicate?] - etc [Final Remarks] [Sessioninfo]
## Preliminaries ### Background and Package Description The usage of Alternative Transcription Start sites (aTSS), Alternative Splicing (AS) and alternative Transcription Termination Sites (aTTS) are collectively collectively results in the production of different isoforms. Alternative isoforms are widely used as recently demonstrated by The ENCODE Consortium, which found that on average, 6.3 different transcripts are generated per gene; a number which may vary considerably per gene. The importance of analyzing isoforms instead of genes has been highlighted by many examples showing functionally important changes. One of these examples is the pyruvate kinase. In normal adult homeostasis, cells use the adult isoform (M1), which supports oxidative phosphorylation. However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis, one of the hallmarks of cancer. Such shifts in isoform usage are termed 'isoform switching' and cannot be detected at when only analyzing data on gene level. On a more systematic level several recent studies suggest that isoform switches are quite common since they often identify hundres of switche events. Tools such as Cufflinks, Salmon and Kallisto allows for reconstruction and quantification of full-length transcripts from RNA-seq data. Such data has the potential to facilitate genome-wide analysis of alternative isoform usage and identification of isoform switching --- but unfortunately these types of analyses are still only rarely done; most analyses are on gene level only. We hypothesize that there are multiple reasons why RNA-seq data is not used to its full potential: 1) There is still a lack of tools that can identify isoform switches with isoform resolution 2) Although there are many excellent tools to perform sequence analysis, there is no common framework which allows for integration of the analysis provided by these tools. 3) There is a lack of tools facilitating easy and article-ready visualization of isoform switches. To solve all these problems, we developed IsoformSwitchAnalyzeR. IsoformSwitchAnalyzeR is an easy-to-use R package that enables the user to import (novel) full-length derived isoforms from an RNA-seq experiment into R. If annotated transcripts are analyzed, IsoformSwitchAnalyzeR offers integration with the multi-layer information stored in a GTF file including the annotated coding sequences (CDS). If transcript structures are predicted (either de-novo or guided) IsoformSwitchAnalyzeR offers an accurate tool for identifying the dominant ORF of the isoforms. The knowledge of isoform positions for the CDS/ORF allows for prediction of sensitivity to Nonsense Mediated Decay (NMD) --- the mRNA quality control machinery that degrades isoforms with pre-mature termination codons (PTC). IsoformSwitchAnalyzeR facilitates identification of isoform switches via newly developed statistical methods that tests each individual isoform for differential usage and thereby identifies the exact isoforms involved in an isoform switch. Since we know the exon structure of the full-length isoform, IsoformSwitchAnalyzeR can extract the underlying nucleotide sequence from a reference genome. This enables integration with the Coding Potential Assessment Tool (CPAT) which predicts the coding potential of an isoform and can be used to increase accuracy of ORF predictions. By combining the CDS/ORF isoform positions with the nucleotide sequence, we can extract the most likely amino acid sequence of the CDS/ORF. The amino acid sequence enables integration of analysis of protein domains (via Pfam) and signal peptides (via SignalP) --- both supported by IsoformSwitchAnalyzeR. Lastly, since the structures of all expressed isoforms from a given gene are known, one can also annotate alternative splicing - including retentions - a functionality also implemented in IsoformSwitchAnalyzeR. Thus, in summary, IsoformSwitchAnalyzeR enables annotation of isoforms with intron retention, ORF, NMD sensitivity, coding potential, protein domains and signal peptides (and many more), resulting in the ability to predict important functional consequences of isoform switches. IsoformSwitchAnalyzeR contains tools that allow the user to create article-ready visualizations of: 1) Individual isoform switches 2) Genome-wide analysis of isoform switches and their predicted consequences 3) Genome-wide analysis of alternative splicing and isoform switches and their predicted consequences. These visualizations are easy to understand and integrate all information gathered throughout the workflow. Example of visualizations can be found in the [Examples Visualizations] section. Lastly IsoformSwitchAnalyzeR is based on standard Bioconductor classes such as GRanges and BSgenome. Thus, it supports all species- and annotation versions facilitated by the Bioconductor annotation packages. Back to [Table of Content]. ### Installation IsoformSwitchAnalyzeR is part of the Bioconductor repository and community which means it is distributed with, and dependent on, Bioconductor. Installation of IsoformSwitchAnalyzeR is easy and can be done from within the R terminal. If it is the first time you use Bioconductor, simply copy-paste the following into an R session to install the basic Bioconductor packages: install.packages("BiocManager") BiocManager::install() If you already have installed Bioconductor, running these two commands will check whether updates for installed packages are available.
After you have installed the basic Bioconductor packages you can install IsoformSwitchAnalyzeR by copy-pasting the following into an R session: BiocManager::install("IsoformSwitchAnalyzeR") This will install the IsoformSwitchAnalyzeR package as well as other R packages that are needed for IsoformSwitchAnalyzeR to work.
If you need to install from the developmental branch of Bioconductor it is nesseary to explicitely specify that. Please note that this is for advanced uses and should not be done unless you have good reason to. Installation from the developmental branch can be done by copy/pasting: BiocManager::install("IsoformSwitchAnalyzeR", version = "devel")
### How To Get Help This R package comes with plenty of documentation. Much information can be found in the R help files (which can easily be accessed by running the following command in R "?functionName", for example "?importRdata") - here a lot of information can be found in the individual argument description as well as in the details section. This vignette contains a lot of information and will be continously updated, so make sure to read both sources carefully as it contains the answers to the most Frequently Asked Questions, Problems and Errors. If you want to report a bug/error (found in the newest version of the R package!) please make an issue with a reproducible example at [github](https://github.com/kvittingseerup/IsoformSwitchAnalyzeR/issues). If you have unanswered questions or comments regarding IsoformSwitchAnalyzeR or how to run it please post them on the associated [google group](https://groups.google.com/forum/#!forum/isoformswitchanalyzer) (after make sure the question was not already answered). If you have suggestions for improvements, please put them on [github](https://github.com/kvittingseerup/IsoformSwitchAnalyzeR). This will allow other people to upvote your idea, thereby showing us there is wide support of implementing your idea. Back to [Table of Content].
## What To Cite The analysis performed by IsoformSwitchAnalyzeR is only possible due to a string of other tools and scientific discoveries --- please read this section thoroughly and cite the appropriate articles to give credit where credit is due (and allow people to continue both maintaining and developing bioinformatic tools). If you are using the - **Import of data from Salmon/Kallisto/RSEM/StringTie** : Please cite reference _10_. - **Inter-library normalization of abundance values** : Please cite reference _10_ and _11_. - **Isoform switch test implemented utilizing DEXSeq via IsoformSwitchAnalyzeR (Default)** : Please cite reference _1_, _12_ and _13_. - **Isoform switch test implemented in the DRIMSeq package** : Please cite referencea _1_ and _3_. - **Prediction of open reading frames (ORF) analysis** : Please cite reference _1_ and _4_. - **Prediction of pre-mature termination codons (PTC) and thereby NMD-sensitivity** : Please cite refrence _1_, _4_, _5_ and _6_. - **CPAT** : Please cite reference _7_. - **Pfam** : Please cite reference _8_. - **SignalP** : Please cite reference _9_. - **Prediction of consequences** please cite reference _1_. - **Visualizations** (plots) implemented in the IsoformSwitchAnalyzeR package : Please cite reference _1_. - **Alternative splicing analysis** : Please cite both reference _1_ and _4_. - **Genome-wide enrichment analysis** : Please cite both reference _1_ and _2_.
Refrences: 1. _Vitting-Seerup et al. **The Landscape of Isoform Switches in Human Cancers.** Cancer Res. (2017)_ 2. _Vitting-Seerup et al. **IsoformSwitchAnalyzeR: Analysis of changes in genome-wide patterns of alternative splicing and its functional consequences.** bioRxiv (2018)_ 3. _Nowicka et al. **DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics.** F1000Research, 5(0), 1356._ 4. _Vitting-Seerup et al. **spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data**. BMC Bioinformatics 2014, 15:81._ 5. _Weischenfeldt et al. **Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns**. Genome Biol 2012, 13:R35_ 6. _Huber et al. **Orchestrating high-throughput genomic analysis with Bioconductor**. Nat. Methods, 2015, 12:115-121._ 7. _Wang et al. **CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model**. Nucleic Acids Res. 2013, 41:e74._ 8. _Finn et al. **The Pfam protein families database**. Nucleic Acids Research (2014) Database Issue 42:D222-D230_ 9. _Petersen et al. **SignalP 4.0: discriminating signal peptides from transmembrane regions**. Nature Methods, 8:785-786, 2011_ 10. _Soneson et al. **Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.** F1000Research 4, 1521 (2015)._ 11. _Robinson et al. **A scaling normalization method for differential expression analysis of RNA-seq data**. Genome Biology (2010)_ 12. _Ritchie et al. **limma powers differential expression analyses for RNA-sequencing and microarray studies**. Nucleic Acids Research (2015)_ 13. _Anders et al. **Detecting differential usage of exons from RNA-seq data**. Genome Research (2012)_
## Quick Start ### Overview of Isoform Switch Workflow The idea behind IsoformSwitchAnalyzeR is to make it easy to do advanced post-analysis of full-length RNA-seq derived transcripts quantification with a focus on finding, annotating and visualizing isoform switches with functional consequences. Furthermore IsoformSwitchAnalyzeR also allows for analysis of alternative splicing. Here we will go though the isoform switch workflow. For the workflow of alternative splicing see [Analyzing Alternative Splicing]. If you want to know more about recomendations for bioinformatic tools which can perform the initial isoform quantifications please refere to [What Quantification Tool(s) Should I Use?]. From the isoform quantifications IsoformSwitchAnalyzeR performs three high-level tasks: - Statistical identification of isoform switches. - Annotation of the transcripts involved in the isoform switches. - Visualization of predicted consequences of the isoform switches, for individual switches and globally. Please note that just like any other statistical tool IsoformSwitchAnalyzeR requires independent biological replicates (see [What constitute an independent biological replicate?]) and that [recent benchmarks](https://f1000research.com/articles/7-952/v2) highlights that at least three independent replicates are needed for good statistical performance - most tools have a hard time controling the False Discovery Rate (FDR) with only two replicates so extra caution is needed for interpreting results based on few replicates. A normal workflow for identification and analysis of isoform switches with functional consequences can be divided into two parts (also illustrated below in Figure 1).
**Part 1) Extract Isoform Switches and Their Sequences.** This part includes importing the data into R, identifying isoform switches, annotating those switches with open reading frames (ORF) and extracting the nucleotide and amino acid (peptide) sequences. The latter step enables the usage of external sequence analysis tools such as * CPAT : The Coding-Potential Assessment Tool, which can be run either locally or via their [webserver](http://lilab.research.bcm.edu/cpat/). * Pfam : Prediction of protein domains, which can be run either locally or via their [webserver](https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan). * SignalP : Prediction of Signal Peptides, which can be run either locally or via their [webserver](http://www.cbs.dtu.dk/services/SignalP/). All of the above steps are performed by the high-level function: isoformSwitchAnalysisPart1() See below for example of usage, and [Detailed Workflow] for details on the individual steps.
**Part 2) Plot All Isoform Switches and Their annotation.** This part involves importing and incorporating the results of the external sequence analysis, identifying intron retention, predicting functional consequences and plotting i) all genes with isoform switches and ii) summaries of general consequences of switching. All of this can be done using the function: isoformSwitchAnalysisPart2() See below for usage example, and [Detailed Workflow] for details on the individual steps.
**Alternatively** if one does not plan to incorporate external sequence analysis, it is possible to run the full workflow using: isoformSwitchAnalysisCombined() This corresponds to running _isoformSwitchAnalysisPart1()_ and _isoformSwitchAnalysisPart2()_ without adding the external results. **Figure 1: Workflow overview.** Grey transparent boxes indicate the two parts of a normal workflow for analyzing isoform switches. The individual steps in the two sub-workflows are indicated by arrows. Speech bubbles summarize how this full analysis can be done in a two-step process using the high-level functions (_isoformSwitchAnalysisPart1()_ and _isoformSwitchAnalysisPart2()_). Back to [Table of Content].
### Example Workflow As indicted above, a full, but less customizable, analysis of isoform switches can be done using the two high-level functions _isoformSwitchAnalysisPart1()_ and _isoformSwitchAnalysisPart2()_. This section aims to show how these functions are used as well as illustrate what IsoformSwitchAnalyzeR can be used for. Lets start by loading the R package ```{r, results = "hide", message = FALSE, warning=FALSE} library(IsoformSwitchAnalyzeR) ```
Note that newer versions of RStudio support auto-completion of package names. #### Importing the Data The first step is to import all data needed for the analysis and store them in a switchAnalyzeRlist object. IsoformSwitchAnalyzeR has different functions for importing data from tools such as Salmon/Kallisto/RSEM/StringTie/Cufflinks, but can be used with all isoform-level expression data using implemented general-purpose functions. For the purpose of illustrating data import lets use some data quantified via Salmon. Note the approach for Kallisto/RSEM/StringTie would be identical - for other sources of quantification (including Cufflinks/Cuffdiff) see [Importing Data Into R]. To illustrate the data import lets use some Salmon data included in the IsoformSwitchAnalyzeR package. ```{r} ### Please note # The way of importing files in the following example with # "system.file('pathToFile', package="IsoformSwitchAnalyzeR") is # specialiced way of accessing the example data in the IsoformSwitchAnalyzeR package # and not somhting you need to do - just supply the string e.g. # "/mySalmonQuantifications/" pointing to the parent directory (where # each sample is a seperate sub-directory) to the function. ### Import quantifications salmonQuant <- importIsoformExpression( parentDir = system.file("extdata/",package="IsoformSwitchAnalyzeR") ) ``` Which results in a list containing both count and abundance estimates for each isoform: ```{r} head(salmonQuant$abundance, 2) head(salmonQuant$counts, 2) ```
Apart from the isoform quantification we need three additional pices of information. - The transcript structure of the isoforms (in genomic coordinats). This is typically stored in a GTF file - a file format directly supported by IsoformSwitchAnalyzeR (as described below). - The information about which isoforms belongs to the same gene (also in the GTF file) - A design matrix indicating which of the independent biological replicates belong to which condition (and if there are other covariates that should be taking into account). The design matrix will have to be generated manually to ensure correct grouping: ```{r} myDesign <- data.frame( sampleID = colnames(salmonQuant$abundance)[-1], condition = gsub('_.*', '', colnames(salmonQuant$abundance)[-1]) ) myDesign ```
Note that if you have additional covariates in the data of interest these can also be added to the design matrix to ensure they are taking into account during the statistical analysis. ( Covariates are (unwanted) sources of variation not due to experimental groups you are interested in. This can be anything from a batch effects (e.g. data produced at different points in time) to an effect you are not interested in but which migth influence the result (e.g. sex or age). See [How to handle cofounding effects (including batches)] for more info ). Now have the isoform quantifications, the design matrix and the isoform annotation (in a GTF file which IsoformSwitchAnalyzeR will import itsef) we can now combine and store all the relevant data in a switchAnalyzeRlist. Please note that: * It is highly recommended to both supply count and abundance expression matrixes to the importRdata() but a switchAnalyzeRlist can also be created with only one of them. * It is essential that the isoforms quantified (with Salmon, Kallisto etc) and the annotation stored in the GTF file is identical. We highly recomend using files from [GENCODE genes](http://gencodegenes.org/) - where The "Comprehensive gene annotation" GTF and the "Transcript sequences" fasta file is a perfect pair. For more information (including instructions for how to use Ensemble annotation) refere to the FAQ [The error "The annotation does not fit the expression data"].
Once we have all the nessesary data the easiest way to create a switchAnalyzeRlist is with the importRdata() function - which also directly can handle import and integration of annotation from a GTF file: ```{r, message = FALSE, warning=FALSE} ### Please note # The way of importing files in the following example with # "system.file("extdata/example.gtf.gz", package="IsoformSwitchAnalyzeR")"" is # specialiced way of accessing the example data in the IsoformSwitchAnalyzeR package # and not somhting you need to do - just supply the string e.g. # "/myAnnotation/myQuantified.gtf" to the isoformExonAnnoation argument ### Create switchAnalyzeRlist aSwitchList <- importRdata( isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance, designMatrix = myDesign, isoformExonAnnoation = system.file("extdata/example.gtf.gz", package="IsoformSwitchAnalyzeR"), showProgress = FALSE ) aSwitchList ``` Note that by supplying the GTF file to the "isoformExonAnnoation" argument IsoformSwitchAnalyzeR will automatically import and integrate CDS regions from in the GTF file as the ORF regions used by IsoformSwitchAnalyzeR (if addAnnotatedORFs=TRUE (default)). For more information about the switchAnalyzeRlist see [IsoformSwitchAnalyzeR Background Information].

To illustrate the IsoformSwitchAnalyzeR workflow we will use a smaller example dataset originally quantified with Cufflinks/CuffDiff. The corresponding switchAnalyzeRlist is included in IsoformSwitchAnalyzeR and can loaded as follows: ```{r} data("exampleSwitchList") exampleSwitchList ``` Note that although a isoform switch identification analysis was performed (by CuffDiff), no genes with differential usage were found.
#### Part 1 Before we can run the analysis it is necessary to know that IsoformSwitchAnalyzeR measures isoform usage via isoform fraction (IF) values which quantifies the fraction of the parent gene expression originating from a specific isoform (calculated as / ). Consequently the difference in isoform usage is quantifed as the difference in isoform fraction (dIF) calculated as IF2 - IF1, and these dIF are used to measure the effect size (like fold changes are in gene/isoform expression analysis). We can now run the first part of the isoform switch analysis workflow which filters for non-expressed genes/isoforms, identifies isoform switches, annotates open reading frames (ORF), switches with and extracts both the nucleotide and peptide (amino acid) sequences and output them as two seperate fasta files. ```{r, results = "hide", message = FALSE, warning=FALSE} ### isoformSwitchAnalysisPart1 needs the genomic sequence to predict ORFs # Genome sequences are available from Bioconductor as BSgenome objects: # http://bioconductor.org/packages/release/BiocViews.html#___BSgenome # Here we use the hg19 reference genome - which can be downloaded by # copy/pasting the following two lines into the R terminal: # BiocManager::install("BSgenome.Hsapiens.UCSC.hg19") library(BSgenome.Hsapiens.UCSC.hg19) exampleSwitchList <- isoformSwitchAnalysisPart1( switchAnalyzeRlist = exampleSwitchList, genomeObject = Hsapiens, # the reference to the human BS genome dIFcutoff = 0.3, # Cutoff for finding switches - set high for short runtime in example data # pathToOutput = 'path/to/where/ouput/should/be/' outputSequences = FALSE # prevents outputting of the fasta files used for external sequence analysis ) ``` Note that: 1. In the example above we set _outputSequences=FALSE_ only to make the example data run nicely (e.g. to prevent the function from out the two fasta files of the example data to your computer). When you analyze your own data you want to set _outputSequences=TRUE_ and use the _pathToOutput_ to specify where the files sould be saved. The two files outputted are needed to do the external analysis as described below. 2. The _isoformSwitchAnalysisPart1()_ function has an argument, _overwritePvalues_, which overwrites the result of an already existing switch test (such as those imported by cufflinks) with the result of running _isoformSwitchTestDEXSeq()_. 3. The switchAnalyzeRlist returned by _isoformSwitchAnalysisPart1()_ has been reduced to only contain genes where an isoform switch (as defined by the _alpha_ and _dIFcutoff_ arguments) was identified. This enables much faster runtimes for the rest of the pipeline, as only isoforms from a gene with a switch are analyzed. The number of switching features is easily summarized as follows: ```{r} extractSwitchSummary( exampleSwitchList, dIFcutoff = 0.3 # supply the same cutoff to the summary function ) ```
In a typical workflow the user would here have to use produced fasta files to perfrom the external analysis tools (Pfam (protein domains), SignalP (signal peptides), CPAT (coding potential)). For more information on how to run those tools refere to [Advice for Running External Sequence Analysis Tools]. To illustrate the workflow we will here use the result of running those tools on the example data wich we have also included in our R package.
#### Part 2 The second part of the isoform switch analysis workflow, which includes importing and incorporating external sequence annotation, analysis of alternative splicing, predicting functional consequences and visualizing both the general effects of isoform switches and the individual isoform switches. The combined analysis can be done by: ```{r, results = "hide", message = FALSE} # Please note that in the following the part of the examples using # the "system.file()" commandis not nesseary when using your own # data - just supply the path as a string # (e.g. pathToCPATresultFile = "/myFiles/myCPATresults.txt" ) exampleSwitchList <- isoformSwitchAnalysisPart2( switchAnalyzeRlist = exampleSwitchList, dIFcutoff = 0.3, # Cutoff for finding switches - set high for short runtime in example data n = 10, # if plotting was enabled, it would only output the top 10 switches removeNoncodinORFs = TRUE, # Because ORF was predicted de novo pathToCPATresultFile = system.file("extdata/cpat_results.txt" , package = "IsoformSwitchAnalyzeR"), pathToPFAMresultFile = system.file("extdata/pfam_results.txt" , package = "IsoformSwitchAnalyzeR"), pathToSignalPresultFile = system.file("extdata/signalP_results.txt", package = "IsoformSwitchAnalyzeR"), codingCutoff = 0.725, # the coding potential cutoff we suggested for human outputPlots = FALSE # keeps the function from outputting the plots from this example ) ``` The exampleSwitchList now contains all the information needed to analyze isoform switches and alterantive splicing both for individual genes as well as genome-wide summary statistics or analysis. The number of isoform switches with predicted functional consequences are extracted by setting "filterForConsequences = TRUE": ```{r} extractSwitchSummary( exampleSwitchList, filterForConsequences = TRUE, dIFcutoff = 0.3 # supply the same cutoff to the summary function ) ``` For each of these top genes, a switch plot will be generated if "outputPlots=TRUE" were used. Let's take a closer look at the top candidates: The top genes with isoform switches are: ```{r, message = FALSE} extractTopSwitches(exampleSwitchList, filterForConsequences = TRUE, n=3) ```
### Examples Visualizations Let's take a look at the isoform switch in the SRRM1 gene via the switch plot produced by IsoformSwitchAnalyzeR: ```{r, fig.width=12, fig.height=7, warning=FALSE} switchPlot(exampleSwitchList, gene='SRRM1') ``` From this plot, we first note that the gene expression is not significantly changed (bottom left). Next, we see the large significant switch in isoform usage across conditions (bottom right). By comparing which isoforms are changing (bottom right) to the isoform structure (top) it can be deduced that in hESC it is primarly the short isoforms that is used, but as the cells are differentiated to fibroblasts, there is a change towards mainly using the long isoform. Interestingly, this isoform switch seems to result in a truncation of the PWI domain which is important for DNA/RNA binding. SRRM1 is a splice factor and this switch could lead to the hypothesis that SRRM1 is nonfunctional in hESC but functional in Fibroblasts even though the total output from the gene is the same --- a hypothesis which would naturallt have to be confirmed experimentally. Note that if you want to save this plot as a pdf file via the _pdf_ command, you need to specify "onefile = FALSE". The folloing code chunk will produce a nicely-sized figure: pdf(file = '.pdf', onefile = FALSE, height=5, width = 8) switchPlot(exampleSwitchList, gene='SRRM1') dev.off() Furthermore, note that: - If the switchAnalyzeRList contains multiple comparison, you will also need to specify the 'condition1' and 'condition2' arguments in the _switchPlot()_ function to indicate specifically which comparison you want to plot. - The differential isoform/gene expression analysis is not a part of the IsoformSwitchAnalyzeR workflow but can easily be added as described in [Adding differential gene expression]. - the _switchPlot()_ function have a argument called _IFcutoff_ which requires the Isoform Fraction of an isforom to be larger than IFcutoff (default 0.01) to be included in the plot. Increasing this cutoff can result in "cleaner" plots as minor isoforms will be removed.
To illustrate the genome-wide analysis of consequences of isoform switching in the different comparisons, we will use a larger dataset which is a subset of two of the TCGA Cancer types analyzed in Vitting-Seerup et al 2017. ```{r} data("exampleSwitchListAnalyzed") ``` The first step is to take a look at the number and overlap of isoform switches. The numbers can be extracted via the _extractSwitchSummary()_ function and by setting "filterForConsequences=TRUE" we only extract the features involved in a switch which is predicted to have a functional consequence: ```{r, fig.width=9, fig.height=5} extractSwitchSummary( exampleSwitchListAnalyzed, filterForConsequences=TRUE ) ```
The number, and overlap, between condtions can be viusalized with the _extractSwitchOverlap()_ function: ```{r, fig.width=9, fig.height=5} extractSwitchOverlap( exampleSwitchListAnalyzed, filterForConsequences=TRUE ) ``` We can look more into the details of the consequences by considering each type consequence seperately as follows: ```{r, fig.width=12, fig.height=5} extractConsequenceSummary( exampleSwitchListAnalyzed, consequencesToAnalyze='all', plotGenes = FALSE, # enables analysis of genes (instead of isoforms) asFractionTotal = FALSE # enables analysis of fraction of significant features ) ``` Note the "consequencesToAnalyze" argument enables analysis of only a subset of features. From this summary plot many conclusions are possible. First of all, the most frequent changes are changes affecting protein domains and ORFs. Secondly, intron retention is more commonin LUAD than in COAD. Lastly, when considering oppositeconsequence (e.g. the gain vs loss of protein domains) its quite easy to see they are unevenly distributed (e.g. more protein domain loss than protein domain gain). This uneven distribution can be systematically analyzed using the build in enrichment analysis: ```{r, fig.width=12, fig.height=5} extractConsequenceEnrichment( exampleSwitchListAnalyzed, consequencesToAnalyze='all', analysisOppositeConsequence = TRUE ) ``` For each pair of oppositeconsequences (e.g. protein domain loss vs gain) (y-axis) this plot shows the fraction of switches, affected by either of the opposing consequence, that results in the consequnce indicated (e.g. protein domain loss) (x-axis). If this fraction is significantly different from 0.5 it indicates there is a systematic biases in which consequence is detected. From the analysis above, it is therefore quite clear, that many of the oposing consequences are significantly unevenly distributed. In other words, many types of consequences seems to be used in a group-specific manner.
When comparing the two cancer types (right vs left plot) the overall pattern seems similar - but there might be differences. This can formally be analyzed with the _extractConsequenceEnrichmentComparison()_ function. This function with for each oposing set of consequence (e.g. protein domain loss vs gain), in a pairwise manner contrasts the individual comparisons to assess whether the ratio of loss/gains are indeed significantly different: ```{r, fig.width=12, fig.height=4} extractConsequenceEnrichmentComparison( exampleSwitchListAnalyzed, consequencesToAnalyze=c('domains_identified','intron_retention','coding_potential'), analysisOppositeConsequence = TRUE ) ``` The analysis shows that compared to LUAD, COAD have a significantly higher fraction of switches resulting in protein domain loss, but there is no difference in terms of intron retention loss or switches from noncoding to to coding transcripts.
Such a results leads to the question what cellular mechanism underlies behind these changes. Due to the detailed alternative splicing analysis performed we can analyze this using the built-in summary function: ```{r, fig.width=12, fig.height=4.5} extractSplicingEnrichment(exampleSwitchListAnalyzed) ``` Which is equivivalent to the consequnce enrichment analysis described above. From this analysis, we see that although the patterns of consequences looked somewhat slimiar (as illustrated above), the underlying consequences are quite different. COAD utilizes alternative 3' acceptor sites (A3), multiple exon skipping (MES) and alternative transcription start sites (ATSS) while LUAD utilize intron retention (IR) and alternative transcription termination sites (ATTS) more. For more details (and functionality) regarding splicing analysis see the [Genome-Wide Analysis of Alternative Splicing] section.
Since all the data about isoform and gene expression is saved in the switchAnalyzeRlist, we can also make a set of overview plots: ```{r, fig.width=8, fig.height=4, warning=FALSE} data("exampleSwitchListAnalyzed") ### Vulcano like plot: ggplot(data=exampleSwitchListAnalyzed$isoformFeatures, aes(x=dIF, y=-log10(isoform_switch_q_value))) + geom_point( aes( color=abs(dIF) > 0.1 & isoform_switch_q_value < 0.05 ), # default cutoff size=1 ) + geom_hline(yintercept = -log10(0.05), linetype='dashed') + # default cutoff geom_vline(xintercept = c(-0.1, 0.1), linetype='dashed') + # default cutoff facet_wrap( ~ condition_2) + #facet_grid(condition_1 ~ condition_2) + # alternative to facet_wrap if you have overlapping conditions scale_color_manual('Signficant\nIsoform Switch', values = c('black','red')) + labs(x='dIF', y='-Log10 ( Isoform Switch Q Value )') + theme_bw() ``` As there are many dIF values (effect size) very close to zero, which have a significant isoform switch (black dots above dashed hoizontal line) this nicely illustrates why a cutoffs both on the dIF and the q-value are necessary. Another interesting overview plot can be made as follows: ```{r, fig.width=8, fig.height=4, warning=FALSE} ### Switch vs Gene changes: ggplot(data=exampleSwitchListAnalyzed$isoformFeatures, aes(x=gene_log2_fold_change, y=dIF)) + geom_point( aes( color=abs(dIF) > 0.1 & isoform_switch_q_value < 0.05 ), # default cutoff size=1 ) + facet_wrap(~ condition_2) + #facet_grid(condition_1 ~ condition_2) + # alternative to facet_wrap if you have overlapping conditions geom_hline(yintercept = 0, linetype='dashed') + geom_vline(xintercept = 0, linetype='dashed') + scale_color_manual('Signficant\nIsoform Switch', values = c('black','red')) + labs(x='Gene log2 fold change', y='dIF') + theme_bw() ``` Here, it is clear that changes in gene expression and isoform switches are not necessarily mutually exclusive, as there are many genes which are both differentially expressed (large gene log2FC) and contain isoform switches (color). This also highlights the importance of also analyzing RNA-seq data at isoform-level resolution. Back to [Table of Content].

## Detailed Workflow ### Overview of Detailed Workflow Before you start on this section, we recommend you to read though the [Quick Start] section, as it gives the large overview, introduces some basic concepts and gives a couple of tips not repeated here. Compared to the workflow presented in [Quick Start], a full workflow for analyzing isoform switches has many sub-steps which each can be customized/optimized. In this section, we will go more into depth with each of the steps as well as provide tips and shortcuts for working with IsoformSwitchAnalyzeR. Specifically, each of the main functions behind these steps will be described and illustrated. For a more comprehensive and detailed description of each individual function please refer to the individual function documentation build into the R package (easily accessed via ?functionName). First we will start by getteing an overview of the detailed workflow before diving into the individual parts. Then we will introduce the [IsoformSwitchAnalyzeR Background Information] - all you need to know before doing the detailed analysis. This is followed by a detailed step-by-step isoform switch analysis workflow and lastly an overview of useful [Other Tools in IsoformSwitchAnalyzeR] is provided.
The detailed workflow consists of the following steps (illustrated in Figure 2) which, just like before, can be divided into two parts: **Part 1) Extract Isoform Switches and Their Sequences.** * [Importing Data Into R] + [Importing Data from Kallisto, Salmon, RSEM or StringTie] + [Importing Data from Cufflinks/Cuffdiff] + [Importing Data from other Full-length Transcript Assemblers] * [Filtering] * [Identifying Isoform Switches] * [Analyzing Open Reading Frames] * [Extracting Nucleotide and Amino Acid Sequences] * [Advice for Running External Sequence Analysis Tools] This corresponds to running the following functions in sequential order (which incidentally is just what the _isoformSwitchAnalysisPart1()_ function does): ``` ### Import data into R via one of: myQantifications <- importIsoformExpression() # RSEM/Kallisto/Salmon/StringTie myCuffDb <- readCufflinks() # Cufflinks/Cuffdiff ### Create SwitchAnalyzeRlist mySwitchList <- importRdata() # OR mySwitchList <- importCufflinksData() ### Filter mySwitchList <- preFilter( mySwitchList ) ### Test for isoform switches mySwitchList <- isoformSwitchTestDEXSeq( mySwitchList ) # OR mySwitchList <- isoformSwitchTestDRIMSeq( mySwitchList ) ### If novel isoforms (else use CDS from ORF as explained in importRdata() ) mySwitchList <- analyzeORF( mySwitchList ) ### Extract Sequences mySwitchList <- extractSequence( mySwitchList ) ### Summary extractSwitchSummary(mySwitchList) ``` **Part 2) Plot All Isoform Switches and Their annotation.** - [Importing External Sequence Analysis] - [Predicting Alternative Splicing] - [Predicting Switch Consequences] - [Post Analysis of Isoform Switches with Consequences] + [Analysis of Individual Isoform Switching] + [Genome-Wide Analysis of Isoform Switching] This corresponds to running the following functions in sequential order (just like the _isoformSwitchAnalysisPart2()_ function does): ``` ### Add annotation mySwitchList <- analyzeCPAT( mySwitchList ) mySwitchList <- analyzePFAM( mySwitchList ) mySwitchList <- analyzeSignalP( mySwitchList ) mySwitchList <- analyzeAlternativeSplicing( mySwitchList ) ### Analyse consequences mySwitchList <- analyzeSwitchConsequences( mySwitchList ) ### Visual analysis # Indiviudal switches switchPlotTopSwitches( mySwitchList ) ### Summary extractSwitchSummary(mySwitchList, filterForConsequences = TRUE) ```
For a normal workflow this would then typically be followed by the global analysis of alternative splcing and consequences consequences: ``` # global consequence analysis extractConsequenceSummary( mySwitchList ) extractConsequenceEnrichment( mySwitchList ) extractConsequenceGenomeWide( mySwitchList ) # global splicing analysis extractSplicingSummary( mySwitchList ) extractSplicingEnrichment( mySwitchList ) extractSplicingGenomeWide( mySwitchList ) ``` The combined workflow is visualized here: **Figure 2: Detailed workflow overview.** The grey transparent boxes indicate the two parts of a normal workflow for analyzing isoform switches. The individual steps in the two sub-workflows are indicated by arrows along with a description and the main R functions for performing the steps.
### IsoformSwitchAnalyzeR Background Information #### The switchAnalyzeRlist The switchAnalyzeRlist object is created to specifically contain and summarize all relevant information about the isoforms involved in isoform switches. The switchAnalyzeRlist object is a named list, meaning each entry in the list can be accessed by its name via the '$' symbol or by using "[['entryName']]". A newly created switchAnalyzeRlist object contains 6 entries, and as the isoforms are gradually annotated and analyzed more entries are added. ```{r} data("exampleSwitchList") # A newly created switchAnalyzeRlist + switch analysis names(exampleSwitchList) data("exampleSwitchListAnalyzed") # A fully analyzed switchAnalyzeRlist names(exampleSwitchListAnalyzed) ``` The first entry 'isoformFeatures' is a data.frame where all relevant data about each comparison of an isoform (between conditions), as well as the analysis performed and annotation incooperate via IsoformSwitchAnalyzeR, is stored. Amongst the default information is isoform and gene id, gene and isoform expression as well as the _isoform\_switch\_q_value_ and _isoform\_switch\_q_value_ where the result of the differential isoform analysis is stored. The comparisons made can be identified as "from 'condition\_1' to 'condition\_2'", meaning 'condition\_1' is considered the ground state and 'condition\_2' the changed state. This also means that a positive dIF value indicates that the isoform usage is increased in 'condition\_2' compared to 'condition\_1'. Since the 'isoformFeatures' entry is the most relevant part of the switchAnalyzeRlist object, the most-used standard methods have also been implemented to work directly on isoformFeatures. ```{r} ### Preview head(exampleSwitchList, 2) # tail(exampleSwitchList, 2) ### Dimentions dim(exampleSwitchList$isoformFeatures) nrow(exampleSwitchList) ncol(exampleSwitchList) dim(exampleSwitchList) ```
A very useful functionality implemented in IsoformSwitchAnalyzeR is the _subsetSwitchAnalyzeRlist()_ function, which allows for removal of isoforms and all their associated information across all entries in a switchAnalyzeRlist. The function subsets the switchAnalyzeRlist based on a vector of logicals matching the isoformFeatures entry of the list. ```{r} exampleSwitchList ### Subset subsetSwitchAnalyzeRlist( exampleSwitchList, exampleSwitchList$isoformFeatures$gene_name == 'ARHGEF19' ) ```
Transcript structure information is stored in the exon entry of the switchAnalyzeRlist and contains the genomic coordinates for each exon in each isoform, and a column indicating which isoform it originates from. This information is stored as GenomicRanges (GRanges), which is very useful for overlapping genomic features and interacting with other Bioconductor packages. ```{r} head(exampleSwitchList$exons,2) ``` A full description of the initial switchAnalyzeRlist can be found via _?switchAnalyzeRlist_ and each additional entry added is described in the "value" part of the documentation for the specific function adding the entry. Back to [Table of Content]
#### Function overview With a few exceptions, all functions in IsoformSwitchAnalyzeR can be divided into 6 groups sharing the first part of the function name (very useful with RStudios autocompletion functionality (via the tab botton)): 1. **_import\*()_**: Functions for importing data from different data sources into a switchAnalyzeRlist, for example _importIsoformExpression()_ for importing data from Kallisto, Salmon StringTie or RSEM. 2. **_analyze\*()_**: Functions for analyzing and annotating the switchAnalyzeRlist. For example _analyzeORF()_ for analyzing the ORFs of the isoforms in the switchAnalyzeRlist. 3. **_isoformSwitchTest\*()_**: Functions performing the differential isoform usage test that enables switch identification. For example _isoformSwitchTestDEXSeq()_. 4. **_extract\*()_**: Functions for extracting (summarized) data or data analysis from the switchAnalyzeRlist., e.g. _extractConsequenceSummary()_ for extracting a summary of the number of genes and isoforms with isoform switches including predicted functional consequences. 5. **_switchPlot\*()_**: Functions that allows for visualization of the data in various ways, e.g. _switchPlotTranscript()_ for visualizing the transcripts along with the compiled annotation or _switchPlot()_ for making the Isoform Switch Analysis Plot (see [Examples Visualizations]) 6. **_isoformSwitchAnalysis\*()_**: High-level functions (described in [Example Workflow]) which offer an easy, but less customizable, way of performing the full workflow for analyzing isoform switches, e.g. _isoformSwitchAnalysisPart1()_ for running the first (of two) part of the workflow. The noteworthy exceptions to these categories are the following: createSwitchAnalyzeRlist() preFilter() Where: * _createSwitchAnalyzeRlist()_ is the central constructor of the switchAnalyzeRlist class which is both used by all the _import\()_ functions and for construction of switchAnalyzeRlist from user supplied data sources. Note that this function also adds the 'iso_ref' and 'gene_ref columns to the isoformFeatures entry of a switchAnalyzeRlist. These reference indexes give a unique id to each gene and isoform in each comparison and thereby allow for easy integration of the different downstream analyses. * _preFilter()_ enables removal of non-interesting data such as non-expressed isoforms or genes with only one annotated isoform from a switchAnalyzeRlist. Some additional tools that are essential for the workflow are omitted form this list (more about them later in the [Other Tools in IsoformSwitchAnalyzeR] secion). Note that we have tried to be systematic with function argument names for cutoffs, meaning that if the argument name contains "cutoff" the value supplied is not included, whereas if it contains "min" or "max" the given value is included. Back to [Table of Content]
### Importing Data Into R IsoformSwitchAnalyzeR can analyze the output from any tool that quantifies (annoated and/or de-novo/guided deconvoluted) full-length isoforms. All it requires are the following four sets of data: * Isoform quantification (preferably both counts and abundance estimates (RPKM/FPKM/TPM)) in multiple samples from each condition (independent replicates are essential if differential isoform usage should be identified, see [What constitute an independent biological replicate?]) * The genomic coordinates of the isoform exon structure. * Annotation of which isoform belongs to which gene. * A design matrix indicating which samples belong to which conditions. It is highly recommended to obtain the estimated replicate abundances and use them in the construction of the switchAnalyzeRlist as they better allow for effect size estimates. From these data, a switchAnalyzeRlist (see [The switchAnalyzeRlist] for description) can be constructed. An important addition is that the design matrix (a data.frame with two mandatory columns: sampleID and condition) also allows for incorporation of additional cofactors (sources of variation not of interest for the the analysis you are doing) in the experimental setup (see [How to handle cofounding effects (including batches)] for more information).
To facilitate the usage of IsoformSwitchAnalyzeR, several dedicated wrappers for constructing a switchAnalyzeRlist from different data sources are included in IsoformSwitchAnalyzeR. See the following sections for details about the specific import methods: * [Importing Data from Kallisto, Salmon, RSEM or StringTie] * [Importing Data from Cufflinks/Cuffdiff] * [Importing Data from other Full-length Transcript Assemblers] For recomendations of bioinformatic tools which can perform the quantifications please refere to [What Quantification Tool(s) Should I Use?]
#### Importing Data from Kallisto, Salmon, RSEM or StringTie While RSEM and StringTie are well-established and robust isoform quantification tool, a new generation of tools often refered to as pseudo/quasi-aligners, such as Salmon and Kallisto, are on the rise due to their speed and increased accuracy. All four tools are directly supported by allowing for easy import of the isoform replicate abundance via the _importIsoformExpression()_ function. This function is a easy-to-use wrapper for the tximport R package with extra functionalities. Using tximport, importIsoformExpression() allows the isoform counts to be estimated from the abundance estimates --- which is recomended as it will incorporate the bias correction performed by the tools into the switch identification. The _importIsoformExpression_ function can furthermore perform an inter-library normalization of the abundance estimates which is necessary for a fair comparison of libraries (for in-depth discussion of normalization and why it is needed, see section 2.7 of the [edgeR user guide](http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)). The importIsoformExpression() function is used as follows: ```{r} ### Please note # The way of importing files in the following example with # "system.file('pathToFile', package="IsoformSwitchAnalyzeR") is # specialiced way of accessing the example data in the IsoformSwitchAnalyzeR package # and not somhting you need to do - just supply the string e.g. # "/mySalmonQuantifications/" pointing to the parent directory (where # each sample is a seperate sub-directory) to the function. ### Import Salmon example data in R package salmonQuant <- importIsoformExpression( parentDir = system.file("extdata/", package="IsoformSwitchAnalyzeR"), addIsofomIdAsColumn = TRUE ) ``` The output of _importIsoformExpression_ is a list which (amongst others) contains both replicate count and abundance matrix which can easily be used to generate a switchAnalyzeRlist with the _importRdata()_ function (as described below). Note that _addIsofomIdAsColumn_ argument lets the user control whether the isoform ids are added as row.names (if FALSE) or as a column (if TRUE). Adding it as a column makes the data ready for use with IsoformSwitchAnalyzeR whereas adding ids as row.names makes it the data directly compatible with other exploratory data analysis methods. Once the quantified isoforms are imported into R all you need to supply to the _importRdata()_ is: 1) The estimated replicate isoform count expression and/or (and highly recomended!) the replicate isoform abundance estimates 2) The isoform annotation (either as a GRange or as sting pointing to a GTF file with the isoforms quantified). Note import of gziped GTF file is directly supported. 3) a design matrix indicating which samples are from the same condition. The _importRdata()_ function will then: * If necessary, calculate isoform abundance (FPKM/RPKM values) (ommited if abundances are supplied). * Sum up abundance values from all isoforms belonging to the same gene (indicated by gene_id) to get the gene expression. * For each gene/isoform in each condition (as indicate by designMatrix) summarize gene/isoform expression/usage . * For each pairwise comparison of condition (or as controlled by the _comparisonsToMake_ argument) the mean gene and isoform expression values are then used to calculate log2 fold changes and Isoform Fraction (IF) values. * All this data is then, along with the genomic annotation, concatenated into a switchAnalyzeRlist. Important notes for the use of importRdata are: 1) The design matrix (a data.frame with two mandatory columns: sampleID and condition) also allows for incorporation of additional cofactors in the experimental setup (such as batch effcts) - simply add these to the design matrix as additional columns and IsoformSwitchAnalyzeR will handle the rest (when testing with _isoformSwitchTestDEXSeq()_ (recomended) or _isoformSwitchTestDRIMSeq()_). 2) If the path to a GTF file is supplied to the "isoformExonAnnoation" argument (as a string) it is possible also to import the CoDing Sequence (CDS) annotated in the GTF file (by setting "addAnnotatedORFs=TRUE") as the Open Reading Frame (ORF) which IsoformSwitchAnalyzeR uses for downstream analysis. This can be advantageous if the data provided is a quantification of known annotated transcripts (i.e. isoforms are not the result of any de-novo or guided assembly) since the prediction of ORFs can then be avoided. The R code for using importRdata: ```{R} ### Make design matrix myDesign <- data.frame( sampleID = colnames(salmonQuant$abundance)[-1], condition = gsub('_.*', '', colnames(salmonQuant$abundance)[-1]) ) ### Create switchAnalyzeRlist aSwitchList <- importRdata( isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance, designMatrix = myDesign, isoformExonAnnoation = system.file("extdata/example.gtf.gz", package="IsoformSwitchAnalyzeR"), showProgress = FALSE ) ```
Note that: 1) By supplying the GTF file to the "isoformExonAnnoation" argument IsoformSwitchAnalyzeR will automatically also import and integrate CDS regions annotated in the GTF file as the ORF regions used by IsoformSwitchAnalyzeR (if addAnnotatedORFs=TRUE (default)). 2) IsoformSwichtAnalyzeR also supports the analysis of isoform switches found via other tools/test --- see [Testing for Isoform Switches with other Tools]. 3) The overall unit of isoforms analyzed together is defined by the gene annotation provided. In principle, this overall unit does not have to be genes but could be for example a TSS shared by multiple isoforms --- to utilize this simply provide TSS\_ids instead of gene\_ids in the isoform annotation. 4) If StringTie have been used to predict novel transcripts (transcript assembly/deconvolution) the gene_ids created is based on islands of overlapping transcripts - this means that sometimes multiple genes (defined by gene short name) as combined into one StringTie gene. This will create quite complex switchPlots which might contain multiple gene names (including NA if it is a novel gene). Currently there are no good way of "reversing" this but we are considering how to handle it. The resulting switchAnalyzeRlist can then be used with the rest of the pipeline. The next step in a typical workflow [Filtering]. Back to [Table of Content]
#### Importing Data from Cufflinks/Cuffdiff The Importing Data from Cufflinks/Cuffdiff is of special interest because Cufflinks/CuffDiff is amongst the most widely used tools for reconstructing isoforms (for quantification of annotated isoforms we recommend Kallisto or Salmon instead). Creating a _switchAnalyzeRlist_ from cufflinks data is done with the function: importCufflinksFiles() The _importCufflinksFiles()_ functions is made to use files outputted by a Cufflinks/Cuffdiff run. This also allows the user to run the computationally heavy RNA-seq pipeline with mapping, transcript deconvolution, transcript quantification and differential expression on a cloud-based server (for example the free-to-use tool [galaxy](https://usegalaxy.org)), and then do the post-analysis on isoform switching locally by simply downloading the required files and supplying them to _importCufflinksFiles()_. The _importCufflinksFiles_ function is used as follows: ``` library(cummeRbund) ### Please note # The way of importing file in this example with # "system.file('pathToFile', package="cummeRbund") is # specialiced to access the example data in the cummeRbund package # and not something you need to do - just supply the string e.g. # "/myCuffdiffFiles/myFile.filetype" to the argument testSwitchList <- importCufflinksFiles( pathToGTF = system.file('extdata/chr1_snippet.gtf', package = "cummeRbund"), pathToGeneDEanalysis = system.file('extdata/gene_exp.diff', package = "cummeRbund"), pathToIsoformDEanalysis = system.file('extdata/isoform_exp.diff', package = "cummeRbund"), pathToGeneFPKMtracking = system.file('extdata/genes.fpkm_tracking', package = "cummeRbund"), pathToIsoformFPKMtracking = system.file('extdata/isoforms.fpkm_tracking', package = "cummeRbund"), pathToIsoformReadGroupTracking = system.file('extdata/isoforms.read_group_tracking', package = "cummeRbund"), pathToSplicingAnalysis = system.file('extdata/splicing.diff', package = "cummeRbund"), pathToReadGroups = system.file('extdata/read_groups.info', package = "cummeRbund"), pathToRunInfo = system.file('extdata/run.info', package = "cummeRbund"), fixCufflinksAnnotationProblem=TRUE, quiet=TRUE ) testSwitchList ``` Note that the cufflinks import functions, per default: 1. Correct the annotation problem caused by cufflinks having to consider islands of overlapping transcripts --- this means that sometimes multiple genes (defined by gene name) are combined into one cufflinks gene (XLOC\_XXXXXX) and this gene is quantified and tested for differential expression. Setting _fixCufflinksAnnotationProblem=TRUE_ will make the import function modify the data so that false conclusions are not made in downstream analysis. More specifically, this causes the function to re-calculate expression values, set gene standard error (of mean (measurement), s.e.m) to NA and the p-value and q-value of the differential expression analysis to 1, whereby false conclusions can be prevented. 2. Import the result of Cuffdiff's splicing test and add it to the isoformSwitchList. This can be disabled by setting _addCufflinksSwichTest=FALSE_. The resulting switchAnalyzeRlist can then be used with the rest of the isoform switch analysis pipeline. The next step is typically [Filtering]. Back to [Table of Content]
#### Importing Data from other Full-length Transcript Assemblers As described above, all you need to create a switchAnalyzeRlist is: * The isoform abundance estimates in multiple samples from multiple conditions * The genomic coordinates of the isoform exon structure. * Annotation of which isoform belongs to which gene. * A design matrix indicate which samples belong to which conditions Once you have imported the isoform expression estimates (via _importIsoformExpression()_ or otherwise) as well as the isoform structure and the isoform annotation into R a switchAnalyzeRlist can be constructed using the general-purpose wrapper _importRdata()_ function (as described in the previous section [Importing Data from Kallisto, Salmon, RSEM or StringTie]) Please note that CPM/TPM (Tags Per Million) values should NOT under any circumstances be supplied to _importRdata()_ since they do not take the isoform length into account, meaning that isoform switches where there are a difference in the length of the involved isoforms, will be biased by TPM/CPM values (please note that Transcripts Per Million (here called TxPM but sometimes denoted TPM, from tools such as Kallisto, Salmon and RSEM) are not problematic). To illustrate the problem with TPM/CPM, a simple thought example can be made: Consider two isoforms where isoform 1 exist in 5 copies and isoform 2 exist in 10 copies (a 1:2 relation). If isoform 1 is twice as long as isoform 2, an RNA-seq experiment will generate (approximately) twice as many reads from it --- meaning if CPM/TPM values are used, a false 1:1 relation will be estimated. This is not the case for TxPM/RPKM/FPKM, since the isoform lengths are taken into account.
Alternatives to using _importRdata()_ are either _importGTF()_ or _createSwitchAnalyzeRlist()_. If the _importGTF()_ function is used it will generate a switchAnalyzeRlist containing all transcripts in the GTF file: ```{r, warning=FALSE, message=FALSE} aSwitchList <- importGTF(pathToGTF = system.file("extdata/example.gtf.gz", package="IsoformSwitchAnalyzeR")) ``` ```{r} aSwitchList head(aSwitchList,2) head(aSwitchList$conditions,2) ``` From above, it is observed that dummy variables have been inserted in both _isoformFeatures_ and _conditions_ entries of the switchAnalyzeRlist. This approach is well suited if you just want to annotate a transcriptome and are not interested in expression. If you are interested in expression estimates and isoform switches, it is probably easier to use _importRdata_. Back to [Table of Content]
### Filtering Once you have a switchAnalyzeRlist, there is a good chance that it contains a lot of genes/isoforms that are irrelevant for an analysis of isoform switches. Examples of such could be single isoform genes or non-expressed isoforms. These extra genes/isoforms will make the downstream analysis take (much) longer than necessary. Therefore we have implemented a pre-filtering step to remove these features before continuing with the analysus. Importantly, filtering can enhance the reliability of the downstream analysis as described in detail below. By using _preFilter()_ it is possible to remove genes and isoforms from all aspects of the switchAnalyzeRlist by filtering on: * Multi-isoform genes * Gene expression * Isoform expression * Isoform Fraction (isoform usage) * Unwanted isoform classes * Genes without differential isoform usage Removal of single isoform genes is the default setting in _preFilter()_ since these genes, per definition, cannot have changes in isoform usage. Filtering on isoform expression allows removal of non-used isoforms that only appear in the switchAnalyzeRlist because they were in the isoform/gene annotation used. Furthermore, the expression filtering allows removal of lowly expressed isoforms where the expression levels might be untrustworthy. The filtering on gene expression allows for removal of lowly expressed genes which causes the calculations of the Isoform Fractions (IF) to become untrustworthy. The filter on Isoform Fraction allows removal of isoforms that only contribute minimally to the gene expression thereby speeding up and simplifying the rest of the downstream analysis without losing important isoforms. Filtering on unwanted isoform classes is only an option available if using data form Cufflinks/Cuffdiff since the Tuxedo workflow includes classification of transcript types. This allows for removal of, for example, transcripts classified as "Possible polymerase run-on fragment" or "Repeat". See full description of Cufflinks/Cuffdiff transcript classification [here](http://cole-trapnell-lab.github.io/cufflinks/cuffcompare/#transfrag-class-codes) The _preFilter()_ function is used as follows: ```{r} data("exampleSwitchList") exampleSwitchListFiltered <- preFilter( switchAnalyzeRlist = exampleSwitchList, geneExpressionCutoff = 1, isoformExpressionCutoff = 0, removeSingleIsoformGenes = TRUE ) exampleSwitchListFilteredStrict <- preFilter( switchAnalyzeRlist = exampleSwitchList, geneExpressionCutoff = 10, isoformExpressionCutoff = 3, removeSingleIsoformGenes = TRUE ) ``` Back to [Table of Content]
### Identifying Isoform Switches As identification of isoform switches are essential for IsoformSwitchAnalyzeR, multiple ways of identifying isoform switches are supported. Currently IsoformSwitchAnalyzeR _directly_ supports three different approaches: * DEXSeq : Using DEXSeq (state-of-the art) to test for differentil isoform usage. DEXSeq was originally designed for testing for differential exon usage but have recently been shown to perform exceptionally well for differential isoform usage. The default and recommend in IsoformSwitchAnalyzeR. See [Testing for Isoform Switches via DEXSeq]. * DRIMSeq : Using the DRIMSeq package. See [Testing for Isoform Switches via DRIMSeq]. * Cufflinks/Cuffdiff : the splicing analysis performed by Cuffdiff is automatically imported when importing Cufflinks/Cuffdiff data. See [Importing Data from Cufflinks/Cuffdiff]. However, results of other tools which can test for differential isoform usage at gene and/or isoform level can also be used. See below in [Testing for Isoform Switches with other Tools]. No matter which method is used, all the downstream functionality in IsoformSwitchAnalyzeR (e.g. not only the test) uses two parameters to define a significant isoform switch: 1. The _alpha_ argument indicating the FDR corrected P-value (Q-value) cutoff. 2. The _dIFcutoff_ argument indicating the minimum (absolute) change in isoform usage (dIF). This combination is used since a Q-value is only a measure of the statistical certainty of the difference between two groups and thereby does not reflect the effect size which is measured by the dIF values. IsoformSwitchAnalyzeR supports tests for differential isoform usage performed both at isoform and gene resolution. Note, however, that for gene-level analysis you rely on cutoffs on dIF values for identifying the isoforms involved in a switch --- something that, when compared to the isoform-level analysis, could give false positives. We therefore recommend the use of isoform-level analysis whenever possible.
#### Testing for Isoform Switches via DEXSeq Two major challanges in testing differential isoform usage have been controling false discovery rates (FDR) and applying effect size cutoffs in experimental setups with counfounding effects. Recent studies such as [Love at al](https://f1000research.com/articles/7-952/v1) highlights DEXSeq (developed by Anders et al., see [What To Cite] --- please remember to cite it) as being a good solution as it controls FDR quite well. We have therfore implemented a DEXSeq based test as the default in IsoformSwitchAnalyzeR. This test furthermore utilizes limma to produce effect sizes corrected for confounding effects. The result is _isoformSwitchTestDEXSeq_ which is highly accurate and although it scales badly with the number of samples analyzed it is still faster and more reliable than DEXSeq --- which is why it is currently the default (and reccomended) test in IsoformSwitchAnalyzeR. The _isoformSwitchTestDEXSeq_ function is used as follows: ```{r, warning=FALSE} # Load example data and pre-filter data("exampleSwitchList") exampleSwitchList <- preFilter(exampleSwitchList) # preFilter to remove lowly expressed features # Perform test exampleSwitchListAnalyzed <- isoformSwitchTestDEXSeq( switchAnalyzeRlist = exampleSwitchList, reduceToSwitchingGenes=TRUE ) # Summarize switching features extractSwitchSummary(exampleSwitchListAnalyzed) ``` Please note if the switchAnalyzeRlist already contains the result of a switch test, this will be overwritten by _isoformSwitchTestDEXSeq_. An important argument in _isoformSwitchTestDEXSeq_ is the 'reduceToSwitchingGenes'. When TRUE this arguments will cause the function to to reduce/subset the switchAnalyzeRlist to the genes which each contains at least one differential used isoform, as indicated by the _alpha_ and _dIFcutoff_ cutoffs. This option ensures the rest of the workflow runs significantly faster since isoforms from genes without isoform switching are not analyzed. The ability to correct for confounding factors rely on that these are annotated in the design matrix (as supplied when creating the switchAnalyzeRlist) and stored in the switchAnalyzeRlist (accessed via exampleSwitchList$designMatrix ). The design matrix is a data.frame with two mandatory columns: sampleID and condition but enables incorporation of additional cofactors in the experimental setup (such as batch effcts) - simply add these to the design matrix as additional columns. If additional co-factors have been added _isoformSwitchTestDEXSeq_ will take that into account if correctForConfoundingFactors=TRUE (default). The co-factor corrected effect sizes (IF and dIF values) can be added to the switchAnalyzeRlist overwriting the old dIF and IF values as controled by the overwriteIFvalues argument (default is TRUE). The next step in a IsoformSwitchAnalyzeR workflow would typically be [Analyzing Open Reading Frames] (if isoform predicted de-novo) or [Extracting Nucleotide and Amino Acid Sequences] (if ORF already importet from GTF file (as supported by the import* functions)) Back to [Table of Content]
#### Testing for Isoform Switches via DRIMSeq Recently the DRIMSeq package (By Nowicka et al, see [What To Cite] --- please remember to cite it) was updated to not only analyze genes for differential isoform usage but provide a statistical test for each isoform analyzed thereby allowing for identification of the exact isoforms involved in a switch. IsoformSwitchAnalyzeR supports identification of isoform switches with DRIMSeq via the wrapper _isoformSwitchTestDRIMSeq()_ which performs the full analysis of all isoforms and comparisons in switchAnalyzeRlist. Afterwards, the results are integrated back into the switchAnalyzeRlist which is returned to the user as usual. Note that we support both the use of the gene-level and isoform-level analysis performed by DRIMSeq as controlled by the 'testIntegration' argument. Using this argument, it is also possible to be even more stringent by requiring that both the gene- and isoform-level analysis must be significant. The _isoformSwitchTestDRIMSeq()_ function is used as follows: ```{r, warning=FALSE} # Load example data and pre-filter data("exampleSwitchList") exampleSwitchList <- preFilter(exampleSwitchList, geneExpressionCutoff = 4) # preFilter for fast runtime # Perform test (takes ~10 sec) exampleSwitchListAnalyzed <- isoformSwitchTestDRIMSeq( switchAnalyzeRlist = exampleSwitchList, testIntegration='isoform_only', reduceToSwitchingGenes=TRUE ) # Summarize switching features extractSwitchSummary(exampleSwitchListAnalyzed) ``` Just like for _isoformSwitchTestDEXSeq()_ the 'reduceToSwitchingGenes' argument controls wether to to reduce/subset the switchAnalyzeRlist to the genes which each contains at least one differential used isoform, as indicated by the _alpha_ and _dIFcutoff_ cutoffs. This option ensures the rest of the workflow runs significantly faster since isoforms from genes without isoform switching are not analyzed. Please note: 1. Currently, the DRIMSeq approach can be a bit slow and might require some patience if a large number of isoforms are analyzed --- not something you want to sit around and wait for. We have experienced runtimes of up to 90 minutes per comparison made for a full-size dataset. 2. If an switchAnalyzeRlist already contains the result of a switch test, this will be overwritten by _isoformSwitchTestDRIMSeq()_. 3. The _isoformSwitchTestDRIMSeq()_ function allows the user to pass arguments to the individual sets of a DRIMSeq workflow --- especially the "dmFilterArgs" option could be of interest, since sometimes additional filtering can be needed to make DRIMSeq run (if you get the "non-finite value supplied by optim" error) The next step in a IsoformSwitchAnalyzeR workflow would typically be [Analyzing Open Reading Frames] (if isoform predicted de-novo) or [Extracting Nucleotide and Amino Acid Sequences] (if ORF already importet from GTF file (as supported by the import* functions)) Back to [Table of Content]
#### Testing for Isoform Switches with other Tools IsoformSwichtAnalyzeR also supports the analysis of isoform switches found via other tools. All you need to do is to generate a switchAnalyzeRlist with the corresponding data via one of the import options (see [Importing Data Into R]) and then fill in either the _isoform\_switch\_q\_value_ or _gene\_switch\_q\_value_ columns in the _isoformFeatures_ entry of the switchAnalyzeRlist with the multiple testing corrected p-values (q-values) from the testing tool. If the external tool used has isoform resolution (one test per isoform), the q-values should be added to the _isoform\_switch\_q\_value_ column and the smallest q-value of a given gene (in a specific comparison) should be added to all the isoforms for that gene in the _gene\_switch\_q\_value_ column. If the external tool has lower resolution (e.g. pre-mRNA or gene level resolution) the q-values should only be added to the _gene\_switch\_q\_value_ column (and the _isoform\_switch\_q\_value_ should be set to NA). In this way, IsoformSwitchAnalyzeR can also support non-isoform resolution testing --- note however that: 1) Via the isoform resolution level tests, we often find cases where the significant isoform is not the same isoform as the ones with the largest changes (i.e. large dIF values) and vice versa indicating that a non-isoform resolution might be inadequate. 2) If you supply the result of a non-isoform resolution isoform switch test (to the _gene\_switch\_q\_value_ column) the number of isoforms reported by summary tools such as _extractSwitchSummary_ or _extractConsequenceSummary_ will not be correct and should therefore be ignored. The next step in a IsoformSwitchAnalyzeR workflow would typically be [Analyzing Open Reading Frames] (if isoform predicted de-novo) or [Extracting Nucleotide and Amino Acid Sequences] (if ORF already importet from GTF file (as supported by the import* functions)) Back to [Table of Content]
### Analyzing Open Reading Frames Once the isoform switches have been found, the next step is to annotate the isoforms involved in the isoform switches. The first step in such annotation is to obtain Open Reading Frames (ORFs). If known annotated isoforms were quantified (i.e. you did not perform a (guided) de novo isoform reconstruction), we advide that you use the annotated CDS (Coding Sequence) from the annotation --- which can be imported and integrated though one of the implemented import methods, see [Importing Data Into R]. If you performed (guided) de-novo isoform reconstruction (isoform deconvolution), prediction of ORFs can be done with high accuracy from the transcript sequence alone (see supplementary data in Vitting-Seerup et al 2017 for benchmark). Such prediction can be done with: analyzeORF() This function utilizes the genomic coordinates of each transcript to extract the transcript nucleotide sequence from a reference genome (supplied via the _genomeObject_ argument). In _analyzeORF_ four different methods for predicting the ORF are implemented, suitable for different purposes and circumstances. The four methods are: 1. The 'longest' method. This method identifies the longest ORF based on finding the canonical start and stop codons in the transcript nucleotide sequence. This approach is what the CPAT tool uses in its analysis of coding potential. This is the default as it is the most common use case and has the highest accuracy in benchmarks against known annotated ORFs (>90% accuracy against GENCODE data, Vitting-Seerup et al 2017). 2. The 'mostUpstream' method. This method identifies the most upstream ORF based on finding the canonical start and stop codons in the transcript nucleotide sequence. 3. The 'longestAnnotated' method. This method identifies the longest ORF downstream of an annotated translation start site. It requires known translation start sites are supplied to the _cds_ argument. 4. The 'mostUpstreamAnnoated' method. This method identifies ORF downstream of the most upstream overlapping annotated translation start site. It requires known translation start sites are supplied to the _cds_ argument. One important argument is the _minORFlength_, which will ensure that only ORFs longer than _minORFlength_ nucleotides are annotated. Besides predicting the ORF, information about the most likely stop codon also allows for prediction of Pre-mature Termination Codon (PTC) and thereby sensitivity to degradation via the Nonsense Mediated Decay (NMD) pathway. This analysis is also implemented in the _analyzeORF()_ function and is controlled by the _PTCDistance_ argument. Note that the ORF prediction can be integrated with both the CPAT results (via the _removeNoncodinORFs_ parameter, see ?analyzeCPAT) as well as Pfam results (see [Augmenting ORF Predictions with Pfam Results]) The _analyzeORF()_ function can be used as follows: ```{r} ### This example relies on the example data from the 'Usage of The Statistical Test' section above ### analyzeORF needs the genomic sequence to predict ORFs. # These are readily available from Bioconductor as BSgenome orbjects: # http://bioconductor.org/packages/release/BiocViews.html#___BSgenome # Here we use Hg19 --- which can be downloaded by copy/pasting the following two lines into the R termminal: # BiocManager::install("BSgenome.Hsapiens.UCSC.hg19") library(BSgenome.Hsapiens.UCSC.hg19) exampleSwitchListAnalyzed <- analyzeORF(exampleSwitchListAnalyzed, genomeObject = Hsapiens, showProgress=FALSE) head(exampleSwitchListAnalyzed$orfAnalysis, 3) ``` As seen above, genomic and transcript coordinates for ORF beginning and end, as well as PTC status, are added to the 'orfAnalysis' entry of the switchAnalyzeRlist. If the user wants to use the _longestAnnotated_ or _mostUpstreamAnnoated_ methods, the _analyzeORF()_ function requires known CDS to be supplied as described above. The CDS must be stored as a CDSSeq (see ?CDSSet) and a wrapper for downloading the CDS of the most frequently used datasets from UCSC genome browser is implemented in: getCDS()
### Extracting Nucleotide and Amino Acid Sequences Now that we know the ORF of a transcript, we can obtain the corresponding protein amino acid sequence of the ORF simply by translating the nucleotide sequence of the ORF into amino acids. This opens the possibility for performing both internal and external sequence analysis which enables us to annotate the isoforms involved in isoform switches even further. To facilitate this we have implemented extractSequence() which allows for the extraction of both nucleotide and amino acid sequences from the switchAnalyzeRlist. To facilitate external sequence analysis _extractSequence_ can output fasta files (one file per sequence type) and to facilitate internal sequence analysis the sequences can be added to the switchAnalyzeRlist. An example of the internal sequence analysis is a pairwise comparison of the ORFs in two switching isoforms (see [Predicting Switch Consequences] below). ```{r} ### This example relies on the example data from the 'Analyzing Open Reading Frames' section above exampleSwitchListAnalyzed <- extractSequence( exampleSwitchListAnalyzed, genomeObject = Hsapiens, pathToOutput = '', writeToFile=FALSE # to avoid output when running this example data ) head(exampleSwitchListAnalyzed$ntSequence,2) head(exampleSwitchListAnalyzed$aaSequence,2) ``` Back to [Table of Content]
### Advice for Running External Sequence Analysis Tools The two fasta files that are the output by _extractSequence()_ (if writeToFile=TRUE) can be used as input, to amongst, others: * CPAT : The Coding-Potential Assessment Tool which is a tool for predicting whether an isoform is coding or not. CPAT can be run either locally or via their [webserver](http://lilab.research.bcm.edu/cpat/). * Pfam : Prediction of protein domains, which can be run either locally (using the pfam_scan.pl script as described in the readme found [here](ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/) or via their [webserver](https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan). * SignalP : Prediction of Signal Peptides --- a short N-terminal sequence of a peptide indicating where a protein should be membrane bound or secreted. SignalP can be run either locally or via their [webserver](http://www.cbs.dtu.dk/services/SignalP/). These three tools are the ones currently supported, but if you have additional ideas please do not hesitate to contact us as described in [How To Get Help]. We suggest that the external sequence analysis tools are run as follows: * CPAT : Use default parameters and the nucleotide fasta file (_nt.fasta). If the [webserver](http://lilab.research.bcm.edu/cpat/) was used, download the tab-delimited result file (available at the bottom of the result page). If a stand-alone version was used, just supply the path to the result file. * Pfam : Use default parameters and the amino acid fasta file (_AA.fasta). If the [webserver](https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan) is used you need to copy/paste the result part of the mail you receive into an empty plain text document (notepad, sublimetext, TextEdit or similar (not Word)) and save that to a plain text (txt) file. The path to that file should be supplied. If a stand-alone version was used, just supply the path to the result file. A more detailed walkthrough is found under details in the documentation of the \code{analyzePFAM()} function (?analyzePFAM). * SignalP : Use the amino acid fasta file (_AA.fasta). When using a stand-alone version, SignalP should be run with the '-f summary' option. If using the [webserver](http://www.cbs.dtu.dk/services/SignalP/) SignalP should be run with the parameter "standard" under "Output format" and "No graphics" under "Graphics output". When using the web-server, the results (everything between the two hoizontal lines) should be copy/pasted into an empty plain text document (notepad, sublimetext, TextEdit or similar (not Word)) and save that to plain text (txt) file. This file is then used as input to the function. If a stand-alone version was used, just supply the path to the summary result file. Back to [Table of Content]
### Importing External Sequence Analysis After the external sequence analysis with CPAT, Pfam, and SignalP have been performed (please remember to cite as describe in [What To Cite]), the results can be extracted and incooperated in the switchAnalyzeRlist via (respectively) analyzeCPAT() analyzePFAM() analyzeSignalP() The functions are used as: ```{r} ### Load test data (maching the external sequence analysis results) data("exampleSwitchListIntermediary") exampleSwitchListIntermediary ### Add PFAM analysis exampleSwitchListAnalyzed <- analyzePFAM( switchAnalyzeRlist = exampleSwitchListIntermediary, pathToPFAMresultFile = system.file("extdata/pfam_results.txt", package = "IsoformSwitchAnalyzeR"), showProgress=FALSE ) ### Add SignalP analysis exampleSwitchListAnalyzed <- analyzeSignalP( switchAnalyzeRlist = exampleSwitchListAnalyzed, pathToSignalPresultFile = system.file("extdata/signalP_results.txt", package = "IsoformSwitchAnalyzeR") ) ### Add CPAT analysis exampleSwitchListAnalyzed <- analyzeCPAT( switchAnalyzeRlist = exampleSwitchListAnalyzed, pathToCPATresultFile = system.file("extdata/cpat_results.txt", package = "IsoformSwitchAnalyzeR"), codingCutoff = 0.725, # the coding potential cutoff we suggested for human removeNoncodinORFs = TRUE # because ORF was predicted de novo ) exampleSwitchListAnalyzed ``` Here, the _pathTo\resultFile_ points to the result files constructed as suggested in [Advice for Running External Sequence Analysis Tools] and in the detailed documentation of each of the individual functions. The part of the examples using system.file() is not nesseary when using your own data - just supply the path as a string (e.g. "/myFiles/myCPATresults.txt") Of particular interest is the _removeNoncodinORFs_ argument in _analyzeCPAT()_ since it allows integration of the CPAT and ORF analyses by removing ORFs from isoforms not predicted to be coding by CPAT. This can be particularly useful if isoforms and ORFs have been predicted de novo (both guided or non-guided). Note that if enabled (by setting to TRUE), it will effect all downstream analyses and plots as both analysis of protein domains and signal peptides require that ORFs are annotated (e.g. _analyzeSwitchConsequences_ will not consider the domains (potentially) found by Pfam if the ORFs have been removed). Back to [Table of Content]
### Predicting Alternative Splicing Another type of annotation we easily can obtain, since we know the exon structure of all isoforms in a given gene (with isoform switching), is alternative splicing. Here intron retention events are of particular interest as a consequnce in isoform switches since they represent the largest changes in isoforms. Identification of alternative splicing, alternative transcription start and termination sites can be obtained via the _analyzeAlternativeSplicing()_. ```{r} ### This example relies on the example data from the 'Importing External Sequence Analysis' section above exampleSwitchListAnalyzed <- analyzeAlternativeSplicing(exampleSwitchListAnalyzed, quiet=TRUE) ### overview of number of intron retentions (IR) table(exampleSwitchListAnalyzed$isoformFeatures$IR) ``` Meaning 25 isoforms contain one intron retention (IR) which two isoforms each contain two intron retentions. IsoformSwitchAnalyzeR also supports many types of genone wide analysis of alternative splicing - for a more thorough walkthrough of the analysis of alternative splicing see the [Analyzing Alternative Splicing] workflow. Back to [Table of Content]
### Predicting Switch Consequences If an isoform has a significant change in its contribution to gene expression, there must per definition be reciprocal changes in one (or more) isoforms in the opposite direction, compensating for the change in the first isoform. We utilize this by extracting the isoforms that are significantly differentially used and compare them to the isoforms that are compensating. Using all the information gathered through the workflow described above, the annotation of the isoform(s) used more (positive dIF) can be compared to the isoform(s) used less (negative dIF) and by systematically identify differences annotation we can identify potential function consequences of the isoform switch. Specifically, IsoformSwitchAnalyzeR contains a function _analyzeSwitchConsequences()_ which extracts the isoforms with significant changes in their isoform usage (defined by the _alpha_ and _dIFcutoff_ parameters, see [Identifying Isoform Switches] for details) and the isoform, with a large opposite change in isoform usage (also controlled via the _dIFcutoff_ parameters) that compensate for the changes. Note that if an isoform-level test was *not* used, the gene is require to be significant (defined by the _alpha_ parameter); but, isoforms are then selected purely based on their changes in dIF values. These isoforms are then divided into the isoforms that increase their contribution to gene expression (positive dIF values larger than _dIFcutoff_) and the isoforms that decrease their contribution (negative dIF values smaller than \-_dIFcutoff_). The isoforms with increased contribution are then (in a pairwise manner) compared to the isoform with decreasing contribution. In each of these comparisons the isoforms compared are analyzed for differences in their annotation (controlled by the _consequencesToAnalyze_ parameter). Currently 22 different features of the isoforms can be compared, which include features such as intron retention, coding potential, NMD status, protein domains and the sequence similarity of the amino acid sequence of the annotated ORFs. For a full list, see "Details" on the documentation page for _analyzeSwitchConsequences()_ (by running ?analyzeSwitchConsequences). A more strict analysis can be performed by enabling the _onlySigIsoforms_ argument, which causes _analyzeSwitchConsequences()_ to only consider significant isoforms (defined by the _alpha_ and _dIFcutoff_ parameters) meaning the compensatory changes in isoform usage are ignored unless they themselves are significant. The analysis of consequences can be performed as follows: ```{r} ### This example relies on the example data from the 'Predicting Alternative Splicing' section above # the consequences highlighted in the text above consequencesOfInterest <- c('intron_retention','coding_potential','NMD_status','domains_identified','ORF_seq_similarity') exampleSwitchListAnalyzed <- analyzeSwitchConsequences( exampleSwitchListAnalyzed, consequencesToAnalyze = consequencesOfInterest, dIFcutoff = 0.4, # very high cutoff for fast runtimes showProgress=FALSE ) extractSwitchSummary(exampleSwitchListAnalyzed, dIFcutoff = 0.4, filterForConsequences = FALSE) extractSwitchSummary(exampleSwitchListAnalyzed, dIFcutoff = 0.4, filterForConsequences = TRUE) ``` Meaning that ~5/7 genes with switches (with dIF > 0.4) have isoform switches with functional consequences. For larger dataset with default parameters this is typicall 1/3. Please note that the _analyzeSwitchConsequences()_ function contains many parameters and cutoffs for deciding when a difference in the annotation is sufficiently different (e.g. supplying 'ntCutoff=50' ensures that differences in nucleotide lengths of two ORF are larger than 50 nucleotides --- not just 3 nucleotides (one codon)). Back to [Table of Content]
### Removal of annotation data not needed anymore Now that we have predicted isoform switches and their functional consequnces the switchAnalyzeRlist contains a lot of information no longer nessesary thereby significantly reducing the memmory usage. To remove the excess annotation - specifically replicate quantification and the biological sequnces - the _removeAnnoationData()_ function have been implemented. Please note that this function: * is mainly intended to be used when analyzing a large number of samples * t is highly recomended to save a version of the switchAnalyzeRlist with all the annotation before removeing it - typically done with _base::saveRDS_ or _base::save_. Back to [Table of Content]
### Post Analysis of Isoform Switches with Consequences Now that we have a genome-wide characterization of isoform switches with potential consequences there are a lot of possibilities for analyzing these. IsoformSwitchAnalyzeR currently directly supports two different types of post-analysis (of switches - see [Genome-Wide Analysis of Alternative Splicing] below for analysis of alternative splicing): 1) [Analysis of Individual Isoform Switching] which aims to identify the top genes and provide a visual overview summarizing all the relevant information about an isoform switch single gene. 2) [Genome-Wide Analysis of Isoform Switching] which summarizes and analyzes larger patterns of isoform switches with predicted functional consequences
#### Analysis of Individual Isoform Switching When analyzing the individual genes with isoform switches, the genes/isoforms with the largest changes in isoform usage (i.e. "most" switching genes/isoforms) are of particular interest. IsoformSwitchAnalyzeR can help you obtain these, either by sorting for the smallest q-values (obtaining the genes with the most significant switches) or the largest absolute dIF values (extracting the genes containing the switches wite the largest effect sizes (which are still significant)). Both methods are implemented at both genes and isoforms levels in the _extractTopSwitches()_ function and the sorting algorithm is controlled via the _sortByQvals_ argument. ```{r} ### Load already analyzed data data("exampleSwitchListAnalyzed") ### Let's reduce the switchAnalyzeRlist to only one condition exampleSwitchListAnalyzedSubset <- subsetSwitchAnalyzeRlist( exampleSwitchListAnalyzed, exampleSwitchListAnalyzed$isoformFeatures$condition_1 == 'COAD_ctrl' ) exampleSwitchListAnalyzedSubset ### Extract top switching genes (by q-value) extractTopSwitches( exampleSwitchListAnalyzedSubset, filterForConsequences = TRUE, n = 2, sortByQvals = TRUE ) ### Extract top 2 switching genes (by dIF values) extractTopSwitches( exampleSwitchListAnalyzedSubset, filterForConsequences = TRUE, n = 2, sortByQvals = FALSE ) ``` Note that by setting "n=NA" _extractTopSwitches()_ will return a data.frame with all significant genes/isoforms - ranked as indicated by the _sortByQvals_ argument. Let us take a look at the switching isoforms in the ZAK gene (ranked 7 on list) which plays an important role cell cycle: ```{r} ### Extract data.frame with all switching isoforms switchingIso <- extractTopSwitches( exampleSwitchListAnalyzedSubset, filterForConsequences = TRUE, n = NA, # n=NA: all features are returned extractGenes = FALSE, # when FALSE isoforms are returned sortByQvals = TRUE ) subset(switchingIso, gene_name == 'ZAK') ``` From which we can see the details of both isoforms (since we supplied "extractGenes = FALSE") involved in the isoforms switch in ZAK. The isoform switch can visually analyzed using a switch plot which summarizes all the concatenated information about the isoforms. Specifically the switch plot is a composite plot visualizing: 1) The isoform structures along with the concatenated annotations (including transcript classification, ORF, Coding Potential, NMD sensitivity, annotated protein domains as well as annotated signal peptides) 2) Gene and isoform expression and any annotated differential expression analysis of these 3) Isoform usage including the result of the isoform switch test ```{r, fig.width=12, fig.height=7} switchPlot(exampleSwitchListAnalyzedSubset, gene = 'ZAK') ``` Note that since there is only one comparison in the switchAnalyzeRlist (after the subset), it is not necessary to specify the conditions (which could otherwise be done via the "condition1" and "condition2" arguments). From this plot, we can easily see that the gene is not differentially expressed (bottom left), but there is an isoform switch where the long isoform is used more in COAD cancers than in normal controls. The switch from the short to the long isoform also results in the inclusion of a SAM_1 protein domain. Since the SAM_1 domain typically facilitates protein-protein interactions this switch might result in a ZAK protein which can interact with a different set of proteins in the cancer patients. From this plot, we first note that the gene expression is not significantly changed (bottom left). Next, we see the large significant switch in isoform usage across conditions (bottom right). By comparing which isoforms are changing (bottom right) to the isoform structure (top) it can be seen that the long isoform is used more in COAD cancers compared to healthy thissue. Interestingly, this isoform switch results in the inclusion of a SAM_1 protein domain. Since a SAM_1 domain typically facilitates protein-protein interactions this switch might result in a ZAK protein isoform which compared to the healthy tissue isoform can interact with a new/different set of proteins in the cancer patients.
One advantage of the Isoform Switch Analysis Plot is that it contains all information needed to judge the potential impact of an isoform switch. This also means a systematic screening of the top isoform switches can be made by generating this plot for each of those genes. To facilitate such screening we have implemented _switchPlotTopSwitches()_ which will extract the top _n_ genes (controlled by the _n_ argument) with significant switches (as defined by _alpha_ and _dIFcutoff_) and output a pdf or png version of the corresponding switch plot to the directory given by the _outputDestination_ argument. The function automatically ranks (by p-value or switch size) the switches and supports to either filter for isoform switches with predicted functional consequences or to output those with and those without consequences to separate folders. switchPlotTopSwitches( switchAnalyzeRlist = exampleSwitchListAnalyzed, n = 10, filterForConsequences = FALSE, splitFunctionalConsequences = TRUE ) If you want to output a plot for all significant genes simply set "n=NA". Back to [Table of Content]
#### Genome-Wide Analysis of Isoform Switching A genome-wide analysis is both useful for getting an overview of the extent of isoform switching as well as discovering general patterns. IsoformSwitchAnalyzeR supports this by providing four different summaries/analyses for both the analysis of alternative splicing and isoform switches with predicted consequences. All functions provide a visual overview as well as a data.frame with the summary statistics. The four analysis types supported are: 1) Global summary statistics, implemented in the *extractConsequenceSummary()* and *extractSplicingSummary()* functions, which summarizes the number of switches with predicted consequences and the number of splicing events occurring in the different comparisons, respectively. 2) Analysis of splicing/consequence enrichment, implemented in the *extractConsequenceEnrichment()* and *extractSplicingEnrichment()* functions, which analyzes whether a particular consequence/splice type occurs more frequently than the opposite event (e.g. domain loss vs domain gain) in a giving comparison (e.g. WT->KO1). 3) Comparison of enrichment, implemented in *extractConsequenceEnrichmentComparison()* and *extractSplicingEnrichmentComparison()* functions, which compares the enrichment of a particular consequence/splice type between comparisons (e.g. compares the changes in WT->KO1 vs WT->KO2 ) 4) Analysis of genome-wide changes in isoform usage, implemented in *extractConsequenceGenomeWide()* and *extractSplicingGenomeWide()* functions, which analyses the genome-wide changes in isoform usage for all isoforms with particular opposite pattern events. This type of analysis is particular interesting if the expected difference between conditions is large, since such effects could result in genome-wide changes. The analysis works by simultaneously analyzing all isoforms with a specific feature (e.g. intron retention) for changes in isoform usage. Examples of all four types of analysis have been shown for isoform switches above in [Examples Visualizations] and the correspsonding analysis for alternative splicing can be found below in [Genome-Wide Analysis of Alternative Splicing]. Back to [Table of Content]
### Other Tools in IsoformSwitchAnalyzeR Apart from the analysis presented above, IsoformSwitchAnalyzeR also contains a few other useful tools which will be presented in this section. The central visualization is the switch plot, created with _switchPlot()_ function as shown above. This plot is made from four subplots, which each can be created individualy using: switchPlotTranscript() # Visualizes the transcripts and their annotation switchPlotGeneExp() # Visualizes the gene expression switchPlotIsoExp() # Visualizes the isoform expression switchPlotIsoUsage() # Visualizes the isoform usage As illustrated here: ```{r, fig.width=12, fig.height=3} ### Load already analyzed data data("exampleSwitchListAnalyzed") switchPlotTranscript(exampleSwitchListAnalyzedSubset, gene = 'TNFRSF1B') ``` ```{r, fig.width=3, fig.height=3} switchPlotGeneExp (exampleSwitchListAnalyzedSubset, gene = 'TNFRSF1B', condition1='iPS', condition2='hESC') ``` ```{r, fig.width=4, fig.height=3} switchPlotIsoExp(exampleSwitchListAnalyzedSubset, gene = 'TNFRSF1B', condition1='iPS', condition2='hESC') ``` ```{r, fig.width=4, fig.height=3} switchPlotIsoUsage(exampleSwitchListAnalyzedSubset, gene = 'TNFRSF1B', condition1='iPS', condition2='hESC') ```
Another useful tool implemented is the _isoformToGeneExp()_ function which takes a data.frame of isoform expression and summarizes it up to gene-level expression values. This function is quite useful in conjunction with the _importIsoformExpression()_ function since the _importIsoformExpression()_ function imports isoform-level data --- which can then be summarized to gene level data with _isoformToGeneExp()_. This is quite analogous to what tximport can do --- except it allows for incorporation of the additional features, such as inter-library normalization, implemented in _importIsoformExpression()_.
If the end goal is not to calculate gene expression (from the isoform expression) but instead to calculate isoform fractions we have implemented the _isoformToIsoformFraction()_ function.
The last tool currently built into IsoformSwitchAnalyzeR is the extractExpressionMatrix() function. The expression information stored in the switchAnalyzeRlist's isoformFeatures is ideal for comparing multiple annotations in a specific comparison of two conditions, but is not well-suited for the comparison of multiple conditions. The _extractExpressionMatrix()_ function solves this by converting the average expression information (gene expression, isoform expression or Isoform Fraction values --- as controlled via the _feature_ argument) into a matrix format as illustrated here: ```{r} data("exampleSwitchListIntermediary") ifMatrix <- extractExpressionMatrix(exampleSwitchListAnalyzed) head(ifMatrix) dim(ifMatrix) ```
Such a matrix can be used for global comparisons of multiple conditions and analysis of sample correlations: ```{r, fig.width=4, fig.height=3} # correlation plot ggplot(reshape2::melt(cor(ifMatrix)), aes(x=Var1, y=Var2, fill=value)) + geom_tile() + scale_fill_continuous('Correlation') + labs(x='Condition', y='Condition') + theme_minimal() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust=0.5)) # turn x-axis 90 degrees ``` Or expression (via a heatmap): ```{r, fig.width=8, fig.height=4} library(pheatmap) pheatmap(t(ifMatrix), show_colnames=FALSE) ``` Back to [Table of Content]
## Analyzing Alternative Splicing ### Overview of Alternative Splicing Workflow Analysis of alternative splicing, here made possible by a liftover of functionalities from the now deprecated R package spliceR, is an important area of research. IsoformSwitchAnalyzeR is well suited for performing alternative splicing analysis since alterntive splicing litteraly will result in isoform switches. Please note that alternative splicing here is used as a term which covers both Alternative Splicing (AS), Alternative Transcription Start Sites (ATSS, sometimes called alternative first exon) as well as Alternative Transcription Start Sites (ATTS, sometimes called alternative last exon): although this is techncaiily incorrect it is here used for convenience. The workflow implemented in IsoformSwitchAnalyzeR is mostly focused on identifying global changes in alternative splicing but also allows for analysis of individual isoforms/switches. A typical workflow consists of five steps. Please note that step 1-4 are included in the standard switch analysis workflow, meaning if you have already analyzed isoform switches with consequences you can skip 1-4. The five steps corresponds to using the following functions in succession: ``` ### Step (1) Import data and create SwitchAnalyzeRlist mySwitchList <- importCufflinksData() # OR mySwitchList <- importIsoformExpression() # OR mySwitchList <- importRdata() ### Step (2) Prefilter mySwitchList <- preFilter( mySwitchList ) ### Step (3) Identify differentially used isoforms mySwitchList <- isoformSwitchTestDEXSeq( mySwitchList, reduceToSwitchingGenes=FALSE ) # OR mySwitchList <- isoformSwitchTestDRIMSeq( mySwitchList, reduceToSwitchingGenes=FALSE ) ### Step (4) Analyze alternative splicing mySwitchList <- analyzeAlternativeSplicing( mySwitchList ) ### Step (5) Global splicing analysis extractSplicingSummary( mySwitchList ) extractSplicingEnrichment( mySwitchList ) extractSplicingEnrichmentComparison( mySwitchList ) extractSplicingGenomeWide( mySwitchList ) ``` Step 1-4 are shared with the standard switch analysis workflow which is describe above in the [Detailed Workflow] (specifically each step is described in [Importing Data Into R], [Filtering], [Identifying Isoform Switches] and [Predicting Alternative Splicing]) and will therefore not be repeated here. The only difference is that in some cases one might *not* want reduce the switchAnalyzeRList to only contain the genes which contain differentially used isoforms --- as discussed below. This is done by setting _reduceToSwitchingGenes=FALSE_ in the switch test. ### Genome-Wide Analysis of Alternative Splicing The genome-wide analysis of splicing patterns is facilitated by the following four functions: - extractSplicingSummary() - extractSplicingEnrichment() - extractSplicingEnrichmentComparison() - extractSplicingGenomeWide() The extractSplicingSummary function does exactly what it says on the tin - it summarizes the number of events found in each comparison. To showcase it, let's look at some already analyzed example data (meaning step 1-4 have already been done). ```{r} data("exampleSwitchListAnalyzed") exampleSwitchListAnalyzed ``` Note that via the in "Feature analyzed" part of the summary we can see that the analysis of alternative splicing have been done and added to the switchAnalyzeRlist.
```{r, fig.width=12, fig.height=5} extractSplicingSummary( exampleSwitchListAnalyzed, asFractionTotal = FALSE, plotGenes=FALSE ) ``` Note that it is possible to both summarize per gene or per isoform via the _plotGenes_ argument and that the _asFractionTotal_ argument enables analysis the fraction (if TRUE) or Number (if FALSE) of significant isoforms/genes which have a particular splice type. From the summary above, it is quite clear that some of the alternative splicing events are not equally used - most prominently illustrated by the use of ATSS in COAD, where there is (a lot) more gain than losses. To formally analyze this type of uneven alternative splicing usage we have implemented _extractSplicingEnrichment()_. This function summarizes the uneven usage within each comparison by for each alternative splicing type calculate the fraction of events being gains (as opposed to loss) and perform a statistical analysis of this fraction. ```{r, fig.width=12, fig.height=4.5} splicingEnrichment <- extractSplicingEnrichment( exampleSwitchListAnalyzed, splicingToAnalyze='all', returnResult=TRUE, returnSummary=TRUE ) ``` From this plot it is clear to see that some alternative splicing types are preferentially used in the two cancer types, and from the statistical summary also created by _extractSplicingEnrichment()_ (due to the returnSummary=TRUE argument): ```{r, fig.width=12, fig.height=6} subset(splicingEnrichment, splicingEnrichment$AStype == 'ATSS') ``` We can indeed see that ATSS has a very significant skew in COAD. Note that the "splicingToAnalyze" argument implemented in all these summary functions enables analysis of only a subset of the splice types.
Another observation that can be made from the enrichment analysis above is that although the overall trend in usage of alternative splicing is the same the two cancer types there seem to be some differences. These differences can be analyzed using _extractSplicingEnrichmentComparison()_ function which for each splice type, in a pairwise manner contrast the individual condition comparison, to assess whether the ratio of gains are indeed different acorss comparisons. ```{r, fig.width=12, fig.height=4.5} extractSplicingEnrichmentComparison( exampleSwitchListAnalyzed, splicingToAnalyze=c('A3','MES','ATSS'), # Those significant in COAD in the previous analysis returnResult=FALSE # Preventing the summary statistics to be returned as a data.frame ) ``` And indeed only ATSS usage is significantly different between the two cancer types.
All of the above splicing analysis relies on the identification of pairs of isoforms involved in a switch, but the analysis of alternative splicing can also be done from the perspective of the individual splice types. This can be done by asking the question: How does the isoform usage of all isoforms utilizing a particular splicing type change - in other words is all isoforms or only only a subset of isoforms that are affected. This can be analysed with the _extractSplicingGenomeWide()_ function. ```{r, fig.width=10, fig.height=6} extractSplicingGenomeWide( exampleSwitchListAnalyzed, featureToExtract = 'all', # all isoforms stored in the switchAnalyzeRlist splicingToAnalyze = c('A3','MES','ATSS'), # Splice types significantly enriched in COAD plot=TRUE, returnResult=FALSE # Preventing the summary statistics to be returned as a data.frame ) ``` From which it can be seen that all isoforms using ATSS is generally used more, meaning it is a global phenomenon, whereas none of the other splice types are general (i.e. they seem to only happen in a specific subset of the data). Please note that: - The the dots in the violin plots above indicate 25th, 50th (median) and 75th percentiles (just like the box in a boxplot would). - The analysis provided by _extractSplicingGenomeWide()_ should only be expected to give a result when splicing is (severely) affected. - When using the featureToExtract='all' one should not use the reduceToSwitchingGenes=TRUE during the testing of isoform switches (as then it is not all isoforms but only those in genes containing at least one differentially used isoform) - then it will not be 'all' isoforms that can be analyzed. Back to [Table of Content]
## Other workflows ### Augmenting ORF Predictions with Pfam Results The workflow described here is an extension of the workflow enabled in the _analyzeCPAT()_ function via the "removeNoncodinORFs" argument which if TRUE removes ORF information from the isoforms where the CPAT analysis classifies them as non-coding. Here we will "rescue" the ORF of isoforms predicted to be noncoding by CPAT but which have a predicted protein domain. This workflow is an alternative approach to setting "removeNoncodinORFs=TRUE" and should be used instead of that option. Note that this is only recommended if ORFs were predicted (i.e. not imported from a GTF file). After this procedure, isoforms with ORFs will be isoforms with an ORF longer than _minORFlength_ (if specified in _analyzeORF_) which are predicted to be coding by CPAT **OR** have a predicted protein domain (by Pfam). Since the ORF information is stored in the 'orfAnalysis' analysis entry of the switchList we can remove it (by replacing it with NAs) as follows: ``` ### Test data data("exampleSwitchListAnalyzed") exampleSwitchListAnalyzed ### Extract coding isoforms nonCodingIsoforms <- unique(exampleSwitchListAnalyzed$isoformFeatures$isoform_id[ which( ! exampleSwitchListAnalyzed$isoformFeatures$codingPotential ) ]) ### Rescue those with protein domains nonCodingIsoformsRescued <- setdiff(nonCodingIsoforms, exampleSwitchListAnalyzed$domainAnalysis$isoform_id) # nr rescued length(nonCodingIsoforms) - length(nonCodingIsoformsRescued) ### Remove noncoding isoforms ORF annotation sum(is.na(exampleSwitchListAnalyzed$orfAnalysis$orfTransciptStart)) exampleSwitchListAnalyzed$orfAnalysis[ which( exampleSwitchListAnalyzed$orfAnalysis$isoform_id %in% nonCodingIsoformsRescued), 2:ncol(exampleSwitchListAnalyzed$orfAnalysis) ] <- NA exampleSwitchListAnalyzed$isoformFeatures$PTC[which(exampleSwitchListAnalyzed$isoformFeatures$isoform_id %in% nonCodingIsoforms)] <- NA sum(is.na(exampleSwitchListAnalyzed$orfAnalysis$orfTransciptStart)) ``` Back to [Table of Content]
### Analyze Small Upstream ORFs Recent research suggests that small upstream ORFs are far more frequent that previously assumed. It is therefore of particular interest to start analyzing these and here we have indirectly presented a tool which can do just so: _analyzeORF()_. Here we show how one could start such an analysis of small upstream ORFs. ``` # run ORF analysis on longest ORF exampleSwitchListAnalyzed <- analyzeORF(exampleSwitchListAnalyzed, genomeObject = Hsapiens, method='longest') mean(exampleSwitchListAnalyzed$orfAnalysis$orfTransciptLength) # run ORF analysis on most upstream ORF exampleSwitchListAnalyzed2 <- analyzeORF(exampleSwitchListAnalyzed, genomeObject = Hsapiens, orfMethod = 'mostUpstream', minORFlength = 50) mean(exampleSwitchListAnalyzed2$orfAnalysis$orfTransciptLength) # calculate pairwise difference summary( exampleSwitchListAnalyzed2$orfAnalysis$orfTransciptLength - exampleSwitchListAnalyzed$orfAnalysis$orfTransciptLength[ match(exampleSwitchListAnalyzed2$orfAnalysis$isoform_id ,exampleSwitchListAnalyzed$orfAnalysis$isoform_id) ] ) ``` Back to [Table of Content]
### Remove Sequences Stored in SwitchAnalyzeRlist The sequences stored in the SwitchAnalyzeRlist are not needed after consequences have been predicted and can thereby be removed, reducing the object size as well as loading/saving times. This has been done for the example dataset 'exampleSwitchListAnalyzed'. This is simply done as follows ``` summary(exampleSwitchListIntermediary) exampleSwitchListIntermediary$ntSequence <- NULL exampleSwitchListIntermediary$aaSequence <- NULL summary(exampleSwitchListIntermediary) ``` Back to [Table of Content]
### Adding Uncertain Category to Coding Potential Predictions There has been considerable debate about whether the default parameters for the codingPotential calculated for CPAT are too lenient (too low). This will always be a problem when having a single cutoff. One possible solution is to introduce an "unknown" class with medium size coding potential which we can then disregard. Let's start by looking at the distribution of coding potential values. ``` data("exampleSwitchListAnalyzed") hist(exampleSwitchListAnalyzed$isoformFeatures$codingPotentialValue) ``` These coding potential values are summarized by the cutoff supplied in the 'codingPotential' column --- which is what IsoformSwitchAnalyzeR uses in the downstream analysis. ``` ### These are summarized by the cutoff supplied in the 'codingPotential' column --- which is what is used by IsoformSwitchAnalyzeR in the downstream analysis. table(exampleSwitchListAnalyzed$isoformFeatures$codingPotential, exclude = NULL) ``` By simply setting the mid-range values to NA it will cause IsoformSwitchAnalyzeR to ignore them, thereby removing the isoforms with "unknown" coding potential. This can be done as follows: ``` exampleSwitchListAnalyzed$isoformFeatures$codingPotential <- NA exampleSwitchListAnalyzed$isoformFeatures$codingPotential[which(exampleSwitchListAnalyzed$isoformFeatures$codingPotentialValue > 0.75)] <- TRUE exampleSwitchListAnalyzed$isoformFeatures$codingPotential[which(exampleSwitchListAnalyzed$isoformFeatures$codingPotentialValue < 0.25)] <- FALSE table(exampleSwitchListAnalyzed$isoformFeatures$codingPotential, exclude = NULL) ``` Back to [Table of Content]
### Quality control of ORF of known annotation As we have shown there are many problems with known CDS annotations (see supplementary data in Vitting-Seerup et al 2017) we have implemented two ways to ensure a high quality of the CDS imported from a GTF annotation file: 1) When you import the GTF file (via _importRdata()_ or _importGTF()_) you can enable the 'onlyConsiderFullORF' argument which makes sure you only add the ORF stored in the GTF file if it is annotated with both a start and stop codon. 2) When you extract the biological sequences (nucleotide and amino acid) with _extractSequence()_ you can enable the argument 'removeORFwithStop' which will remove ORFs that contain un-annotated stop codons, defined as * when the ORF nucleotide sequences are translated to the amino acid sequence. If enabled, the ORFs will both be removed from the switchAnalyzeRList and from the sequences outputted by _extractSequence()_.
### Analyzing the Biological Mechanisms Behind Isoform Switching The difference between the isoforms involved in an isoform switch can arise by changes in three distinct biological mechanisms: 1) Alternative Transcription Start Site (aTSS) 2) Alternative Splicing (AS) 3) Alternative Transcription Termination Site (aTTS) Since we how the structure of the isoforms involved in a isoform switch we can also analyze which (combination) of these biological mechanisms gives rise to the difference between the two isoforms involved in an isoform switch. This is simply done by, in addition to running _analyzeSwitchConsequences_ with the consequences you find interesting, making a separate consequence analysis of consequences (also with _analyzeSwitchConsequences_) where the consequences you analyze (supplied to the _consequencesToAnalyze_ argument) are: 1) 'tss' --- which will analyze the isoforms for aTSS 2) 'intron_structure' --- which will analyze the isoforms for AS 3) 'tts' --- which will analyze the isoforms for aTSS Then we can simply compare the result of this analysis to the isoform switches with consequences we already have identified to be of interest, and thereby identify which biological mechanisms give rise to the isoform switches with consequence you are interested in. One suggestion for such an analysis is illustrated here: ```{r, results = "hide", message = FALSE} ### Load example data data("exampleSwitchListAnalyzed") ### Reduce datasize for fast runtime selectedGenes <- unique(exampleSwitchListAnalyzed$isoformFeatures$gene_id)[50:100] exampleSwitchListAnalyzedSubset <- subsetSwitchAnalyzeRlist( exampleSwitchListAnalyzed, exampleSwitchListAnalyzed$isoformFeatures$gene_id %in% selectedGenes ) ### analyze the biological mechanisms bioMechanismeAnalysis <- analyzeSwitchConsequences( exampleSwitchListAnalyzedSubset, consequencesToAnalyze = c('tss','tts','intron_structure'), showProgress = FALSE )$switchConsequence # only the consequences are interesting here ### subset to those with differences bioMechanismeAnalysis <- bioMechanismeAnalysis[which(bioMechanismeAnalysis$isoformsDifferent),] ### extract the consequences of interest already stored in the switchAnalyzeRlist myConsequences <- exampleSwitchListAnalyzedSubset$switchConsequence myConsequences <- myConsequences[which(myConsequences$isoformsDifferent),] myConsequences$isoPair <- paste(myConsequences$isoformUpregulated, myConsequences$isoformDownregulated) # id for specific iso comparison ### Obtain the mechanisms of the isoform switches with consequences bioMechanismeAnalysis$isoPair <- paste(bioMechanismeAnalysis$isoformUpregulated, bioMechanismeAnalysis$isoformDownregulated) bioMechanismeAnalysis <- bioMechanismeAnalysis[which(bioMechanismeAnalysis$isoPair %in% myConsequences$isoPair),] # id for specific iso comparison ``` This result is best summarized in a Venn diagram: ```{r, fig.width=3.5, fig.height=3.5} ### Create list with the isoPair ids for each consequencee AS <- bioMechanismeAnalysis$isoPair[ which( bioMechanismeAnalysis$featureCompared == 'intron_structure')] aTSS <- bioMechanismeAnalysis$isoPair[ which( bioMechanismeAnalysis$featureCompared == 'tss' )] aTTS <- bioMechanismeAnalysis$isoPair[ which( bioMechanismeAnalysis$featureCompared == 'tts' )] mechList <- list( AS=AS, aTSS=aTSS, aTTS=aTTS ) ### Create venn diagram library(VennDiagram) myVenn <- venn.diagram(mechList, col='transparent', alpha=0.4, fill=RColorBrewer::brewer.pal(n=3,name='Dark2'), filename=NULL) ### Plot the venn diagram grid.newpage() ; grid.draw(myVenn) ``` From this, we can see the relative importance of each of the three mechanisms, as well as the combination of these. Back to [Table of Content]
### Analysing experiments without replicates Generally we do not recommend analysis of experiments without replicates as it is impossible to assess technical and biological variation. On the other hand, single-replicate pilot studies can be useful for planning larger experiments: hence this small guide. For running IsoformSwitchAnalyzeR without replicates, we have to solely rely on the dIF values to call isoform switches. This can be done by following the steps outlined in [Detailed Workflow] with a few modifications: - Omit the isoform switch test step - Manuallyt set the switch q-value to 1, which can be done as follows: ``` aSwitchList$isoformFeatures$gene_switch_q_value <- 1 ``` - In all subsequent steps of the [Detailed Workflow], make sure IsoformSwitchAnalyzeR ignores the q-value by setting the _alpha_ argument to something larger than 1 (e.g. 1.1). You will have to ignore the resulting warnings IsoformSwitchAnalyzeR produces as it test the validity of the arguments. Back to [Table of Content]
## Frequently Asked Questions, Problems and Errors ### FAQ Table of Content - [What Quantification Tool(s) Should I Use?] - [How to handle cofounding effects (including batches)] - [What constitute an independent biological replicate?] - [Adding differential gene expression] - [The error "The annotation does not fit the expression data"] - [The error "The annotation (count matrix and isoform annotation) contain differences in which isoforms are analyzed..."] - [The error "The supplied design matrix will result in a model matrix that is not full rank"] ### What Quantification Tool(s) Should I Use? Please note that the following is a reflection of my personal convictions and should not be read as "the truth". Generally speaking there are two different ways to obtain transcript quantification which boils down to whether you trust the available transcript annotation or not (or think you RNA-seq data contains important additional isoforms not included in the available annotations). If you generally trust the transcript annotation, state-of-the-art quantifications can be obtained directly from the sequence files (FASTQ) via Salmon or Kallisto. These methods give much more accurate quantifications (and are a lot faster) than for example first mapping to the genome and afterwards quantifying with Cufflinks. You can find more information in the [Kallisto paper](http://www.nature.com/doifinder/10.1038/nbt.3519). For comparison you can find the Salmon paper [here](https://www.nature.com/articles/nmeth.4197) but please not the comparison in figure 1 is a special case. Under normal circumstances Kallisto and Salmon performs quite similarly and and choice is up to personal preference. IsoformSwitchAnalyzeR directly supports the quantifications of both tools (and more) via the importIsoformExpression() function - see [Importing Data Into R]. Kallisto can be downloaded from [here](https://pachterlab.github.io/kallisto/download) and the manual for running Kallisto can be found [here](http://pachterlab.github.io/kallisto/manual.html). Salmon can be downloaded from [here](https://github.com/COMBINE-lab/salmon/releases) and a manual for running Salmon can be found [here](https://salmon.readthedocs.io/en/latest/salmon.html). If you on the other hand do not have any annotation, or want to augment it, you need to do a transcript deconvolution. A transcript deconvolution can be described as a 4-step process: - First you map reads to the genome (with tools such as [STAR](http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STAR.posix/doc/STARmanual.pdf), [HISAT2](https://ccb.jhu.edu/software/hisat2/manual.shtml) etc.). - Then you assemble transcripts (typically guided) with tools such as [Cufflinks](http://cole-trapnell-lab.github.io/cufflinks/manual/) or [StringTie](http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual). Please note that many other tools for solving this problem exists and new methods are still published regularly. Since the transcript assembly is the central step in the analysis, you probably want to read up on these tools yourself to make sure you use the better tools. - The third step is to merge the assembled transcripts from all individual Cufflinks/StringTie/other runs into one combined reference transcripome with tools such as [Cuffmerge](http://cole-trapnell-lab.github.io/cufflinks/cuffmerge/index.html), [StringTie](http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual) or [TACO](https://tacorna.github.io/). - In step 4 this combined transcriptome can be used to quantify all the original FASTQ files with [Cufflinks/CuffDiff](http://cole-trapnell-lab.github.io/cufflinks/manual/), [StringTie]((http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual)), [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) or [Kallisto](http://pachterlab.github.io/kallisto/manual.html). As an alternative, tools such as [Bayesembler](http://bioinformatics-centre.github.io/bayesembler/) performs step 2-4 jointly. You will need to figure out which combination of tools you choose on your own - but IsoformSwitchAnalyzeR can analyze the data from all of them. For more information see [Importing Data Into R]. Back to [Table of Content]
### How to handle cofounding effects (including batches) "Cofounding effect" is a common description of all sources of variation not of interest for the the analysis you are doing. This can be anything from a batch effects (data/sample produced/analyzed at different locations, timepoints, personel etc) to an effect you are not interested in but which migth influence the result (e.g. sex or age). Taking all this into account to avoide false positive of false negative results can easily be done when testing for isoform switches (or anything else really) when using tools based on linear models. Naturally taking cofounding effects into account is also supported in IsoformSwitchAnalyzeR via the isoformSwitchTestDRIMSeq() and isoformSwitchTestDEXSeq() functions. Both these functions automatically takes additional counfounding effects into account if they have been annotated in the switchAnalyzeRlist. isoformSwitchTestDEXSeq() furthermroe also supports extract of batch corrected effect sizes (IF and dIFs). Counfounding effects should be annotated in 'designMatrix' used when creating the switchAnalyzeRList (via createSwitchAnalyzeRlist() or importRdata()). This is simply done by adding extra columns to the design matrix before creating the switchAnalyzeRlist: So if the design matrix looked like: ```{r} myDesign <- data.frame( sampleID=1:6, condition=c('wt','wt','wt','ko','ko','ko') # WT = wildtype, KO = Knock out ) myDesign ``` Adding a co-founder is simply done as follows: ```{r} myDesign$batch <- factor(c('a','a','b','a','b','b')) myDesign ``` Please note that cofounders can both be categorical (e.g. sex) (indicated by type character or factor) and continous variables (e.g. age) (indicated by numeric or integers) - both will be taken into account. Due to the differenti interpertation it is recomended to use strings/factors for categorical variables (aka not use 1 and 2 to indicate batch in the example above (instead of "a" and "b")). The the conditions analyzed (as indicated by the "condition" column in the design matrix) must however always be categorical (character or factor). Back to [Table of Content]
### What constitute an independent biological replicate? A very nice discussion, which I highly reccomend, can be found in [Vaux et al](http://dx.doi.org/10.1038/embor.2012.36). As a TL;DR here are a couple of guidelines: - Science can under NO circumstances be made from sample sizes of 1 (n=1). To call anything science proof that a result is reproducible is needed. - If all replicates were made at the same time no extrapolations can be made! Running an experiment in parallel does not result in independent samples - it is still only n=1. The experiment will need to be done multiple times. - If all experiments were all done on the same cell line, extrapolation to anything other than that cell line cannot be made. The same actually goes for mouse strains. - If a knock-down is performed multiple different siRNA/shRNAs are needed (both to be independent and due to off target effects). - If a knock-out is created is not enough to only investigate a single cell line originating from single clone. Multiple independent clones are needed. - If you want to investigate a disease, analysing a single mouse or human is not enough - you would need data from multiple patients/mice to extrapolate to other humans/mice. All these replicates which does not allow for extrapolation would be termed independent replicates - and if you ask me the work replicate should never be allowed to be used without specifying whether it is independent or not. Back to [Table of Content]
### Adding differential gene expression Gene differential expression is currently not build into IsoformSwitchAnalyzeR since there are so many other good tools to do this. Some frequently used tools are [edgeR](http://bioconductor.org/packages/edgeR/), [DESeq2](http://bioconductor.org/packages/DESeq2/), [limma]( http://bioconductor.org/packages/limma/) and [Sleuth](https://github.com/pachterlab/sleuth) (for Kallisto users). This list is not exhaustive but clearly shows the variety of well maintained tools out there. For a recent review of these tool we refere you to [Love et al 2018](https://f1000research.com/articles/7-952/v3). No matter which tool you choose we support integration of the result by showing the result in the plot created by the switchPlot() function. To add the result of a differential expression analysis all you need to do is overwrite the following columns: exampleSwitchList$isoformFeatures$gene_q_value exampleSwitchList$isoformFeatures$iso_q_value With the corresponding FDR corrected P-values of a differential expression analysis. We reccomend the base function match() to do this - and do remember to consider conditions (indicated by the "condition_1" and "condition_2" columns) if you have more than one comparison. Back to [Table of Content]
### The error "The annotation does not fit the expression data" This problem can arrise from multiple sources but essentially means that we determined that not all isoforms quantified are found in the annotation or vice versa. Ensuring that it is the exact same isoforms that are pressent in the annotation and quantification is essential because it is the foundation for all downstream analysis with IsoformSwitchAnalyzeR - therefore we need to be quite strict with this. Currently we have documented this problem from three different sources: **Source 1**: The GTF and Fasta files were obtained from [Ensemble](https://www.ensembl.org/info/data/ftp/index.html) - Where files unfortunatly don't match. This most often occure because Ensembl's file structure is a bit complicated (the fasta and GTF files available do not match default). This also mean that to get a combination suitable for IsoformSwitchAnalyzeR a particular combination of files have to be used. Specifically you need to download 3 files: - The cDNA fasta file (called "Homo_sapiens.GRCh38.cdna.all.fa.gz" for human) - The ncRNA fasta file (called "Homo_sapiens.GRCh38.ncrna.fa.gz" for human) - The largest GTF which include haplotype information (called "Homo_sapiens.GRCh38.93.chr_patch_hapl_scaff.gtf.gz" for human) The two fasta files (cdna + ncrna) then needs to be concatenated into one fasta file. This concatenated file (almost perfectly) match the large GTF file. This also mean that you will have to use the concatenated fasta file as a basis for building the Salmon/Kallisto/RSEM index. Please note that since [GENCODE genes](http://gencodegenes.org/) is a parsing of Ensemble data their files might be easier to use. Simply use the "Comprehensive gene annotation" GTF file along with the "Transcript sequences" fasta file - these match. **Source 2**: The GTF and FASTQ file used to quantify with Kallisto/Salmon/RSEM/StringTie contain non-overlapping isoforms which in some cases are due to version numbering being pressent or not. We have found this when using data downloaded from Ensembl. Three possible solutions are: - Update to the lastest IsoformSwitchAnalyzeR version which better handles this. - Use files from [GENCODE genes](http://gencodegenes.org/) instead. Simply use the "Comprehensive gene annotation" GTF file along with the "Transcript sequences" fasta file - these match. - Use a tool to generate the fastq file from the GTF file of interest. This can be done with tool such as the gffread tool included in both [Cufflinks/Cuffdiff](http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility) and [Stringtie](http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread) **Source 3**: When using RSEM for the quantificaitons some older versions of tximport could cause this problm. To fix this problem you need to update the underlying package tximport. This can be done by copy/pasting this command into R: devtools::install_github("mikelove/tximport") Back to [Table of Content]
### The error "The annotation (count matrix and isoform annotation) contain differences in which isoforms are analyzed..." This error occurs because there is differences in which isoforms are found in the annotation and which isoforms have been quantified. This problem have (so far) only been observed in data quantified with Salmon. It typically occres because during the index building Salmon collapses identical transcripts. A discussion with the Salmon authors of pros and cons of this behaviour can be found [here](https://github.com/COMBINE-lab/salmon/issues/214). We ask the user to make sure why this problem occurs in the user-specific setting, and that removing the non-quantified isoforms is the desired solution, before continuing. Back to [Table of Content]
### The error "The supplied design matrix will result in a model matrix that is not full rank" If a desing matrix is not of full rank it simply means that one or more columns (or combinations of columns) in the design matrix describe the same thing. Imaging a case when you had two paramters describing the same person - say fingerprint and social security number. If you then try to estimate the effect of these factors you can have an infinite number of solutions by having them being opposites (-1 and 1, -2 and 2 etc) which means you cannot estimate a linear model. If you have problems with this the easies thing is to write a post about it on [http://support.bioconductor.org](http://support.bioconductor.org). ## Final Remarks With this vignette, we hope to provide a thorough introduction to IsoformSwitchAnalyzeR and give some examples of what IsoformSwitchAnalyzeR can be used for. We aim to continuously keep IsoformSwitchAnalyzeR up to date and continuously improve it. The latter includes the integration of new tools as they are developed (isoform quantification, isoform switch identification, statistical ORF detection, external sequence analysis etc.), so please feel free to suggest new tools to us (see the [How To Get Help] section for info of how to get in contact). Tools should preferably either be light weight (runnable on an old laptop) and/or available via web-services as that will allow a larger audience to use it. As the continued updates might change the behaviour of certain functions we encurage uses the read the NEWS file distributed with the R package whenever IsoformSwitchAnalyzeR is updated. Back to [Table of Content] ## Sessioninfo ```{r} sessionInfo() ``` Back to [Table of Content]