%\VignetteIndexEntry{spliceR Vignette} %\VignetteKeyword{RNA-Seq} %\VignetteKeyword{alternative splicing} %\VignetteKeyword{PTC} %\VignetteKeyword{nonsense mediated decay} %\VignettePackage{spliceR} \documentclass[11pt]{article} \usepackage[utf8]{inputenc} \textwidth=6.1in \textheight=8.7in \oddsidemargin=0.15in \evensidemargin=0.15in \headheight=0in \headsep=0in \begin{document} \title{spliceR: An R package for classification of alternative splicing and prediction of coding potential from RNA-seq data} \author{Kristoffer Knudsen, Johannes Waage} \date{5 Dec 2013} \maketitle \newpage \tableofcontents \newpage \section{Introduction} \texttt{spliceR} is an R package that allows for full isoform classification of transcripts generated by RNA-seq assemblers such as Cufflinks. spliceR outputs information about classes of alternative splicing, susceptibility of transcripts to the nonsense mediated decay mechanism (NMD), and provides functions for export of data to the GTF format for viewing in genome browsers, and a few plotting functions. \texttt{spliceR} lies at the end of a typical RNA-seq workflow, using full devoncoluted isoforms generated by assemblers, and supports output of genomic locations of spliced elements, facilitating downstream analyses on sequence elements. \texttt{spliceR} stores transcript information in the \texttt{SpliceRList} object, constisting of two \texttt{GRanges}, 'transcript\_features' and 'exon\_features'. The 'transcript\_features' \texttt{GRanges} object contains meta-columns with additional information about each isoform, including parent gene, and expression values. The 'exon\_features' \texttt{GRanges} contains all exons, with a isoform ID column, refering back to 'transcript\_features'. Such a \texttt{SpliceRList} can be generated from a Cufflinks workflow by using the built-in \texttt{prepareCuff} function, directly reading from a cuffDB object created by Cufflinks auxiliary package \texttt{cummeRbund}. Alternatively, a \texttt{SpliceRList} can be generated manually from user-supplied transcript and exon information generated by any RNA-seq assembler. Additionally, \texttt{spliceR} provides annotation of transcripts with nonsense mediated decay susceptibility via \texttt{annotatePTC()}, visualization via \texttt{spliceRPlot()}, generation of GTF files for genome browsers via \texttt{generateGTF()}, and a few other auxilliary functions. \subsection{Alternative splicing classes} \texttt{spliceR} annotates the 'transcript\_features' \texttt{GRanges} with the following splice classes: \begin{itemize} \item Single exon skipping/inclusion (ESI) \item Multiple exon skipping/inclusion (MESI) \item Intron retention/inclusion (ISI) \item Alternative 5'/3' splice sites (A5 / A3) \item Alternative transcription start site (ATSS) / transcription termination site (ATTS) \item Mutually exclusive exons (MEE) \end{itemize} \section{Cufflinks workflow} Cufflinks is part of the Tuxido suite, a well-documented and well-supported array of tools for mapping, assembly and quantification of RNA-seq data. At the end of a Tuxido suite workflow often lies analysis with the \texttt{cummeRbund} package, that converts the output of Tuxido RNA-seq assembler Cufflinks with the analysis and visualisation power of R. \texttt{spliceR} provides the \texttt{prepareCuff()} function for easy conversion of \texttt{cummeRbund}s \texttt{cuffSet} class to \texttt{spliceR}s \texttt{SpliceRList} class. As \texttt{spliceR} requires full information about isoforms and the genomic coordinate of their constituting exons, the transcript GTF file output by Cufflinks must be supplied when generating cuffSet object. Here, we use the built-in data set supplied with \texttt{cummeRbund} to generate a \texttt{SpliceRList}: <>= library("spliceR") dir <- tempdir() extdata <- system.file("extdata", package="cummeRbund") file.copy(file.path(extdata, dir(extdata)), dir) cuffDB <- readCufflinks(dir=dir, gtf=system.file("extdata/chr1_snippet.gtf", package="cummeRbund"),genome="hg19" ,rebuild=TRUE) cuffDB_spliceR <- prepareCuff(cuffDB) @ The \texttt{SpliceRList} is a list, containing two \texttt{GRanges} objects, \texttt{transcript\_features} and \texttt{exon\_features}, containing information about the genomic coordinates and transcript and gene identifiers of transcripts and their constituent exons, respectively. Using \texttt{prepareCuff()} creates additional metacolumns in the \texttt{transcript\_features} \texttt{GRanges} object, allowing for further filtering later in the spliceR workflow. In addition, the \texttt{SpliceRList} object contains information about the sample names, the source of the data, any filters applied, and other data used by \texttt{spliceR}. A number of helper functions are provided to extract the transcript and exon GRanges, as well as the sample names: <>=> myTranscripts <- transcripts(cuffDB_spliceR) myExons <- exons(cuffDB_spliceR) conditions(cuffDB_spliceR) @ Once the have the \texttt{SpliceRList} object ready, it can be further annotated with alternative splicing classes, using, \texttt{spliceR()}, or with information about the coding potential of each transcript with \texttt{annotatePTC()} \subsection{Cufflinks gene annotation problem} Cufflinks uses an internal 'gene id'. This gene id is, according to the official manual, 'a unique identifier describing the gene' . But upon investigation we have found that many Cufflinks gene ids actually contain multiple annotated genes. To our knowledge this problem is most visible when running Cufflinks/Cuffdiff with the Ensembl/Gencode as reference transcriptome, using an unstranded RNA-seq experiment. Under these conditions we find that 200-400 (2-4 percent of all) cuffgenes consists of multiple annotated genes. Our hypothesis is, that the problem arises because there are long Ensembl transcripts spanning the intragenic regions between these genes, thereby tricking Cufflinks into thinking that all transcripts, which originates from this region, must belong to the same gene. The global effect of the Cufflinks annotation problem is that the expression levels of all these individual genes are wrong, since the expression level is an average over a couple of genes. This in turns also affects the differential expression analysis of the genes. The Cufflinks annotation problem does however only affect the expression level of genes, meaning that both transcript expression levels and differential expression tests of these are unaffected by this problem. Our (crude) solution to this problem is to: 1) Identify all affected genes 2) Create a new cuffgene for each real annotated gene within the cuffgenes 3) Recalculate the gene expression according to the new assignment of transcript 4) Set significance test to be negative This should correct gene expression and remove false positive differentially expressed genes. This setting (The 'fixCufflinksAnnotationProblem' setting in prepareCuff()) is on by default. \section{GRanges workflow} For RNA-seq data from other transcripts assemblers than Cufflinks, a \texttt{SpliceRList} can easily be manually created, by supplying the class constructor with transcript and exon \texttt{GRanges}. In the following example, a public available transcript and exon model set is downloaded, and expression and p-values between samples are set to faux values to simulate generic RNA-seq assembled data. First, the UCSC knownGene transcript table is downloaded from the UCSC repository: <>=> session <- browserSession("UCSC") genome(session) <- "hg19" query <- ucscTableQuery(session, "knownGene") tableName(query) <- "knownGene" cdsTable <- getTable(query) @ Next, we download the kgXref table from UCSC, contaning the link between transcript IDs and gene IDs. <>=> tableName(query) <- "kgXref" kgXref <- getTable(query) @ The \texttt{GRanges} containing the transcript features is created (setting faux gene expression, fold change and p values), fetching the gene names from the kgXref table: <>=> knownGeneTranscripts <- GRanges( seqnames=cdsTable$"chrom", ranges=IRanges( start=cdsTable$"txStart", end=cdsTable$"txEnd"), strand=cdsTable$"strand", spliceR.isoform_id = cdsTable$"name", spliceR.sample_1="placeholder1", spliceR.sample_2="placeholder2", spliceR.gene_id=kgXref[match(cdsTable$"name", kgXref$"kgID"), "geneSymbol"], spliceR.gene_value_1=1, spliceR.gene_value_2=1, spliceR.gene_log2_fold_change=log2(1/1), spliceR.gene_p_value=1, spliceR.gene_q_value=1, spliceR.iso_value_1=1, spliceR.iso_value_2=1, spliceR.iso_log2_fold_change=log2(1/1), spliceR.iso_p_value=1, spliceR.iso_q_value=1 ) @ The \texttt{GRanges} containing the exon features is created: <>=> startSplit <- strsplit(as.character(cdsTable$"exonStarts"), split=",") endSplit <- strsplit(as.character(cdsTable$"exonEnds"), split=",") startSplit <- lapply(startSplit, FUN=as.numeric) endSplit <- lapply(endSplit, FUN=as.numeric) knownGeneExons <- GRanges( seqnames=rep(cdsTable$"chrom", lapply(startSplit, length)), ranges=IRanges( start=unlist(startSplit)+1, end=unlist(endSplit)), strand=rep(cdsTable$"strand", lapply(startSplit, length)), spliceR.isoform_id=rep(knownGeneTranscripts$"spliceR.isoform_id", lapply(startSplit, length)), spliceR.gene_id=rep(knownGeneTranscripts$"spliceR.gene_id", lapply(startSplit, length)) ) @ Finally, a \texttt{SpliceRList} object is created from both \texttt{GRanges}, setting assembly, source, and coditions : <>=> knownGeneSpliceRList <- SpliceRList( transcript_features=knownGeneTranscripts, exon_features=knownGeneExons, assembly_id="hg19", source="granges", conditions=c("placeholder1", "placeholder2") ) @ \section{preFiltering} Datasets produced by e.g. Cufflinks in multi-sample setups can quickly become very large, as all pairwise combinations are considered. Many genes and transcripts reported by Cufflinks are based on reference GTF files provided at run time, and may not be expressed in any or all samples. Similarly, Cufflinks may not have confidence in all reported transcripts, marking these models as 'FAIL' or 'LOW DATA'. Filtering these genes and transcripts out of the \texttt{SpliceRList} before running \texttt{spliceR()} and downstream analyses may decrease runtimes considerably. Given here is an example of a prefiltering step, using standard filters. For more on these filters, refer to the help page for \texttt{spliceR()}. NB: Note that \texttt{preSpliceRFilter()} reports the number of total pairwise comparison, while \texttt{spliceR()} reports the total number of unique isoforms. <>=> #cuffDB_spliceR_filtered <- preSpliceRFilter( # cuffDB_spliceR, # filters=c("expressedIso", "isoOK", "expressedGenes", "geneOK") # ) @ \section{SpliceR} \texttt{spliceR()} takes a \texttt{SpliceRList} object and annotates the isoforms contained within with classes of alternative splicing, as well as a few other metacolumns - data could be generated either by \texttt{prepareCuff()} or manually, via a \texttt{GRanges} object. The \texttt{compareTo} parameter, set to either 'preTranscript' or a given sample, tells \texttt{spliceR()} how to sort out isoform comparisons when classifying events. Using the 'preTranscript' parameter, \texttt{spliceR()} creates a hypothetical pre-RNA based on all isoforms for each unique gene id, aggregating all exon information. Each isoform is then compared to this construct when classifying. Giving a sample name urges \texttt{spliceR()} to consider the major isoform (in terms of isoform expression relative to the total gene expression) of that sample as the primary RNA, comparing each isoform to this reference when classifying. \texttt{spliceR()} can be given a number of filters to reduce the overall data size and running time. Most of these filters are only relevant when running on data from cufflinks, as this software generates additional information for each transcript and gene, indicating the overall robustness and confidence in the models. A typical call to \texttt{spliceR()} is shown below, setting common filters and setting the reference to the pre-RNA. For details on all the filters, refer to the help page for \texttt{spliceR()}. <>=> #Commented out due to problems with vignettes and progress bars. mySpliceRList <- spliceR( cuffDB_spliceR, compareTo="preTranscript", filters=c("expressedGenes","geneOK", "isoOK", "expressedIso", "isoClass"), useProgressBar=F ) @ Typical runtime for < 100.000 transcripts for 4 samples on a contemporary workstation should be in the vicinity of 1.5-2 hours. After \texttt{spliceR()} is done, the \texttt{transcriptFeatures} \texttt{GRanges} is populated with several additional columns. These include columns for isoform fraction values (IF) and delta IF. Each splicing class gets a column, indicating how many events of a given class was detected for the respective transcripts. Finally, the "analyzed" columned indicates whether this transcript was carried through filtering. \section{PTC-annotation} \texttt{annotatePTC()} takes a \texttt{SpliceRList} object, a \texttt{BSGenome} sequence object and a \texttt{CDSList} object. The \texttt{BSGenome} sequence object can be downloaded and installed from Bioconductor, and the assembly and chromosome names herein should match the assembly used for mapping with Cufflinks or any other RNA-seq assembler (e.g. BSgenome.Hsapiens.UCSC.hg19 or BSgenome.Mmusculus.UCSC.mm9). The \texttt{CDSList} object is generated using \texttt{getCDS()}, which takes a assembly and gene track names, and automatically retrieves CDS information from the UCSC Genome Browser table (internet connection required). Alternatively, a \texttt{CDSList} object can be created manually with a CDS table of choice (refer to \texttt{?CDSList} for more information). Using the \texttt{CDSList} and the \texttt{BSGenome} object, \texttt{annotatePTC()} retrieves genomic sequence for all the exons contained in the \texttt{SpliceRList} \texttt{exon\_features} \texttt{GRanges}, and iterates over each transcript in the \texttt{transcript\_features} \texttt{GRanges}, concatenating exons, and translating the transcript with a compatible CDS. If more than one compatible CDS is found, the most upstream is used. For each transcript, the \texttt{transcript\_features} \texttt{GRanges} is annotated with stop codon position (in mRNA coordinates), stop codon exon (where 0 is the last exon, -1 is the second last exon and so forth), distance to the final exon-exon junction (in mRNA coordinates), and the genomic location of the used CDS. If no CDS is found, the columns are set to NA. Finally, transcripts are annotated with TRUE/FALSE with respect to nonense mediated decay (NMD) sensitivity; this parameter is set to a default of 50 (meaning, that transcripts with STOPs falling more than 50 nt from the final exon-exon junction, and not in the final exon), but can be changed at runtime. Here follows a typical workflow for \texttt{annotatePTC()}: <>=> ucscCDS <- getCDS(selectedGenome="hg19", repoName="UCSC") require("BSgenome.Hsapiens.UCSC.hg19",character.only = TRUE) #Commented out due to problems with vignettes and progress bars. #PTCSpliceRList <- annotatePTC(cuffDB_spliceR, cds=ucscCDS, Hsapiens, # PTCDistance=50) @ \section{GTF export} After running assembly with cufflinks or another assembler, and perhaps \texttt{spliceR()} and/or \texttt{annotatePTC()}, visualization of trancripts in genome browsers is often helpful. To facilitate this, \texttt{spliceR} provides the generateGTF function. \texttt{generateGTF()} generates a GTF for each sample, and allows for filtering of transcripts, identical to the filters used in \texttt{spliceR()} and \texttt{annotatePTC()}. Furthermore, \texttt{generateGTF()} color codes transcripts according to expression. The exact scoring method is set by the parameter \texttt{scoreMethod} - "local" scores transcripts relative to the transcript most highly expressed within each unique gene identifier, and "global" scores transcripts in relation to the whole sample. A typical call to \texttt{generateGTF()} could look like this: <>=> generateGTF(mySpliceRList, filters= c("geneOK", "isoOK", "expressedGenes", "expressedIso"), scoreMethod="local", useProgressBar=F) @ GTF-files are output to the current working directory, and are names according to sample names, with the \texttt{filePrefix} argument added to each file. \section{Visualizations} After running \texttt{spliceR()}, annotating the spliceRlist object with information about alternative splicing events. the \texttt{spliceRPlot()} function can be used for initial data exploration. \texttt{spliceRPlot()} generates a range of Venn diagrams, with one circle per condition, analyzing different aspects of the data. The \texttt{spliceRPlot()} function have two layers of control. Firstly the user can control what data is used for the plots by the two parameters expressionCutoff and filters. Since the fundamental purpose of the Venn diagrams is to distinguish between transcripts expressed in the different conditions, the expressionCutoff parameter is used to control at what cutoff is regarded as being expressed. Next \texttt{spliceRPlot()} have, like the other core functions of \texttt{spliceR}, a filtering argument, accessed by the filters parameter, which allows for selective subsetting of the data. Since filtering data preparation and conversion might take a couple of minutes, \texttt{spliceRPlot()} returns a spliceRlist object, where intermediate data structures have been stored, which enables the user to skip this step if another plot is made. If the user which to change the expressionCutoff or the filters parameter, the reset argument should be set to TRUE. The second layer of control is to determine what is actually plotted. There are a number of options available through the evaluate parameter, ranging from the number of genes expressed in each of the condition, to the average number of alternative splicing events found in the transcripts expressed in the different conditions. Furthermore the "asType" parameter allows the user to specify a subtype of alternative splicing to evaluate. \begin{center} <>=> #Plot the average number of transcripts pr gene mySpliceRList <- spliceRPlot(mySpliceRList, evaluate="nr_transcript_pr_gene") @ <>=> #Plot the average number of Exon skipping/inclucion evens pr gene mySpliceRList <- spliceRPlot(mySpliceRList, evaluate="mean_AS_transcript", asType="ESI") @ <>=> #Plot the average number of Exon skipping/inclucion evens pr gene, #but only using transcripts that are significantly differntially expressed mySpliceRList <- spliceRPlot(mySpliceRList, evaluate="mean_AS_transcript", asType="ESI", filters="sigIso",reset=TRUE) @ \end{center} <>= sessionInfo() @ \end{document}