--- title: "NBSplice-vignette" author: "Gabriela A. Merino and Elmer A. Fernandez" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('NBSplice')`" abstract: > NBSplice is an R package that uses isoform expression counts to evaluate differential splicing by means of generalized linear models at the gene level. The package allows the estimation of differences between isoforms' relative expression to infer changes in alternative splicing patterns. This vignette explains the use of the package in a typical analysis workflow. NBSplice package version: `r packageVersion("NBSplice")` output: rmarkdown::html_document: highlight: pygments toc: true fig_width: 5 bibliography: library.bib vignette: > %\VignetteIndexEntry{NBSplice-vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding[utf8]{inputenc} %\usepackage[utf8]{inputenc} --- ```{r setup, echo=FALSE, results="hide"} knitr::opts_chunk$set(tidy=FALSE, cache=TRUE, dev="png", message=FALSE, error=FALSE, warning=TRUE) ``` # NBSplice installation `NBSplice` is an R package sitting on Bioconductor. You could download the source file from the Bioconductor site (http://bioconductor.org/) and then use `R CMD install`. Another easy way to install it is by doing: ```{r bioC, eval=FALSE} ## try http:// if https:// URLs are not supported if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("NBSplice", version="devel") ``` # Analysis workflow Here we show the most basic steps for a differential splicing analysis pipeline using the `NBSplice` R package. ## Input Data To start the analysis, an expression matrix at the isoform/transcript level called `isoCounts`, previously generated by upstream processing tools like *Kallisto* [@bray2016near] or *RSEM* [@li2011rsem], is required. You also need a table describing isoform-gene relationships called `geneIso` and a table of experiment information called `designMatrix`. In particular, this table must contain a column storing the experiment factor which will be evaluated. Here, we assume that this column is called `condition`. ### Transcript expression matrix As input, the `NBSplice` package expects isoform expression counts obtained from RNA-seq in the form of a matrix of numerical values. In this matrix, the count value found in the *i*-th row and the *j*-th column represents the expression counts assigned to isoform *i* in experimental sample *j*. The values in the matrix should be un-normalized. We suggest the use of the effective counts typically reported by the mentioned quantification tools. It is not necessary to round those because `NBSplice` will do it internally. The expression matrix could be easily obtained by the combination of quantification files for all samples. For instance, if Kallisto was used to transcript quantification, the expression matrix should be built combining the `est_counts` columns of those files as is shown in the code below. ```{r expMat built, eval=FALSE} fileNames<-c(paste(rep("C1R",4), 1:4,"/abundance.tsv", sep=""), paste(rep("C2R",4), 1:4,"/abundance.tsv", sep="")) myInfo<-lapply(seq_along(fileNames), function(x){ quant<-read.delim(fileNames[x], header = TRUE) expression<-quant[,"est_counts"] return(expression) }) # Adding the isoform IDS as row names isoform_id<-read.delim(fileNames[x], header = TRUE)[,"target_id"] expressionMatrix<-(do.call(cbind, myInfo)) rownames(expressionMatrix)<-isoform_id # Adding the samples names as column names colnames(expressionMatrix)<-c(paste(rep("C1R",4), 1:4, sep=""), paste(rep("C2R",4), 1:4, sep="")) ``` If another quantification tool was used, you only need to change the name of the quantification files and the name of the columns having estimated counts and isoforms IDs. For instance, for RSEM quantification, isoforms counts and IDS are in the `expected_count` and `transcript_id` columns of the `Sample.isoforms.results` file, respectively. As an example, we have included an `isoCounts` matrix which is a subset of one of the simulated matrix used to demonstrate `NBSplice` utility. ```{r expMat, eval=TRUE} data(isoCounts, package="NBSplice") head(isoCounts) dim(isoCounts) ``` As can be noted, the `isoCounts` matrix represents an experiment with `8` samples were `3122` isoforms were quantified. ### Gene-isoform relationships matrix The `geneIso` matrix must have two columns called `isoform_id` and `gene_id` and as many rows as the `isoCounts` matrix. The *i*-th row of the `geneIso` matrix specifies from which gene was originated the *i*-th isoform. This information will be used by `NBSplice` methods for model fitting and hypothesis tests. This matrix could be easily obtained from the annotation file used for isoform quantification. For instance, if RSEM was used, the matrix can be built taking the first two columns (`transcript_id` and `gene_id`) of the `Sample.isoforms.results` file. In the case of Kallisto, this information is not stored in the quantification file. You could use the `biomaRt` R package to obtain gene IDs for the quantified isoforms or the `refGenome` package to extract those from the annotation file. As an example, we provided a `geneIso` matrix associated with the `isoCounts` object previously loaded. ```{r geneIsoMat, eval=TRUE} data(geneIso, package="NBSplice") head(geneIso) dim(geneIso) ``` This matrix contains annotation information, so, you need to build it everytime that you change your reference transcriptome, downloading the respective information from the database site. ### Experiment data and colName The `designMatrix` table must contain experiment information relevant for the differential splicing analysis. Its number of columns is variable but it should have as rows as columns the `isoCounts` matrix has. Thus, in the `designMatrix` the *i*-th row specifies experimental information associated to the *i*-th sample. It is worth to mention that the order and names of `isoCounts` columns must be the same of the `designMatrix` rows. Currently, `NBSplice` allows two-levels comparison of single-factor experiments (multiple-factor experiment comparisons are in development). In order to do this, it is necessary to specify the argument `colName` which indicates the name of the `designMatrix` column corresponding to that factor. We provided the `designMatrix` object associated to our `isoCounts` matrix which could be loaded doing `data(designMatrix, package="NBSplice")`. Alternatively, this matrix could be created using the code below. ```{r designMat, eval=TRUE} designMatrix<-data.frame(sample=c(paste(rep("C1R", 4), 1:4, sep=""), paste(rep("C2R", 4), 1:4, sep="")), condition=c(rep( "Normal", 4), rep("Tumor", 4))) rownames(designMatrix)<-designMatrix[,"sample"] designMatrix ``` In our case, we must define our `colName` argument indicating that our factor of interest is *condition*. ```{r colData, eval=TRUE} colName<-"condition" levels(designMatrix[,colName]) ``` As can be noted, our experiment involved two experimental conditions: *Normal* and *Tumor*. Note that this column of the `designMatrix` is a `factor` with two levels where the first level, commonly taken as the reference, is *Normal*. ## IsoDataSet class ### Basics `NBSplice` incorporates a new class called `IsoDataSet` to store expression counts and experimental data. An object of this class has the next slots: `counts`, `geneCounts`, `colData`, `isoGeneRel`, `design` and `lowExpIndex`. The first slot will contain the `isoCounts` matrix, whereas, the second one store the expression counts at the gene level. To do this, `NBSplice` incorporates a function which takes every gene of `geneIso` and, for each sample, compute its total expression as the sum of the counts of all its isoforms contained in `isoCounts`. The third and fifth slots correspond to `designMatrix` and `geneIso` objects. The slot called `design` save the formula used to model fitting and hypothesis tests, and is internally defined by `NBSplice` based on the `colName` argument. The last argument, `lowExpIndex`, stores the indexes of low-expressed isoforms. The way to define those isoforms is explained in [Identifying low-expressed isoforms]. ### Creating an object The typical way to build an `IsoDataSet` object is using its constructor which have the same name as the class. In addition to the four arguments previously defined, it is possible to define the `BPPARAM` argument which must be a `BiocParallelParam` instance defining the parallel back-end to be used [@biocparallel]. ```{r objectBuild, eval=TRUE} library(NBSplice) myIsoDataSet<-IsoDataSet(isoCounts, designMatrix, colName, geneIso) ``` `NBSplice` also provides several methods for easy object and slots inspections. ```{r objectInsp, eval=TRUE} show(myIsoDataSet) head(isoCounts(myIsoDataSet)) ``` Note that the `isoCounts` method returns the `counts` slot but it seems to be different to the `IsoCounts` object previously loaded. This is because this method has an additional logical parameter, `CPM` with default value `TRUE`, useful to indicate if expression counts should be returned in *counts per million* (CPM) scale. ### Identifying low-expressed isoforms Low expression transcripts have been widely recognized as a source of noise in model fitting and parameter estimation of RNA-seq expression counts. Typically, count thresholds are used to decide if a gene or an isoform has low expression. However, `NBSplice` assumes that an isoform has a low expression if their absolute or relative expression is lower than specifics thresholds. This assumption is due to the package model infers differential splicing by means of changes in relative expression of isoforms. Consequently, a transcript which has a very low relative expression in both conditions could exhibit a small change but enough to be erroneously detected as significant. The `buildLowExpIdx` method is responsible for the identification of low-expressed isoforms. It takes an `IsoDataSet` object and two expression thresholds and creates a low-expression index and stores this index in the `lowExpIndex` object slot. The two thresholds are called `ratioThres` and `countsThres`. The first one of those is used to set a minimum of relative expression that each isoform should achieve in all experimental samples. The second one indicates a threshold in CPM for mean expression across conditions. The method also needs the specification of the `colName` argument, in order to compute mean expressions per factor level, and admits parallelization using the `BPPARAM` argument. ```{r lowExp, eval=TRUE} myIsoDataSet<-buildLowExpIdx(myIsoDataSet, colName, ratioThres = 0.01, countThres = 1) ``` ### Testing differential gene splicing Once an `IsoDataSet` was built and its low-expressed isoforms were identified, you could use the `NBSplice` for differential splicing evaluation using the `NBTest` method. It provides a user-friendly interface between model fitting and hypothesis testing and the user by means a single code line. The theory behind the `NBSplice` model is explained in [NBSplice strategy]. Briefly, `NBTest` iterates along genes obtaining significance p-values at gene and isoform level indicating the occurrence of differential splicing and significant changes in isoform proportion, respectively. Then, isoform and gene p-values are adjusted. Low-expressed isoforms are ignored for model fitting but their mean relative expression per condition are also computed. The method requires the specification of the `colName`, as well as the statistic assumed for hypothesis test p-values distribution. As the `colName` factor could have more than two levels, it is necessary the specification of an additional argument defining the two levels to be contrasted. By default, the `contrast` argument takes the two first levels of `colName`. `NBTest` returns an `NBSpliceRes` object which stores the obtained results. ```{r NBTest, eval=TRUE} myDSResults<-NBTest(myIsoDataSet, colName, test="F") ``` ```{r object loading2, echo=FALSE, eval=TRUE, results="hide"} data(myDSResults, package="NBSplice") ``` ## NBSpliceRes class ### Basics `NBSpliceRes` is another new class provided by NBSplice to facilitate the manipulation and exploration of the differential splicing results obtained using the `NBTest` method. An object of this class has three slots called `results`, `lowExpIndex` and `contrast`. The first slot stores differential splicing results for all isoforms, whereas the second slot keeps the indexes of low-expressed isoforms and the third one contains the evaluated contrast. Specific methods for slot accession are also provided by `NBSplice` ```{r NBRes, eval=TRUE} head(lowExpIndex(myDSResults)) contrast(myDSResults) head(results(myDSResults)) ``` The `data.frame` containing differential splicing results has the next columns: * `iso`: Isoform name. * `gene`: Gene name. * `ratio_X`: Relative expression of the isoform in experimental condition *X*, which is the first value specified with the `contrast` argument. * `ratio_Y`: Relative expression of the isoform in experimental condition *Y*, which is the second value specified with the `contrast` argument. * `odd`: Ratio between isoform's relative expression in condition *X* and *Y*. * `stat`: Statistic of the hypothesis test performed to decide if the change in isoform's relative expression between *X* and *Y* conditions was significant. * `pval`: P-value at the isoform level corresponding to the statistic `stat`. * `genePval`: P-value at the gene level. * `FDR`: Adjusted p-value at the isoform level. * `genePval`: Adjusted p-value at the gene level. In particular, the `pval` or its adjusted value, `FDR`, is useful for deciding if an isoform experimented significant change in its relative expression due to the change between *X* and *Y* conditions, here called *Normal* and *Tumor*. Whereas, the `genePval` or its adjusted value, `geneFDR`, will help you to decide if a gene evidenced significant changes in its alternative splicing pattern when conditions *X* and *Y* were contrasted. In the example illustrated above, non-reliable isoforms were discarded by means of the `filter` argument of the `results` methods which has `TRUE` as its default value. Non-reliable refers to both low expressed isoforms and those cases where model estimations did not achieve convergence. To obtain the full `results` matrix only you need specifies that `filter` argument is `FALSE`. ```{r lowExpRes, eval=TRUE} head(results(myDSResults, filter = FALSE)) ``` Non-reliable isoforms could be easily detected in the obtained matrix because they present `NA` value in all parameters derived from model fitting and hypothesis test. It is worth to note that although non-reliable isoforms were not analyzed by statistical models, they have been used to compute the total gene counts so in order to avoid overestimation of relative expression of the well-expressed isoforms. ### Results exploration The `NBSpliceRes` class has several methods useful for differential splicing results exploration. #### Getting results of differentially spliced genes `NBSplice` assumes a gene is differentially spliced if its p-value (raw or adjusted) is lower than a significant threshold (default value = 0.05). The results table only for those differentially spliced genes is easily extracted using the `GetDSResults` method. Whereas, the names of those genes is obtained by means of the `GetDSGenes`. Both methods expect at least one argument of class `NBSpliceRes`. In addition, they could receive a logical parameter, `adjusted`, indicating if p-values must be adjusted (default) or not, and a numeric parameter, `p.value`, establishing the significance threshold (default value = 0.05). ```{r getDSGe, eval=TRUE} mySignificantRes<-GetDSResults(myDSResults) head(mySignificantRes) dim(mySignificantRes) myDSGenes<-GetDSGenes(myDSResults) head(myDSGenes) length(myDSGenes) ``` As can be seen, in our example, `NBSplice` detected 184 isoforms with a significant change of relative expression between *Normal* and *Tumor* condition whereas 40 genes were identified as differentially spliced. #### Isoforms' relative expression plot The scatter plot of isoforms' relative expression between the two contrasted conditions is explored using the `plotRatiosDisp`. It returns a `ggplot2` object which could be then easily modified and customized. ```{r plotD, eval=TRUE} plotRatiosDisp(myDSResults) ``` As can be noted, isoforms are colored according to if they came from those genes detected as differentially spliced or not. To do this, the `adjusted` and `p.value` parameters, previously used to call the `getDSResults` method, could be specified in the `plotRatiosDisp` calling. #### Volcano plot The volcano plot is commonly used to explore the dispersion between statistical significance from a statistical test with the magnitude of the change. In the case of `NBSplice` the isoform p-value and the difference between isoform's relative expression per contrasted condition are explored using the `plotVolcano` method. In addition, isoforms are colored according to if they came from those genes detected as differentially spliced or not. ```{r plotv, eval=TRUE} plotVolcano(myDSResults) ``` #### Exploring a differentially spliced gene `NBSplice` also provides two methods to individually explore differentially spliced genes: `GetGeneResults` and `plotGeneResults`. The first one is useful for obtaining the differential expression results of a single gene, specified using the `gene` argument. The second one is a graphical method to generate a bar-plot illustrating the isoforms' relative expression values in the two contrasted conditions. In both cases, it is possible filtering those non-reliable isoforms by means of the `filterLowExpIso` argument. In the case of the `plotGeneResults` method, non-significant isoforms could be also filtered out setting the `filterNotSignificant` argument to `TRUE` which is its default value. As an example, the results obtained for the first analyzed gene are shown below. ```{r plotGene, eval=TRUE} ## Select the first differentially spliced gene gene<-GetDSGenes(myDSResults)[1] GetGeneResults(myDSResults, gene) plotGeneResults(myDSResults, gene) ``` ```{r plotGene2, eval=TRUE} ## Keeping non-reliable and non-significant isoforms plotGeneResults(myDSResults, gene, filterLowExpIso = FALSE, filterNotSignificant = FALSE) ``` # NBSplice strategy `NBSplice`, is a package developed for differential splicing detection starting from isoform quantification data from RNA-seq experiments. Our tool is based on generalized linear models (GLMs) and uses them to infer significant differences in isoforms relative expression values between two experimental conditions. In addition, `NBSplice` not only reveals what genes were under DS but also reveals the identity of the gene isoforms affected by it as well as their relative expression in each experimental condition. ## Model `NBSplice` is based on the use of GLMs to model isoform expression counts. This approach has been widely used for several differential expression tools like `edgeR` [@robinson2010edger] and `DESeq2` [@love2014moderated] for differential gene expression and `DEXSeq` [@anders2012detecting] for differential exon usage analysis. These tools are based in negative binomial (NB) models and, in general, they assume that the counts of the i-th gene/exon in the k-th sequenced sample, denoted as $y_{ik}$, are a random variable with a NB distribution. \begin{equation} y_{ik}\sim NB(\mu_{ik}=p_{ik}s_k, \phi_{i}) \end{equation} This NB distribution is characterized by two parameters: the *mean*, $\mu_{ik}$, and the *dispersion*, $\phi_{i}$. Specifically, previously mentioned R packages assume that $\mu_{ik}$ can be expressed as the product of the real proportion of mRNA fragments of the i-th gene in the k-th sample, $p_{ik}$, with a factor, $s_k$, related to the sequencing library size. The parameter $\phi_{i}$ is directly related to the variance of $y_{ik}$ \begin{equation} V(y_{ik})=\mu_{ik}+\phi_i\mu_{ik} \label{EQ:var} \end{equation} Commonly used methods fit GLMs with a logarithmic link and with the elements $x_{kr}$ of the design matrix to infer differential expression \begin{equation} log(p_{ik})=\sum\limits_rx_{kr}\beta_{ir} \label{EQ:mod} \end{equation} In the most simple case, a treatment-control experiment, $r$ be equal to $1$ and the design matrix only will have one column where each cell indicates if sample $k$ was or not treated. `NBSplice` takes the advantage of the previously described strategy although it proposes an alternative approach. It assumes that isoform expression counts, in CPMs, of the i-th isoform from the j-th gene in the k-th sample, $y_{ijk}$, follow a NB distribution as \begin{equation} y_{ijk}\sim NB(media=p_{ijk}\mu_{jk}, dispersion=\phi_{j}) \label{EQ:eq5} \end{equation} where $p_{ijk}$ represents the proportion of the mRNA fragments from the j-th gene ($\mu_{jk}$) belonging to the i-th isoform and the $\phi_{j}$ is the dispersion of the j-th gene, in the k-th sample. In this context, for each isoform, it is possible to predict its mean relative expression or proportion by means of a GLM fit at the level gene according to \begin{equation} log(p_{ijk})=x_{r.}\beta_{ij} \label{EQ:eq6} \end{equation} where $\beta_{ij}$ represents the change in $p_{ijk}$ due to the effect described by the r-th column of the design matrix **X** ($x_{r.}$). Thus, `NBSplice` proposes, unlike other tools such as `DEXSeq` or `Limma` [@ritchie2015limma], the adjustment of one model for each gene. In particular, this model must consider the different gene isoforms as effects. In the most simple case, where only one factor (*cond*) with two levels ($l=1, ~2$) is analyzed , the previous equation is summarized at: \begin{equation} ln(p_{ijl})=\mu_0+iso_{ij}+cond_l+ iso_{ij}:cond_l \label{EQ:eq7} \end{equation} Once coefficient models are estimated, $\hat{\beta}$, `NBSplice` evaluates if the changes in isoforms proportions due to experimental conditions are significant by means of a linear hypothesis test (Wald test) based on the F-statistic defined below [@greene2017econometric]. \begin{equation} H_0:cond_l+ iso_{ij}:cond_l=0 \\ F=\frac{(H\hat{\beta}-q)'(H \hat{cov}(\hat{\beta})H')^{-1}(H\hat{\beta}-q)}{p} \end{equation} Here, $H$ is the contrast matrix, $q$ is a single element matrix having the value of the right side of the test equation ($q=0$), $\hat{cov}(\hat{\beta})$ is the covariance matrix of $\hat{\beta}$ and $p$ is equal to the number of rows of $H$ ($p=1$). In particular, $H$ is a full rank matrix, conformable with $\hat{\beta}$ so, when they are multiplied, the linear combination of the left side of the test equation is defined. Thus, under the null hypothesis, the F statistics will have $F$ distribution with $1$ and $n-K$ degrees of freedom. Specifically, $n$ is the number of samples and $K$ is the length of $\beta$. When only one experimental factor with two levels is analyzed (treatment-control case), $K$ will be equal to the double of the number of analyzed isoforms of gene $j$, $I_j$. The comparison of the F statistics against its distribution under null hypothesis provides a p-value for each isoform of gene $j$. Given the fact that DS affect the whole gene, once isoform hypothesis tests were performed, `NBSplice` uses the Simes test [@simes1986improved] to combine isoform p-values obtaining a gene p-value. Given a family of null hypothesis, $H_{01}, ..., H_{0I_j}$, to test changes in relative isoform expression of the $I_j$ isoforms from gene $j$, the Simes test evaluates only one global null hypothesis \begin{equation} H_0=\bigcap\limits_{p=1}^{I_j}H_{0i} \end{equation} To do this, the isoform p-values are sorted $p-value_{(1)j}< \dots < p-value_{(m)j}< \dots